TERCOM算法 - 从单线程改变到CUDA多线程多线程、算法、单线程、TERCOM

2023-09-11 07:05:26 作者:┊非卖品┊

我目前正在从只使用1线程使用多一个 TERCOM算法移植线程。简要说明的那样,TERCOM算法获得5测量和标题,并比较这测量到prestored地图。该算法将选择最佳匹配,即最低平均绝对差(MAD),并返回的位置。

在code为正常使用一个线程for循环,但是当我尝试使用多个线程,并阻止其返回错误的答案。这似乎是多线程版本不计算以同样的方式作为singlethread versjon通过运行。有谁知道我在做什么错了?

下面是使用code for循环

  __ global__无效内核(INT男,诠释N,INT小时,诠释N,浮*楼浮航向,浮*测量)
{
    //如果没有线程
    浮POS [2] = {0};
    浮THETA =标题*(PI / 180);
    浮MAD = 0;

    //计算多少在x和y方向移动
    浮offset_x = H * COS(THETA);
    浮offset_y = -h * SIN(THETA);

    浮分钟= 100000; //一些高价值

    //计算平均绝对差
    对于(浮排= 0;行<米;排++)
    {
        为(浮动COL = 0;西n种;西++)
        {
            为(浮动G = 0; G< N,G ++)
            {
                F [(INT)G] = tex2D(TEX,COL +(G-2)* offset_x + 0.5F,行+(G-2)* offset_y + 0.5F);
                MAD + = ABS(测量[(INT)G] -f [(INT)G]);
            }
            如果(MAD<分)
            {
                分=疯了。
                POS [0] =关口;
                POS [1] =行;
            }
            MAD = 0; //重置MAD
        }
    }

    F [0] =分钟;
    F [1] = POS [0];
    F [2] = POS [1];
}
 

这是我尝试使用多线程

  __ global__无效内核(INT男,诠释N,INT小时,诠释N,浮*楼浮航向,浮*测量)
{
    //随着螺纹
    INT IDX = blockIdx.x * blockDim.x + threadIdx.x;
    INT IDY = blockIdx.y * blockDim.y + threadIdx.y;
    浮POS [2] = {0};
    浮THETA =标题*(PI / 180);
    浮MAD = 0;

    //计算多少在x和y方向移动
    浮offset_x = H * COS(THETA);
    浮offset_y = -h * SIN(THETA);

    浮分钟= 100000; //一些高价值

    如果(IDX&n种安培;&安培; IDY&所述;米)
    {
        为(浮动G = 0; G< N,G ++)
        {
            F [(int)的克] = tex2D(tex的,IDX +(G-2)* offset_x + 0.5F,IDY +(G-2)* offset_y + 0.5F);
            MAD + = ABS(测量[(INT)G] -f [(INT)G]);
        }

        如果(MAD<分)
        {
            分=疯了。
            POS [0] = IDX;
            POS [1] = IDY;
        }
        MAD = 0; //重置MAD
    }
    F [0] =分钟;
    F [1] = POS [0];
    F [2] = POS [1];
}
 
Leetcode多线程

要启动的内核

 为dim3 dimBlock(16,16);
为dim3 dimGrid;
dimGrid.x =(N + dimBlock.x  -  1)/dimBlock.x;
dimGrid.y =(M + dimBlock.y  -  1)/dimBlock.y;

内核<<< dimGrid,dimBlock>>> (M,N,H,N,dev_results,标题,dev_measurements);
 

解决方案

这里的基本问题是,你必须在code内存的比赛,周围使用为中心的˚F既是一些输出变量类型的线程局部的搔抓空间。每个并发线程将尝试写值到f中同时,它会产生不确定的行为在相同的位置。

是最好的,我可以告诉大家,使用 F 作为暂存空间甚至没有必要在所有与内核的主要计算部分可以写的东西这样的:

 如果(IDX&n种放大器;&安培; IDY< M)
{
    为(浮动G = 0; G< N,G ++)
    {
        浮未来值= tex2D(TEX,IDX +(G-2)* offset_x + 0.5F,IDY +(G-2)* offset_y + 0.5F);
        MAD + = ABS(测量[(INT)G] -fval);
    }
    分=疯了。
    POS [0] = IDX;
    POS [1] = IDY;
}
 

[免责声明:写在浏览器中,使用风险自负]

目前的计算结束,每个的线程的有自己的值最小 POS 。至少这些必须存储在唯一的全局存储器(即输出必须有足够的空间用于每个线程的结果)。然后,您将需要进行某种形式的减少操作,以获得一组线程局部值的全局最优解。这可能是在主机,或在装置code或两者的某种组合。有很多的code已经可用于CUDA并行减少,你应该能够找到通过搜索和/或寻求与CUDA工具包所提供的例子。在这里你需要保留的位置以及最小值应该是微不足道的,以使其适应您的指定情况。

I'm currently working on porting a TERCOM algorithm from using only 1 thread to use multiple threads. Briefly explained , the TERCOM algorithm receives 5 measurements and the heading, and compare this measurements to a prestored map. The algorithm will choose the best match, i.e. lowest Mean Absolute Difference (MAD), and return the position.

The code is working perfectly with one thread and for-loops, but when I try to use multiple threads and blocks it returns the wrong answer. It seems like the multithread version doesn't "run through" the calculation in the same way as the singlethread versjon. Does anyone know what I am doing wrong?

Here's the code using for-loops

__global__ void kernel (int m, int n, int h, int N, float *f, float heading, float *measurements) 
{
    //Without threads
    float pos[2]={0};
    float theta=heading*(PI/180);
    float MAD=0;

    // Calculate how much to move in x and y direction
    float offset_x = h*cos(theta);
    float offset_y = -h*sin(theta); 

    float min=100000; //Some High value

    //Calculate Mean Absolute Difference
    for(float row=0;row<m;row++)
    {
        for(float col=0;col<n;col++)
        {
            for(float g=0; g<N; g++)
            {
                f[(int)g] = tex2D (tex, col+(g-2)*offset_x+0.5f, row+(g-2)*offset_y+0.5f);
                MAD += abs(measurements[(int)g]-f[(int)g]);
            }
            if(MAD<min) 
            {
                min=MAD;
                pos[0]=col;
                pos[1]=row;
            }
            MAD=0;                  //Reset MAD
        }
    }

    f[0]=min;
    f[1]=pos[0];
    f[2]=pos[1];
}

This is my attempt to use multiple threads

__global__ void kernel (int m, int n, int h, int N, float *f, float heading, float *measurements) 
{
    // With threads
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int idy = blockIdx.y * blockDim.y + threadIdx.y;
    float pos[2]={0};
    float theta=heading*(PI/180);
    float MAD=0;

    // Calculate how much to move in x and y direction
    float offset_x = h*cos(theta);
    float offset_y = -h*sin(theta); 

    float min=100000; //Some High value

    if(idx < n && idy < m)
    {
        for(float g=0; g<N; g++)
        {
            f[(int)g] = tex2D (tex, idx+(g-2)*offset_x+0.5f, idy+(g-2)*offset_y+0.5f);
            MAD += abs(measurements[(int)g]-f[(int)g]); 
        }

        if(MAD<min) 
        {
            min=MAD;
            pos[0]=idx;
            pos[1]=idy;
        }
        MAD=0;                  //Reset MAD
    }
    f[0]=min;
    f[1]=pos[0];
    f[2]=pos[1];
}

To launch the kernel

dim3 dimBlock( 16,16 );
dim3 dimGrid;
dimGrid.x = (n + dimBlock.x - 1)/dimBlock.x;
dimGrid.y = (m + dimBlock.y - 1)/dimBlock.y;

kernel <<< dimGrid,dimBlock >>> (m, n, h, N, dev_results, heading, dev_measurements);

解决方案

The basic problem here is that you have a memory race in the code, centered around the use of f as both some sort of thread local scratch space and an output variable. Every concurrent thread will be trying to write values into the same locations in f simultaneously, which will produce undefined behaviour.

As best as I can tell, the use of f as scratch space isn't even necessary at all and the main computational section of the kernel could be written as something like:

if(idx < n && idy < m)
{
    for(float g=0; g<N; g++)
    {
        float fval = tex2D (tex, idx+(g-2)*offset_x+0.5f, idy+(g-2)*offset_y+0.5f);
        MAD += abs(measurements[(int)g]-fval); 
    }
    min=MAD;
    pos[0]=idx;
    pos[1]=idy;
}

[disclaimer: written in browser, use at own risk]

At the end of that calculation, each thread has its own values of min and pos. At a minimum these must be stored in unique global memory (ie. the output must have enough space for each thread result). You will then need to perform some sort of reduction operation to obtain the global minimum from the set of thread local values. That could be in the host, or in the device code, or some combination of the two. There is a lot of code already available for CUDA parallel reductions which you should be able to find by searching and/or looking in the examples supplied with the CUDA toolkit. It should be trivial to adapt them to your specify case where you need to retain the position along with the minimum value.