TERCOM算法 - 从单线程改变到CUDA多线程多线程、算法、单线程、TERCOM

2023-09-11 07:05:26 作者：┊非卖品┊

我目前正在从只使用1线程使用多一个 TERCOM算法移植线程。简要说明的那样，TERCOM算法获得5测量和标题，并比较这测量到prestored地图。该算法将选择最佳匹配，即最低平均绝对差（MAD），并返回的位置。

在code为正常使用一个线程for循环，但是当我尝试使用多个线程，并阻止其返回错误的答案。这似乎是多线程版本不计算以同样的方式作为singlethread versjon通过运行。有谁知道我在做什么错了？

下面是使用code for循环

  __ global__无效内核（INT男，诠释N，INT小时，诠释N，浮*楼浮航向，浮*测量）
{
    //如果没有线程
    浮POS [2] = {0};
    浮THETA =标题*（PI / 180）;
    浮MAD = 0;

    //计算多少在x和y方向移动
    浮offset_x = H * COS（THETA）;
    浮offset_y = -h * SIN（THETA）;

    浮分钟= 100000; //一些高价值

    //计算平均绝对差
    对于（浮排= 0;行＆LT;米;排++）
    {
        为（浮动COL = 0;西n种;西++）
        {
            为（浮动G = 0; G＆LT; N，G ++）
            {
                F [（INT）G] = tex2D（TEX，COL +（G-2）* offset_x + 0.5F，行+（G-2）* offset_y + 0.5F）;
                MAD + = ABS（测量[（INT）G] -f [（INT）G]）;
            }
            如果（MAD＆LT;分）
            {
                分=疯了。
                POS [0] =关口;
                POS [1] =行;
            }
            MAD = 0; //重置MAD
        }
    }

    F [0] =分钟;
    F [1] = POS [0];
    F [2] = POS [1];
}

这是我尝试使用多线程

  __ global__无效内核（INT男，诠释N，INT小时，诠释N，浮*楼浮航向，浮*测量）
{
    //随着螺纹
    INT IDX = blockIdx.x * blockDim.x + threadIdx.x;
    INT IDY = blockIdx.y * blockDim.y + threadIdx.y;
    浮POS [2] = {0};
    浮THETA =标题*（PI / 180）;
    浮MAD = 0;

    //计算多少在x和y方向移动
    浮offset_x = H * COS（THETA）;
    浮offset_y = -h * SIN（THETA）;

    浮分钟= 100000; //一些高价值

    如果（IDX＆n种安培;＆安培; IDY＆所述;米）
    {
        为（浮动G = 0; G＆LT; N，G ++）
        {
            F [（int）的克] = tex2D（tex的，IDX +（G-2）* offset_x + 0.5F，IDY +（G-2）* offset_y + 0.5F）;
            MAD + = ABS（测量[（INT）G] -f [（INT）G]）;
        }

        如果（MAD＆LT;分）
        {
            分=疯了。
            POS [0] = IDX;
            POS [1] = IDY;
        }
        MAD = 0; //重置MAD
    }
    F [0] =分钟;
    F [1] = POS [0];
    F [2] = POS [1];
}

要启动的内核

 为dim3 dimBlock（16,16）;
为dim3 dimGrid;
dimGrid.x =（N + dimBlock.x  -  1）/dimBlock.x;
dimGrid.y =（M + dimBlock.y  -  1）/dimBlock.y;

内核＆LT;＆LT;＆LT; dimGrid，dimBlock＆GT;＆GT;＆GT; （M，N，H，N，dev_results，标题，dev_measurements）;

解决方案

这里的基本问题是，你必须在code内存的比赛，周围使用为中心的˚F既是一些输出变量类型的线程局部的搔抓空间。每个并发线程将尝试写值到f中同时，它会产生不确定的行为在相同的位置。



是最好的，我可以告诉大家，使用 F 作为暂存空间甚至没有必要在所有与内核的主要计算部分可以写的东西这样的：

 如果（IDX＆n种放大器;＆安培; IDY＆LT; M）
{
    为（浮动G = 0; G＆LT; N，G ++）
    {
        浮未来值= tex2D（TEX，IDX +（G-2）* offset_x + 0.5F，IDY +（G-2）* offset_y + 0.5F）;
        MAD + = ABS（测量[（INT）G] -fval）;
    }
    分=疯了。
    POS [0] = IDX;
    POS [1] = IDY;
}
 


 [免责声明：写在浏览器中，使用风险自负] 

目前的计算结束，每个的线程的有自己的值最小和 POS 。至少这些必须存储在唯一的全局存储器（即输出必须有足够的空间用于每个线程的结果）。然后，您将需要进行某种形式的减少操作，以获得一组线程局部值的全局最优解。这可能是在主机，或在装置code或两者的某种组合。有很多的code已经可用于CUDA并行减少，你应该能够找到通过搜索和/或寻求与CUDA工具包所提供的例子。在这里你需要保留的位置以及最小值应该是微不足道的，以使其适应您的指定情况。
I'm currently working on porting a TERCOM algorithm  from using only 1 thread to use multiple threads. Briefly explained , the TERCOM algorithm receives 5 measurements and the heading, and compare this measurements to a prestored map. The algorithm will choose the best match, i.e. lowest Mean Absolute Difference (MAD), and return the position.

The code is working perfectly with one thread and for-loops, but when I try to use multiple threads and blocks it returns the wrong answer. It seems like the multithread version doesn't "run through" the calculation in the same way as the singlethread versjon. Does anyone know what I am doing wrong?

Here's the code using for-loops
__global__ void kernel (int m, int n, int h, int N, float *f, float heading, float *measurements) 
{
    //Without threads
    float pos[2]={0};
    float theta=heading*(PI/180);
    float MAD=0;

    // Calculate how much to move in x and y direction
    float offset_x = h*cos(theta);
    float offset_y = -h*sin(theta); 

    float min=100000; //Some High value

    //Calculate Mean Absolute Difference
    for(float row=0;row<m;row++)
    {
        for(float col=0;col<n;col++)
        {
            for(float g=0; g<N; g++)
            {
                f[(int)g] = tex2D (tex, col+(g-2)*offset_x+0.5f, row+(g-2)*offset_y+0.5f);
                MAD += abs(measurements[(int)g]-f[(int)g]);
            }
            if(MAD<min) 
            {
                min=MAD;
                pos[0]=col;
                pos[1]=row;
            }
            MAD=0;                  //Reset MAD
        }
    }

    f[0]=min;
    f[1]=pos[0];
    f[2]=pos[1];
}
This is my attempt to use multiple threads
__global__ void kernel (int m, int n, int h, int N, float *f, float heading, float *measurements) 
{
    // With threads
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int idy = blockIdx.y * blockDim.y + threadIdx.y;
    float pos[2]={0};
    float theta=heading*(PI/180);
    float MAD=0;

    // Calculate how much to move in x and y direction
    float offset_x = h*cos(theta);
    float offset_y = -h*sin(theta); 

    float min=100000; //Some High value

    if(idx < n && idy < m)
    {
        for(float g=0; g<N; g++)
        {
            f[(int)g] = tex2D (tex, idx+(g-2)*offset_x+0.5f, idy+(g-2)*offset_y+0.5f);
            MAD += abs(measurements[(int)g]-f[(int)g]); 
        }

        if(MAD<min) 
        {
            min=MAD;
            pos[0]=idx;
            pos[1]=idy;
        }
        MAD=0;                  //Reset MAD
    }
    f[0]=min;
    f[1]=pos[0];
    f[2]=pos[1];
}
To launch the kernel
dim3 dimBlock( 16,16 );
dim3 dimGrid;
dimGrid.x = (n + dimBlock.x - 1)/dimBlock.x;
dimGrid.y = (m + dimBlock.y - 1)/dimBlock.y;

kernel <<< dimGrid,dimBlock >>> (m, n, h, N, dev_results, heading, dev_measurements);

 解决方案 The basic problem here is that you have a memory race in the code, centered around the use of f as both some sort of thread local scratch space and an output variable. Every concurrent thread will be trying to write values into the same locations in f simultaneously, which will produce undefined behaviour.

As best as I can tell, the use of f as scratch space isn't even necessary at all and the main computational section of the kernel could be written as something like:
if(idx < n && idy < m)
{
    for(float g=0; g<N; g++)
    {
        float fval = tex2D (tex, idx+(g-2)*offset_x+0.5f, idy+(g-2)*offset_y+0.5f);
        MAD += abs(measurements[(int)g]-fval); 
    }
    min=MAD;
    pos[0]=idx;
    pos[1]=idy;
}
[disclaimer: written in browser, use at own risk]

At the end of that calculation, each thread has its own values of min and pos. At a minimum these must be stored in unique global memory (ie. the output must have enough space for each thread result). You will then need to perform some sort of reduction operation to obtain the global minimum from the set of thread local values. That could be in the host, or in the device code, or some combination of the two. There is a lot of code already available for CUDA parallel reductions which you should be able to find by searching and/or looking in the examples supplied with the CUDA toolkit. It should be trivial to adapt them to your specify case where you need to retain the position along with the minimum value.



                
                
                                    上一篇：SQL随机数不内的一组重复随机数、SQL
                                                            下一篇：在Knuth的伪$ C $下水库取样可能的错误水库、错误、Knuth
                                    

                
                
                    
                        相关推荐
                       
                    
                  

                    
算法将一组字符串到最低组互相排斥的群体大致相同的尺
即使长路径算法-DFS算法、路径、DFS
如何计算一个算法的运行时间？算法、时间
串置换算法算法
比赛计分/排名算法算法
最好的图形算法/实现动态的最大流计算最好的、算法、
应用排序算法，数据库查询算法、数据库查询
将SCC格局变化，如果我们扭转图（使用Kosaraju的算法）？算法
如何推广这个算法（符号模式匹配计数器）？算法、计数器、符
实施宾利奥特曼算法AVL树算法、宾利、奥特曼、AVL
				   
                

                


    
        
                  

        
        
                  

          

             
        
    
    
                  

    


                
                
                    
                        猜您喜欢
                    
                    
					 
								
								零下51度夫君出门漫步创造只乌羊 佳奇上
							
						
                        
   安卓：传递一个服务处理程序程序
     发送从弯曲对象的Java Servlet弯曲、对象、Servlet、Java
     如何阅读使用数它的名字，与GSON JSON变量它的、变量、名字、GSON
     为什么在LinkedList通常比列表慢？列表、LinkedList
     法国大革命的意义 法国大革命的历史意义是什么_法国大革命的意义
     海龟自埋之谜，海龟将本人生坑深海淤泥中_海龟寻短见生坑
                                                      
                                        

                
                
                
                
                    
                        精彩图集
                     
                    
                       
                    宇宙这么大，那么宇宙之外的是什么?会有什
                        
                    浩瀚宇宙有多大:宇宙到底有多大呢？是无边
                        
                    地球和仙女座的距离，我们来计算一下有多少
                        
                    磁星是宇宙中的贵族，至今仅发现20余颗，其磁
                        
                    感受一下白垩纪著名的恐龙灭绝事件，一代地
                        
                    物种演化离不开自然法则，但早已克服生存困
                        
                    人类在其他星球都能跳跃多少高度呢？接下来



            
            

    
    
        


   
    
        精彩推荐
        
        
 
 
    

    
        
 
 
    
    
        图片推荐
        
            
                    
                    3男2女钻进深山找刺激 警察找到人时绘面