如何改善这种实现的基数排序的？基数

2023-09-11 07:05:42 作者：丿弹指间的念想

我实施2个字节的基数排序。这个概念是使用计数排序，向整数，则高16位的低16位排序。这使我在2次迭代运行排序。第一个概念，我不得不试图弄清楚如何处理的底片。因为符号位将被翻转为负数，则在十六进制形式中，这将使底片大于阳性。为了解决这个问题我翻转的符号位时，它是正的，以使[0，2亿元）= [128 000 000 000 255 255 ...）。并且当它是负我翻转的所有位，以使其范围为（000 000 ..，127 255 ..）。这网站帮我处理这些信息。要完成它，我会分裂成整数的顶部或基于通低16位。以下是code允许我这样做。

I'm implementing a 2-byte Radix Sort. The concept is to use Counting Sort, to sort the lower 16 bits of the integers, then the upper 16 bits. This allows me to run the sort in 2 iterations. The first concept I had was trying to figure out how to handle negatives. Since the sign bit would be flipped for negative numbers, then in hex form, that would make negatives greater than the positives. To combat this I flipped the sign bit when it was positive, in order to make [0, 2 bil) = [128 000 000 000, 255 255...). And when it was negative I flipped all the bits, to make it range from (000 000 .., 127 255 ..). This site helped me with that information. To finish it off, I would split the integer into either the top or bottom 16-bits based on the pass. The following is the code allowing me to do that.

static uint32_t position(int number, int pass) {
    int mask;
    if (number <= 0) mask = 0x80000000;
    else mask = (number >> 31) | 0x80000000;
    uint32_t out = number ^ mask;
    return pass == 0 ? out & 0xffff : (out >> 16) & 0xffff;
}

要开始实际的基数排序，我需要形成规模65536元的直方图。我跑过的问题是，当输入元件的数量是非常大的。这将需要一段时间来创建直方图，所以我并行实现它，使用进程和共享内存。我划分了数组的大小/ 8小节然后通过共享内存数组大小65536 * 8，我有每个进程创建自己的直方图。后来，我把它概括在一起，形成一个单一的柱状图。以下是code为：

To start the actual Radix Sort, I needed to form a histogram of size 65536 elements. The problem I ran across was when the number of elements inputted was very large. It would take a while to create the histogram, so I implemented it in parallel, using processes and shared memory. I partitioned the array into subsections of size / 8. Then over an array of shared memory sized 65536 * 8, I had each process create its own histogram. Afterwards, I summed it all together to form a single histogram. The following is the code for that:

for (i=0;i<8;i++) {
    pid_t pid = fork();
    if (pid < 0) _exit(0);
    if (pid == 0) {
        const int start = (i * size) >> 3;
        const int stop  = i == 7 ? size : ((i + 1) * size) >> 3;
        const int curr  = i << 16;
        for (j=start;j<stop;++j)
            hist[curr + position(array[j], pass)]++;
        _exit(0);
    }
}
for (i=0;i<8;i++) wait(NULL);

for (i=1;i<8;i++) {
    const int pos = i << 16;
    for (j=0;j<65536;j++)
        hist[j] += hist[pos + j];
}

接下来的部分是在那里我度过了我的大部分时间来分析如何缓存影响了preFIX森的表现。用一个8位和11位通基数排序，所有的直方图将适合于L1高速缓存。随着16位，所以只适合在二级高速缓存。最终16位直方图跑的总和最快的，因为我只跑2次迭代它。我也跑并行preFIX款项按照CUDA网站的建议。在2.5亿元，这比跑的16位整数慢1.5秒。所以，我的preFIX款项最终看上去是这样的：

The next part was where I spent most of my time analyzing how cache affected the performance of the prefix-sum. With an 8-bit and 11-bit pass Radix Sort, all of the histogram would fit within L1 cache. With 16-bits, it would only fit within L2 cache. In the end the 16-bit histogram ran the sum the fastest, since I only had to run 2 iterations with it. I also ran the prefix sum in parallel as per the CUDA website recommendations. At 250 million elements, this ran about 1.5 seconds slower than the 16-bit integer. So my prefix sum ended up looking like this:

for (i=1;i<65536;i++)
    hist[i] += hist[i-1];

剩下的唯一的事情是向后遍历数组，并把所有的元素融入到他们的临时数组中的各个点。因为我只有去那里而不是从临时回阵列复制，并再次运行code到两次。我跑了第一次使用数组作为输入，并临时作为输出的排序。然后运行它使用温度作为输入和阵列作为输出的第二时间。这使我从MEM-复制回阵两次。在code看起来像这样的实际排序：

The only thing left was to traverse backwards through the array and put all the elements into their respective spots in the temp array. Since I only had to go through twice, instead of copying from the temp back to array, and running the code again. I ran the sort first using array as the input, and temp as the output. Then ran it the second time using temp as the input and array as the output. This kept me from mem-copying back to array both times. The code looks like this for the actual sort:

histogram(array, size, 0, hist);
for (i=size-1;i>=0;i--)
    temp[--hist[position(array[i], 0)]] = array[i];

memset(hist, 0, arrSize);
histogram(temp, size, 1, hist);
for (i=size-1;i>=0;i--)
    array[--hist[position(temp[i], 1)]] = temp[i];

此链接包含完整code，我有这么远。我跑针对快速排序测试，并且它跑得更快之间5倍和10倍带整数和浮点数，和大约5倍的速度用8字节的数据类型。有没有一种方法来改善呢？

This link contains the full code that I have so far. I ran a test against quicksort, and it ran between 5 and 10 times faster with integers and floats, and about 5 times faster with 8-byte data types. Is there a way to improve on this?

如何改善这种实现的基数排序的？基数

推荐答案