最快的排序算法以百万计的UINT64 RGBZ图形像素算法、像素、图形、最快

2023-09-11 22:43:59 作者:故与放纵

我整理10+亿UINT64s与.RAW文件中的RGB数据和我的C程序的时间79%都用在的qsort。我期待为这个特定的数据类型更快的排序。

I am sorting 10+ million UINT64s with RGB data from .RAW files and 79% of my C program time is spent in qsort. I am looking for a faster sort for this specific data type.

作为原始图形数据,该数字是很随机的,约80%是独一无二的。没有部分排序或排序的数据运行可期。该UINT64内的4 UINT16s为R,G,B和零(可能是一个小型的计数&其中; =〜20)。

Being RAW graphical data, the numbers are very random and ~80% unique. No partial sorting or runs of sorted data can be expected. The 4 UINT16s inside the UINT64 are R, G, B and zero (possibly a small count <= ~20).

我有最简单的比较功能,我能想到使用无符号长long(有你不能只是相减)的:

I have the simplest comparison function I can think of using UNSIGNED long longs (you CAN NOT just subtract them):

qsort(hpidx, num_pix, sizeof(uint64_t), comp_uint64); 
...
int comp_uint64(const void *a, const void *b)  {
    if(*((uint64_t *)a) > *((uint64_t *)b))  return(+1);
    if(*((uint64_t *)a) < *((uint64_t *)b))  return(-1);
    return(0);
}  // End Comp_uint64().

有一个非常有趣;在StackExchange编程拼图及放code高尔夫,但他们使用的花车。然后还有的qsort,RecQuick,堆,跟屁虫,树,麦冬,...

There was a very interesting "Programming Puzzles & Code Golf" on StackExchange, but they used floats. Then there are QSort, RecQuick, heap, stooge, tree, radix, ...

在斯文森/排序看上去有趣,但为我的数据类型,UINT64没有(明显)的支持。而快速排序时间是最好的。有消息说,系统的qsort可以是任何东西,不一定是快速排序。

The swenson/sort looked interesting but had no (obvious) support for my datatype, UINT64. And the "quick sort" time was the best. Some sources say the system "QSORT" can be anything, not necessarily "Quick Sort".

一个C ++类绕过空指针的一般铸造和超过C.性能实现了很大的改进必须有一个优化的方法,通过以异常的速度在64位处理器大满贯U8s。

A C++ sort bypasses the generic casting of void pointers and realizes great improvements in performance over C. There has to be an optimized method to slam U8s through a 64bit processor at warp speed.

我目前使用草莓P​​erl的海湾合作​​委员会 gcc版本4.9.2(x86_64的-POSIX的sjlj,由strawberryperl.com建 英特尔2700K的Sandy Bridge处理器,32GB DDR3 窗户7/64亲

I am currently using the GCC with Strawberry Perl gcc version 4.9.2 (x86_64-posix-sjlj, built by strawberryperl.com Intel 2700K Sandy Bridge CPU, 32GB DDR3 windows 7/64 pro

GCC -D__USE_MINGW_ANSI_STDIO -O 4 -ffast,数学-m64 -Ofast -march = corei7-AVX -mtune = corei7 -Ic:/斌/ xxHash主-Lc:/斌/ xxHash主C:/斌/ STDDEV .C -oc:/bin/stddev.g6.exe

gcc -D__USE_MINGW_ANSI_STDIO -O4 -ffast-math -m64 -Ofast -march=corei7-avx -mtune=corei7 -Ic:/bin/xxHash-master -Lc:/bin/xxHash-master c:/bin/stddev.c -o c:/bin/stddev.g6.exe

=============================================== ================== 第一次尝试一个更好的qsort和qSort()! 试图用迈克尔·托卡列夫的内联的qsort。

================================================================= FIRST ATTEMPT at a better qsort, QSORT()! Tried to use Michael Tokarev's inline qsort.

在一些准备使用的例子: * 在整数数组排序: 无效int_qsort(INT *改编,无符号N){

定义int_lt(A,B)((* A)≤(* B))

*的qsort(INT,编曲,N,int_lt);

Several ready-to-use examples: * Sorting array of integers: void int_qsort(int *arr, unsigned n) {

define int_lt(a,b) ((*a)<(*b))

* QSORT(int, arr, n, int_lt);

从变更类型为INT到uint64_t中 在TYPE编译错误???

Change from type "int" to "uint64_t" compile error on TYPE???

c:/bin/bpbfct.c:586:8: error: expected expression before 'uint64_t'
  QSORT(uint64_t, hpidx, num_pix, islt);

我无法找到一个真正的,编制,工作程序示例中,只用了一般概念

I can't find a real, compiling, working example program, just comments with the "general concept"

#define QSORT_TYPE uint64_t 
#define islt(a,b) ((*a)<(*b))

uint64_t *QSORT_BASE; 
int QSORT_NELT;

hpidx=(uint64_t *) calloc(num_pix+2, sizeof(uint64_t));  // Hash . PIDX
QSORT_BASE = hpidx;
QSORT_NELT = num_pix;  // QSORT_LT is function QSORT_LT()
QSORT(uint64_t, hpidx, num_pix, islt);  
//QSORT(uint64_t *, hpidx, num_pix, QSORT_LT);  // QSORT_LT mal-defined?
//qsort(hpidx, num_pix, sizeof(uint64_t), comp_uint64); // << WORKS

准备使用的示例使用类型的诠释,字符*和结构ELT。是不是uint64_t中一种类型?尝试长长

The "ready-to-use" examples use types of "int", "char *" and "struct elt". Isn't uint64_t a type?? Try "long long"

QSORT(long long, hpidx, num_pix, islt); 
c:/bin/bpbfct.c:586:8: error: expected expression before 'long'
 QSORT(long long, hpidx, num_pix, islt);
 ==========================================================

下次尝试RADIXSORT:

NEXT ATTEMPT RADIXSORT:

结果:RADIX_SORT激进

Results: RADIX_SORT is RADICAL!

  I:\br3\pf.249465>grep "Event" bb12.log | grep -i Sort       
 << 1.40 sec average
4) Time=1.411 sec    = 49.61%, Event RADIX_SORT        , hits=1
4) Time=1.396 sec    = 49.13%, Event RADIX_SORT        , hits=1
4) Time=1.392 sec    = 49.15%, Event RADIX_SORT        , hits=1
16) Time=1.414 sec    = 49.12%, Event RADIX_SORT        , hits=1

I:\br3\pf.249465>grep "Event" bb11.log | grep -i Sort 
 << 5.525 sec average  = 3.95 time slower
4) Time=5.538 sec    = 86.34%, Event QSort             , hits=1
4) Time=5.519 sec    = 79.41%, Event QSort             , hits=1
4) Time=5.519 sec    = 79.02%, Event QSort             , hits=1
4) Time=5.563 sec    = 79.49%, Event QSort             , hits=1
4) Time=5.684 sec    = 79.83%, Event QSort             , hits=1
4) Time=5.509 sec    = 79.30%, Event QSort             , hits=1

3.94倍,比任何一种的qsort开箱即用使用了速度更快!

3.94 times faster than whatever sort "qsort" out of the box uses!

和,更重要的是,有实际的,工作code,你需要通过一些大师谁假设你知道他们知道的一切,并在另外20%可以填补给予什么不只是80%。

And, even more importantly, there was actual, working code, not just 80% of what you need given by some Guru who assumes you know everything they know and can fill in the other 20%.

神奇的解决方案!感谢路易斯·里奇!

Fantastic solution! Thanks Louis Ricci!

推荐答案

我会用基数排序与一个8位的基数。对于64位值一个很好的优化基数排序将不得不遍历列表9次(一至precalculate计数和偏移量和8 64位/ 8位)。 9 * N的时间和2 ​​ N空间(使用阴影阵列)。

I would use Radix Sort with an 8bit radix. For 64bit values a well optimized radix sort will have to iterate over the list 9 times (one to precalculate the counts and offsets and 8 for 64bits/8bits). 9*N time and 2*N space (using a shadow array).

下面是一个优化的基数排序是什么样子。

Here's what an optimized radix sort would look like.

typedef union {
    struct {
        uint32_t c8[256];
        uint32_t c7[256];
        uint32_t c6[256];
        uint32_t c5[256];
        uint32_t c4[256];
        uint32_t c3[256];
        uint32_t c2[256];
        uint32_t c1[256];
    };
    uint32_t counts[256 * 8];
} rscounts_t;

uint64_t * radixSort(uint64_t * array, uint32_t size) {
    rscounts_t counts;
    memset(&counts, 0, 256 * 8 * sizeof(uint32_t));
    uint64_t * cpy = (uint64_t *)malloc(size * sizeof(uint64_t));
    uint32_t o8=0, o7=0, o6=0, o5=0, o4=0, o3=0, o2=0, o1=0;
    uint32_t t8, t7, t6, t5, t4, t3, t2, t1;
    uint32_t x;
    // calculate counts
    for(x = 0; x < size; x++) {
        t8 = array[x] & 0xff;
        t7 = (array[x] >> 8) & 0xff;
        t6 = (array[x] >> 16) & 0xff;
        t5 = (array[x] >> 24) & 0xff;
        t4 = (array[x] >> 32) & 0xff;
        t3 = (array[x] >> 40) & 0xff;
        t2 = (array[x] >> 48) & 0xff;
        t1 = (array[x] >> 56) & 0xff;
        counts.c8[t8]++;
        counts.c7[t7]++;
        counts.c6[t6]++;
        counts.c5[t5]++;
        counts.c4[t4]++;
        counts.c3[t3]++;
        counts.c2[t2]++;
        counts.c1[t1]++;
    }
    // convert counts to offsets
    for(x = 0; x < 256; x++) {
        t8 = o8 + counts.c8[x];
        t7 = o7 + counts.c7[x];
        t6 = o6 + counts.c6[x];
        t5 = o5 + counts.c5[x];
        t4 = o4 + counts.c4[x];
        t3 = o3 + counts.c3[x];
        t2 = o2 + counts.c2[x];
        t1 = o1 + counts.c1[x];
        counts.c8[x] = o8;
        counts.c7[x] = o7;
        counts.c6[x] = o6;
        counts.c5[x] = o5;
        counts.c4[x] = o4;
        counts.c3[x] = o3;
        counts.c2[x] = o2;
        counts.c1[x] = o1;
        o8 = t8; 
        o7 = t7; 
        o6 = t6; 
        o5 = t5; 
        o4 = t4; 
        o3 = t3; 
        o2 = t2; 
        o1 = t1;
    }
    // radix
    for(x = 0; x < size; x++) {
        t8 = array[x] & 0xff;
        cpy[counts.c8[t8]] = array[x];
        counts.c8[t8]++;
    }
    for(x = 0; x < size; x++) {
        t7 = (cpy[x] >> 8) & 0xff;
        array[counts.c7[t7]] = cpy[x];
        counts.c7[t7]++;
    }
    for(x = 0; x < size; x++) {
        t6 = (array[x] >> 16) & 0xff;
        cpy[counts.c6[t6]] = array[x];
        counts.c6[t6]++;
    }
    for(x = 0; x < size; x++) {
        t5 = (cpy[x] >> 24) & 0xff;
        array[counts.c5[t5]] = cpy[x];
        counts.c5[t5]++;
    }
    for(x = 0; x < size; x++) {
        t4 = (array[x] >> 32) & 0xff;
        cpy[counts.c4[t4]] = array[x];
        counts.c4[t4]++;
    }
    for(x = 0; x < size; x++) {
        t3 = (cpy[x] >> 40) & 0xff;
        array[counts.c3[t3]] = cpy[x];
        counts.c3[t3]++;
    }
    for(x = 0; x < size; x++) {
        t2 = (array[x] >> 48) & 0xff;
        cpy[counts.c2[t2]] = array[x];
        counts.c2[t2]++;
    }
    for(x = 0; x < size; x++) {
        t1 = (cpy[x] >> 56) & 0xff;
        array[counts.c1[t1]] = cpy[x];
        counts.c1[t1]++;
    }
    free(cpy);
    return array;
}

修改这个实现是基于一个JavaScript版本的Fastest办法在JavaScript中32位有符号整数数组排序?

EDIT this implementation was based on a JavaScript version Fastest way to sort 32bit signed integer arrays in JavaScript?

下面是IDEONE对于C基数排序 http://ideone.com/JHI0d9

Here's the IDEONE for the C radix sort http://ideone.com/JHI0d9