有效地找到与大集低海明距离二进制字符串有效地、字符串、大集、距离

2023-09-10 23:44:11 作者:星可摘你不不可得

问题:

鉴于32位无符号整数大(约100万美元)列表中,一个无符号的32位整数输入值,和最大的汉明距离,返回所有列表成员是输入值的特定汉明距离内的。

实际数据结构来保存列表是开放的,性能要求决定了内存的解决方案,成本构建数据结构是次要的,低成本的查询数据结构是至关重要的。

示例:

 为1时最大汉明距离(值通常相当小)

输入:
00001000100000000000000001111101

值:
01001000100000000000000001111101
00001000100000000010000001111101

应匹配,因为只有1位,其中位是不同的。

11001000100000000010000001111101

不应该匹配,因为3位的位置是不同的。
 

我的想法至今:

利好信号释放,文一地产稳健向好,彰显强劲活力

有关为0的汉明距离的退化情况下,只需要使用排序列表,做一个二进制搜索特定的输入值。

如果汉明距离将永远只能是1,我可以翻转在原始输入的每个比特,重复上述的32倍。

我有效地(无扫描整个表)

如何才能找到一个汉明距离> 1列表成员。

解决方案

问:我们所知道的汉明距离d(X,Y)就

答:

这是非负:D(X,Y)≥0 在只有零相同的输入:D(X,Y)= 0⇔X = Y 这是对称的:D(X,Y)= D(Y,X) 它服从三角不等式,D(X,Z)≤D(X,Y)+ D(Y,Z)

问:我们为什么要关心

答:,因为它意味着汉明距离是的指标的的的度量空间的。还有用于索引度量空间的算法。

指标树(维基百科) BK树(维基百科) M-树(维基百科) VP-树(维基百科) 覆盖树(维基百科)

您也可以看一下对空间索引,在一般情况下,武装与知识的算法,你的空间是不是欧几里德,但它的是的度量空间。使用公制关于这一问题的封面字符串索引许多书籍,如汉明距离。

脚注:如果你是比较固定宽度的字符串的汉明距离,你可以通过使用汇编或处理器的内在函数来得到一个显著的性能提升。例如,与海湾合作委员会(手动)你这样做:

 静态内联INT距离(无符号X,符号Y)
{
    返回__builtin_popcount(X ^ Y);
}
 

如果你再告诉GCC您编译支持计算机与SSE4A,那么我认为,应该减少,只是一对夫妇的运算codeS。

编辑:的根据多个来源,这有时/往往比一般面膜/班慢/加code。基准测试表明,在我的系统中,C版本跑赢大市的海湾合作​​委员会的 __ builtin_popcount 约160%。

附录:我很好奇自己的问题,所以我异形三种实现:线性搜索,BK树,和VP树。需要注意的是副总裁兼BK树非常相似。在一个BK树中的节点的儿童是含有属于从树的中心每一个固定距离点树壳。在VP树中的节点有两个孩子,一个包含集中在节点的中心范围,并含有以外的所有点别的孩子中的所有点。所以,你可以把一个VP节点有两个很厚的壳,而不是很多更细的人一BK节点。

结果被俘虏我的3.2 GHz的PC机上,并且算法不要试图利用多个内核(这应该很容易)。我选择的100M伪随机整数的数据库大小。结果是1000查询距离1..5,和100查询6..10的平均值和线性搜索

在数据库:100M伪随机整数 试验次数:1000距离1..5,100距离6..10和线性 结果:查询命中平均#(非常接近) 速度:每秒查询数 报道:数据库的每个查询审查平均百分比

                 -  BK树 -   -  VP树 -   - 直线 - 
DIST结果速度冠状病毒速度冠状病毒速度冠状病毒
1 0.90 3800 0.048%4200 0.048%
2 11 300 0.68%330 0.65%
3 130 56 3.8%63 3.4%
4 970 18 12%22 10%
5 5700 8.5 26 10%22%
6 2.6e4 5.2 42%6.0 37%
7 1.1e5 3.7 60%4.1 54%
8 3.5e5 3.0 74%3.2 70%
9 1.0e6 2.6 85%2.7 82%
10 2.5e6 2.3 91%2.4 90%
任何2.2 100%

在你的评论,你所提到的:

  

我觉得BK-树木可以通过生成一个一堆BK-树木不同的根节点得到改善,而s preading出来。

我想,这正是原因VP树进行(略)比BK树更好。作为更深,而不是浅,它比较对多个点,而不是使用更精细的比较对较少的点。我怀疑的差异是在高维空间更极端。

最后一个提示:在树的叶子节点应该仅仅是整数的线性扫描平板阵列。对于小套(也许1000点或更少),这将是更快,存取效率更高。

Problem:

Given a large (~100 million) list of unsigned 32-bit integers, an unsigned 32-bit integer input value, and a maximum Hamming Distance, return all list members that are within the specified Hamming Distance of the input value.

Actual data structure to hold the list is open, performance requirements dictate an in-memory solution, cost to build the data structure is secondary, low cost to query the data structure is critical.

Example:

For a maximum Hamming Distance of 1 (values typically will be quite small)

And input: 
00001000100000000000000001111101

The values:
01001000100000000000000001111101 
00001000100000000010000001111101 

should match because there is only 1 position in which the bits are different.

11001000100000000010000001111101

should not match because 3 bit positions are different.

My thoughts so far:

For the degenerate case of a Hamming Distance of 0, just use a sorted list and do a binary search for the specific input value.

If the Hamming Distance would only ever be 1, I could flip each bit in the original input and repeat the above 32 times.

How can I efficiently (without scanning the entire list) discover list members with a Hamming Distance > 1.

解决方案

Question: What do we know about the Hamming distance d(x,y)?

Answer:

It is non-negative: d(x,y) ≥ 0 It is only zero for identical inputs: d(x,y) = 0 ⇔ x = y It is symmetric: d(x,y) = d(y,x) It obeys the triangle inequality, d(x,z) ≤ d(x,y) + d(y,z)

Question: Why do we care?

Answer: Because it means that the Hamming distance is a metric for a metric space. There are algorithms for indexing metric spaces.

Metric tree (Wikipedia) BK-tree (Wikipedia) M-tree (Wikipedia) VP-tree (Wikipedia) Cover tree (Wikipedia)

You can also look up algorithms for "spatial indexing" in general, armed with the knowledge that your space is not Euclidean but it is a metric space. Many books on this subject cover string indexing using a metric such as the Hamming distance.

Footnote: If you are comparing the Hamming distance of fixed width strings, you may be able to get a significant performance improvement by using assembly or processor intrinsics. For example, with GCC (manual) you do this:

static inline int distance(unsigned x, unsigned y)
{
    return __builtin_popcount(x^y);
}

If you then inform GCC that you are compiling for a computer with SSE4a, then I believe that should reduce to just a couple opcodes.

Edit: According to a number of sources, this is sometimes/often slower than the usual mask/shift/add code. Benchmarking shows that on my system, a C version outperform's GCC's __builtin_popcount by about 160%.

Addendum: I was curious about the problem myself, so I profiled three implementations: linear search, BK tree, and VP tree. Note that VP and BK trees are very similar. The children of a node in a BK tree are "shells" of trees containing points that are each a fixed distance from the tree's center. A node in a VP tree has two children, one containing all the points within a sphere centered on the node's center and the other child containing all the points outside. So you can think of a VP node as a BK node with two very thick "shells" instead of many finer ones.

The results were captured on my 3.2 GHz PC, and the algorithms do not attempt to utilize multiple cores (which should be easy). I chose a database size of 100M pseudorandom integers. Results are the average of 1000 queries for distance 1..5, and 100 queries for 6..10 and the linear search.

Database: 100M pseudorandom integers Number of tests: 1000 for distance 1..5, 100 for distance 6..10 and linear Results: Average # of query hits (very approximate) Speed: Number of queries per second Coverage: Average percentage of database examined per query

                -- BK Tree --   -- VP Tree --   -- Linear --
Dist    Results Speed   Cov     Speed   Cov     Speed   Cov
1          0.90 3800     0.048% 4200     0.048%
2         11     300     0.68%   330     0.65%
3        130      56     3.8%     63     3.4%
4        970      18    12%       22    10%
5       5700       8.5  26%       10    22%
6       2.6e4      5.2  42%        6.0  37%
7       1.1e5      3.7  60%        4.1  54%
8       3.5e5      3.0  74%        3.2  70%
9       1.0e6      2.6  85%        2.7  82%
10      2.5e6      2.3  91%        2.4  90%
any                                             2.2     100%

In your comment, you mentioned:

I think BK-trees could be improved by generating a bunch of BK-trees with different root nodes, and spreading them out.

I think this is exactly the reason why the VP tree performs (slightly) better than the BK tree. Being "deeper" rather than "shallower", it compares against more points rather than using finer-grained comparisons against fewer points. I suspect that the differences are more extreme in higher dimensional spaces.

A final tip: leaf nodes in the tree should just be flat arrays of integers for a linear scan. For small sets (maybe 1000 points or fewer) this will be faster and more memory efficient.