找到对位串用的最多的一套共同的位最多

2023-09-11 02:39:14 作者：じ小女子、难养也、

我想找到一个算法来找到对位串中具有数量最多的公共集位（阵列中所有对之间）的数组。我知道这是可能通过比较所有对阵列中的位串要做到这一点，但这是O（ N 的 2 ）。有没有更有效的算法？理想情况下，我想的算法在每次迭代处理一个进来的比特串增量工作。

例如，假设我们有位串（长度为8）的数组：

  B1：01010001
B2：01101010
B3：01101010
B4：11001010
B5：00110001

这里最好的一对是 B2 和 B3 ，其中有四组常用位。

我发现，似乎说明这样的算法（S.泰勒和放纸; T.德拉蒙德（2011年）;二进制直方图化强的修补高效，稳定匹配的诠释J. COMPUT可见的94：241-265），但我不明白这一点从252页的说明：

这可以在每次迭代中作为唯一的[位串]进行增量更新重叠的需要重新计算是指那些为新父特征和任何其他[位串]根的最重叠的特点入选两个中的一个组合。这避免了需要的O（N 2 ）重叠在每次迭代比较，并允许一个森林700特征的典型大小的数据库中进行下面的第二内置的。

解决方案

据我所知，泰勒和放大器;德拉蒙德（2011）无意给一个O（ N 的）算法的查找的一对位串中拥有最多的一套共同位数的数组。他们勾画一个参数，最好这种对的记录可的更新的在O（ N 的）后，一个新的比特串已被添加到阵列（和两位老位串删除）。

当然，算法的252页上的解释不是很清楚，我认为他们的素描论点，即记录可以在O更新（ N 的）是不完整的，在最好的，所以我能看到为什么你感到困惑。

总之，这里是从纸张解释算法1我最好的尝试。

算法

该算法采用位串的数组，并构建了一个的查找树的。的查找树是一个二进制林（组二进制树），其叶子是原始位串从阵列，其内部节点是新的位串，并且其中如果节点A是节点B的父，那么A和B = A（即，在A的所有组的位也被设置在B）的

例如，如果输入是位串的这个阵列

那么输出为查找树：

如在纸张前进描述如下的算法：

让研究的是初始设置位串（根设置的）。

的

对于每个位串的 F1 的中的研究的，有没有合作伙伴的研究的，发现并记录它的合作伙伴（位串的 F2 的中的研究的 - { F1 的}，它拥有最多的公共集位与的 F1 ），并记录比特它们具有共同的数目

如果没有对位串中的研究的任何公共集位，停止。

让 F1 和 F2 的是对位串中的研究的数量最多组公共位的。

让 P = F1 的＆放大器; F2 的是 F1 的的的父和 F2 的。

删除 F1 和 F2 的距离的研究的;添加的 P 到研究的。

转到步骤2。

分析

假设数组包含的 N 的固定长度的位串。然后，当所描述的算法是O（ N 的 3 ），因为第2步是O（ N 的 2 ），并有O（ N 的）的迭代，因为每次迭代，我们删除两个位串从研究的再加一。

该文件包含了第2步是Ω一个参数（ N 的 2 ）只在第一时间周围循环，而对其他的迭代是O（ N 的），因为我们只需要找到的 P ，并在其他位串的研究的，其合作伙伴是选择组合两个中的一个。然而，这种说法没有说服力对我说：这是不明确的，目前只有O（1）其他类似位串。（也许有一个更好的理由？）

我们可以把算法向下到O（ N 的 2 ）通过存储每一对位串之间共同组的比特数。这需要O（ N 的 2 ）额外的空间。

参考

秒。泰勒和放大器; T.德拉蒙德（2011年）。二进制直方图化强的修补高效，稳定匹配。诠释。 J. COMPUT。。可见的94：241-265

I want to find an algorithm to find the pair of bitstrings in an array that have the largest number of common set bits (among all pairs in the array). I know it is possible to do this by comparing all pairs of bitstrings in the array, but this is O(n2). Is there a more efficient algorithm? Ideally, I would like the algorithm to work incrementally by processing one incoming bitstring in each iteration.

For example, suppose we have this array of bitstrings (of length 8):

B1:01010001 
B2:01101010
B3:01101010
B4:11001010
B5:00110001

The best pair here is B2 and B3, which have four common set bits.

I found a paper that appears to describe such an algorithm (S. Taylor & T. Drummond (2011); "Binary Histogrammed Intensity Patches for Efficient and Robust Matching"; Int. J. Comput. Vis. 94:241–265), but I don't understand this description from page 252:

This can be incrementally updated in each iteration as the only [bitstring] overlaps that need recomputing are those for the new parent feature and any other [bitstrings] in the root whose "most overlapping feature" was one of the two selected for combination. This avoids the need for the O(N2) overlap comparison in every iteration and allows a forest for a typically-sized database of 700 features to be built in under a second.

解决方案

As far as I can tell, Taylor & Drummond (2011) do not purport to give an O(n) algorithm for finding the pair of bitstrings in an array with the largest number of common set bits. They sketch an argument that a record of the best such pairs can be updated in O(n) after a new bitstring has been added to the array (and two old bitstrings removed).

Certainly the explanation of the algorithm on page 252 is not very clear, and I think their sketch argument that the record can be updated in O(n) is incomplete at best, so I can see why you are confused.

Anyway, here's my best attempt to explain Algorithm 1 from the paper.

Algorithm

The algorithm takes an array of bitstrings and constructs a lookup tree. A lookup tree is a binary forest (set of binary trees) whose leaves are the original bitstrings from the array, whose internal nodes are new bitstrings, and where if node A is a parent of node B, then A & B = A (that is, all the set bits in A are also set in B).

For example, if the input is this array of bitstrings:

then the output is the lookup tree:

The algorithm as described in the paper proceeds as follows:

Let R be the initial set of bitstrings (the root set).

For each bitstring f1 in R that has no partner in R, find and record its partner (the bitstring f2 in R − {f1} which has the largest number of set bits in common with f1) and record the number of bits they have in common.

If there is no pair of bitstrings in R with any common set bits, stop.

Let f1 and f2 be the pair of bitstrings in R with the largest number of common set bits.

Let p = f1 & f2 be the parent of f1 and f2.

Remove f1 and f2 from R; add p to R.

Go to step 2.

Analysis

Suppose that the array contains n bitstrings of fixed length. Then the algorithm as described is O(n3) because step 2 is O(n2), and there are O(n) iterations, because at each iteration we remove two bitstrings from R and add one.

The paper contains an argument that step 2 is Ω(n2) only on the first time around the loop, and on other iterations it is O(n) because we only have to find the partner of p "and any other bitstrings in R whose partner was one of the two selected for combination." However, this argument is not convincing to me: it is not clear that there are only O(1) other such bitstrings. (Maybe there's a better argument?)

We could bring the algorithm down to O(n2) by storing the number of common set bits between every pair of bitstrings. This requires O(n2) extra space.