查找使用preFIX树单近邻的O(1)?近邻、preFIX

2023-09-11 05:03:50 作者:暴富解忧

我读一篇论文,他们提到,他们能找到(1)使用preFIX树单近邻O型。我将描述的一般问题,然后古典解终于在本文提出的解决方案:

I'm reading a paper where they mention that they were able to find the single nearest neighbor in O(1) using a prefix tree. I will describe the general problem and then the classical solution and finally the proposed solution in the paper:

问题:给定的位向量L的列表(所有向量长度相同)和查询位矢量q,我们想找到q的近邻。所述距离度量是汉明距离(多少比特是不同的)。天真的办法是通过列表和计算列表和q每一个矢量,这将需要O(N)之间的汉明距离。但是因为我们将拥有的百万位向量,这是非常昂贵的的,所以我们想减少。

Problem: given a list of bit vectors L (all vectors have the same length) and query bit vector q, we would like to find the nearest neighbor of q. The distance metric is the hamming distance (how many bits are different). The naive approach would be to go through the list and calculate the hamming distance between each vector in the list and q, which will take O(N). However given that we will have millions of bit vectors that is very expensive so we would like to reduce that.

经典解决方案: 经典的解决这个问题是通过使用近似找到最近的邻居这样就实现了O(logN)的。要做到这一点的方法是,首先分拣大号字典顺序,这样类似的位向量将接近对方。然后提供的q,我们应用排序的列表上的二进制搜索,获得的,其中q可以是在排序列表中的位置,并采取矢量它上面和下面的列表中的(因为它们是相似的Cuz排序的),并计算出和它们之间的距离挑中,最低的汉明距离。然而只是单纯地做一个分类,我们还是会错过许多类似的载体,所以要覆盖尽可能我们使用列表的对数和混杂功能P个之多类似载体。各混杂功能对应于每个列表。然后我们插入每个比特向量到每个列表P中混杂其位,相应的混杂功能之后。因此,我们最终以P列出了每个具有位向量,但在不同的顺序位。我们又在P中的每个列表进行排序字典顺序。现在给Q,我们采用相同的二进制搜索在P中的每个列表,但在这里我们根据我们正在访问该列表之前应用混杂功能为q。在这一步中,我们得到最相似的向量为q的对数,所以我们最终得到一个最相似的Q值。这样,我们覆盖的最相似的载体,我们可以。通过忽略所需排序的时间,以找到最近的邻居所需的时间为O(log n)的这是时间为每个列表上的二进制搜索。

Classical solution: the classical solution to this problem is by using an approximation to find the nearest neighbor so to achieve O(logN). The way to do this is by first sorting L lexicographically so that similar bit vectors will be close to each other. Then given q, we apply binary search on the sorted list to get the position of where q could be in the sorted list and take the vectors above it and below it in the list (since they are similar cuz of the sorting) and calculate the distance between them and pick the one with lowest hamming distance. However just simply doing one sorting we will still miss many similar vectors, so to cover as much similar vectors as possible we use P number of lists and P number of jumbling functions. Each jumbling function corresponds to each list. Then we insert each bit vector to each list in P after jumbling its bits with the corresponding jumbling function. So we end up with P lists each having the bits vectors but with the bits in different order. We again sort each list in P lexicographically. Now given q, we apply the same binary search for each list in P, but here we before apply the jumbling function to q according to which list we are accessing. In this step we get P number of most similar vectors to q, so we finally get the one most similar to q. This way we cover as most similar vectors as we can. By ignoring the time required for sorting, the time needed to locate the nearest neighbor is O(log n) which is the time for the binary search on each list.

建议的解决方案: 该解决方案所提出的文件(但没有任何解释)说,我们可以用preFIX树得到近邻的O(1)时间。在论文中,他们说,他们使用的preFIX树木和混杂功能P数P数,其中每个混杂功能相当于每棵树。然后,他们插入位向量到各个树混杂每个矢量的比特与相应的混杂功能后。由于Q,我们运用了跳跃功能,相当于每棵树q和我们检索每个树最相似的矢量为q。现在,我们最终从树上获取P位向量。在论文中,他们说,刚开始最相似的矢量为q从preFIX树为O(1)。我真不明白这所有的,因为我知道搜索preFIX树为O(M),其中M是位向量的长度。任何人都不会理解为什么它O(1)?

Proposed solution: this solution as proposed in the paper (but without any explanation) says that we can get the nearest neighbor in O(1) time using prefix trees. In the paper they said that they use P number of prefix trees and P number of jumbling functions, where each jumbling function corresponds to each tree. Then they insert the bit vectors into each tree after jumbling the bits of each vector with the corresponding jumbling function. Given q, we apply the jumping function to q corresponding to each tree and we retrieve the most similar vector to q from each tree. Now we end up with P bits vectors retrieved from the trees. In the paper they say that just getting the most similar vector to q from a prefix tree is O(1). I really don't understand this at all, as I know searching prefix tree is O(M) where M is the length of the bit vector. Does anybody understand why is it O(1)?

这是纸,我指的是(3.3.2):基于内容的检索人群的实时网络

This is the paper I'm referring to (Section 3.3.2): Content-Based Crowd Retrieval on the Real-Time Web

http://students.cse.tamu.edu/ kykamath /纸/ cikm2012 / fp105-kamath.pdf

我也想,如果你能回答我的与此相关的另一个问题:How查找在preFIX树NN-搜索最相似的位向量?

I also wish if you can answer my other question related to this one: How to lookup the most similar bit vector in an prefix tree for NN-search?

推荐答案

这是 O(1)仅在一个非常松散定义的意义。事实上,我会去这么远,在这种情况下挑战自​​己的使用情况。

This is O(1) only in a very loosely defined sense. In fact I would go so far as to challenge their usage in this case.

这是他们的论文中,为了确定最近的邻居给用户, U

From their paper, To determine the nearest neighbor to a user, u.

在我们首先计算出它的签名, U :可 O(1)根据签名 那么对于每preFIX树在 P :嗯哦,听起来不太 O( 1) 0(P)会更正确。 在迭代的一部分,从2......我们找到了最近的签名在preFIX树,通过在树的上一层迭代的时间... :最好的情况下 O(D),其中 D 这个词的树或长度的深度。 (这是慷慨找到的最近点在preFIX树可以比这更) 在这样做了以后......我们最终 | P | 签名...它的最小汉明距离采摘:那么另一个 P 计算时代这个词的长度。 O(PD)。 "We first calculate it's signature, u" : can be O(1) depending on the "signature" "Then for every prefix tree in P" : Uh oh, not sounding very O(1), O(P) would be more correct. iterative part from 2. "... we find the nearest signature in the prefix tree, by iterating through the tree one level at a time..." : best case O(d) where d is the depth of the tree or length of the word. (this is generous as finding the nearest point in a prefix tree can be more than this) "After doing this... we end up with |P| signatures... of which the smallest hamming distance is picked" : so another P calculations times the length of the word. O(Pd).

更正确的总运行时间为 O(1)+ O(P)+ O(PD)+ O(PD)= O(PD)

More correctly the total runtime is O(1) + O(P)+ O(Pd) + O(Pd) = O(Pd)

我相信他们是如何努力使今天 O(1),但是从我读过他们不相信@mcdowella是在他的分析正确我的。

I believe that @mcdowella is correct in his analysis of how they try to make this O(1), but from what I've read they haven't convinced me.