两种算法找到最近的邻居地点敏感的哈希,哪一个?两种、算法、邻居、敏感

2023-09-11 04:53:22 作者:雕刻美男゛

目前我正在学习如何找到使用局部性敏感散列一个近邻。不过虽然我读的报纸和在网上搜索,我发现两种算法实现此目的:

Currently I'm studying how to find a nearest neighbor using Locality-sensitive hashing. However while I'm reading papers and searching the web I found two algorithms for doing this:

-1-使用大号带的随机LSH函数L个哈希表的数量,从而增加了机会,两个文件是类似于获得相同的签名。例如,如果两个文件是80%相似,那么有80%的机会,他们将得到相同的签名从一个LSH功能。但是,如果我们使用多个LSH功能,然后有一个更高的机会从的LSH功能之一得到相同的签名的文件。这种方法是在维基百科解释,我希望我的理解是正确的:

1- Use L number of hash tables with L number of random LSH functions, thus increasing the chance that two documents that are similar to get the same signature. For example if two documents are 80% similar, then there's an 80% chance that they will get the same signature from one LSH function. However if we use multiple LSH functions, then there's a higher chance to get the same signature for the documents from one of the LSH functions. This method is explained in wikipedia and I hope my understanding is correct:

http://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search

2-其他算法使用的方法从纸(第5)呼吁:从四舍五入算法摩西S.恰里卡尔相似性评估技术。它是基于使用一个LSH函数生成签名,然后应用P排列在其上,然后对列表排序。其实我不明白的方式非常好,我希望如果有人能澄清。

2- The other algorithm uses a method from a paper (section 5) called: Similarity Estimation Techniques from Rounding Algorithms by Moses S. Charikar. It's based on using one LSH function to generate the signature and then apply P permutations on it and then sort the list. Actually I don't understand the method very well and I hope if someone could clarify it.

我的主要问题是:为什么会有人使用第二种方法,而不是第一种方法?因为我觉得它更容易和更快。

My main question is: why would anyone use the second method rather than the first method? As I find it's easier and faster.

我真的希望有人能够帮助!

I really hope someone can help!!!

编辑: 其实我不知道,如果@ Raff.Edward进行的第一和第二的混合。因为只有第二种方法使用半径和第一只使用一个新的哈希家庭克哈希家庭组成F.请检查维基百科的链接。他们只是使用了许多的G功能产生不同的签名,然后对每个G功能它都有一个对应的哈希表。为了找到一个点的近邻,你只是让点经过G功能,并检查相应的哈希表碰撞。因此,我怎么知道它更多的功能...碰撞更多的机会。

Actually I'm not sure if @Raff.Edward were mixing between the "first" and the "second". Because only the second method uses a radius and the first just uses a new hash family g composed of the hash family F. Please check the wikipedia link. They just used many g functions to generate different signatures and then for each g function it has a corresponding hash table. In order to find the nearest neighbor of a point you just let the point go through the g functions and check the corresponding hash tables for collisions. Thus how I understood it as more function ... more chance for collisions.

我没有找到任何关于提半径的第一个方法。

I didn't find any mentioning about radius for the first method.

对于第二种方法,他们只生成一个签名为每一个特征向量,然后应用P排列在他们身上。现在我们有排列,其中每个包含n个签名的对表。现在,他们则排序后给定一个查询点q从P.每个列表,它们产生于它的签名,然后应用在P排列就可以了,然后用每个置换和分类P表上的二进制搜索找到最相似的签名查询Q。我结束了本阅读很多关于它的文章之后,但我还是不明白为什么会有人用这样的方法,因为它似乎并没有很快找到了汉明距离!!!!

For the second method they generate only one signature for each feature vector and then apply P permutations on them. Now we have P lists of permutations where each contains n signatures. Now they then sort each list from P. After that given a query point q, they generate the signature for it and then apply the P permutations on it and then use binary search on each permuted and sorted P list to find the most similar signature to the query q. I concluded this after reading many papers about it, but I still don't understand why would anyone use such a method because it doesn't seem fast in finding the hamming distance!!!!

对于我来说,我只想做以下找到近邻查询点q。定的签名N A列表,我会生成用于查询点q的签名,然后扫描列表N和计算N中的每个元件和q的签名之间的汉明距离。因此,我将结束与近邻的Q值。它需要O(N)!

For me I would simply do the following to find the nearest neighbor for a query point q. Given a list of signatures N, I would generate the signature for the query point q and then scan the list N and compute the hamming distance between each element in N and the signature of q. Thus I would end up with the nearest neighbor for q. And it takes O(N)!!!

推荐答案

您第一个的理解是有点过。发生碰撞的概率是不成正比的相似性,但它是否小于$ P $对 - 定义的半径。我们的目标是,半径范围内任何事情都会有很高的几率碰撞,以及任何半径*(1 + EPS)外将有碰撞的可能性低(与在中间的区域是有点阴暗)。

Your understanding of the first one is a little off. The probability of a collision occurring is not proportional to the similarity, but whether or not it is less than the pre-defined radius. The goal is that anything within the radius will have a high chance of colliding, and anything outside the radius * (1+eps) will have a low chance of colliding (and the area in-between is a little murky).

第一种算法其实是相当难以实现,但是能取得好成绩。特别是,第一算法是针对L1和L2(和技术上几个)度量。

The first algorithm is actually fairly difficult to implement well, but can get good results. In particular, the first algorithm is for the L1 and L2 (and technically a few more) metrics.

第二种算法很容易实现,但天真的实现可能会占用太多内存,这取决于你的问题的大小是有用的。在这种情况下,冲突的概率正比于输入的相似性。然而,它仅适用于所述余弦相似度(或距离度量基于变换的相似性。)

The second algorithm is very simple to implement, though a naive implementation may use up too much memory to be useful depending on your problem size. In this case, the probability of collision is proportional to the similarity of the inputs. However, it only works for the Cosine Similarity (or distance metrics based on a transform of the similarity.)

那么,哪一个你会使用主要是基于其距离度量你正在使用的近邻(或任何其他应用程序)。

So which one you would use is based primarily on which distance metric you are using for Nearest Neighbor (or whatever other application).

第二个实际上是更容易理解和实施比第一个,本文只是很罗嗦。

The second one is actually much easier to understand and implement than the first one, the paper is just very wordy.

短的版本:采取随机向量V,并给每个索引一个独立的随机单元正常值。只要你想签名长度要创造尽可能多的载体。签名是各指标的迹象,当你做一个矩阵向量的产品。现在任何两个签名之间的汉明距离是涉及到的各数据点之间的余弦相似性。

The short version: Take a random vector V and give each index a independent random unit normal value. Create as many vectors as you want the signature length to be. The signature is the signs of each index when you do a Matrix Vector product. Now the hamming distance between any two signatures is related to the cosine similarity between the respective data points.

由于可以连接code中的签名改成一个int数组,并使用XOR带着几分计数指令来获得汉明距离非常快,你可以​​得到近似的余弦相似度非常快。

Because you can encode the signature into an int array and use an XOR with a bit count instruction to get the hamming distance very quickly, you can get approximate cosine similarity scores very quickly.

LSH算法没有大量的标准化,以及两篇论文(和其他人)使用不同的定义,所以其所有的有点混乱的时候。我只是在最近才实现这两种算法 JSAT ,和我仍然在完全理解他们。

LSH algorithms doesn't have a lot of standardization, and the two papers (and others) use different definitions, so its all a bit confusing at times. I only recently implemented both of these algorithms in JSAT, and am still working on fully understanding them both.

编辑:在回答你的编辑。维基百科的文章是不是伟大的LSH。如果你读href="http://people.csail.mit.edu/indyk/nips-nn.ps" rel="nofollow">原件的