两种算法找到最近的邻居地点敏感的哈希，哪一个？两种、算法、邻居、敏感

2023-09-11 04:53:22 作者：雕刻美男゛

目前我正在学习如何找到使用局部性敏感散列一个近邻。不过虽然我读的报纸和在网上搜索，我发现两种算法实现此目的：

Currently I'm studying how to find a nearest neighbor using Locality-sensitive hashing. However while I'm reading papers and searching the web I found two algorithms for doing this:

-1-使用大号带的随机LSH函数L个哈希表的数量，从而增加了机会，两个文件是类似于获得相同的签名。例如，如果两个文件是80％相似，那么有80％的机会，他们将得到相同的签名从一个LSH功能。但是，如果我们使用多个LSH功能，然后有一个更高的机会从的LSH功能之一得到相同的签名的文件。这种方法是在维基百科解释，我希望我的理解是正确的：

1- Use L number of hash tables with L number of random LSH functions, thus increasing the chance that two documents that are similar to get the same signature. For example if two documents are 80% similar, then there's an 80% chance that they will get the same signature from one LSH function. However if we use multiple LSH functions, then there's a higher chance to get the same signature for the documents from one of the LSH functions. This method is explained in wikipedia and I hope my understanding is correct:

http://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search

2-其他算法使用的方法从纸（第5）呼吁：从四舍五入算法摩西S.恰里卡尔相似性评估技术。它是基于使用一个LSH函数生成签名，然后应用P排列在其上，然后对列表排序。其实我不明白的方式非常好，我希望如果有人能澄清。

2- The other algorithm uses a method from a paper (section 5) called: Similarity Estimation Techniques from Rounding Algorithms by Moses S. Charikar. It's based on using one LSH function to generate the signature and then apply P permutations on it and then sort the list. Actually I don't understand the method very well and I hope if someone could clarify it.

我的主要问题是：为什么会有人使用第二种方法，而不是第一种方法？因为我觉得它更容易和更快。

My main question is: why would anyone use the second method rather than the first method? As I find it's easier and faster.

我真的希望有人能够帮助！

I really hope someone can help!!!

编辑：其实我不知道，如果@ Raff.Edward进行的第一和第二的混合。因为只有第二种方法使用半径和第一只使用一个新的哈希家庭克哈希家庭组成F.请检查维基百科的链接。他们只是使用了许多的G功能产生不同的签名，然后对每个G功能它都有一个对应的哈希表。为了找到一个点的近邻，你只是让点经过G功能，并检查相应的哈希表碰撞。因此，我怎么知道它更多的功能...碰撞更多的机会。

Actually I'm not sure if @Raff.Edward were mixing between the "first" and the "second". Because only the second method uses a radius and the first just uses a new hash family g composed of the hash family F. Please check the wikipedia link. They just used many g functions to generate different signatures and then for each g function it has a corresponding hash table. In order to find the nearest neighbor of a point you just let the point go through the g functions and check the corresponding hash tables for collisions. Thus how I understood it as more function ... more chance for collisions.

我没有找到任何关于提半径的第一个方法。

I didn't find any mentioning about radius for the first method.

对于第二种方法，他们只生成一个签名为每一个特征向量，然后应用P排列在他们身上。现在我们有排列，其中每个包含n个签名的对表。现在，他们则排序后给定一个查询点q从P.每个列表，它们产生于它的签名，然后应用在P排列就可以了，然后用每个置换和分类P表上的二进制搜索找到最相似的签名查询Q。我结束了本阅读很多关于它的文章之后，但我还是不明白为什么会有人用这样的方法，因为它似乎并没有很快找到了汉明距离!!!!

For the second method they generate only one signature for each feature vector and then apply P permutations on them. Now we have P lists of permutations where each contains n signatures. Now they then sort each list from P. After that given a query point q, they generate the signature for it and then apply the P permutations on it and then use binary search on each permuted and sorted P list to find the most similar signature to the query q. I concluded this after reading many papers about it, but I still don't understand why would anyone use such a method because it doesn't seem fast in finding the hamming distance!!!!

对于我来说，我只想做以下找到近邻查询点q。定的签名N A列表，我会生成用于查询点q的签名，然后扫描列表N和计算N中的每个元件和q的签名之间的汉明距离。因此，我将结束与近邻的Q值。它需要O（N）！

For me I would simply do the following to find the nearest neighbor for a query point q. Given a list of signatures N, I would generate the signature for the query point q and then scan the list N and compute the hamming distance between each element in N and the signature of q. Thus I would end up with the nearest neighbor for q. And it takes O(N)!!!

两种算法找到最近的邻居地点敏感的哈希，哪一个？两种、算法、邻居、敏感

推荐答案