在高维数据最近的邻居?邻居、最近、数据、高维

2023-09-10 22:33:34 作者:搞笑内涵的

我曾问一个question几天就回来如何找到最近的邻居给定的载体。我的载体是现在21的尺寸和之前,我再继续,因为我不是从机器学习,也没有数学的领域,我开始问自己一些基本的问题:

I have asked a question a few days back on how to find the nearest neighbors for a given vector. My vector is now 21 dimensions and before I proceed further, because I am not from the domain of Machine Learning nor Math, I am beginning to ask myself some fundamental questions:

是欧氏距离一个很好的度量查找最近的邻居摆在​​首位?如果不是,我有哪些选择? 在另外一个人如何去确定合适的阈值确定的K-邻居?有一些分析认为可以做,以算出这个价值呢? previously,我建议使用KD树,但维基百科页面明确表示,对高维,kd树几乎等同于一个强力搜索。在这种情况下,什么是有效地找到最近邻在万点的数据集的最佳方法是什么?

有人可以澄清上述问题的部分(或全部)→

Can someone please clarify the some (or all) of the above questions?

推荐答案

我目前正在研究这样的问题 - 分类,最邻近搜索 - 音乐信息检索

I currently study such problems -- classification, nearest neighbor searching -- for music information retrieval.

您可能会感兴趣的近似近邻( ANN )算法。这个想法是,你让该算法返回足够的近邻的(也许不是最近邻);这样做,可以减少复杂性。你刚才提到在 kd树;这是一个例子。但正如你所说, kd树运作不佳的大尺寸。事实上,所有的当前索引技术(基于空间的划分)降解为线性搜索足够高的尺寸[1] [2] [3]。

You may be interested in Approximate Nearest Neighbor (ANN) algorithms. The idea is that you allow the algorithm to return sufficiently near neighbors (perhaps not the nearest neighbor); in doing so, you reduce complexity. You mentioned the kd-tree; that is one example. But as you said, kd-tree works poorly in high dimensions. In fact, all current indexing techniques (based on space partitioning) degrade to linear search for sufficiently high dimensions [1][2][3].

在 ANN 算法最近提出,也许是最流行的是局部性敏感散列( LSH ),它映射一组点的一个高维空间成一组箱,即,哈希表[1] [3]。但不同于传统的哈希值,一个的本地敏感的哈希地方的附近的点到同一箱中。

Among ANN algorithms proposed recently, perhaps the most popular is Locality-Sensitive Hashing (LSH), which maps a set of points in a high-dimensional space into a set of bins, i.e., a hash table [1][3]. But unlike traditional hashes, a locality-sensitive hash places nearby points into the same bin.

LSH 有一些巨大的优势。首先,它是简单的。你刚才计算哈希值的所有点在你的数据库,然后做一个哈希表他们。要查询,只计算查询点的哈希值,然后检索所有点从哈希表同斌。

LSH has some huge advantages. First, it is simple. You just compute the hash for all points in your database, then make a hash table from them. To query, just compute the hash of the query point, then retrieve all points in the same bin from the hash table.

二,有一个支持它的性能进行严格的理论。可以示出,该查询时间的次线性的数据库中的大小,即,比线性搜索更快。快多少取决于有多少近似,我们可以忍受。

Second, there is a rigorous theory that supports its performance. It can be shown that the query time is sublinear in the size of the database, i.e., faster than linear search. How much faster depends upon how much approximation we can tolerate.

最后, LSH 与任何Lp范数为 0℃兼容; P< = 2 。因此,要回答你的第一个问题,你可以使用 LSH 与欧氏距离度量,也可以使用它与曼哈顿(L1)的距离度量。也有变体对于海明距离和余弦相似

Finally, LSH is compatible with any Lp norm for 0 < p <= 2. Therefore, to answer your first question, you can use LSH with the Euclidean distance metric, or you can use it with the Manhattan (L1) distance metric. There are also variants for Hamming distance and cosine similarity.

有一个像样的介绍在2008年写的Slaney马尔科姆和迈克尔·凯西IEEE信号处理杂志[4]。

A decent overview was written by Malcolm Slaney and Michael Casey for IEEE Signal Processing Magazine in 2008 [4].

LSH 已应用似乎无处不在。你可能想尝试一下。

LSH has been applied seemingly everywhere. You may want to give it a try.

[1]达塔尔,达克,Immorlica,Mirrokni,局部性敏感基于对稳定分布,散列方案,2004年。

[1] Datar, Indyk, Immorlica, Mirrokni, "Locality-Sensitive Hashing Scheme Based on p-Stable Distributions," 2004.

[2]韦伯,Schek,Blott,定量分析和性能研究在高维空间中的相似性搜索方法,1998年。

[2] Weber, Schek, Blott, "A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces," 1998.

[3] Gionis,达克,Motwani,在通过散列高维,相似性检索,1999年。

[3] Gionis, Indyk, Motwani, "Similarity search in high dimensions via hashing," 1999.

[4] Slaney河,凯西,局部性敏感哈希查找最近的邻居,2008年。

[4] Slaney, Casey, "Locality-sensitive hashing for finding nearest neighbors", 2008.