如何有效地找到k最近邻居高维数据?有效地、邻居、最近、数据

2023-09-11 03:20:10 作者:渲染一世の傷

所以我约16000 75维的数据点,并为每一个点,我想找到它的k个最近的邻居(使用欧氏距离​​,目前K = 2,如果这使得它easiser)

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)

我的第一个念头是使用kd树这一点,但事实证明,他们变得​​相当低效的,因为尺寸的数量增长。在我的示例实现,它只是略微快于穷举搜索。

My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.

我的下一个想法是使用PCA(主成分分析法),以减少维数,但我想知道:是否有一些聪明的算法和数据结构来解决这个问题正是在合理的时间

My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?

推荐答案

有关KD树维基百科的文章有一个链接到的安库:

The Wikipedia article for kd-trees has a link to the ANN library:

ANN是用C ++编写,一个图书馆,   支持的数据结构和   为精确算法和   近似​​最邻近搜索   在任意高的尺寸。

ANN is a library written in C++, which supports data structures and algorithms for both exact and approximate nearest neighbor searching in arbitrarily high dimensions.

  执行非常有效的点   台大小不等,从几千到   几十万,而在   尺寸高达20 。 (对于显著更高的应用   尺寸,结果是相当   参差不齐,但你可以试试也无妨。)

Based on our own experience, ANN performs quite efficiently for point sets ranging in size from thousands to hundreds of thousands, and in dimensions as high as 20. (For applications in significantly higher dimensions, the results are rather spotty, but you might try it anyway.)

至于算法/数据结构而言:

As far as algorithm/data structures are concerned:

该库实现了一些   不同的数据结构,是根据   KD树和中分解树的,   并采用两个不同的   检索策略。

The library implements a number of different data structures, based on kd-trees and box-decomposition trees, and employs a couple of different search strategies.

我想尝试它第一次直接,如果没有产生令人满意的结果我想使用它与应用PCA / ICA(因为它是不太可能你将要结束了几家有足够的尺寸为KD后的数据集 - 树处理)。

I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).