DBSCAN - 的最大簇跨度设定上限跨度、上限、最大、DBSCAN

2023-09-11 04:14:01 作者：孤独是毒

据我DBSCAN的理解，它可能为你指定的，比如说一个小量，100米， - 因为DBSCAN考虑的密度可达和不可以直接密度可达的发现群集时 - 结束了一个集群中任意两点之间的最大距离> 100米。在更极端的可能性，似乎有可能，你可以设置为100米小量，最终以1公里集群：参见[2] [6]在该数组中的图像从scikit学习获得的一个例子时可能发生的。（我更愿意被告知我是一个总的白痴，我误解DBSCAN如果这就是这里发生了什么。）

By my understanding of DBSCAN, it's possible for you to specify an epsilon of, say, 100 meters and — because DBSCAN takes into account density-reachability and not direct density-reachability when finding clusters — end up with a cluster in which the maximum distance between any two points is > 100 meters. In a more extreme possibility, it seems possible that you could set epsilon of 100 meters and end up with a cluster of 1 kilometer: see [2][6] in this array of images from scikit learn for an example of when that might occur. (I'm more than willing to be told I'm a total idiot and am misunderstanding DBSCAN if that's what's happening here.)

有一种算法，该算法基于密度的像DBSCAN但考虑到某种阈值的最大距离的任意两个点之间的一个集群？

Is there an algorithm that is density-based like DBSCAN but takes into account some kind of thresholding for the maximum distance between any two points in a cluster?

推荐答案

DBSCAN确实不征收群集上的总大小的限制。

DBSCAN indeed does not impose a total size constraint on the cluster.

小量值PTED为分隔两簇的差距在尺寸最佳间$ P $ （可以最多包含minpts-1对象）。

The epsilon value is best interpreted as the size of the gap separating two clusters (that may at most contain minpts-1 objects).

我相信，你实际上即使不找集群：集群是发现数据结构的任务。的结构可以是简单的（例如k均值）或络合物（例如通过分级聚类和k均值发现了任意形状的簇）。

I believe, you are in fact not even looking for clustering: clustering is the task of discovering structure in data. The structure can be simpler (such as k-means) or complex (such as the arbitrarily shaped clusters discovered by hierarchical clustering and k-means).

您可能会寻找矢量量化 - 减少数据集，以一组重较小presentatives - 或集合覆盖 - 寻找最佳盖一组给定的 - 而不是

You might be looking for vector quantization - reducing a data set to a smaller set of representatives - or set cover - finding the optimal cover for a given set - instead.

不过，我也有IM pression，你是真的不知道你需要什么，以及为什么。

However, I also have the impression that you aren't really sure on what you need and why.

DBSCAN的心肌力异同是，它有一个的数学结构的中密度相连部件的形式定义。这是一个强大的和（除了某些罕见的情况下，边界）明确定义的数学概念，而DBSCAN算法是一种有效的优化算法来发现这个结构。

A stength of DBSCAN is that it has a mathematical definition of structure in the form of density-connected components. This is a strong and (except for some rare border cases) well-defined mathematical concept, and the DBSCAN algorithm is an optimally efficient algorithm to discover this structure.

直接的密度可达性然而，没有定义的有用（分区）结构。它只是的没有数据的分成不相交的分区。

Direct density reachability however, doesn't define a useful (partitioning) structure. It just does not partition the data into disjoint partitions.

如果你不需要这种强大的结构（即你不这样做集群，如结构的发现，但你只是想COM preSS你的数据在矢量量化），你可以给华盖preclustering一试。它可以被看作是一个preprocessing步骤设计的聚类。本质上，它是像DBSCAN，不同之处在于它采用两个小量值，并且该结构不能保证最优以任何的方式，但将高度依赖于你的数据的顺序。如果你那么适当preprocess，它仍然是有用的。除非你是在分布式环境，树冠preclustering但至少不是一个完整的运行DBSCAN贵。由于松要求（尤其是集群可能会重叠，而对象预期属于多个组），这是比较容易并行化

If you don't need this kind of strong structure (i.e. you don't do clustering as in "structure discovery", but you just want to compress your data as in vector quantization), you could give "canopy preclustering" a try. It can be seen as a preprocessing step designed for clustering. Essentially, it is like DBSCAN, except that it uses two epsilon values, and the structure is not guaranteed to be optimal in any way, but will highly depend on the ordering of your data. If you then preprocess it appropriately, it can still be useful. Unless you are in a distributed setting, canopy preclustering however is at least as expensive than a full DBSCAN run. Due to the loose requirements (in particular, "clusters" may overlap, and objects are expected to belong to multiple "clusters"), it is easier to parallelize.

呵呵，你也可能只是在寻找的完成连锁系统聚类。如果切断树状图在您需要的高度，由此产生的集群都应该有所需的最大距离插图中任何两个对象。唯一的问题是，层次聚类通常为O（n ^ 3），即它不能扩展到大型数据集。 DBSCAN在运行为O（n log n）的在好的实现（与指数的支持）。

Oh, and you might also just be looking for complete-linkage hierarchical clustering. If you cut the dendrogram at your desired height, the resulting clusters should all have the desired maximum distance inbetween of any two objects. The only problem is that hierarchical clustering usually is O(n^3), i.e. it doesn't scale to large data sets. DBSCAN runs in O(n log n) in good implementations (with index support).