分区的浮动数组类似的片段(集群)数组、集群、分区、片段

2023-09-11 02:23:44 作者:问世间谁能敌我

我有花车的数组是这样的:

I have an array of floats like this:

[1.91, 2.87, 3.61, 10.91, 11.91, 12.82, 100.73, 100.71, 101.89, 200]

现在,我要分区像这样的数组:

Now, I want to partition the array like this:

[[1.91, 2.87, 3.61] , [10.91, 11.91, 12.82] , [100.73, 100.71, 101.89] , [200]]

// [200],因为较少集群支持将被视为离群值

// [200] will be considered as an outlier because of less cluster support

我一定要找到这种细分为多个阵列,我不知道什么应该是分区大小。我试图通过使用层次聚类(凝聚)做到这一点,并得到满意的结果对我来说。然而,问题是,我建议不要使用聚类算法进行一维的问题,因为他们是没有理论依据(因为它们是多维数据)来做到这一点。

I have to find this kind of segment for several arrays and I don't know what should be the partition size. I tried to do it by using hierarchical clustering (Agglomerative) and it gives satisfactory results for me. However, issue is, I was suggested not to use clustering algorithms for one-dimensional problem as their is no theoretical justification (as they are for multidimensional data) to do that.

我花了很多时间去寻找解决方案。但是,建议似乎完全不同,如:this和这 VS. this和this和this.

I spent lots of time to find solution. However, suggestions seem quite different like: this and this VS. this and this and this.

我发现了另外一个建议,而不是集群即自然间断优化。然而,这也需要声明像K-手段分区号(是吗?)。

I found another suggestion rather than clustering i.e. natural breaks optimization. However, this also needs to declare the partition number like K-means (right ?).

这是相当混乱(特别是因为我有几个阵列执行这些类型的细分,这是不可能知道的最佳分区号)。

It is quite confusing (specially because I have to perform those kind of segmentation on several arrays and it is impossible to know the optimal partition number).

有什么方法找到分区(这样就可以减少分区中的变化,最大限度地分区之间的差额)与一些理论依据?

Are there any ways to find partitions (thus we can reduce the variance within partitions and maximize the variance between partitions) with some theoretical justification?

任何指针文章/文件(如果可用C / C ++ / Java实现),还有一些理论依据将是对我来说非常有帮助的。

Any pointers to article/papers (if available C/C++/Java implementation) with some theoretical justification will be very helpful for me.

推荐答案

我想我会的数据进行排序(如果不是的话),然后取相邻的差异。由数字的小分割的差异也之间得到一个百分比变化的差。设定阈值,当变化超出阈值时,开始一个新的集群。

I think I'd sort the data (if it's not already), then take adjacent differences. Divide the differences by the smaller of the numbers it's a difference between to get a percentage change. Set a threshold and when the change exceeds that threshold, start a new "cluster".

编辑:快速演示code C ++中:

Quick demo code in C++:

#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>
#include <numeric>
#include <functional>

int main() {
    std::vector<double> data{ 
        1.91, 2.87, 3.61, 10.91, 11.91, 12.82, 100.73, 100.71, 101.89, 200 
    };

    // sort the input data
    std::sort(data.begin(), data.end());

    // find the difference between each number and its predecessor
    std::vector<double> diffs;
    std::adjacent_difference(data.begin(), data.end(), std::back_inserter(diffs));

    // convert differences to percentage changes
    std::transform(diffs.begin(), diffs.end(), data.begin(), diffs.begin(),
        std::divides<double>());

    // print out the results
    for (int i = 0; i < data.size(); i++) {

        // if a difference exceeds 40%, start a new group:
        if (diffs[i] > 0.4)
            std::cout << "\n";

        // print out an item:
        std::cout << data[i] << "\t";
    }

    return 0;
}

结果:

1.91    2.87    3.61
10.91   11.91   12.82
100.71  100.73  101.89
200