快速算法计算百分点,除去异常值百分点、算法、异常、快速

2023-09-11 02:08:19 作者:  純情⑦鈅

我有一个需要反复计算一个数据集的近似百分(顺序统计),以便进一步处理之前去除异常值的程序。目前,我正在做这样通过排序值的数组,并挑选合适的单元;这是可行的,但它在配置文件的noticable昙花一现尽管是该计划的一个相当小的部分。

I have a program that needs to repeatedly compute the approximate percentile (order statistic) of a dataset in order to remove outliers before further processing. I'm currently doing so by sorting the array of values and picking the appropriate element; this is doable, but it's a noticable blip on the profiles despite being a fairly minor part of the program.

更多信息:

该数据集包含高达100000浮点数的数量级上,并且假定为合理分布 - 有不太可能是重复的,也不巨大的尖峰在邻近特定值密度;如果由于某种奇怪的原因的分布是奇数,它是确定一个近似是不太准确的,因为数据很可能搞砸了,无论如何进一步处理可疑。但是,数据不一定是均匀地或正态分布;它只是非常不可能堕落。 在一种近似的解决办法是罚款,但我需要了解的如何的近似引入的错误,以确保它是有效的。 由于目的是去除异常值,我计算2百分位在相同的数据在任何时候都:如1,在95%和一​​个在5%。 在该应用程序是在C#与繁重的工作在C ++中位;伪code或任一preexisting库就可以了。 在去除异常值将被罚款过,只要是合理的一个完全不同的方式。 的 更新:的看来我正在寻找一个近似选择算法。 The data set contains on the order of up to 100000 floating point numbers, and assumed to be "reasonably" distributed - there are unlikely to be duplicates nor huge spikes in density near particular values; and if for some odd reason the distribution is odd, it's OK for an approximation to be less accurate since the data is probably messed up anyhow and further processing dubious. However, the data isn't necessarily uniformly or normally distributed; it's just very unlikely to be degenerate. An approximate solution would be fine, but I do need to understand how the approximation introduces error to ensure it's valid. Since the aim is to remove outliers, I'm computing two percentiles over the same data at all times: e.g. one at 95% and one at 5%. The app is in C# with bits of heavy lifting in C++; pseudocode or a preexisting library in either would be fine. An entirely different way of removing outliers would be fine too, as long as it's reasonable. Update: It seems I'm looking for an approximate selection algorithm.

尽管这是所有在一个循环中完成的,数据是(略)不同每一次,所以它是不容易重用数据结构为做for这个问题。

Although this is all done in a loop, the data is (slightly) different every time, so it's not easy to reuse a datastructure as was done for this question.

使用维基选择算法所建议Gronim由大约一个因数20减小这部分的工作时间。

Using the wikipedia selection algorithm as suggested by Gronim reduced this part of the run-time by about a factor 20.

因为我无法找到一个C#实现,这是我想出了。它的速度更快,即使对小的投入比的Array.Sort;而在1000元这是快25倍。

Since I couldn't find a C# implementation, here's what I came up with. It's faster even for small inputs than Array.Sort; and at 1000 elements it's 25 times faster.

public static double QuickSelect(double[] list, int k) {
    return QuickSelect(list, k, 0, list.Length);
}
public static double QuickSelect(double[] list, int k, int startI, int endI) {
    while (true) {
        // Assume startI <= k < endI
        int pivotI = (startI + endI) / 2; //arbitrary, but good if sorted
        int splitI = partition(list, startI, endI, pivotI);
        if (k < splitI)
            endI = splitI;
        else if (k > splitI)
            startI = splitI + 1;
        else //if (k == splitI)
            return list[k];
    }
    //when this returns, all elements of list[i] <= list[k] iif i <= k
}
static int partition(double[] list, int startI, int endI, int pivotI) {
    double pivotValue = list[pivotI];
    list[pivotI] = list[startI];
    list[startI] = pivotValue;

    int storeI = startI + 1;//no need to store @ pivot item, it's good already.
    //Invariant: startI < storeI <= endI
    while (storeI < endI && list[storeI] <= pivotValue) ++storeI; //fast if sorted
    //now storeI == endI || list[storeI] > pivotValue
    //so elem @storeI is either irrelevant or too large.
    for (int i = storeI + 1; i < endI; ++i)
        if (list[i] <= pivotValue) {
            list.swap_elems(i, storeI);
            ++storeI;
        }
    int newPivotI = storeI - 1;
    list[startI] = list[newPivotI];
    list[newPivotI] = pivotValue;
    //now [startI, newPivotI] are <= to pivotValue && list[newPivotI] == pivotValue.
    return newPivotI;
}
static void swap_elems(this double[] list, int i, int j) {
    double tmp = list[i];
    list[i] = list[j];
    list[j] = tmp;
}

谢谢,Gronim,指着我在正确的方向!

Thanks, Gronim, for pointing me in the right direction!

推荐答案

从亨利克直方图解决方案将正常工作。您还可以使用一个选择算法能够有效地找到n个元素的O(n)的数组中的k个最大或最小元素。要使用此为第95百分位设置k = 0.05N,找到k个最大的元素。

The histogram solution from Henrik will work. You can also use a selection algorithm to efficiently find the k largest or smallest elements in an array of n elements in O(n). To use this for the 95th percentile set k=0.05n and find the k largest elements.

参考:

http://en.wikipedia.org/wiki/Selection_algorithm#Selecting_k_smallest_or_largest_elements

 
精彩推荐
图片推荐