"在线" (迭代器)算法,估算统计位数,众数,偏度,峰度?在线、位数、算法、迭代

2023-09-10 22:46:17 作者:抢我辣条还想跑ヘ

有一个算法来估算位数,众数,偏度,和/组值的或峰度,但这并不需要存储在内存中的所有值一次?

Is there an algorithm to estimate the median, mode, skewness, and/or kurtosis of set of values, but that does NOT require storing all the values in memory at once?

我想计算的基本统计信息:

I'd like to calculate the basic statistics:

的意思是:算术平均值 变化:从的均值偏差的平方 标准差:方差的平方根 位数:值从较小的半分隔更大数目的一半 模式:在设置中最常见的值 偏度:TL;博士 峰度:TL;博士 mean: arithmetic average variance: average of squared deviations from the mean standard deviation: square root of the variance median: value that separates larger half of the numbers from the smaller half mode: most frequent value found in the set skewness: tl; dr kurtosis: tl; dr

基本公式计算任何这些是等级学校的算术,我也认识他们。有迹象表明,实现了他们很多的统计库,也是如此。

The basic formulas for calculating any of these is grade-school arithmetic, and I do know them. There are many stats libraries that implement them, as well.

我的问题是值的,我处理组的大量(十亿):使用Python,我不能只是做一个列表或哈希与数十亿的元素。即使我用C写了这个,数十亿单元阵列不太现实。

My problem is the large number (billions) of values in the sets I'm handling: Working in Python, I can't just make a list or hash with billions of elements. Even if I wrote this in C, billion-element arrays aren't too practical.

中的数据进行排序。它产生随机上即时,由其他进程。各组的大小是高度可变的,以及大小不会提前知道。

The data is not sorted. It's produced randomly, on-the-fly, by other processes. The size of each set is highly variable, and the sizes will not be known in advance.

我已经想通了如何处理的均值和方差pretty的同时,通过设置任意顺序每个值进行迭代。 (其实,在我的情况,我把他们在他们正在生成的顺序。)下面是我使用的算法,礼貌http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm:

I've already figured out how to handle the mean and variance pretty well, iterating through each value in the set in any order. (Actually, in my case, I take them in the order in which they're generated.) Here's the algorithm I'm using, courtesy http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm:

在初始化三个变量:计数,总和,并sum_of_squares 对于每一个值: 增计数。 添加值来概括。 添加值的平方sum_of_squares。 Initialize three variables: count, sum, and sum_of_squares For each value: Increment count. Add the value to sum. Add the square of the value to sum_of_squares.

本上线算法的弱点(如精度问题sum_of_squares快速增长不是整数的范围或浮动precision大),但它基本上给我我需要什么,而不必存储在每个每个值集。

This "on-line" algorithm has weaknesses (e.g., accuracy problems as sum_of_squares quickly grows larger than integer range or float precision), but it basically gives me what I need, without having to store every value in each set.

但我不知道类似的技术是否用于估计其他统计量(中位数,众偏度,峰度)存在。我可以忍受一个偏估计,甚至是一种方法,妥协的准确性在一定程度上,只要处理的N值所需要的内存比O(N)。

But I don't know whether similar techniques exist for estimating the additional statistics (median, mode, skewness, kurtosis). I could live with a biased estimator, or even a method that compromises accuracy to a certain degree, as long as the memory required to process N values is substantially less than O(N).

指点我到现有的统计库将帮助也一样,如果库中有函数来计算一个上线,这些或多个操作。

Pointing me to an existing stats library will help, too, if the library has functions to calculate one or more of these operations "on-line".

推荐答案

偏度和峰度

有关上线算法偏度和峰度(沿方差线),看到在同一个Wiki页面here并行算法更高时刻的统计数据。

For the on-line algorithms for Skewness and Kurtosis (along the lines of the variance), see in the same wiki page here the parallel algorithms for higher-moment statistics.

中位数

中位数是没有排序的数据坚韧。如果你知道,你有多少个数据点,理论上你只需要部分地排序,如通过使用选择算法。然而,这并不能帮助太多数十亿值。我会建议使用频率计数,请参见下一节。

Median is tough without sorted data. If you know, how many data points you have, in theory you only have to partially sort, e.g. by using a selection algorithm. However, that doesn't help too much with billions of values. I would suggest using frequency counts, see the next section.

中位数,众与频率计数

如果是整数,我也要算 频率的,可能是切断超过一定价值的最高值和最低值的地方我相信它不再是相关的。对于花车(或太多的整数),我可能会创造桶/间隔,然后用同样的方法为整数。 (近似)模式和中位数计算比变得容易,基于频率表。

If it is integers, I would count frequencies, probably cutting off the highest and lowest values beyond some value where I am sure that it is no longer relevant. For floats (or too many integers), I would probably create buckets / intervals, and then use the same approach as for integers. (Approximate) mode and median calculation than gets easy, based on the frequencies table.

通常分布的随机变量

如果它是正态分布的,我会用人口抽样意味着,的