＆QUOT;在线＆QUOT; （迭代器）算法，估算统计位数，众数，偏度，峰度？在线、位数、算法、迭代

2023-09-10 22:46:17 作者：抢我辣条还想跑ヘ

有一个算法来估算位数，众数，偏度，和/组值的或峰度，但这并不需要存储在内存中的所有值一次？

Is there an algorithm to estimate the median, mode, skewness, and/or kurtosis of set of values, but that does NOT require storing all the values in memory at once?

我想计算的基本统计信息：

I'd like to calculate the basic statistics:

的意思是：算术平均值变化：从的均值偏差的平方标准差：方差的平方根位数：值从较小的半分隔更大数目的一半模式：在设置中最常见的值偏度：TL;博士峰度：TL;博士 mean: arithmetic average variance: average of squared deviations from the mean standard deviation: square root of the variance median: value that separates larger half of the numbers from the smaller half mode: most frequent value found in the set skewness: tl; dr kurtosis: tl; dr

基本公式计算任何这些是等级学校的算术，我也认识他们。有迹象表明，实现了他们很多的统计库，也是如此。

The basic formulas for calculating any of these is grade-school arithmetic, and I do know them. There are many stats libraries that implement them, as well.

我的问题是值的，我处理组的大量（十亿）：使用Python，我不能只是做一个列表或哈希与数十亿的元素。即使我用C写了这个，数十亿单元阵列不太现实。

My problem is the large number (billions) of values in the sets I'm handling: Working in Python, I can't just make a list or hash with billions of elements. Even if I wrote this in C, billion-element arrays aren't too practical.

中的数据进行排序。它产生随机上即时，由其他进程。各组的大小是高度可变的，以及大小不会提前知道。

The data is not sorted. It's produced randomly, on-the-fly, by other processes. The size of each set is highly variable, and the sizes will not be known in advance.

我已经想通了如何处理的均值和方差pretty的同时，通过设置任意顺序每个值进行迭代。（其实，在我的情况，我把他们在他们正在生成的顺序。）下面是我使用的算法，礼貌http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm:

I've already figured out how to handle the mean and variance pretty well, iterating through each value in the set in any order. (Actually, in my case, I take them in the order in which they're generated.) Here's the algorithm I'm using, courtesy http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm:

在初始化三个变量：计数，总和，并sum_of_squares 对于每一个值：增计数。添加值来概括。添加值的平方sum_of_squares。 Initialize three variables: count, sum, and sum_of_squares For each value: Increment count. Add the value to sum. Add the square of the value to sum_of_squares.

本上线算法的弱点（如精度问题sum_of_squares快速增长不是整数的范围或浮动precision大），但它基本上给我我需要什么，而不必存储在每个每个值集。

This "on-line" algorithm has weaknesses (e.g., accuracy problems as sum_of_squares quickly grows larger than integer range or float precision), but it basically gives me what I need, without having to store every value in each set.

但我不知道类似的技术是否用于估计其他统计量（中位数，众偏度，峰度）存在。我可以忍受一个偏估计，甚至是一种方法，妥协的准确性在一定程度上，只要处理的N值所需要的内存比O（N）。

But I don't know whether similar techniques exist for estimating the additional statistics (median, mode, skewness, kurtosis). I could live with a biased estimator, or even a method that compromises accuracy to a certain degree, as long as the memory required to process N values is substantially less than O(N).

指点我到现有的统计库将帮助也一样，如果库中有函数来计算一个上线，这些或多个操作。

Pointing me to an existing stats library will help, too, if the library has functions to calculate one or more of these operations "on-line".

＆QUOT;在线＆QUOT; （迭代器）算法，估算统计位数，众数，偏度，峰度？在线、位数、算法、迭代

推荐答案