的实时数据采集百分百分、数据采集、实时

2023-09-10 23:45:31 作者:你的眼中没有我@

我在寻找一种算法,确定百分实时数据采集。

I am looking for an algorithm that determines percentiles for live data capture.

例如,假设一个服务器应用程序的开发。

For example, consider the development of a server application.

服务器可能具有响应时间如下: 17毫秒 33毫秒 52毫秒 60毫秒 55毫秒 等等。

The server might have response times as follows: 17 ms 33 ms 52 ms 60 ms 55 ms etc.

这是非常有用的报告90个百分点的响应时间,80百分位的响应时间等。

It is useful to report the 90th percentile response time, 80th percentile response time, etc.

天真的算法是每一个响应时间插入到一个列表。当统计要求,对列表进行排序,并获取值的正确位置上。

The naive algorithm is to insert each response time into a list. When statistics are requested, sort the list and get the values at the proper positions.

内存用法呈线性变化关系的请求数量。

Memory usages scales linearly with the number of requests.

有一个算法,产生给出有限的内存使用大概百分统计?例如,假设我要解决的,我处理数百万请求的一种方式这个问题,但只是想用说一个内存千字节百分跟踪(丢弃跟踪旧的要求是不是一种选择,因为百分都应该对于所有的请求)。

Is there an algorithm that yields "approximate" percentile statistics given limited memory usage? For example, let's say I want to solve this problem in a way that I process millions of requests but only want to use say one kilobyte of memory for percentile tracking (discarding the tracking for old requests is not an option since the percentiles are supposed to be for all requests).

另外要求,没有分布的先验知识。例如,我不希望指定水桶提前任何范围。

Also require that there is no a priori knowledge of the distribution. For example, I do not want to specify any ranges of buckets ahead of time.

推荐答案

我相信有这个问题很多很好的近似算法。成绩优秀的切割方法是简单地用一个固定大小的数组(比如1K有价值的数据)。修正了一些概率p。对于每个请求,以概率p,写入其响应时间到阵列(代替最早的时间在那里)。由于数组是实时流的二次抽样,自二次取样preserves的分布,这样做阵列上的数据将会给你的全部,实时流的统计数据近似。

I believe there are many good approximate algorithms for this problem. A good first-cut approach is to simply use a fixed-size array (say 1K worth of data). Fix some probability p. For each request, with probability p, write its response time into the array (replacing the oldest time in there). Since the array is a subsampling of the live stream and since subsampling preserves the distribution, doing the statistics on that array will give you an approximation of the statistics of the full, live stream.

该方法有几个优点:它不需要先验信息,并可以很容易地code。您可以快速构建,并通过实验确定,为你的特定服务器,在什么时候增大缓冲区对答案只有微不足道的影响。这是点的近似足够precise。

This approach has several advantages: it requires no a-priori information, and it's easy to code. You can build it quickly and experimentally determine, for your particular server, at what point growing the buffer has only a negligible effect on the answer. That is the point where the approximation is sufficiently precise.

如果你发现你需要太多的内存,给你的统计数据是precise不够,那么你就必须进一步深入研究。好的关键字是:流计算,流统计,当然还有百分。您也可以尝试愤怒和诅咒的做法。

If you find that you need too much memory to give you statistics that are precise enough, then you'll have to dig further. Good keywords are: "stream computing", "stream statistics", and of course "percentiles". You can also try "ire and curses"'s approach.