如何保持数据流的随机子集?子集、数据流

2023-09-11 03:31:33 作者:我会慢慢长大

我已经流过我的服务器事件流。对我来说,存储所有的人,但我想定期能够处理他们中的一些合计是不可行的。所以,我想保持流就是一切,我已经看到了一个随机抽样,但封顶到最大尺寸的一个子集。

I have a stream of events flowing through my servers. It is not feasible for me to store all of them, but I would like to periodically be able to process some of them in aggregate. So, I want to keep a subset of the stream that is a random sampling of everything I've seen, but is capped to a max size.

因此​​,对于每一个新的项,我需要的算法来决定是否应该将其添加到所存储的集合,或者如果我应该丢弃。如果我添加它,我已经在我的极限,我需要一个算法来驱逐的旧项目之一。

So, for each new item, I need an algorithm to decide if I should add it to the stored set, or if I should discard it. If I add it, and I'm already at my limit, I need an algorithm to evict one of the old items.

显然,这是很容易,只要我下面我的极限(只需保存一切)。但我怎么能保持不被偏向老项目或新项目一旦我过去认为限制了良好的随机抽样?

Obviously, this is easy as long as I'm below my limit (just save everything). But how can I maintain a good random sampling without being biased towards old items or new items once I'm past that limit?

谢谢

推荐答案

对于每一个元素的电子的的我的在你的输入流,生成随机数的研究的的我的。添加对(研究的的我的,电子的的我的)到一组。当设置超过样本大小的 N 的,扔了一对具有最小的研究的。 (A 堆是一个方便的数据结构这样做。)

For each element ei in your input stream, generate a random number ri. Add the pair (ri, ei) to a set. When the set exceeds your sample size n, throw out the pair with the smallest r. (A heap is a convenient data structure for doing this.)

目前的过程结束时,你现在有配对的最大的 N 的随机数(因此从该长度的所有样品均匀地选择)。

At the end of the procedure, you now have a sample of elements paired with the largest n random numbers (hence chosen uniformly from all samples of that length).

例如,在Python中,你会写这样的:

For example, in Python, you would write it like this:

import heapq, random

def sample(s, n):
    """
    Generate a random sample of n elements from the sequence s.
    """
    pairs = ((random.random(), e) for e in s)
    for r, e in heapq.nlargest(n, pairs):
        yield e

或者说,你如果不是一个事实,即 random.sample 已经在标准库函数!

or rather, you would if it weren't for the fact that random.sample is already a function in the standard library!