迭代或懒水库取样水库、迭代

2023-09-11 03:59:38 作者:你深入我心

我非常非常熟悉用水库取样从单一传过来的数据集的长度不确定的样本。这种方法的一个限制,在我的脑海里,是它仍然需要一个传过来设置任何结果可以返回之前的全部数据。从概念上讲这是有意义的,因为一个具有允许在该序列的全部项目的机会,以取代previously遇到项实现均匀样品

I'm fairly well acquainted with using Reservoir Sampling to sample from a set of undetermined length in a single pass over the data. One limitation of this approach, in my mind, is that it still requires a pass over the entire data set before any results can be returned. Conceptually this makes sense, since one has to allow items in the entirety of the sequence the opportunity to replace previously encountered items to achieve a uniform sample.

有没有一种办法能够产生一些随机的结果,整个序列已经评价过吗?我想那种偷懒的做法,将十分符合蟒蛇的伟大itertools库。这也许可以在某些特定的容错做了什么?我倒是AP preciate任何形式的反馈,这样的想法!

Is there a way to be able to yield some random results before the entire sequence has been evaluated? I'm thinking of the kind of lazy approach that would fit well with python's great itertools library. Perhaps this could be done within some given error tolerance? I'd appreciate any sort of feedback on this idea!

只是为了澄清这个问题了一下,这张图总结了我理解的内存与不同的采样技术流的权衡。我要的是什么,属于类别进样,在这里我们不知道民众的长度事先。

Just to clarify the question a bit, this diagram sums up my understanding of the in-memory vs. streaming tradeoffs of different sampling techniques. What I want is something that falls into the category of Stream Sampling, where we don't know the length of the population beforehand.

很明显,在不知道的长度先验,仍然得到一个统一的样品表面上的矛盾,因为我们很可能会偏向人口开始样品。有没有一种方法来量化这种偏见?是否有折衷进行?没有任何人有一个聪明的算法来解决这个问题呢?

Clearly there is a seeming contradiction in not knowing the length a priori and still getting a uniform sample, since we will most likely bias the sample towards the beginning of the population. Is there a way to quantify this bias? Are there tradeoffs to be made? Does anybody have a clever algorithm to solve this problem?

推荐答案

如果你事先知道项目总数将由一个迭代群中产生了,它可当你来到他们(不仅达到结束后)得到人口样品的项目。如果你不提前知道时间的人口规模,这是不可能的(因为任何项目的样品中存在的可能性不能被计算)。

If you know in advance the total number of items that will be yielded by an iterable population, it is possible to yield the items of a sample of population as you come to them (not only after reaching the end). If you don't know the population size ahead of time, this is impossible (as the probability of any item being in the sample can't be be calculated).

下面是一个快速生成,做这样的:

Here's a quick generator that does this:

def sample_given_size(population, population_size, sample_size):
    for item in population:
        if random.random() < sample_size / population_size:
            yield item
            sample_size -= 1
        population_size -= 1

请注意,该发生器所产生的出现在人群中的订单项目(不按照随机顺序,如 random.sample 或大部分水库抽样codeS),因此样本的切片将不会是一个随机子样本!

Note that the generator yields items in the order they appear in the population (not in random order, like random.sample or most reservoir sampling codes), so a slice of the sample will not be a random subsample!

 
精彩推荐
图片推荐