在Knuth的伪$ C $下水库取样可能的错误水库、错误、Knuth

2023-09-11 07:05:35 作者:灬永杬菂神奇寳寳

下面是克努特水库取样(该pseodo code如何从一组 N 号码,确保每一个数字都有相同的概率)。

Below is the pseodo code from Knuth for Reservoir Sampling (how to select k numbers from a set of n numbers, making sure that every number has the same probability).

初​​始化:与大小水库: K

Init: a reservoir with the size:k.

for i = k+1 to N
    M = random(1, i);

    if (M < k) // should this be if (M <= k)
       SWAP the Mth value and ith value
    end if    
end for

从这个code, M&LT的可能性;氏/ code>是(K-1)/ I ,不是 K / I ,所以我想在循环体的如果的说法应该是如果(M&LT; = K)。我试图测试它们之间的区别,但我没有得到任何地方。

From this code,the probability of M < K is (k-1)/i, not k/i, so I think the if statement in the body of the loop should be if (M < =k). I tried to test the difference between them, but I didn't get anywhere.

推荐答案

你是对的。但是,您的code不正确地实现算法R的错误是你(或任何人写这篇code),不Knuth的; - )

You are right. However, your code does not correctly implement Algorithm R. The bug is yours (or whomever wrote this code), not Knuth's ;-)

从克努特(计算机编程册3ED 1998年,第144页的艺术)报价:

Quoting from Knuth (The Art of Computer Programming Vol.2 3Ed 1998, p.144):

...会出现一个问题,如果我们不预先知道N的值,因为N的precise值在算法S.关键假设我们想从文件中随机选择n个项目,不知道到底有多少人present在该文件中。我们可以先通过和计数的记录,再取第二次通过选择它们;但其通常为更好采样米的> N于第一遍,其中m是大于N少得多,使得仅米物品必须考虑在第二通原始项目。当然,诀窍,就是要做到这一点的   这样一种方式,最后的结果是原始文件的一个真正的随机样本。

... A problem arises if we don't know the value of N in advance, since the precise value of N is crucial in Algorithm S. Suppose we want to select n items at random from a file, without knowing exactly how many are present in that file. We could first go through and count the records, then take a second pass to select them; but it is generally better to sample m > n of the original items on the first pass, where m is much less than N, so that only m items must be considered on the second pass. The trick, of course, is to do this in such a way that the final result is a truly random sample of the original file.

由于我们不知道什么时候输入将要结束时,我们必须保持迄今看到的输入记录进行随机抽样的轨道,因此总是被ppared为最终$ P $。当我们读取输入,我们将建立一个蓄水池仅包含已出现了previous样本中的记录。前n个记录总是进入贮存器。当(T + 1)个记录被输入,在t> N,我们将在内存中n个索引指向我们从中间第t所选择的记录表。问题是要保持这种状况与T增加一个,即找到一个新的随机样本从T + 1的记录,现在知道是present之一。由此不难看出,我们应该包括新的记录中的概率N /(T + 1)的新的样品,并且在这样的情况下,它应该取代previous样品的随机元素。

Since we don't know when the input is going to end, we must keep track of a random sample of the input records seen so far, thus always being prepared for the end. As we read the input we will construct a "reservoir" that contains only the records that have appeared among the previous samples. The first n records always go into the reservoir. When the (t + 1)st record is being input, for t>n, we will have in memory a table of n indices pointing to the records that we have chosen from among the first t. The problem is to maintain this situation with t increased by one, namely to find a new random sample from among the t + 1 records now known to be present. It is not hard to see that we should include the new record in the new sample with probability n/(t + 1), and in such a case it should replace a random element of the previous sample.

因此​​,下面的步骤做这项工作:

Thus, the following procedure does the job:

算法有r (水库采样)。从大小未知> n的文件,给定的n> 0的辅助文件名为蓄水池中选择n条记录随机包含了所有的候选人最终样品的记录。该算法采用不同的索引表的 I 的研究[J] 1&LT; J&LT; N,其中每个指向的贮存器中的一个记录。

Algorithm R (Reservoir sampling). To select n records at random from a file of unknown size > n, given n > 0. An auxiliary file called the "reservoir" contains all records that are candidates for the final sample. The algorithm uses a table of distinct indices I[j] for 1 < j < n, each of which points to one of the records in the reservoir.

R1。 [初始化]输入第N个记录,并将它们复制到水库。设置的 I 的研究[J]←J表示1&LT; J&LT; n和集合T←米←ñ。 (如果被取样的文件小于n个记录,这将当然有必要中止算法和报告失败。在该算法中,指数的 I 的[1],...,的 I 的[n]的指向当前样本中的记录; m是储存器的大小;和t是处理迄今输入的记录的数量)

R1. [Initialize.] Input the first n records and copy them to the reservoir. Set I[j] ← j for 1 < j < n, and set t ← m ← n. (If the file being sampled has fewer than n records, it will of course be necessary to abort the algorithm and report failure. During this algorithm, indices I[1], ..., I[n] point to the records in the current sample; m is the size of the reservoir; and t is the number of input records dealt with so far.)

R2。 [文件结束了吗?如果没有更多的记录被输入,则转到步骤R6。

R2. [End of file?] If there are no more records to be input, go to step R6.

R3。 [生成和测试。]增加t有1,然后生成1吨(含)之间的随机整数微米。如果M> N,去R5。

R3. [Generate and test.] Increase t by 1, then generate a random integer M between 1 and t (inclusive). If M > n, go to R5.

R4。 [添加到水库]输入文件的下一个记录复制到水库,增加1米,设置I [M]←微米。 (记录previously指向我[M]。被替换的样本被新的纪录。)回到R2。

R4. [Add to reservoir.] Copy the next record of the input file to the reservoir, increase m by 1, and set I[M] ← m. (The record previously pointed to by I[M] is being replaced in the sample by the new record.) Go back to R2.

R5。 [跳过]跳过输入文件的下一条记录(不包括它的库),并返回步骤R2。

R5. [Skip.] Skip over the next record of the input file (do not include it in the reservoir), and return to step R2.

R6 [第二部]分类中的 I 的表项,这样的 I 的[1]; ...&LT; I 的[N];然后通过水库,复制与这些指标的记录成是拿最后的样本输出文件。

R6. [Second pass.] Sort the I table entries so that I[1] < ... < I[n]; then go through the reservoir, copying the records with these indices into the output file that is to hold the final sample.

算法为R的伪code看起来是这样的:

A pseudocode of Algorithm R would look something like:

for j= 1 to n
    Reservoir[j]= File.GetNext()
    I[j]= j

t=n // number of input records so far
m=n // size of the reservoir

while not File.EOF()
    x= File.GetNext()
    t++
    M= Random(1..t)
    if (M<=n)
        m++
        Reservoir[m]= x
        I[M]= m

Sort(I[1..n])

for j= 1 to n
    Output[j]= Reservoir[I[j]]
 
精彩推荐
图片推荐