在Knuth的伪$ C $下水库取样可能的错误水库、错误、Knuth

2023-09-11 07:05:35 作者：灬永杬菂神奇寳寳

下面是克努特水库取样（该pseodo code如何从一组 N 号码，确保每一个数字都有相同的概率）。


Below is the pseodo code from Knuth for Reservoir Sampling (how to select k numbers from a set of n numbers, making sure that every number has the same probability).
初始化：与大小水库： K  
Init: a reservoir with the size：k.
for i = k+1 to N
    M = random(1, i);

    if (M < k) // should this be if (M <= k)
       SWAP the Mth value and ith value
    end if    
end for



从这个code， M＆LT的可能性;氏/ code>是（K-1）/ I ，不是 K / I ，所以我想在循环体的如果的说法应该是如果（M＆LT; = K）。我试图测试它们之间的区别，但我没有得到任何地方。

From this code,the probability of M < K is (k-1)/i, not k/i, so I think the if statement in the body of the loop should be if (M < =k). I tried to test the difference between them, but I didn't get anywhere.
推荐答案
你是对的。但是，您的code不正确地实现算法R的错误是你（或任何人写这篇code），不Knuth的;  - ）
You are right. However, your code does not correctly implement Algorithm R. The bug is yours (or whomever wrote this code), not Knuth's ;-)
从克努特（计算机编程册3ED 1998年，第144页的艺术）报价：
Quoting from Knuth (The Art of Computer Programming Vol.2 3Ed 1998, p.144):
 ...会出现一个问题，如果我们不预先知道N的值，因为N的precise值在算法S.关键假设我们想从文件中随机选择n个项目，不知道到底有多少人present在该文件中。我们可以先通过和计数的记录，再取第二次通过选择它们;但其通常为更好采样米的> N于第一遍，其中m是大于N少得多，使得仅米物品必须考虑在第二通原始项目。当然，诀窍，就是要做到这一点的
  这样一种方式，最后的结果是原始文件的一个真正的随机样本。 

  ... A problem arises if we don't know the value of N in advance, since the precise value of N is crucial in Algorithm S. Suppose we want to select n items  at random from a file, without knowing exactly how many are present in that  file. We could first go through and count the records, then take a second pass  to select them; but it is generally better to sample m > n of the original items  on the first pass, where m is much less than N, so that only m items must be considered on the second pass. The trick, of course, is to do this in
  such a way  that the final result is a truly random sample of the original file.  
由于我们不知道什么时候输入将要结束时，我们必须保持迄今看到的输入记录进行随机抽样的轨道，因此总是被ppared为最终$ P $。当我们读取输入，我们将建立一个蓄水池仅包含已出现了previous样本中的记录。前n个记录总是进入贮存器。当（T + 1）个记录被输入，在t> N，我们将在内存中n个索引指向我们从中间第t所选择的记录表。问题是要保持这种状况与T增加一个，即找到一个新的随机样本从T + 1的记录，现在知道是present之一。由此不难看出，我们应该包括新的记录中的概率N /（T + 1）的新的样品，并且在这样的情况下，它应该取代previous样品的随机元素。
Since we don't know when the input is going to end, we must keep track of  a random sample of the input records seen so far, thus always being prepared for  the end. As we read the input we will construct a "reservoir" that contains only  the records that have appeared among the previous samples. The first n records  always go into the reservoir. When the (t + 1)st record is being input, for t>n,  we will have in memory a table of n indices pointing to the records that we have  chosen from among the first t. The problem is to maintain this situation with  t increased by one, namely to find a new random sample from among the t + 1  records now known to be present. It is not hard to see that we should include  the new record in the new sample with probability n/(t + 1), and in such a case  it should replace a random element of the previous sample.
因此，下面的步骤做这项工作：
Thus, the following procedure does the job:  
 算法有r （水库采样）。从大小未知> n的文件，给定的n> 0的辅助文件名为蓄水池中选择n条记录随机包含了所有的候选人最终样品的记录。该算法采用不同的索引表的 I 的研究[J] 1＆LT; J＆LT; N，其中每个指向的贮存器中的一个记录。
Algorithm R (Reservoir sampling). To select n records at random from a file of  unknown size > n, given n > 0. An auxiliary file called the "reservoir" contains  all records that are candidates for the final sample. The algorithm uses a table  of distinct indices I[j] for 1 < j < n, each of which points to one of the records  in the reservoir.
  R1。 [初始化]输入第N个记录，并将它们复制到水库。设置的 I 的研究[J]←J表示1＆LT; J＆LT; n和集合T←米←ñ。 （如果被取样的文件小于n个记录，这将当然有必要中止算法和报告失败。在该算法中，指数的 I 的[1]，...，的 I 的[n]的指向当前样本中的记录; m是储存器的大小;和t是处理迄今输入的记录的数量）
R1. [Initialize.] Input the first n records and copy them to the reservoir. Set I[j] ← j for 1 < j < n, and set t ← m ← n. (If the file being sampled has fewer than n records, it will of course be necessary to abort the algorithm  and report failure. During this algorithm, indices I[1], ..., I[n] point to the  records in the current sample; m is the size of the reservoir; and t is the  number of input records dealt with so far.)  
  R2。 [文件结束了吗？如果没有更多的记录被输入，则转到步骤R6。 
R2. [End of file?] If there are no more records to be input, go to step R6.  
  R3。 [生成和测试。]增加t有1，然后生成1吨（含）之间的随机整数微米。如果M> N，去R5。
R3. [Generate and test.] Increase t by 1, then generate a random integer M between 1 and t (inclusive). If M > n, go to R5.
  R4。 [添加到水库]输入文件的下一个记录复制到水库，增加1米，设置I [M]←微米。 （记录previously指向我[M]。被替换的样本被新的纪录。）回到R2。
R4. [Add to reservoir.] Copy the next record of the input file to the reservoir, increase m by 1, and set I[M] ← m. (The record previously pointed to by  I[M] is being replaced in the sample by the new record.) Go back to R2.
  R5。 [跳过]跳过输入文件的下一条记录（不包括它的库），并返回步骤R2。
R5. [Skip.] Skip over the next record of the input file (do not include it in the  reservoir), and return to step R2.
  R6  [第二部]分类中的 I 的表项，这样的 I 的[1]; ...＆LT;  I 的[N];然后通过水库，复制与这些指标的记录成是拿最后的样本输出文件。
R6. [Second pass.] Sort the I table entries so that I[1] < ... < I[n]; then go through the reservoir, copying the records with these indices into the output  file that is to hold the final sample.
算法为R的伪code看起来是这样的：
A pseudocode of Algorithm R would look something like:
for j= 1 to n
    Reservoir[j]= File.GetNext()
    I[j]= j

t=n // number of input records so far
m=n // size of the reservoir

while not File.EOF()
    x= File.GetNext()
    t++
    M= Random(1..t)
    if (M<=n)
        m++
        Reservoir[m]= x
        I[M]= m

Sort(I[1..n])

for j= 1 to n
    Output[j]= Reservoir[I[j]]



                
                
                                    上一篇：TERCOM算法 - 从单线程改变到CUDA多线程多线程、算法、单线程、TERCOM
                                                            下一篇：给定一个有向图，找出是否有两个节点之间的路线节点、路线、有两个
                                    

                
                
                    
                        相关推荐
                       
                    
                  

                    
语法错误：缺少;声明jQuery的JSONP前声明、语法错误、JS
如何生成-complete-数独板？算法错误算法、错误、数独、
解决：JavaScript中IE11给我脚本错误1003给我、脚本、错
Spring 2.5的阿贾克斯1.7，更新收到错误的反应反应、错
如何燮preSS SSL错误，当AJAX请求到服务器证书无效证书
经过验证错误后续Ajax请求获取值从UI组件，而不是从豆类
在JSON处理500错误（jQuery的）错误、JSON、jQuery
使用Internet Explorer和jQuery Ajax错误错误、Explor
PHP和AJAX：如何发送/处理错误的反应？反应、错误、PHP、A
深度在python错误优先搜索：关键错误7错误、深度、关键
				   
                

                


    
        
                  

        
        
                  

          

             
        
    
    
                  

    


                
                
                    
                        猜您喜欢
                    
                    
					 
								
								恒大有息欠债一年大落3000亿
							
						
                        
   如何使一个机器人微调与初始文本和QUOT;选择一个＆QUOT;机器人、文
     如何转换一个递归函数使用堆栈？递归、堆栈、函数
     的HtmlUnit并不jQuery的升级，从1.7到1.8.1后正确重定向并不、重定
     Laravel:array_merge():参数#2不是数组错误数组、错误、参数、不
     揭秘绘作中的天主之脸，达芬奇绘作惊现耶和华头像/圣经
     中国盘古十学朱门朱门 瞅瞅你的姓氏上榜了吗
     逼写反省校长被行拘！
                                                      
                                        

                
                
                
                
                    
                        精彩图集
                     
                    
                       
                    宇宙这么大，那么宇宙之外的是什么?会有什
                        
                    浩瀚宇宙有多大:宇宙到底有多大呢？是无边
                        
                    地球和仙女座的距离，我们来计算一下有多少
                        
                    磁星是宇宙中的贵族，至今仅发现20余颗，其磁
                        
                    感受一下白垩纪著名的恐龙灭绝事件，一代地
                        
                    物种演化离不开自然法则，但早已克服生存困
                        
                    人类在其他星球都能跳跃多少高度呢？接下来