C#:如何让阿特金的筛增量增量、阿特

2023-09-10 22:57:43 作者:待人相惜

我不知道这是可能的或没有,但我得问问。那种我的数学和算法技术在这里没有我:P

I don't know if this is possible or not, but I just gotta ask. My mathematical and algorithmic skills are kind of failing me here :P

关键是我现在有这个类,生成素数达到一定的限制:

The thing is I now have this class that generates prime numbers up to a certain limit:

public class Atkin : IEnumerable<ulong>
{
    private readonly List<ulong> primes;
    private readonly ulong limit;

    public Atkin(ulong limit)
    {
        this.limit = limit;
        primes = new List<ulong>();
    }

    private void FindPrimes()
    {
        var isPrime = new bool[limit + 1];
        var sqrt = Math.Sqrt(limit);

        for (ulong x = 1; x <= sqrt; x++)
            for (ulong y = 1; y <= sqrt; y++)
            {
                var n = 4*x*x + y*y;
                if (n <= limit && (n % 12 == 1 || n % 12 == 5))
                    isPrime[n] ^= true;

                n = 3*x*x + y*y;
                if (n <= limit && n % 12 == 7)
                    isPrime[n] ^= true;

                n = 3*x*x - y*y;
                if (x > y && n <= limit && n % 12 == 11)
                    isPrime[n] ^= true;
            }

        for (ulong n = 5; n <= sqrt; n++)
            if (isPrime[n])
            {
                var s = n * n;
                for (ulong k = s; k <= limit; k += s)
                    isPrime[k] = false;
            }

        primes.Add(2);
        primes.Add(3);
        for (ulong n = 5; n <= limit; n++)
            if (isPrime[n])
                primes.Add(n);
    }


    public IEnumerator<ulong> GetEnumerator()
    {
        if (!primes.Any())
            FindPrimes();

        foreach (var p in primes)
            yield return p;
    }


    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

现在,我想是摆脱的限制,使该序列将永远不会停止(理论上)。

Now, what I would like is to get rid of the limit so that the sequence would never stop (theoretically).

我想这可能去是这样的:

I am thinking it could go something like this:

在开始与一些琐碎的限制 找到所有的素数到了极限 收益的所有新发现的素数 增加限制(通过加倍或者平方旧的限制或类似的东西) 转到第2步

和优化这两个步骤中应该只有旧的限制和新的工作。换句话说,它不应该一次又一次地找到最低的素数。

And optimally that step two should only have to work between the old limit and the new one. In other words it shouldn't have to find the lowest primes again and again.

有没有一种方法可以做到这一点?我的主要问题是,我不太明白什么 X 例如是这样的算法研究。就像,我能不能使用相同的算法种,但设置 X oldLimit (最初1),然后运行它以 newLimit ?或如何将这项工作?任何明亮的头脑用一些轻阐明这一点?

Is there a way this can be done? My main problem is that I don't quite understand what x and y for example is in this algorithm. Like, could I just use the same algorithm kind of but set x and y to oldLimit (initially 1) and run it up to newLimit? Or how would that work? Any bright minds with some light to shed on this?

这样做的问题是,这样我就不必设置限制。所以,我可以如使用LINQ和刚以()然而,许多素数我需要的,而不是担心如果限制是足够高,等等。

The point of this is so that I won't have to set that limit. So that I can for example use Linq and just Take() however many primes I need, not worrying about if the limit is high enough, et cetera.

推荐答案

下面是一个解决方案,以无界素筛分在C#中,可以采用以下方式实现埃拉托色尼(SOE)或阿特金的筛(SOA)算法的筛;但是,我认为这是不值得的比实际环境状况给出一个优化的SOA解决方案的极端复杂性提供了有关不作为的复杂性相同的性能。因此,这也许是只有部分答案,同时展示了如何实现更好的SOA算法,并展示了如何运用出售国有企业实现一个无界序列,它只是提示怎么把这些写了一个相当有效的SOA。

Here is a solution to unbounded prime sieving in C#, which can be implemented using the Sieve of Eratosthenes (SoE) or the Sieve of Atkin (SoA) algorithms; however, I maintain that it is hardly worth the extreme complexity of an optimized SoA solution given than the true SoE gives about the same performance without as much complexity. Thus, this is perhaps only a partial answer in that, while it shows how to implement a better SoA algorithm and shows how to implement an unbounded sequence using SoE, it only hints at how to combine these to write a reasonably efficient SoA.

请注意,如果对这些想法非常最快的实现只讨论需要,跳转到这个答案的底部。

首先,我们应该对这项工作的角度发表评论生产的素数的无限序列使用IEnumerable的扩展方法,如采取(),TakeWhile(),其中(),COUNT(),等等许可,因为这些方法白送由于方法调用的水平增加一些性能,为每次通话至少增加28个机器周期,返回列举一个值,并增加几个级别的每个功能的方法调用;这么说,有一个(有效)无限序列仍然是有用的,即使人使用的枚举的结果更有必要的过滤技术更好的速度。

First we should comment on the point of this exercise in producing an unbounded sequence of primes to permit using the IEnumerable extension methods such as Take(), TakeWhile(), Where(), Count(), etcetera, as these methods give away some performance due to the added levels of method calling, adding at least 28 machine cycles for every call and return to enumerate one value and adding several levels of method calls per function; that said, having an (effectively) infinite sequence is still useful even if one uses more imperative filtering techniques on the results of the enumerations for better speed.

请注意,在简单的例子下面的,我有范围限制为无符号32位数字(UINT)之多过去的范围基本环境状况和SOA的实现是不是真的适合作为提高效率,一个需要切换到一个桶或其他形式的用于筛分效率的一部分增量筛的;的然而,对于完全优化页面分段筛中最快的实现的范围内,增大到64位范围虽然作为写入一个可能的不希望筛过去约一百兆(10至第十四功率)由于运行时间将长达几百年的全系列。

Note, in the the simpler examples following, I have limited the range to that of unsigned 32-bit numbers (uint) as much past that range the basic SoE or SoA implementations aren't really appropriate as to efficiency and one needs to switch over to a "bucket" or other form of incremental sieve for part of the sieving for efficiency; however, for a fully optimized page segmented sieve as in the fastest implementation, the range is increased to the 64-bit range although as written one likely would not want to sieve past about a hundred trillion (ten to the fourteenth power) as run time would take up to hundreds of years for the full range.

由于问题选择了SOA的大概在第一个错误的原因,误以为审判庭(TD)黄金筛一个真正的环境状况,然后使用一个天真的SOA在TD筛子,让我们建立真正的边界环境状况,可实现在几行作为方法(可以转换为一个类按问​​题的实现编码风格)如下:

As the question chooses the SoA for probably the wrong reasons in first mistaking a Trial Division (TD) prime sieve for a true SoE and then using a naive SoA over the TD sieve, let's establish the true bounded SoE which can be implemented in a few lines as a method (which could be converted to a class as per the question's implementation coding style) as follows:

static IEnumerable<uint> primesSoE(uint top_number) {
  if (top_number < 2u) yield break;
  yield return 2u; if (top_number < 3u) yield break;
  var BFLMT = (top_number - 3u) / 2u;
  var SQRTLMT = ((uint)(Math.Sqrt((double)top_number)) - 3u) / 2u;
  var buf = new BitArray((int)BFLMT + 1,true);
  for (var i = 0u; i <= BFLMT; ++i) if (buf[(int)i]) {
      var p = 3u + i + i; if (i <= SQRTLMT) {
        for (var j = (p * p - 3u) / 2u; j <= BFLMT; j += p)
          buf[(int)j] = false; } yield return p; } }

该计算质数为2万美元的约16毫秒上i7-2700K(3.5千兆赫)和203280221素数最多的32位数字范围4294967291,在同一台机器上约67秒(给定一个备用256兆RAM内存)。

This calculates the primes to 2 million in about 16 milliseconds on an i7-2700K (3.5 GHz) and the 203,280,221 primes up to 4,294,967,291 in the 32-bit number range in about 67 seconds on the same machine (given a spare 256 MegaBytes of RAM memory).

现在,使用上面的版本来比较SOA是不公平的真正的SOA自动忽略的2的倍数,3和5,从而引入轮分解为国有企业做同样产生以下code

Now, using the version above to compare to the SoA is hardly fair as the true SoA automatically ignores the multiples of 2, 3, and 5 so introducing wheel factorization to do the same for the SoE yields the following code:

static IEnumerable<uint> primesSoE(uint top_number) {
  if (top_number < 2u) yield break;
  yield return 2u; if (top_number < 3u) yield break;
  yield return 3u; if (top_number < 5u) yield break;
  yield return 5u; if (top_number < 7u) yield break;
  var BFLMT = (top_number - 7u) / 2u;
  var SQRTLMT = ((uint)(Math.Sqrt((double)top_number)) - 7u) / 2u;
  var buf = new BitArray((int)BFLMT + 1,true);
  byte[] WHLPTRN = { 2, 1, 2, 1, 2, 3, 1, 3 };
  for (uint i = 0u, w = 0u; i <= BFLMT; i += WHLPTRN[w], w = (w < 7u) ? ++w : 0u)
    if (buf[(int)i]) { var p = 7u + i + i; if (i <= SQRTLMT) {
        var pX2 = p + p; uint[] pa = { p, pX2, pX2 + p };
        for (uint j = (p * p - 7u) / 2u, m = w; j <= BFLMT;
                               j += pa[WHLPTRN[m] - 1u], m = (m < 7u) ? ++m : 0u)
          buf[(int)j] = false; } yield return p; } }

以上code计算在大约10毫秒的素数为2万元,而素数到32位数字范围内大约40秒的同一台机器上上面。

The above code calculates the primes to 2 million in about 10 milliseconds and the primes to the 32-bit number range in about 40 seconds on the same machine as above.

接下来,让我们建立一个版本,我们很可能会在这里写的SOA是否真的有任何好处相比,国有企业按照上述code至于执行速度去。下面的code遵循国有企业的上述模型和优化SOA伪code从的维基百科文章以调整的范围为X,并使用为单个二次解决方案独立循环的该文章建议,解决了二次方程(和方免费冲销)只为奇数的解决方案'Y'变量,结合3 * X ^ 2二次解决同时为正和负中的同一通第二方面,消除了计算上昂贵的模运算,以产生code表示比幼稚版本快三倍以上的伪code贴那里,在这个问题用在这里。在code为界SOA是为随后如下:

Next, let's establish whether a version of the SoA that we are likely to write here actually has any benefit as compared to the SoE as per the above code as far as execution speed goes. The code below follows the model of the SoE above and optimizes the SoA pseudo-code from the Wikipedia article as to tuning the ranges of the 'x' and 'y' variables using separate loops for the individual quadratic solutions as suggested in that article, solving the quadratic equations (and the square free eliminations) only for odd solutions, combining the "3*x^2" quadratic to solve for both the positive and negative second terms in the same pass, and eliminating the computationally expensive modulo operations, to produce code that is over three times faster than the naive version of the pseudo-code posted there and as used in the question here. The code for the bounded SoA is as then follows:

static IEnumerable<uint> primesSoA(uint top_number) {
  if (top_number < 2u) yield break;
  yield return 2u; if (top_number < 3u) yield break;
  yield return 3u; if (top_number < 5u) yield break;
  var BFLMT = (top_number - 3u) / 2u; var buf = new BitArray((int)BFLMT + 1, false);
  var SQRT = (uint)(Math.Sqrt((double)top_number)); var SQRTLMT = (SQRT - 3u) / 2u;
  for (uint x = 1u, s = 1u, mdx12 = 5u, dmdx12 = 0u; s <= BFLMT; ++x, s += ((x << 1) - 1u) << 1) {
    for (uint y = 1u, n = s, md12 = mdx12, dmd12 = 8u; n <= BFLMT; y += 2, n += (y - 1u) << 1) {
      if ((md12 == 1) || (md12 == 5)) buf[(int)n] = buf[(int)n] ^ true;
      md12 += dmd12; if (md12 >= 12) md12 -= 12; dmd12 += 8u; if (dmd12 >= 12u) dmd12 -= 12u; }
    mdx12 += dmdx12; if (mdx12 >= 12u) mdx12 -= 12u; dmdx12 += 8u; if (dmdx12 >= 12u) dmdx12 -= 12u; }
  for (uint x = 1u, s = 0u, mdx12 = 3u, dmdx12 = 8u; s <= BFLMT; ++x, s += x << 1) {
    int y = 1 - (int)x, n = (int)s, md12 = (int)mdx12, dmd12 = ((-y - 1) << 2) % 12;
    for (; (y < 0) && (uint)n <= BFLMT; y += 2, n += (-y + 1) << 1) {
      if (md12 == 11) buf[(int)n] = buf[(int)n] ^ true;
      md12 += dmd12; if (md12 >= 12) md12 -= 12; dmd12 += 4; if (dmd12 >= 12) dmd12 -= 12; }
    if (y < 1) { y = 2; n += 2; md12 += 4; dmd12 = 0; } else { n += 1; md12 += 2; dmd12 = 8; }
    if (md12 >= 12) md12 -= 12; for (; (uint)n <= BFLMT; y += 2, n += (y - 1) << 1) {
      if (md12 == 7) buf[(int)n] = buf[(int)n] ^ true;
      md12 += dmd12; if (md12 >= 12) md12 -= 12; dmd12 += 8; if (dmd12 >= 12) dmd12 -= 12; }
    mdx12 += dmdx12; if (mdx12 >= 12) mdx12 -= 12; dmdx12 += 4; if (dmdx12 >= 12) dmdx12 -= 12; }
  for (var i = 0u; i<=BFLMT; ++i) if (buf[(int)i]) { var p = 3u+i+i; if (i<=SQRTLMT) { var sqr = p*p;
        for (var j = (sqr - 3ul) / 2ul; j <= BFLMT; j += sqr) buf[(int)j] = false; } yield return p; } }

这是仍超过两倍的发布,由于没有完全优化的一些操作的轮分解环境状况算法慢。进一步优化,可到了SOA code在使用模60操作作为原始(非伪code)算法,并采用位打包,只处理倍数不包括3和5,以获得$ C $Ç相当接近的性能国有企业,甚至超过它略有下降,但我们越来越接近到了伯恩斯坦实现这个功能表现的复杂性。我的观点是:为什么SOA,当一个人的作品很难得到大致相同的性能随着国有企业?

This is still over twice as slow as the wheel factorization SoE algorithm posted due to the not fully optimized number of operations. Further optimizations can be made to the SoA code as in using modulo 60 operations as for the original (non-pseudo-code) algorithm and using bit packing to only deal with multiples excluding 3 and 5 to get the code fairly close in performance to SoE or even exceed it slightly, but we get closer and closer to the complexity of the Berstein implementation to achieve this performance. My point is "Why SoA, when one works very hard to get about the same performance as SoE?".

现在的无界素数序列,要做到这一点很简单的方法就是定义Uint32.MaxValue的常量top_number和消除primesSoE或primesSoA方法的参数。这是在该剔除仍然在整个数范围内进行不管实际素值被处理,这使得该判定为素数的小范围需要更长的时间比它应该有点低效。还有一些其他的原因去了素数的分段版本筛比这和极端内存使用等:一,使用过程更快,因为更高效的内存存取数据,主要是在CPU的L1或L2数据缓存算法,和其次是因为分割使得作业能够容易地分割成使用多核心处理器的高速化,可以是正比于所用磁芯的数量可以在后台被剔除的页面。

Now for the unbounded primes sequence, the very easiest way to do this is to define a const top_number of Uint32.MaxValue and eliminate the argument in the primesSoE or primesSoA methods. This is somewhat inefficient in that the culling is still done over the full number range no matter the actual prime value being processed, which makes the determination for small ranges of primes take much longer than it should. There are also other reasons to go to a segmented version of the primes sieve other than this and extreme memory use: First, algorithms that use data that is primarily within the CPU L1 or L2 data caches process faster because of more efficient memory access, and secondly because segmentation allows the job to be easily split into pages that can be culled in the background using multi-core processors for a speed-up that can be proportional to the number of cores used.

有关效率,应该选择在CPU的L1或L2高速缓存大小是为了避免高速缓存抖动,因为我们宰杀素数的复合材料从阵列,这意味着BitArray至少16千字节为最现代化的CPU的阵列大小可以具有八倍大(每字节8位)或128千比特的尺寸。因为我们只需要筛奇素数,这再presents一系列数字较一季度万美元,这意味着一个分段的版本将只使用八段筛至200万为10一可以节省所需的欧拉问题从最后段的结果,并继续从该点,但是这将preclude适应此code键多核情况,其中一个遇事推诿剔除到多个线程的背景,以充分利用多芯的处理器。在C#$ C $下一个(单线程)无界环境状况如下:

For efficiency, we should choose an array size of the CPU L1 or L2 cache size which is at least 16 Kilobytes for most modern CPU's in order to avoid cache thrashing as we cull composites of primes from the array, meaning that the BitArray can have a size of eight times that large (8 bits per byte) or 128 Kilobits. Since we only need to sieve odd primes, this represents a range of numbers of over a quarter million, meaning that a segmented version will only use eight segments to sieve to 2 million as required by Euler Problem 10. One could save the results from the last segment and continue on from that point, but that would preclude adapting this code to the multi-core case where one relegates culling to the background on multiple threads to take full advantage of multi-core processors. The C# code for an (single thread) unbounded SoE is as follows:

static IEnumerable<uint> primesSoE() { yield return 2u; yield return 3u; yield return 5u;
  const uint L1CACHEPOW = 14u + 3u, L1CACHESZ = (1u << (int)L1CACHEPOW); //for 16K in bits...
  const uint BUFSZ = L1CACHESZ / 15u * 15u; //an even number of wheel rotations
  var buf = new BitArray((int)BUFSZ);
  const uint MAXNDX = (uint.MaxValue - 7u) / 2u; //need maximum for number range
  var SQRTNDX = ((uint)Math.Sqrt(uint.MaxValue) - 7u) / 2u;
  byte[] WHLPTRN = { 2, 1, 2, 1, 2, 3, 1, 3 }; //the 2,3,5 factorial wheel, (sum) 15 elements long
  byte[] WHLPOS = { 0, 2, 3, 5, 6, 8, 11, 12 }; //get wheel position from index
  byte[] WHLNDX = { 0, 0, 1, 2, 2, 3, 4, 4, 5, 5, 5, 6, 7, 7, 7, //get index from position
                    0, 0, 1, 2, 2, 3, 4, 4, 5, 5, 5, 6, 7 }; //allow for overflow
  byte[] WHLRNDUP = { 0, 2, 2, 3, 5, 5, 6, 8, 8, 11, 11, 11, 12, 15, //allow for overflow...
                      15, 15, 17, 17, 18, 20, 20, 21, 23, 23, 26, 26, 26, 27 };
  uint BPLMT = (ushort.MaxValue - 7u) / 2u; var bpbuf = new BitArray((int)BPLMT + 1, true);
  for (var i = 0; i <= 124; ++i) if (bpbuf[i]) { var p = 7 + i + i; //initialize baseprimes array
      for (var j = (p * p - 7) / 2; j <= BPLMT; j += p) bpbuf[j] = false; } var pa = new uint[3];
  for (uint i = 0u, w = 0, si = 0; i <= MAXNDX;
        i += WHLPTRN[w], si += WHLPTRN[w], si = (si >= BUFSZ) ? 0u : si, w = (w < 7u) ? ++w : 0u) {
    if (si == 0) { buf.SetAll(true);
      for (uint j = 0u, bw = 0u; j <= BPLMT; j += WHLPTRN[bw], bw = (bw < 7u) ? ++bw : 0u)
        if (bpbuf[(int)j]) { var p = 7u+j+j; var pX2=p+p; var k = p * (j + 3u) + j;
          if (k >= i + BUFSZ) break; pa[0] = p; pa[1] = pX2; pa[2] = pX2 + p; var sw = bw; if (k < i) {
            k = (i - k) % (15u * p); if (k != 0) { var os = WHLPOS[bw]; sw = os + ((k + p - 1u) / p);
              sw = WHLRNDUP[sw]; k = (sw - os) * p - k; sw = WHLNDX[sw]; } } else k -= i;
          for (; k<BUFSZ; k+=pa[WHLPTRN[sw]-1], sw=(sw<7u) ? ++sw : 0u) buf[(int)k]=false; } }
    if (buf[(int)si]) yield return 7u + i + i; } }

以上code大约需要16毫秒筛素数最多两万元,约30秒,以筛完整的32位号码范围。这code是比用阶乘轮类似的版本速度更快,而不分割为众多的范围,因为,即使code是为计算更复杂,大量的时间被保存在没有​​颠簸的CPU缓存。大部分的复杂性是每段基总理在新的起点指标,可能已经减少了每段但上面的code节约每互素的状态可以很容易地转换,以便运行的计算在单独的后台线程剔除进程进一步四次加快一台机器具有四个核心,更是有八个核心。不使用BitArray类和寻址通过位屏蔽的各个位的位置将提供一个进一步加快由两个左右的一个因素,阶乘轮的进一步的增加将提供关于在时间至约三分之二的另一个减小。更好的包装位中淘汰指标由车轮分解消除会略有增加效益为大范围的值,但是也有些加入到了位操作的复杂性。然而,所有这些环境状况的优化会不会接近伯恩斯坦SOA实施的极端复杂性,但将运行在大致相同的速度。

The above code takes about 16 milliseconds to sieve the primes up to two million and about 30 seconds to sieve the full 32-bit number range. This code is faster than the similar version using the factorial wheel without segmentation for large number ranges because, even though the code is more complex as to computations, a great deal of time is saved in not thrashing the CPU caches. Much of the complexity is in the computation of the new starting indexes per base prime per segment, which could have been reduced by saving the state of each per prime per segment but the above code can easily be converted so as to run the culling processes on separate background threads for a further four times speed up for a machine with four cores, even more with eight cores. Not using the BitArray class and addressing the individual bit locations by bit masks would provide a further speed up by about a factor of two, and further increases of the factorial wheel would provide about another reduction in time to about two thirds. Better packing of the bit array in eliminated indexes for values eliminated by the wheel factorization would slightly increase efficiency for large ranges, but would also add somewhat to the complexity of the bit manipulations. However, all of these SoE optimizations would not approach the extreme complexity of the Berstein SoA implementation, yet would run at about the same speed.

要转换成上述code SOE的向SOA,我们只需要改变对SOA剔除code从有限的情况下,但与起始地址被重新计算每个页面段类似的修改的起始地址被计算为环境状况的情况下,但即使稍微中的操作,因为二次方程式预先通过的数量,而不是由素数的简单倍数平方复杂。我会留下必要的调整,以学生,但我实在不明白鉴于SOA与合理的优化点并不比国有企业的当前版本速度更快,复杂得多;)

To convert the above code from SoE to SoA, we only need to change to the SoA culling code from the bounded case but with the modification that the starting addresses are recalculated for every page segment something like as the starting address is calculated for the SoE case, but even somewhat more complex in operation as the quadratics advance by squares of numbers rather than by simple multiples of primes. I'll leave the required adaptations to the student, although I don't really see the point given that the SoA with reasonable optimizations is not faster than the current version of the SoE and is considerably more complex ;)

EDIT_ADD:

注意:以下code已被纠正,因为虽然原贴code是正确的,因为所提供的素数序列,它比它需要是因为它剔除了一半又慢低于该范围的平方根,而不是只在发现碱素数可达的范围内的平方根的所有数字。

的最有效和最简单的优化退居每段页背景拣出操作后台线程以便,给予足够的CPU核心,在列举的素数的主要限制是枚举循环本身,在这种情况下,所有的素数中的32比特数的范围可以在大约10秒内列举的上述参考机器上(在C#),无其他优化,与所有其他的优化​​,包括写了IEnumerable接口实现,而不是使用内置的迭代器减少这种为大约六在32位数字范围内的所有203280221素数秒(1.65秒到1十亿),再次用大部分的时间仅仅花了列举的结果。下面code有许多这些修改包括使用从任务所使用的DOTNET框架4线程池后台任务,使用状态机实现为查找表来实施进一步的位打包,使素数枚举更快,并重新对写算法采用滚你自己的迭代器提高效率的枚举类:

The most effective and simplest optimization is relegating the culling operations per segment page to background threads so that, given enough CPU cores, the main limit in enumerating the primes is the enumeration loop itself, in which case all the primes in the 32-bit number range can be enumerated in about ten seconds on the above reference machine (in C#) without other optimizations, with all other optimizations including writing the IEnumerable interface implementations rather than using the built-in iterators reducing this to about six seconds for all the 203,280,221 primes in the 32-bit number range (1.65 seconds to one billion), again with most of the time spent just enumerating the results. The following code has many of those modifications including using background tasks from the Dotnet Framework 4 ThreadPool used by Tasks, using a state machine implemented as a look up table to implement further bit packing to make the primes enumeration quicker, and re-writing the algorithm as an enumerable class using "roll-your-own" iterators for increased efficiency:

class fastprimesSoE : IEnumerable<uint>, IEnumerable {
  struct procspc { public Task tsk; public uint[] buf; }
  struct wst { public byte msk; public byte mlt; public byte xtr; public byte nxt; }
  static readonly uint NUMPROCS = (uint)Environment.ProcessorCount + 1u; const uint CHNKSZ = 1u;
  const uint L1CACHEPOW = 14u, L1CACHESZ = (1u << (int)L1CACHEPOW), PGSZ = L1CACHESZ >> 2; //for 16K in bytes...
  const uint BUFSZ = CHNKSZ * PGSZ; //number of uints even number of caches in chunk
  const uint BUFSZBTS = 15u * BUFSZ << 2; //even in wheel rotations and uints (and chunks)
  static readonly byte[] WHLPTRN = { 2, 1, 2, 1, 2, 3, 1, 3 }; //the 2,3,5 factorial wheel, (sum) 15 elements long
  static readonly byte[] WHLPOS = { 0, 2, 3, 5, 6, 8, 11, 12 }; //get wheel position from index
  static readonly byte[] WHLNDX = { 0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 6, 6, 7, 0, 0, 0 }; //get index from position
  static readonly byte[] WHLRNDUP = { 0, 2, 2, 3, 5, 5, 6, 8, 8, 11, 11, 11, 12, 15, 15, 15, //allow for overflow...
                                      17, 17, 18, 20, 20, 21, 23, 23, 26, 26, 26, 27, 30, 30, 30 }; //round multiplier up
  const uint BPLMT = (ushort.MaxValue - 7u) / 2u; const uint BPSZ = BPLMT / 60u + 1u;
  static readonly uint[] bpbuf = new uint[BPSZ]; static readonly wst[] WHLST = new wst[64];
  static void cullpg(uint i, uint[] b, int strt, int cnt) { Array.Clear(b, strt, cnt);
    for (uint j = 0u, wp = 0, bw = 0x31321212u, bi = 0u, v = 0xc0881000u, bm = 1u; j <= BPLMT;
      j += bw & 0xF, wp = wp < 12 ? wp += bw & 0xF : 0, bw = (bw > 3u) ? bw >>= 4 : 0x31321212u) {
      var p = 7u + j + j; var k = p * (j + 3u) + j; if (k >= (i + (uint)cnt * 60u)) break;
      if ((v & bm) == 0u) { if (k < i) { k = (i - k) % (15u * p); if (k != 0) {
            var sw = wp + ((k + p - 1u) / p); sw = WHLRNDUP[sw]; k = (sw - wp) * p - k; } }
        else k -= i; var pd = p / 15;
        for (uint l = k / 15u + (uint)strt * 4u, lw = ((uint)WHLNDX[wp] << 3) + WHLNDX[k % 15u];
               l < (uint)(strt + cnt) * 4u; ) { var st = WHLST[lw];
          b[l >> 2] |= (uint)st.msk << (int)((l & 3) << 3); l += st.mlt * pd + st.xtr; lw = st.nxt; } }
      if ((bm <<= 1) == 0u) { v = bpbuf[++bi]; bm = 1u; } } }
  static fastprimesSoE() {
    for (var x = 0; x < 8; ++x) { var p = 7 + 2 * WHLPOS[x]; var pr = p % 15;
      for (int y = 0, i = ((p * p - 7) / 2); y < 8; ++y) { var m = WHLPTRN[(x + y) % 8]; i %= 15;
        var n = WHLNDX[i]; i += m * pr; WHLST[x * 8 + n] = new wst { msk = (byte)(1 << n), mlt = m,
                                                                     xtr = (byte)(i / 15),
                                                                     nxt = (byte)(8 * x + WHLNDX[i % 15]) }; }
    } cullpg(0u, bpbuf, 0, bpbuf.Length);  } //init baseprimes
  class nmrtr : IEnumerator<uint>, IEnumerator, IDisposable {
    procspc[] ps = new procspc[NUMPROCS]; uint[] buf;
    Task dlycullpg(uint i, uint[] buf) {  return Task.Factory.StartNew(() => {
        for (var c = 0u; c < CHNKSZ; ++c) cullpg(i + c * PGSZ * 60, buf, (int)(c * PGSZ), (int)PGSZ); }); }
    public nmrtr() {
      for (var i = 0u; i < NUMPROCS; ++i) ps[i] = new procspc { buf = new uint[BUFSZ] };
      for (var i = 1u; i < NUMPROCS; ++i) { ps[i].tsk = dlycullpg((i - 1u) * BUFSZBTS, ps[i].buf); } buf = ps[0].buf;  }
    public uint Current { get { return this._curr; } } object IEnumerator.Current { get { return this._curr; } }
    uint _curr; int b = -4; uint i = 0, w = 0; uint v, msk = 0;
    public bool MoveNext() {
      if (b < 0) if (b == -1) { _curr = 7; b += (int)BUFSZ; }
        else { if (b++ == -4) this._curr = 2u; else this._curr = 7u + ((uint)b << 1); return true; }
      do {  i += w & 0xF; if ((w >>= 4) == 0) w = 0x31321212u; if ((this.msk <<= 1) == 0) {
          if (++b >= BUFSZ) { b = 0; for (var prc = 0; prc < NUMPROCS - 1; ++prc) ps[prc] = ps[prc + 1];
            ps[NUMPROCS - 1u].buf = buf; var low = i + (NUMPROCS - 1u) * BUFSZBTS;
            ps[NUMPROCS - 1u].tsk = dlycullpg(i + (NUMPROCS - 1u) * BUFSZBTS, buf);
            ps[0].tsk.Wait(); buf = ps[0].buf; } v = buf[b]; this.msk = 1; } }
      while ((v & msk) != 0u); if (_curr > (_curr = 7u + i + i)) return false; else return true;  }
    public void Reset() { throw new Exception("Primes enumeration reset not implemented!!!"); }
    public void Dispose() { }
  }
  public IEnumerator<uint> GetEnumerator() { return new nmrtr(); }
  IEnumerator IEnumerable.GetEnumerator() { return new nmrtr(); } }

请注意,此code是不小于最后一个版本为素数的小范围中至多一两百万更快由于设置和初始化数组的开销,但是要快于较大范围可达4个十亿加。它比这个问题的实现阿特金筛快8倍左右,但进入到20〜50倍的速度进行更大范围的多达四十亿加。有在code定义的常量设置基地缓存大小以及有多少,这些每线程(CHNKSZ),可能会略有调整性能扑杀。一些轻微的进一步优化可以尝试可能由向上减少的执行时间大素数的两个如进一步位打包作为2,3,5,7车轮的一个因素,而不是仅仅在2,3,5轮的约25%在包装(或许切割的执行时间到三分之二)和$ P $对 - 灭杀页段由轮阶乘高达17为进一步降低约20%的因子的减少,但这些都是关于什么可以做纯C#code比为C code,可采取独特的本土code优化利用的程度。

Note that this code isn't faster than the last version for small ranges of primes as in up to one or two million due to the overhead of setting up and initializing arrays but is considerably faster for larger ranges up to four billion plus. It is about 8 times faster than the question's implementation of the Atkin sieve, but goes to from 20 to 50 times as fast for larger ranges up to four billion plus. There are defined constants in the code setting the base cache size and how many of these to cull per thread (CHNKSZ) that may slightly tweak the performance. Some slight further optimizations could be tried that might reduce the execution time for large primes by up to a factor of two such as further bit packing as for the 2,3,5,7 wheel rather than just the 2,3,5 wheel for a reduction of about 25% in packing (perhaps cutting the execution time to two thirds) and pre-culling the page segments by the wheel factorial up to the factor of 17 for a further reduction of about 20%, but these are about the extent of what can be done in pure C# code as compared to C code which can take advantage of unique native code optimizations.

在任何情况下,它是几乎不值得进一步优化,如果打算用IEnumberable接口输出的问题,希望为约三分之二的时间用来只是列举发现素数,只有约三分之一的时间是在剔除了合数花。更好的办法是写在类的方法来实现所期望的结果,如CountTo,的ElementAt,SumTo,等等,以避免枚举全部。

At any rate, it is hardly worth further optimizations if one intends to use the IEnumberable interface for output as the question desires as about two thirds of the time is used just enumerating the found primes and only about one third of the time is spent in culling the composite numbers. A better approach would be to write methods on the class to implement the desired results such as CountTo, ElementAt, SumTo, etcetera so as to avoid enumeration entirely.

但是,我没有做额外的优化作为一个额外的答案这表明额外的好处,尽管额外的复杂性,并进一步说明为什么人们不希望使用SOA,因为一个完全优化的环境状况实际上是更好的。

But I did do the extra optimizations as an additional answer which shows additional benefits in spite of additional complexity, and which further shows why one doesn't want to use the SoA because a fully optimized SoE is actually better.

 
精彩推荐
图片推荐