蛮力,单线程因式分解因式分解、单线程、蛮力

2023-09-10 23:26:39 作者:久伴不如酒伴

向上为考虑的是以下功能,可用于(相对快)因子一个64位无符号整数到其首要因素。需要注意的是保理不probabalistic(即它是准确的)。该算法已经足够快地发现,一个数是素或者有几个非常大的因素,在几秒钟的事,在现代硬件上。

Up for consideration is the following function which can be used to (relatively quickly) factor a 64-bit unsigned integer into its prime factors. Note that the factoring is not probabalistic (i.e., it is exact). The algorithm is already fast enough to find that a number is prime or has few very large factors in a matter of several seconds, on modern hardware.

问题: 可以任何改进作出算法presented,同时保持它单线程的,以便它可以因子(任意的)非常大的无符号64位整数更快,preferably而不使用probabalistic方法(例如,米勒-Rabin)确定素性?

The question: Can any improvements be made to the algorithm presented, while keeping it single-threaded, so that it can factor (arbitrary) very large unsigned 64-bit integers faster, preferably without using a probabalistic approach (e.g., Miller-Rabin) for determining primality?

// system specific typedef for ulong should go here (or use boost::uint64_t)
typedef unsigned __int64 ulong;
typedef std::vector<ulong> ULongVector;

// Caller needs to pass in an empty factors vector
void GetFactors(ULongVector &factors, ulong num)
{
  // Num has to be at least 2 to contain "prime" factors
  if (num<2)
    return;

  ulong workingNum=num;
  ulong nextOffset=2; // Will be used to skip multiples of 3, later

  // Factor out factors of 2
  while (workingNum%2==0)
  {
    factors.push_back(2);
    workingNum/=2;
  }

  // Factor out factors of 3
  while (workingNum%3==0)
  {
    factors.push_back(3);
    workingNum/=3;
  }

  // If all of the factors were 2s and 3s, done...
  if (workingNum==1)
    return;

  // sqrtNum is the (inclusive) upper bound of our search for factors
  ulong sqrtNum=(ulong) sqrt(double(workingNum+0.5));

  // Factor out potential factors that are greate than or equal to 5
  // The variable n represents the next potential factor to be tested
  for (ulong n=5;n<=sqrtNum;)
  {
    // Is n a factor of the current working number?
    if (workingNum%n==0)
    {
      // n is a factor, so add it to the list of factors
      factors.push_back(n);

      // Divide current working number by n, to get remaining number to factor
      workingNum/=n;

      // Check if we've found all factors
      if (workingNum==1)
        return;

      // Recalculate the new upper bound for remaining factors
      sqrtNum=(ulong) sqrt(double(workingNum+0.5));

      // Recheck if n is a factor of the new working number, 
      // in case workingNum contains multiple factors of n
      continue;
    }

    // n is not or is no longer a factor, try the next odd number 
    // that is not a multiple of 3
    n+=nextOffset;
    // Adjust nextOffset to be an offset from n to the next non-multiple of 3
    nextOffset=(nextOffset==2UL ? 4UL : 2UL);
  }

  // Current workingNum is prime, add it as a factor
  factors.push_back(workingNum);
}

感谢

编辑:我已经添加了更多的评论。一个矢量是通过引用传递的原因,是允许该载体调用之间重复使用中,避免动态分配。矢量没有在功能空的原因,是为了允许在矢量追加当前数词的因素的因素已经奇要求。

I've added even more comments. The reason that a vector is passed in by reference, is to allow for the vector to be reused in between calls and avoid dynamic allocations. The reason the vector is not emptied in the function, is to allow for the odd requirement of appending the current "num's" factors to factors already in the vector.

函数本身并不pretty的,可重构,但问题是如何使算法快。所以,请,没有关于如何使功能更pretty的,可读,或C ++十岁上下的建议。这是孩子们的游戏。提高这个算法,以便它可以找到(证明)的因素更快是困难的部分。

The function itself is not pretty and can be refactored, but the question is about how to making the algorithm faster. So, please, no suggestions about how to make the function more pretty, readable, or C++ish. That's child's play. Improving this algorithm, so that it can find (proven) factors faster is the difficult part.

更新: Potatoswatter 有一些优秀的解决方案,到目前为止,一定要检查了他的MMX解决方案的底部附近,以及

Update: Potatoswatter has some excellent solutions so far, be sure to check out his MMX solution near the bottom, as well.

推荐答案

比较这种方法为(pre-生成)筛。模是昂贵的,所以这两种方法主要做两件事情:产生的潜在因素,并进行模运算。无论哪种方案应合理地产生更少的周期一个新的候选系数比模​​需要,所以无论是程序模界。

Compare such an approach to a (pre-generated) sieve. Modulo is expensive, so both approaches essentially do two things: generate potential factors, and perform modulo operations. Either program should reasonably generate a new candidate factor in less cycles than modulo takes, so either program is modulo bound.

给定的方法筛选出的所有整数,2和3,或75%即倍数的恒定比例。四分之一的(如给定的)的数字被用作一个参数模运算符。我称之为跳跃过滤器。

The given approach filters out a constant proportion of all integers, namely the multiples of 2 and 3, or 75%. One in four (as given) numbers is used as an argument to the modulo operator. I'll call it a skip filter.

在另一方面,一个筛只使用素数作为参数的模运算符,和连续的素数之间的平均差由素数定理的管辖为1 / LN(N)。例如,电子的^ 20是不到500万元,使500多万个号码都在被总理5%的几率。如果所有的数字达到2 ^ 32后认为,5%是一个很好的经验法则。

On the other hand, a sieve uses only primes as arguments to the modulo operator, and the average difference between successive primes is governed by the prime number theorem to be 1/ln(N). For example, e^20 is just under 500 million, so numbers over 500 million have under a 5% chance of being prime. If all numbers up to 2^32 are considered, 5% is a good rule of thumb.

因此​​,筛子将花5倍的时间在 DIV 的操作,你的跳跃的过滤器。接下来要考虑的因素是其筛生产的素数,即从内存或磁盘中读取它们的速度。如果取一体的绝佳比4 DIV 速度比较快,那么筛更快。根据我的表 DIV 吞吐量在我的酷睿2最多只有一个每12个周期。这将是很难划分的问题,让我们保守预算每个主要50个周期。对于一个2.5 GHz处理器,这是20纳秒。

Therefore, a sieve will spend 5 times less time on div operations as your skip filter. The next factor to consider is the speed at which the sieve produces primes, i.e. reads them from memory or disk. If fetching one prime is faster than 4 divs, then the sieve is faster. According to my tables div throughput on my Core2 is at most one per 12 cycles. These will be hard division problems, so let's conservatively budget 50 cycles per prime. For a 2.5 GHz processor, that's 20 nanoseconds.

在20纳秒,50 MB /秒的硬盘驱动器可以读取约一个字节。最简单的解决方法是使用每个黄金4个字节,因此变频器会慢一些。但是,我们可以更聪明。如果我们要连接code的所有素数的顺序,我们只要连接code分歧。再次,预期的差为1 / LN(N)。此外,他们都均匀,从而节省额外位。他们是永远不会为零,这使得扩展不含多字节编码。因此,使用每素一个字节,差别高达512可以存储在一个字节,它获取我们高达303371455241根据,维基百科文章。

In 20 ns, a 50 MB/sec hard drive can read about one byte. The simple solution is to use 4 bytes per prime, so the drive will be slower. But, we can be more clever. If we want to encode all the primes in order, we can just encode their differences. Again, the expected difference is 1/ln(N). Also, they're all even, which saves an extra bit. And they are never zero, which makes extension to a multibyte encoding free. So using one byte per prime, differences up to 512 can be stored in one byte, which gets us up to 303371455241 according to that Wikipedia article.

因此​​,根据硬盘驱动器上,质数的存储的名单应约等于在速度核实素性。如果它可以存储在RAM(这是203 MB,因此后续运行可能会撞到磁盘高速缓存),然后问题消失完全是,因为FSB速度通常由一个因素不同于处理器速度小于前端总线宽度字节 - 也就是说,外频可以将每个周期超过一个素数。然后改进的因素是降低除法运算,即五次。证实了这一点通过以下的实验结果。

Therefore, depending on the hard drive, a stored list of primes should be about equal in speed at verifying primality. If it can be stored in RAM (it's 203 MB, so subsequent runs will probably hit the disk cache), then the problem goes away entirely, as the FSB speed typically differs from the processor speed by a factor less than the FSB width in bytes — i.e., the FSB can transfer more than one prime per cycle. Then factor of improvement is the reduction in division operations, i.e. five times. This is borne out by the experimental results below.

当然,再有就是多线程。或者素数或跳过过滤候选范围可以被分配到不同的线程,使得任一方法易并行。有没有优化,不涉及增加平行分频电路的数目,除非你以某种方式消除模。

Of course, then there is multithreading. Ranges of either primes or skip-filtered candidates can be assigned to different threads, making either approach embarrassingly parallel. There are no optimizations that don't involve increasing the number of parallel divider circuits, unless you somehow eliminate the modulo.

下面是这样一个程序。它的模板,所以你可以添加大数。

Here is such a program. It's templated so you could add bignums.

/*
 *  multibyte_sieve.cpp
 *  Generate a table of primes, and use it to factorize numbers.
 *
 *  Created by David Krauss on 10/12/10.
 *
 */

#include <cmath>
#include <bitset>
#include <limits>
#include <memory>
#include <fstream>
#include <sstream>
#include <iostream>
#include <iterator>
#include <stdint.h>
using namespace std;

char const primes_filename[] = "primes";
enum { encoding_base = (1<< numeric_limits< unsigned char >::digits) - 2 };

template< typename It >
unsigned decode_gap( It &stream ) {
    unsigned gap = static_cast< unsigned char >( * stream ++ );

    if ( gap ) return 2 * gap; // only this path is tested

    gap = ( decode_gap( stream )/2-1 ) * encoding_base; // deep recursion
    return gap + decode_gap( stream ); // shallow recursion
}

template< typename It >
void encode_gap( It &stream, uint32_t gap ) {
    unsigned len = 0, bytes[4];

    gap /= 2;
    do {
        bytes[ len ++ ] = gap % encoding_base;
        gap /= encoding_base;
    } while ( gap );

    while ( -- len ) { // loop not tested
        * stream ++ = 0;
        * stream ++ = bytes[ len + 1 ];
    }
    * stream ++ = bytes[ 0 ];
}

template< size_t lim >
void generate_primes() {
    auto_ptr< bitset< lim / 2 > > sieve_p( new bitset< lim / 2 > );
    bitset< lim / 2 > &sieve = * sieve_p;

    ofstream out_f( primes_filename, ios::out | ios::binary );
    ostreambuf_iterator< char > out( out_f );

    size_t count = 0;

    size_t last = sqrtl( lim ) / 2 + 1, prev = 0, x = 1;
    for ( ; x != last; ++ x ) {
        if ( sieve[ x ] ) continue;
        size_t n = x * 2 + 1; // translate index to number
        for ( size_t m = x + n; m < lim/2; m += n ) sieve[ m ] = true;
        encode_gap( out, ( x - prev ) * 2 );
        prev = x;
    }

    for ( ; x != lim / 2; ++ x ) {
        if ( sieve[ x ] ) continue;
        encode_gap( out, ( x - prev ) * 2 );
        prev = x;
    }

    cout << prev * 2 + 1 << endl;
}

template< typename I >
void factorize( I n ) {
    ifstream in_f( primes_filename, ios::in | ios::binary );
    if ( ! in_f ) {
        cerr << "Could not open primes file.\n"
                "Please generate it with 'g' command.\n";
        return;
    }

    while ( n % 2 == 0 ) {
        n /= 2;
        cout << "2 ";
    }
    unsigned long factor = 1;

    for ( istreambuf_iterator< char > in( in_f ), in_end; in != in_end; ) {
        factor += decode_gap( in );

        while ( n % factor == 0 ) {
            n /= factor;
            cout << factor << " ";
        }

        if ( n == 1 ) goto finish;
    }

    cout << n;
finish:
    cout << endl;
}

int main( int argc, char *argv[] ) {
    if ( argc != 2 ) goto print_help;

    unsigned long n;

    if ( argv[1][0] == 'g' ) {
        generate_primes< (1ul<< 32) >();
    } else if ( ( istringstream( argv[1] ) >> n ).rdstate() == ios::eofbit )
        factorize( n );
    } else goto print_help;

    return 0;

print_help:
    cerr << "Usage:\n\t" << argv[0] << " <number> -- factorize number.\n"
            "\t" << argv[0] << " g -- generate primes file in current directory.\n";
}

这是一个2.2 GHz的性能MacBook Pro的:

Performance on a 2.2 GHz MacBook Pro:

dkrauss$ time ./multibyte_sieve g
4294967291

real    2m8.845s
user    1m15.177s
sys    0m2.446s
dkrauss$ time ./multibyte_sieve 18446743721522234449
4294967231 4294967279 

real    0m5.405s
user    0m4.773s
sys 0m0.458s
dkrauss$ time ./mike 18446743721522234449
4294967231 4294967279
real    0m25.147s
user    0m24.170s
sys 0m0.096s