整数序列的C​​OM pression提供随机访问整数、序列、pression、OM

2023-09-11 04:46:40 作者:痴骨ら

我有个整数在一个小范围内的序列 [0,K)和所有的整数具有相同的频率 F (使该序列的大小是 N = F *氏/ code>)。我想要,现在要做的就是COM preSS这个序列,同时提供的随机访问的(什么是第i个整数)。实现随机访问的时候不必为O(1)。我更感兴趣的是在更高的随机存取时间为代价实现高COM pression。

我还没有尝试过用Huffman编码,因为它赋予codeS基于频率(和我所有的频率都是一样的)。也许我错过了一些简单的编码这种特殊情况下。

任何帮助或指针将AP preciated。

在此先感谢。

PS:已经问cs.stackexchange,但要求在这里也为更好的覆盖,遗憾

解决方案

如果所有的整数具有相同的频率,那么一个公平的近似最优COM pression将 CEIL(LOG2(K ))元整数位。您可以访问在固定时间有点阵列的这些。

如果 K 是痛苦小(如3),上述方法可能会浪费空间相当数量。但是,你可以结合你的小整数一个固定的数字转换为基地 - K 数,它可以更有效地融入位固定数量的(也可能是你能方便地适合的结果到一个标准大小的字)。在任何情况下,您也可以访问这个编码在固定时间内。

如果您的整数的没有的具有相同的频率,优化COM pression可能产生的可变比特率从你输入的不同部分,因此简单的数组访问将无法工作。在这种情况下,良好的随机访问性能,需要索引结构:打破你的COM pressed数据至适宜大小的块,其中每个可以DECOM pressed顺序,但是这一次是由块大小为界

如果每个数字的频率的完全的一样,你可以通过充分利用这点来节省一些空间 - 但它可能是不够的,不值得。

N 随机数的范围 [0,K) n的熵LOG2(K),这是 LOG2(k)的每个号码位;这是它需要连接code你的号码的没有的取精确频率的优势位数。

区分排列F的将每个 k的元素(其中 n中的熵= F * K )是:

  LOG2(N /(F)^ķ!)= LOG2(N!) - (!F)K * LOG2
 

应用斯特灵公式(这是很好的在这里只有 N F 大),收益率:

 〜N LOG2(N) -  N LOG2(五) -  K(F LOG2(F) - ˚FLOG2(E))
= N LOG2(N) -  N LOG2(五) -  N LOG2(F)+ N LOG2(五)
= N(LOG2(N) -  LOG2(F))
= N LOG2(N / F)
= N LOG2(K)
 

这句话的意思是,如果 N 大且 K 是小,你不会得到一个显著通过把你输入的准确的频率优势的空间。

从斯特林逼近上方的总误差 O(LOG2(N)+ K LOG2(F)),这是 0(LOG2 (N)/ N + LOG2(F)/ F)每个数字连接codeD。这也意味着,如果你的 K 是如此之大,你的 F 小(即,每一个不同的号码只有一个少数副本),您可以节省一些空间,一个聪明的编码。然而,问题指定了 K 是,其实小。

I have a sequence of n integers in a small range [0,k) and all the integers have the same frequency f (so the size of the sequence is n=f∗k). What I'm trying to do now is to compress this sequence while providing random access (what is the i-th integer). The time to achieve random access doesn't have to be O(1). I'm more interested in achieving high compression at the expense of higher random access times.

I haven't tried with Huffman coding since it assigns codes based on frequencies (and all my frequencies are the same). Perhaps I'm missing some simple encoding for this particular case.

Any help or pointers would be appreciated.

Thanks in advance.

PS: Already asked in cs.stackexchange, but asking here also for better coverage, sorry.

解决方案

If all your integers have the same frequency, then a fair approximation to optimal compression will be ceil(log2(k)) bits per integer. You can access a bit-array of these in constant time.

If k is painfully small (like 3), the above method may waste a fair amount of space. But, you can combine a fixed number of your small integers into a base-k number, which can fit more efficiently into a fixed number of bits (you may also be able to fit the result conveniently into a standard-sized word). In any case, you can also access this coding in constant time.

If your integers don't have the same frequency, optimal compression may yield variable bit rates from different parts of your input, so the simple array access won't work. In that case, good random-access performance would require an index structure: break your compressed data into convenient sized chunks, which can each be decompressed sequentially, but this time is bounded by the chunk size.

If the frequency of each number is exactly the same, you may be able to save some space by taking advantage of this -- but it may not be enough to be worthwhile.

The entropy of n random numbers in range [0,k) is n log2(k), which is log2(k) bits per number; this is the number of bits it takes to encode your numbers without taking advantage of the exact frequency.

The entropy of distinguishable permutations of f copies each of k elements (where n=f*k) is:

log2( n!/(f!)^k ) = log2(n!) - k * log2(f!)

Applying Stirling's approximation (which is good here only if n and f are large), yields:

~ n log2(n) - n log2(e) - k ( f log2(f) - f log2(e) )
= n log2(n) - n log2(e) - n log2(f) + n log2(e)
= n ( log2(n) - log2(f) )
= n log2(n/f)
= n log2(k)

What this means is that, if n is large and k is small, you will not gain a significant amount of space by taking advantage of the exact frequency of your input.

The total error from the Stirling approximation above is O(log2(n) + k log2(f)), which is O(log2(n)/n + log2(f)/f) per number encoded. This does mean that if your k is so large that your f is small (i.e., each distinct number only has a small number of copies), you may be able to save some space with a clever encoding. However, the question specifies that k is, in fact, small.

 
精彩推荐