简单来说,如何COM pression普遍实现?普遍、简单、COM、pression

2023-09-11 22:47:52 作者:你是我心中的一首歌ゞ

所以,我一直在思考最近怎么COM pression可能实现,是我到目前为止已经推测的是,它可能会使用一种的字节的签名密钥哈希表的内存位置值,其中那是字节的签名应该在扩张时所涉及的COM pressed项目所取代。

So I've been thinking lately about how compression might be implemented, and what I've postulated so far is that it might be using a sort of HashTable of 'byte signature' keys with memory location values where that 'byte signature' should be replaced upon expansion of the compressed item in question.

这是与事实不符?

如何COM pression通常执行?无需一页值得回答,只是简单来说是好的。

How is compression typically implemented? No need for a page worth of answer, just in simple terms is fine.

推荐答案

的COM pressing算法,试图找出重复的子序列将它们与较短的再presentation取代。

Compressing algorithms try to find repeated subsequences to replace them with a shorter representation.

让我们以25字节长的字符串等等等等等等等等等等!(200位),从的的在的放气的算法的解读的例如。

Let's take the 25 byte long string Blah blah blah blah blah! (200 bit) from An Explanation of the Deflate Algorithm for example.

一个幼稚的做法是EN code每个字符与相同长度的code字。我们有7个不同的角色,因此需要codeS与的长度CEIL(LD(7))= 3 。我们的code的话可以比看起来像这些:

A naive approach would be to encode every character with a code word of the same length. We have 7 different characters and thus need codes with the length of ceil(ld(7)) = 3. Our code words can than look like these:

000 → "B"
001 → "l"
010 → "a"
011 → "h"
100 → " "
101 → "b"
110 → "!"
111 → not used

现在我们可以连接code我们的字符串如下:

Now we can encode our string as follows:

000 001 010 011 100 101 001 010 011 100 101 001 010 011 100 101 001 010 110
B   l   a   h   _   b   l   a   h   _   b   l   a   h   _   b   l   a   !

这将只需要25·3位= 75位的EN codeD字外加7·8位= 56位的字典,从而 131位(65.5%)

或为序列:

00 → "lah b"
01 → "B"
10 → "lah!"
11 → not used

带连接codeD字:

The encoded word:

01 00    00    00    00    10
B  lah b lah b lah b lah b lah!

现在,我们只需要6·2位= 12位的EN codeD字和10·8位= 80位加3·8位= 24位的每个单词的长度,从而 116位(58.0%)。

Now we just need 6·2 bit = 12 bit for the encoded word and 10·8 bit = 80 bit plus 3·8 bit = 24 bit for the length of each word, thus 116 bit (58.0%).

借助霍夫曼code 是用来连接code更频繁的字符/子更短code小于频繁的:

The Huffman code is used to encode more frequent characters/substrings with shorter code than less frequent ones:

5 × "l", "a", "h"
4 × " ", "b"
1 × "B", "!"

// or for sequences

4 × "lah b"
1 × "B", "lah!"

一个可能的霍夫曼$ C $下是:

A possible Huffman code for that is:

0      → "l"
10     → "a"
110    → "h"
1110   → " "
11110  → "b"
111110 → "B"
111111 → "!"

或为序列:

0  → "lah b"
10 → "B"
11 → "lah!"

现在我们等等等等等等等等等等可连接coded到:

Now our Blah blah blah blah blah! can be encoded to:

111110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 111111
B      l a  h   _    b     l a  h   _    b     l a  h   _    b     l a  h   _    b     l a  h   !

或为序列:

10 0     0     0     0     11
B  lah b lah b lah b lah b lah!

现在先搞code只需要78位或8位,而不是25·8 = 200有点像我们最初的弦了。但是,我们仍然需要添加在那里我们的字符/序列被存储在字典中。对于我们的每个字符的例子,我们将需要7个额外的字节(7·8位= 56位),而我们的每个序列的例子会再次需要7个字节加上3个字节的每个序列(从而59位)的长度。这将导致:

Now out first code just needs 78 bit or 8 bit instead of 25·8 = 200 bit like our initial string has. But we still need to add the dictionary where our characters/sequences are stored. For our per-character example we would need 7 additional bytes (7·8 bit = 56 bit) and our per-sequence example would need again 7 bytes plus 3 bytes for the length of each sequence (thus 59 bit). That would result in:

56 + 78 = 134 bit (67.0%)
59 +  8 =  67 bit (33.5%)

的实际数量可能不正确。请随意编辑/纠正。的