寻找关于&QUOT的详细信息;集团varint编码/解码" psented在杰夫的幻灯片$ P $详细信息、幻灯片、杰夫、集团

2023-09-11 04:17:08 作者:`〃諷欥乷瞇錑旒涙

我注意到,在杰夫的幻灯片,在建设大规模信息检索系统面临的挑战,这也可以在这里下载:的 http://research.google.com/people/jeff/WSDM09-keynote.pdf ,整数COM pression称为方法组varint编码被提及。有人说,不是每个字节的整数编码的7位(2X更多)要快得多。我非常有兴趣在此,希望寻找一个实现这一点,或任何详细信息,可以帮助我通过自己实现这一点。

I noticed that in Jeff's slides "Challenges in Building Large-Scale Information Retrieval Systems", which can also be downloaded here: http://research.google.com/people/jeff/WSDM09-keynote.pdf, a method of integers compression called "group varint encoding" was mentioned. It was said much faster than 7 bits per byte integer encoding (2X more). I am very interested in this and looking for an implementation of this, or any more details that could help me implement this by myself.

我不是一个亲和的新本,并欢迎任何帮助!

I am not a pro and new to this, and any help is welcome!

推荐答案

这是指可变整数编码,其中所用的比特数来存储的整数时序列不固定在4个字节。还有就是 varint在协议缓冲文件

That's referring to "variable integer encoding", where the number of bits used to store an integer when serialized is not fixed at 4 bytes. There is a good description of varint in the protocol buffer documentation.

这是用在编码谷歌的协议缓冲区,你可以浏览协议缓冲源$ C ​​$ C 。

It is used in encoding Google's protocol buffers, and you can browse the protocol buffer source code.

codedOutputStream 包含准确编码功能 WriteVarint32FallbackToArrayInline :

inline uint8* CodedOutputStream::WriteVarint32FallbackToArrayInline(
    uint32 value, uint8* target) {
  target[0] = static_cast<uint8>(value | 0x80);
  if (value >= (1 << 7)) {
    target[1] = static_cast<uint8>((value >>  7) | 0x80);
    if (value >= (1 << 14)) {
      target[2] = static_cast<uint8>((value >> 14) | 0x80);
      if (value >= (1 << 21)) {
        target[3] = static_cast<uint8>((value >> 21) | 0x80);
        if (value >= (1 << 28)) {
          target[4] = static_cast<uint8>(value >> 28);
          return target + 5;
        } else {
          target[3] &= 0x7F;
          return target + 4;
        }
      } else {
        target[2] &= 0x7F;
        return target + 3;
      }
    } else {
      target[1] &= 0x7F;
      return target + 2;
    }
  } else {
    target[0] &= 0x7F;
    return target + 1;
  }
}

级联如果取值只会增加额外的字节到目标年底数组,如果幅度值得那些额外的字节。该 0x80的口罩字节写入,而下移。从我可以告诉,在 0x7F的面膜使其以表示编码的最后一个字节。 (当或运算 0x80的,最高位永远是 1 ,则最后一个字节清除最高位(通过AND'ing的 0x7F的)。所以,在阅读的时候varints你看,直到你得到的最高位零字节。

The cascading ifs will only add additional bytes onto the end of the target array if the magnitude of value warrants those extra bytes. The 0x80 masks the byte being written, and the value is shifted down. From what I can tell, the 0x7f mask causes it to signify the "last byte of encoding". (When OR'ing 0x80, the highest bit will always be 1, then the last byte clears the highest bit (by AND'ing 0x7f). So, when reading varints you read until you get a byte with a zero in the highest bit.

我才意识到你问集团VarInt编码明确。很抱歉,code是关于基本VarInt编码(仍高于7位)。基本思路看起来是相似的。不幸的是,它的的没有的什么东西被用来存储64位数字的协议缓冲区。我也不会感到惊讶,如果是code是开源的地方,但。

I just realized you asked about "Group VarInt encoding" specifically. Sorry, that code was about basic VarInt encoding (still faster than 7-bit). The basic idea looks to be similar. Unfortunately, it's not what's being used to store 64bit numbers in protocol buffers. I wouldn't be surprised if that code was open sourced somewhere though.

使用想法从 varint ,然后从幻灯片组varint的图表,它不应该是太太难煮了你自己:)

Using the ideas from varint and the diagrams of "Group varint" from the slides, it shouldn't be too too hard to cook up your own :)

下面是另一个页面描述集团VarInt COM pression ,其中包含解码code。不幸的是,他们暗示公开的实现,但它们不提供参考。

Here is another page describing Group VarInt compression, which contains decoding code. Unfortunately they allude to publicly available implementations, but they don't provide references.

void DecodeGroupVarInt(const byte* compressed, int size, uint32_t* uncompressed) {
  const uint32_t MASK[4] = { 0xFF, 0xFFFF, 0xFFFFFF, 0xFFFFFFFF };
  const byte* limit = compressed + size;
  uint32_t current_value = 0;
  while (compressed != limit) {
    const uint32_t selector = *compressed++;
    const uint32_t selector1 = (selector & 3);
    current_value += *((uint32_t*)(compressed)) & MASK[selector1];
    *uncompressed++ = current_value;
    compressed += selector1 + 1;
    const uint32_t selector2 = ((selector >> 2) & 3);
    current_value += *((uint32_t*)(compressed)) & MASK[selector2];
    *uncompressed++ = current_value;
    compressed += selector2 + 1;
    const uint32_t selector3 = ((selector >> 4) & 3);
    current_value += *((uint32_t*)(compressed)) & MASK[selector3];
    *uncompressed++ = current_value;
    compressed += selector3 + 1;
    const uint32_t selector4 = (selector >> 6);
    current_value += *((uint32_t*)(compressed)) & MASK[selector4];
    *uncompressed++ = current_value;
    compressed += selector4 + 1;
  }
}