一个字节数组的基础-N编码数组、字节、基础

2023-09-11 00:10:07 作者:笑衬心酸

这是前两天我碰到这codeReview 为基础-36编码的字节数组。然而,随后没有触及解码回一个字节数组,或者可能重用的答案来执行不同的基地(基数)的编码问题的答案。

A couple of days ago I came across this CodeReview for Base-36 encoding a byte array. However, the answers that followed didn't touch on decoding back into a byte array, or possibly reusing the answer to perform encodings of different bases (radix).

有关链接的问题的回答使用的BigInteger。因此,尽可能实现得好,基础和数字,可以参数化。

The answer for the linked question uses BigInteger. So as far as implementation goes, the base and its digits could be parametrized.

使用的BigInteger的问题虽然是我们对待我们的输入作为一个假设的整数。然而,我们的输入,一个字节数组,只是一个不透明的值系列。

The problem with BigInteger though, is that we're treating our input as an assumed integer. However, our input, a byte array, is just an opaque series of values.

如果字节数组中一系列的零字节,如{0xFF的,地址为0x7F,0x00,0x00},这些字节将答案用算法(在丢失结束只会恩code {0xFF时,0x7F的} 如果所述最后非零字节的符号位设置,那么程序零字节被消耗,因为它是当作BigInt有氏征分隔符。所以{0xFF时,为0xFF,0x00,0x00}将连接code只能作为{0xFF的,0xFF时,0×00}。

怎么可能一个.NET程序员使用的BigInteger创造一个合理的高效和基数无关恩codeR,与解码的支持,再加上处理字节次序的能力,并有能力来变通结束零字节丢失?的

推荐答案

修改:我添加了一个HACK来恩code()来处理情况,在最后一个字节输入签署(如果你要转换到sbyte)。唯一明智的解决办法我能想到的,现在是刚刚调整()数组一个。其他单元测试这种情况下通过了,但我没有重新运行PERF code考虑到这样的情况。如果你能帮助它,总是有你的输入连接code()包括一个虚拟的0字节结束,以避免额外拨款。

edit: I've added a 'HACK' to Encode() to handle cases where the last byte in the input is signed (if you were to convert to sbyte). Only sane solution I could think of right now is to just Resize() the array by one. Additional unit tests for this case passed, but I didn't rerun perf code to account for such cases. If you can help it, always have your input to Encode() include a dummy 0 byte at the end to avoid additional allocations.

我创建了一个RadixEncoding类(code部分中找到),它初始化与三个参数:

I've created a RadixEncoding class (found in the "Code" section) which initializes with three parameters:

在该基数的数字作为一个字符串(长度决定,当然实际的基数), 输入字节数组的假设字节顺序(端), 用户以及是否想要连接code /德code逻辑来确认结束零字节。

要创建一个基地 - 36编码,与小端输入,并给出关于结束零字节:

To create a Base-36 encoding, with little-endian input, and with respect given to ending zero bytes:

const string k_base36_digits = "0123456789abcdefghijklmnopqrstuvwxyz";
var base36_no_zeros = new RadixEncoding(k_base36_digits, EndianFormat.Little, false);

然后到实际执行编码/解码:

And then to actually perform encoding/decoding:

const string k_input = "A test 1234";
byte[] input_bytes = System.Text.Encoding.UTF8.GetBytes(k_input);
string encoded_string = base36_no_zeros.Encode(input_bytes);
byte[] decoded_bytes = base36_no_zeros.Decode(encoded_string);

性能

定时与Diagnostics.Stopwatch,采用了一种I7 860 @ 2.80GHz的。时序EXE跑本身,而不是下一个调试器。

Performance

Timed with Diagnostics.Stopwatch, ran on an i7 860 @2.80GHz. Timing EXE ran by itself, not under a debugger.

编码与相同的 k_base36_digits 从上面的字符串初始化,EndianFormat.Little,并用结束零字节承认(即使UTF8字节没有任何多余的结束零字节)

Encoding was initialized with the same k_base36_digits string from above, EndianFormat.Little, and with ending zero bytes acknowledged (even though the UTF8 bytes don't have any extra ending zero bytes)

要连接code中的UTF8字节测试1234100万次需要2.6567905secs 要取消code相同的字符串的次数相同数量的需要3.3916248secs

To encode the UTF8 bytes of "A test 1234" 1,000,000 times takes 2.6567905secs To decode the same string the same amount of times takes 3.3916248secs

要连接code的UTF8字节MADE稍大测试1234! 10万次需要1.1577325secs 要取消code相同的字符串的次数相同数量的需要1.244326secs

To encode the UTF8 bytes of "A test 1234. Made slightly larger!" 100,000 times takes 1.1577325secs To decode the same string the same amount of times takes 1.244326secs

如果您还没有一个 codeContracts发电机,你将不得不重新实现的合同,如果/扔code。

If you don't have a CodeContracts generator, you will have to reimplement the contracts with if/throw code.

using System;
using System.Collections.Generic;
using System.Numerics;
using Contract = System.Diagnostics.Contracts.Contract;

public enum EndianFormat
{
    /// <summary>Least Significant Bit order (lsb)</summary>
    /// <remarks>Right-to-Left</remarks>
    /// <see cref="BitConverter.IsLittleEndian"/>
    Little,
    /// <summary>Most Significant Bit order (msb)</summary>
    /// <remarks>Left-to-Right</remarks>
    Big,
};

/// <summary>Encodes/decodes bytes to/from a string</summary>
/// <remarks>
/// Encoded string is always in big-endian ordering
/// 
/// <p>Encode and Decode take a <b>includeProceedingZeros</b> parameter which acts as a work-around
/// for an edge case with our BigInteger implementation.
/// MSDN says BigInteger byte arrays are in LSB->MSB ordering. So a byte buffer with zeros at the 
/// end will have those zeros ignored in the resulting encoded radix string.
/// If such a loss in precision absolutely cannot occur pass true to <b>includeProceedingZeros</b>
/// and for a tiny bit of extra processing it will handle the padding of zero digits (encoding)
/// or bytes (decoding).</p>
/// <p>Note: doing this for decoding <b>may</b> add an extra byte more than what was originally 
/// given to Encode.</p>
/// </remarks>
// Based on the answers from http://codereview.stackexchange.com/questions/14084/base-36-encoding-of-a-byte-array/
public class RadixEncoding
{
    const int kByteBitCount = 8;

    readonly string kDigits;
    readonly double kBitsPerDigit;
    readonly BigInteger kRadixBig;
    readonly EndianFormat kEndian;
    readonly bool kIncludeProceedingZeros;

    /// <summary>Numerial base of this encoding</summary>
    public int Radix { get { return kDigits.Length; } }
    /// <summary>Endian ordering of bytes input to Encode and output by Decode</summary>
    public EndianFormat Endian { get { return kEndian; } }
    /// <summary>True if we want ending zero bytes to be encoded</summary>
    public bool IncludeProceedingZeros { get { return kIncludeProceedingZeros; } }

    public override string ToString()
    {
        return string.Format("Base-{0} {1}", Radix.ToString(), kDigits);
    }

    /// <summary>Create a radix encoder using the given characters as the digits in the radix</summary>
    /// <param name="digits">Digits to use for the radix-encoded string</param>
    /// <param name="bytesEndian">Endian ordering of bytes input to Encode and output by Decode</param>
    /// <param name="includeProceedingZeros">True if we want ending zero bytes to be encoded</param>
    public RadixEncoding(string digits,
        EndianFormat bytesEndian = EndianFormat.Little, bool includeProceedingZeros = false)
    {
        Contract.Requires<ArgumentNullException>(digits != null);
        int radix = digits.Length;

        kDigits = digits;
        kBitsPerDigit = System.Math.Log(radix, 2);
        kRadixBig = new BigInteger(radix);
        kEndian = bytesEndian;
        kIncludeProceedingZeros = includeProceedingZeros;
    }

    // Number of characters needed for encoding the specified number of bytes
    int EncodingCharsCount(int bytesLength)
    {
        return (int)Math.Ceiling((bytesLength * kByteBitCount) / kBitsPerDigit);
    }
    // Number of bytes needed to decoding the specified number of characters
    int DecodingBytesCount(int charsCount)
    {
        return (int)Math.Ceiling((charsCount * kBitsPerDigit) / kByteBitCount);
    }

    /// <summary>Encode a byte array into a radix-encoded string</summary>
    /// <param name="bytes">byte array to encode</param>
    /// <returns>The bytes in encoded into a radix-encoded string</returns>
    /// <remarks>If <paramref name="bytes"/> is zero length, returns an empty string</remarks>
    public string Encode(byte[] bytes)
    {
        Contract.Requires<ArgumentNullException>(bytes != null);
        Contract.Ensures(Contract.Result<string>() != null);

        // Don't really have to do this, our code will build this result (empty string),
        // but why not catch the condition before doing work?
        if (bytes.Length == 0) return string.Empty;

        // if the array ends with zeros, having the capacity set to this will help us know how much
        // 'padding' we will need to add
        int result_length = EncodingCharsCount(bytes.Length);
        // List<> has a(n in-place) Reverse method. StringBuilder doesn't. That's why.
        var result = new List<char>(result_length);

        // HACK: BigInteger uses the last byte as the 'sign' byte. If that byte's MSB is set, 
        // we need to pad the input with an extra 0 (ie, make it positive)
        if ( (bytes[bytes.Length-1] & 0x80) == 0x80 )
            Array.Resize(ref bytes, bytes.Length+1);

        var dividend = new BigInteger(bytes);
        // IsZero's computation is less complex than evaluating "dividend > 0"
        // which invokes BigInteger.CompareTo(BigInteger)
        while (!dividend.IsZero)
        {
            BigInteger remainder;
            dividend = BigInteger.DivRem(dividend, kRadixBig, out remainder);
            int digit_index = System.Math.Abs((int)remainder);
            result.Add(kDigits[digit_index]);
        }

        if (kIncludeProceedingZeros)
            for (int x = result.Count; x < result.Capacity; x++)
                result.Add(kDigits[0]); // pad with the character that represents 'zero'

        // orientate the characters in big-endian ordering
        if (kEndian == EndianFormat.Little)
            result.Reverse();
        // If we didn't end up adding padding, ToArray will end up returning a TrimExcess'd array, 
        // so nothing wasted
        return new string(result.ToArray());
    }

    void DecodeImplPadResult(ref byte[] result, int padCount)
    {
        if (padCount > 0)
        {
            int new_length = result.Length + DecodingBytesCount(padCount);
            Array.Resize(ref result, new_length); // new bytes will be zero, just the way we want it
        }
    }
    #region Decode (Little Endian)
    byte[] DecodeImpl(string chars, int startIndex = 0)
    {
        var bi = new BigInteger();
        for (int x = startIndex; x < chars.Length; x++)
        {
            int i = kDigits.IndexOf(chars[x]);
            if (i < 0) return null; // invalid character
            bi *= kRadixBig;
            bi += i;
        }

        return bi.ToByteArray();
    }
    byte[] DecodeImplWithPadding(string chars)
    {
        int pad_count = 0;
        for (int x = 0; x < chars.Length; x++, pad_count++)
            if (chars[x] != kDigits[0]) break;

        var result = DecodeImpl(chars, pad_count);
        DecodeImplPadResult(ref result, pad_count);

        return result;
    }
    #endregion
    #region Decode (Big Endian)
    byte[] DecodeImplReversed(string chars, int startIndex = 0)
    {
        var bi = new BigInteger();
        for (int x = (chars.Length-1)-startIndex; x >= 0; x--)
        {
            int i = kDigits.IndexOf(chars[x]);
            if (i < 0) return null; // invalid character
            bi *= kRadixBig;
            bi += i;
        }

        return bi.ToByteArray();
    }
    byte[] DecodeImplReversedWithPadding(string chars)
    {
        int pad_count = 0;
        for (int x = chars.Length - 1; x >= 0; x--, pad_count++)
            if (chars[x] != kDigits[0]) break;

        var result = DecodeImplReversed(chars, pad_count);
        DecodeImplPadResult(ref result, pad_count);

        return result;
    }
    #endregion
    /// <summary>Decode a radix-encoded string into a byte array</summary>
    /// <param name="radixChars">radix string</param>
    /// <returns>The decoded bytes, or null if an invalid character is encountered</returns>
    /// <remarks>
    /// If <paramref name="radixChars"/> is an empty string, returns a zero length array
    /// 
    /// Using <paramref name="IncludeProceedingZeros"/> has the potential to return a buffer with an
    /// additional zero byte that wasn't in the input. So a 4 byte buffer was encoded, this could end up
    /// returning a 5 byte buffer, with the extra byte being null.
    /// </remarks>
    public byte[] Decode(string radixChars)
    {
        Contract.Requires<ArgumentNullException>(radixChars != null);

        if (kEndian == EndianFormat.Big)
            return kIncludeProceedingZeros ? DecodeImplReversedWithPadding(radixChars) : DecodeImplReversed(radixChars);
        else
            return kIncludeProceedingZeros ? DecodeImplWithPadding(radixChars) : DecodeImpl(radixChars);
    }
};

基本的单元测试

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;

static bool ArraysCompareN<T>(T[] input, T[] output)
    where T : IEquatable<T>
{
    if (output.Length < input.Length) return false;
    for (int x = 0; x < input.Length; x++)
        if(!output[x].Equals(input[x])) return false;

    return true;
}
static bool RadixEncodingTest(RadixEncoding encoding, byte[] bytes)
{
    string encoded = encoding.Encode(bytes);
    byte[] decoded = encoding.Decode(encoded);

    return ArraysCompareN(bytes, decoded);
}
[TestMethod]
public void TestRadixEncoding()
{
    const string k_base36_digits = "0123456789abcdefghijklmnopqrstuvwxyz";
    var base36 = new RadixEncoding(k_base36_digits, EndianFormat.Little, true);
    var base36_no_zeros = new RadixEncoding(k_base36_digits, EndianFormat.Little, true);

    byte[] ends_with_zero_neg = { 0xFF, 0xFF, 0x00, 0x00 };
    byte[] ends_with_zero_pos = { 0xFF, 0x7F, 0x00, 0x00 };
    byte[] text = System.Text.Encoding.ASCII.GetBytes("A test 1234");

    Assert.IsTrue(RadixEncodingTest(base36, ends_with_zero_neg));
    Assert.IsTrue(RadixEncodingTest(base36, ends_with_zero_pos));
    Assert.IsTrue(RadixEncodingTest(base36_no_zeros, text));
}