有没有一个标准的技术包装的二进制数据转化成UTF-16字符串?字符串、转化成、二进制数、标准

2023-09-04 00:11:46 作者:孤枕

(在.NET)予有存储在任意的二进制数据一个的字节[] 的(图像,例如)。现在,我需要存储的数据在字符串的(一个注释遗留API的领域)。是否有一个标准技术的包装这个二进制数据转化成的字符串的?通过包装我的意思是,对于任何相当大的和随机数据集, bytes.Length / 2 的是大约相同的 packed.Length 的;因为两个字节是更多或更少的字符。

(In .NET) I have arbitrary binary data stored in in a byte[] (an image, for example). Now, I need to store that data in a string (a "Comment" field of a legacy API). Is there a standard technique for packing this binary data into a string? By "packing" I mean that for any reasonably large and random data set, bytes.Length/2 is about the same as packed.Length; because two bytes are more-or-less a single character.

这两个明显的答案不符合所有条件:

The two "obvious" answers don't meet all the criteria:

string base64 = System.Convert.ToBase64String(bytes)

不会使非常有效地利用了的字符串的,因为它仅使用64个字符出约为60,000可用的(我的存储是一个的 System.String 的)。水往低

doesn't make very efficient use of the string since it only uses 64 characters out of roughly 60,000 available (my storage is a System.String). Going with

string utf16 = System.Text.Encoding.Unicode.GetString(bytes)

可以更好地利用的字符串的,但它不会包含无效的Uni code字符(比如不匹配的代理对)数据的工作。 这MSDN文章显示了这个确切的(差)的技术。

makes better use of the string, but it won't work for data that contains invalid Unicode characters (say mis-matched surrogate pairs). This MSDN article shows this exact (poor) technique.

让我们来看一个简单的例子:

Let's look at a simple example:

byte[] bytes = new byte[] { 0x41, 0x00, 0x31, 0x00};
string utf16 = System.Text.Encoding.Unicode.GetString(bytes);
byte[] utf16_bytes = System.Text.Encoding.Unicode.GetBytes(utf16);

在这种情况下的字节的和的 utf16_bytes 的是相同的,因为该原单的字节的是一个UTF-16字符串。这样做同样的程序用base64编码给16名成员组成的 base64_bytes 的数组。

In this case bytes and utf16_bytes are the same, because the orginal bytes were a UTF-16 string. Doing this same procedure with base64 encoding gives 16-member base64_bytes array.

现在,随着无效的UTF-16数据重复该过程:

Now, repeat the procedure with invalid UTF-16 data:

byte[] bytes = new byte[] { 0x41, 0x00, 0x00, 0xD8};

您会发现的 utf16_bytes 的不匹配的原始数据。

You'll find that utf16_bytes do not match the original data.

我已经写了code;它的工作原理,但我想知道是否有更标准的技术不是东西,我只是煮了我自己。更何况,我不喜欢的抓的荷兰国际集团的的德coderFallbackException 的作为检测无效字符的方式。

I've written code that uses U+FFFD as an escape before invalid Unicode characters; it works, but I'd like to know if there is a more standard technique than something I just cooked up on my own. Not to mention, I don't like catching the DecoderFallbackException as the way of detecting invalid characters.

我想你可以称此为基地BMP或基地UTF-16编码(使用的所有字符在C基本多文种平面的统一$ C $)。是的,理想情况下我会按照肖恩·斯蒂尔的建议并通过周围的字节[] 的。

I guess you could call this a "base BMP" or "base UTF-16" encoding (using all the characters in the Unicode Basic Multilingual Plane). Yes, ideally I'd follow Shawn Steele's advice and pass around byte[].

我要与彼得·Housel的建议,去为正确的答案,因为他是那个差点提示有标准技术的唯一。

I'm going to go with Peter Housel's suggestion as the "right" answer because he's the only that came close to suggesting a "standard technique".

编辑 base16k 看起来甚至更好。吉姆·贝弗里奇有一个实施。

Edit base16k looks even better. Jim Beveridge has an implementation.

推荐答案

我无意中发现的 Base16k 阅读你的问题后。没有严格的标准,但它似乎运作良好,是很容易在C#来实现。

I stumbled onto Base16k after reading your question. Not strictly a standard but it seems to work well and was easy enough to implement in C#.