Encoding.UTF8.GetString没有考虑到preamble / BOM考虑到、Encoding、GetString、BOM

2023-09-02 23:52:21 作者:白骨画颜

在.NET中,我试图用 Encoding.UTF8.GetString 的方法,这需要一个字节数组,并将其转换为字符串

看起来这种方法忽略了 BOM(字节顺序标记),这可能是一个合法的二进制文件的一部分再presentation一个UTF8字符串,并将其作为一个字符。

我知道我可以使用的TextReader 根据需要消化的BOM,但我认为GetString方法应该是某种宏,使我们的$ C $ Ç短。

我缺少的东西?这是像这样故意?

下面是一个再现code:

 静态无效的主要(字串[] args)
{
    字符串S1 =ABC;
    byte []的abcWithBom;
    使用(VAR毫秒=新的MemoryStream())
    使用(VAR SW =新的StreamWriter(MS,新UTF8Encoding(真)))
    {
        sw.Write(S1);
        sw.Flush();
        abcWithBom = ms.ToArray();
        Console.WriteLine(FormatArray(abcWithBom)); // EF,BB,BF,61,62,63
    }

    byte []的abcWithoutBom;
    使用(VAR毫秒=新的MemoryStream())
    使用(VAR SW =新的StreamWriter(MS,新UTF8Encoding(假)))
    {
        sw.Write(S1);
        sw.Flush();
        abcWithoutBom = ms.ToArray();
        Console.WriteLine(FormatArray(abcWithoutBom)); // 61,62,63
    }

    VAR restore1 = Encoding.UTF8.GetString(abcWithoutBom);
    Console.WriteLine(restore1.Length); // 3
    Console.WriteLine(restore1); // ABC

    VAR restore2 = Encoding.UTF8.GetString(abcWithBom);
    Console.WriteLine(restore2.Length); // 4(!)
    Console.WriteLine(restore2); //?ABC
}

私人静态字符串FormatArray(byte []的bytes1)
{
    返回的string.join(,,从步骤b中bytes1选择b.ToString(×));
}
 
解决PostgreSQL数据库传输出现ERROR invalid byte sequence for encoding UTF8 0xe5 0x9b 0x20

解决方案   

看起来这种方法忽略了BOM(字节顺序标记),这可能是一个UTF8字符串的合法二进制重新presentation的一部分,并把它作为一个字符。

它看起来并不像它忽略它 - 它忠实地将其转换为BOM字符。那它是什么,毕竟。

如果你想的您的code忽略BOM在其转换任何字符串,这是给你做......或使用的StreamReader

请注意,如果你的或者的使用 Encoding.GetBytes 然后按 Encoding.GetString 或的使用的StreamWriter 然后按的StreamReader ,这两种形式要么产生再吞或不产生BOM表。当你混合使用只是一个的StreamWriter (使用 Encoding.Get preamble )有直接 Encoding.GetString 你最终的额外字符的呼叫。

In .NET, I'm trying to use Encoding.UTF8.GetString method, which takes a byte array and converts it to a string.

It looks like this method ignores the BOM (Byte Order Mark), which might be a part of a legitimate binary representation of a UTF8 string, and takes it as a character.

I know I can use a TextReader to digest the BOM as needed, but I thought that the GetString method should be some kind of a macro that makes our code shorter.

Am I missing something? Is this like so intentionally?

Here's a reproduction code:

static void Main(string[] args)
{
    string s1 = "abc";
    byte[] abcWithBom;
    using (var ms = new MemoryStream())
    using (var sw = new StreamWriter(ms, new UTF8Encoding(true)))
    {
        sw.Write(s1);
        sw.Flush();
        abcWithBom = ms.ToArray();
        Console.WriteLine(FormatArray(abcWithBom)); // ef, bb, bf, 61, 62, 63
    }

    byte[] abcWithoutBom;
    using (var ms = new MemoryStream())
    using (var sw = new StreamWriter(ms, new UTF8Encoding(false)))
    {
        sw.Write(s1);
        sw.Flush();
        abcWithoutBom = ms.ToArray();
        Console.WriteLine(FormatArray(abcWithoutBom)); // 61, 62, 63
    }

    var restore1 = Encoding.UTF8.GetString(abcWithoutBom);
    Console.WriteLine(restore1.Length); // 3
    Console.WriteLine(restore1); // abc

    var restore2 = Encoding.UTF8.GetString(abcWithBom);
    Console.WriteLine(restore2.Length); // 4 (!)
    Console.WriteLine(restore2); // ?abc
}

private static string FormatArray(byte[] bytes1)
{
    return string.Join(", ", from b in bytes1 select b.ToString("x"));
}

解决方案

It looks like this method ignores the BOM (Byte Order Mark), which might be a part of a legitimate binary representation of a UTF8 string, and takes it as a character.

It doesn't look like it "ignores" it at all - it faithfully converts it to the BOM character. That's what it is, after all.

If you want to make your code ignore the BOM in any string it converts, that's up to you to do... or use StreamReader.

Note that if you either use Encoding.GetBytes followed by Encoding.GetString or use StreamWriter followed by StreamReader, both forms will either produce then swallow or not produce the BOM. It's only when you mix using a StreamWriter (which uses Encoding.GetPreamble) with a direct Encoding.GetString call that you end up with the "extra" character.

 
精彩推荐