我如何能最好地猜测编码时的BOM(字节顺序标记)丢失?字节、标记、顺序、BOM

2023-09-10 22:43:50 作者:渺小的秘密

我的程序必须读取使用各种编码的文件。他们可能是ANSI,UTF-8或UTF-16(或大或小端)。

My program has to read files that use various encodings. They may be ANSI, UTF-8 or UTF-16 (big or little endian).

在BOM(字节顺序标记)是有的,我没有问题。我知道,如果该文件是UTF-8或UTF-16 BE或LE。

When the BOM (Byte Order Mark) is there, I have no problem. I know if the file is UTF-8 or UTF-16 BE or LE.

我想假设当时没有BOM该文件是ANSI。但我发现,这些文件我处理往往缺少他们的BOM。因此,任何的BOM可能意味着该文件是ANSI,UTF-8,UTF-16 BE或LE。

I wanted to assume when there was no BOM that the file was ANSI. But I have found that the files I am dealing with often are missing their BOM. Therefore no BOM may mean that the file is ANSI, UTF-8, UTF-16 BE or LE.

在文件没有BOM,这将是扫描某些文件的最佳途径和最准确地猜测编码类型?我想是对的接近100%的时间,如果该文件是ANSI,并在高90的,如果它是一个UTF格式。

When the file has no BOM, what would be the best way to scan some of the file and most accurately guess the type of encoding? I'd like to be right close to 100% of the time if the file is ANSI and in the high 90's if it is a UTF format.

我在寻找一个通用的算法的方式来确定这一点。但我实际使用德尔福2009年它知道的Uni code和具有TEncoding类,所以一些具体的事情来,这将是一个奖金。

I'm looking for a generic algorithmic way to determine this. But I actually use Delphi 2009 which knows Unicode and has a TEncoding class, so something specific to that would be a bonus.

答:

ShreevatsaR的回答使我在谷歌搜索的通用编码探测器德尔福,它具有这个职位在#1位置列出活着只有约45分钟后让我大吃一惊!这是快速googlebotting!而且还令人惊讶的是#1进入第一名这么快。

ShreevatsaR's answer led me to search on Google for "universal encoding detector delphi" which surprised me in having this post listed in #1 position after being alive for only about 45 minutes! That is fast googlebotting!! And also amazing that Stackoverflow gets into 1st place so quickly.

在谷歌第二项是在字符编码检测上市算法在不同的语言。

The 2nd entry in Google was a blog entry by Fred Eaker on Character encoding detection that listed algorithms in various languages.

我发现网页上提到德尔福,它使我直接到的免费开源ChsDet字符集探测器在SourceForge 的写在Delphi和基于Mozilla的i18n的组成部分。

I found the mention of Delphi on that page, and it led me straight to the Free OpenSource ChsDet Charset Detector at SourceForge written in Delphi and based on Mozilla's i18n component.

太棒了!谢谢所有谁回答(所有+1),谢谢ShreevatsaR,并再次感谢你#1,帮我找到我的答案,在不到一个小时!

Fantastic! Thank you all those who answered (all +1), thank you ShreevatsaR, and thank you again Stackoverflow, for helping me find my answer in less than an hour!

推荐答案

也许你可以掏出来使用 chardet的Python脚本:通用编码探测器。这是所使用的Firefox中的字符编码​​检测的重新实现,并使用许多不同的应用的。相关链接: Mozilla的code ,的short解释,的详细的说明。

Maybe you can shell out to a Python script that uses Chardet: Universal Encoding Detector. It is a reimplementation of the character encoding detection that used by Firefox, and is used by many different applications. Useful links: Mozilla's code, research paper it was based on (ironically, my Firefox fails to correctly detect the encoding of that page), short explanation, detailed explanation.