我如何能最好地猜测编码时的BOM（字节顺序标记）丢失？字节、标记、顺序、BOM

2023-09-10 22:43:50 作者：渺小的秘密

我的程序必须读取使用各种编码的文件。他们可能是ANSI，UTF-8或UTF-16（或大或小端）。

My program has to read files that use various encodings. They may be ANSI, UTF-8 or UTF-16 (big or little endian).

在BOM（字节顺序标记）是有的，我没有问题。我知道，如果该文件是UTF-8或UTF-16 BE或LE。

When the BOM (Byte Order Mark) is there, I have no problem. I know if the file is UTF-8 or UTF-16 BE or LE.

我想假设当时没有BOM该文件是ANSI。但我发现，这些文件我处理往往缺少他们的BOM。因此，任何的BOM可能意味着该文件是ANSI，UTF-8，UTF-16 BE或LE。

I wanted to assume when there was no BOM that the file was ANSI. But I have found that the files I am dealing with often are missing their BOM. Therefore no BOM may mean that the file is ANSI, UTF-8, UTF-16 BE or LE.

在文件没有BOM，这将是扫描某些文件的最佳途径和最准确地猜测编码类型？我想是对的接近100％的时间，如果该文件是ANSI，并在高90的，如果它是一个UTF格式。

When the file has no BOM, what would be the best way to scan some of the file and most accurately guess the type of encoding? I'd like to be right close to 100% of the time if the file is ANSI and in the high 90's if it is a UTF format.

我在寻找一个通用的算法的方式来确定这一点。但我实际使用德尔福2009年它知道的Uni code和具有TEncoding类，所以一些具体的事情来，这将是一个奖金。

I'm looking for a generic algorithmic way to determine this. But I actually use Delphi 2009 which knows Unicode and has a TEncoding class, so something specific to that would be a bonus.

答：

ShreevatsaR的回答使我在谷歌搜索的通用编码探测器德尔福，它具有这个职位在＃1位置列出活着只有约45分钟后让我大吃一惊！这是快速googlebotting！而且还令人惊讶的是＃1进入第一名这么快。

ShreevatsaR's answer led me to search on Google for "universal encoding detector delphi" which surprised me in having this post listed in #1 position after being alive for only about 45 minutes! That is fast googlebotting!! And also amazing that Stackoverflow gets into 1st place so quickly.

在谷歌第二项是在字符编码检测上市算法在不同的语言。

The 2nd entry in Google was a blog entry by Fred Eaker on Character encoding detection that listed algorithms in various languages.

我发现网页上提到德尔福，它使我直接到的免费开源ChsDet字符集探测器在SourceForge 的写在Delphi和基于Mozilla的i18n的组成部分。

I found the mention of Delphi on that page, and it led me straight to the Free OpenSource ChsDet Charset Detector at SourceForge written in Delphi and based on Mozilla's i18n component.

太棒了！谢谢所有谁回答（所有+1），谢谢ShreevatsaR，并再次感谢你＃1，帮我找到我的答案，在不到一个小时！

Fantastic! Thank you all those who answered (all +1), thank you ShreevatsaR, and thank you again Stackoverflow, for helping me find my answer in less than an hour!

我如何能最好地猜测编码时的BOM（字节顺序标记）丢失？字节、标记、顺序、BOM

推荐答案