C:最有效的方法,以确定有多少字节将需要一个UTF-16从UTF-8字符串有多少、字符串、最有效、字节

2023-09-11 04:06:57 作者:孤独与你

我已经看到了一些非常聪明的code在那里为统一code codepoints和UTF-8之间进行转换,所以我想知道是否有人已经(或将享有制定)这一点。

I've seen some very clever code out there for converting between Unicode codepoints and UTF-8 so I was wondering if anybody has (or would enjoy devising) this.

在给定一个UTF-8字符串,多少字节需要的相同的字符串的UTF-16编码。 假设UTF-8字符串已经过验证。它没有BOM,没有超长的序列,没有无效的序列,是空值终止。它不是 CESU-8 。 全UTF-16与代理人必须得到支持。

具体我不知道是否有捷径可走,知道什么时候代理对将需要不完全转换为UTF-8序列变成$ C $连接点。

Specifically I wonder if there are shortcuts to knowing when a surrogate pair will be needed without fully converting the UTF-8 sequence into a codepoint.

最好的UTF-8 $ C $连接点code我见过采用量化技术,所以我不知道这是否也有可能在这里。

The best UTF-8 to codepoint code I've seen uses vectorizing techniques so I wonder if that's also possible here.

推荐答案

效率始终是一个速度VS大小的权衡。如果速度是有利的尺寸超过则最有效的办法就是猜测基于源串的长度。

Efficiency is always a speed vs size tradeoff. If speed is favored over size then the most efficient way is just to guess based on the length of the source string.

有4案件需要加以考虑,简单地拿最坏的情况下为最终缓冲区大小:

There are 4 cases that need to be considered, simply take the worst case as the final buffer size:

U + 0000-U + 007F - 将连接code在UTF8为1字节,并且在UTF16每个字符2个字节。 (1:2 = X2) U + 0080-U + 07FF - CN coded到2字节UTF8序列,或2字节每个字符UTF16字符。 (2:2 = X1) U + 0800-U + FFFF - 存储为3byte UTF8序列,但仍适合单UTF16字符。 (3:2 = x.67) 在U + 10000-U + 10FFFF - 存储为4字节UTF8序列,或UTF16代理对。 (4:4 = X1)

在最坏的情况下膨胀系数为UTF8到UTF16翻译U + 0000-U + 007F时:缓冲区,按字节,只是有两倍大源字符串。每隔UNI code $ C $连接点导致一个大小相等,或按字节分配更小的连接时,codeD的UTF16为UTF8。

The worse case expansion factor is when translating U+0000-U+007f from utf8 to utf16: the buffer, bytewise, merely has to be twice as large as the source string. Every other unicode codepoint results in an equal size, or smaller bytewise allocation when encoded as utf16 as utf8.