非英语字符是德codeD错误在Android上HtlmCleaner英语、字符、错误、codeD

2023-09-06 05:50:21 作者:硬汉°丶走你i

我用 HtmlCleaner 来刮 ISO-8859-1 连接codeD的网站中Android系统。

I'm using HtmlCleaner to scrape a ISO-8859-1 encoded web site in Android.

我在外部 JAR 文件,我导入我的Andr​​oid应用程序来实现这一点。

I've implemented this in an external jar file that I import into my Android app.

当我运行单元测试在Eclipse它所处理的挪威字母(æ,O,A )正确的(我可以确认,在调试器),但在Android应用这些字符看起来像倒置的问号。

When I run the unit tests in Eclipse it handles Norwegian letters (æ,ø,å) correct (I can verify that in the debugger), but in the Android app these characters look like inverted question marks.

如果我调试器附加到我的Andr​​oid应用程序,我可以看到,这些信件是从没有在Eclipse中运行单元测试时,他们是很好的完全相同的地方正确的,所以它不是一个显示/渲染/视图问题的Andr​​oid应用

If I attach the debugger to my Android app I can see that these letters are not correct in the exact same places they were good when running unit test from Eclipse, so it's not a display/render/view issue in the Android app.

当我从调试器复制的文字,我得到的结果:

When I copy the text from the debuggers I get these results:

Java进程(单元测试):LAQUO;Blårek资讯»,«千红资讯»

Java Process (Unit Test): «Blårek», «Benny»

Android的过程(在模拟器):LAQUO; Blrek资讯»,«千红资讯»

Android Process (In emulator): «Bl�rek», «Benny»

我希望将这些字符串是平等的,但要注意如何A是在Android上的倒问号replaed。

I would expect these Strings to be equal, but notice how the "å" is replaed by the inverted question marks in Android.

我曾尝试运行 htmlCleaner.getProperties()。setRecognizeUni codeChars(真)没有任何运气。另外,我发现没有强制UTF-8或ISO-8859-1编码的HTML更清洁的方式,但我不知道这会作出区别。

I have tried running htmlCleaner.getProperties().setRecognizeUnicodeChars(true) without any luck. Also, I found no way of forcing UTF-8 or ISO-8859-1 encoding in html cleaner, but I' not sure if that would have made a difference.

下面是code I运行:

Here is the code i run:

HtmlCleaner htmlCleaner = new HtmlCleaner();

// connect to url and get root TagNode from HtmlCleaner
InputSteram is = new URL( url ).openConnection().getInputStream();
TagNode rootNode = htmlCleaner.clean( is );

// navigate through some TagNodes, getting the ContentNode 
ContentNode cn = rootNode... 

// This String contains the incorrectly decoded characters on Android. 
// Good in Oracle JVM though..
String value = cn.toString().trim();

有谁知道什么可能导致解码behavoir是在Android上有什么不同?我想这两个环境之间的主要区别在于,Android应用采用Android的java.io栈,而我的单元测试使用Sun / Oracle的堆栈。

Does anyone knows what could cause the decoding behavoir to be different on Android? I guess the main difference between the two environments is that the Android app uses Android's java.io stack while my unit tests use Sun/Oracle's stack.

谢谢,结果盖尔

Thanks, Geir

推荐答案

HtmlCleaner 不能告诉使用何种编码;你只传递身在的InputStream ,但该编码是在内容类型标题。

HtmlCleaner can't tell what encoding to use; you are passing only the body of the response in the InputStream, but the encoding is in the "content-type" header.

您可以设置的 HtmlCleaner 从HTTP连接正确的编码性能的字符编码​​。但是,这将要求您解析来自content-type标题正确的参数。或者,你可以传递网​​址实例 HtmlCleaner 并让它管理连接。然后,将所有需要正确地去code中的信息访问。

You can set the character encoding on the properties of the HtmlCleaner to the correct encoding from the HTTP connection. But that would require you to parse the correct parameter from the content-type header. Alternatively, you can pass a URL instance to HtmlCleaner and let it manage the connection. Then, it will have access to all the information it needs to decode properly.