什么是Android中刮HTML网页上的最快的方法?最快、网页、方法、Android

2023-09-12 21:36:26 作者:致命的勾引°

我需要从Android中的非结构化网页提取信息。我想要的信息被嵌入在不具有一个id的表

I need to extract information from an unstructured web page in Android. The information I want is embedded in a table that doesn't have an id.

<table> 
<tr><td>Description</td><td></td><td>I want this field next to the description cell</td></tr> 
</table>

我应该使用

模式匹配? 使用的BufferedReader来提取这些信息?

或者有没有更快的方式来获得这些信息?

Or are there faster way to get that information?

推荐答案

我觉得在这种情况下,它是没有意义的寻找一个的快速的办法的提取的信息因为实际上在回答已经提出的方法之间的性能差异,当你比较它的时候,将采取的下载的HTML中。

I think in this case it makes no sense to look for a fast way to extract the information as there is virtually no performance difference between the methods already suggested in answers when you compare it to the time it will take to download the HTML.

因此​​,假如通过的最快的你的意思是最方便的,可读性和可维护性code,我建议你使用一个DocumentBuilder解析有关HTML和使用提取数据XPathEx$p$pssions:

So assuming that by fastest you mean most convenient, readable and maintainable code, I suggest you use a DocumentBuilder to parse the relevant HTML and extract data using XPathExpressions:

Document doc = DocumentBuilderFactory.newInstance()
  .newDocumentBuilder().parse(new InputSource(new StringReader(html)));

XPathExpression xpath = XPathFactory.newInstance()
  .newXPath().compile("//td[text()=\"Description\"]/following-sibling::td[2]");

String result = (String) xpath.evaluate(doc, XPathConstants.STRING);

如果你碰巧检索无效的HTML,我建议以隔离的相关部分(例如,使用子(的indexOf(&LT;表).. ),并在必要时与字符串操作解析之前,如果变得太复杂但(即很糟糕的HTML)正确的剩余HTML错误,只是随哈克模式匹配办法的建议在其他的答案。

If you happen to retrieve invalid HTML, I recommend to isolate the relevant portion (e.g. using substring(indexOf("<table")..) and if necessary correct remaining HTML errors with String operations before parsing. If this gets too complex however (i.e. very bad HTML), just go with the hacky pattern matching approach as suggested in other answers.

备注

在XPath是可用,因为API等级8(Android 2.2的)。如果你开发的较低的空气污染水平,你可以使用DOM方法和条件导航至要提取的节点