如何从HTML页面中提取文章正文内容像口袋(阅读后)或可读性?可读性、文章正文、口袋、页面

2023-09-02 01:56:50 作者:心痛为谁、不解释

我要寻找一些开源的框架或算法通过清洁HTML code,去除垃圾的东西,类似于掌上(又名读更高版本)软件完成提取任何HTML页的文章的文本内容。

I am looking for some open source framework or algorithm to extract article text contents from any HTML page by cleaning the HTML code, removing garbage stuff, similar to what Pocket (aka Read It Later) software does.

掌上官方网页: http://getpocket.com/

这个问题已经可以在链接: How从HTML提取文本内容就像后来读它,或者Instapaper的iPhone应用程序? 但我的要求是有点不同。我要清理HTML和提取主要内容由$ P $图像pserving的字体和样式(CSS)。

This question is already available under link: How to extract text contents from html like Read it later or InstaPaper Iphone app? but my requirement is bit different. I want to clean the HTML and extract main contents with images by preserving the font and style (CSS).

推荐答案

我会建议 NReadability ,加上 HtmlAgilityPack

正文始终是DIV ID为 readInner 后NReadability跨codeD的页面。

Main text is always in div with id readInner after NReadability transcoded the page.

//** replace this with any url **
string url = "http://www.bbc.co.uk/news/world-asia-19457334";

var t = new NReadability.NReadabilityWebTranscoder();
bool b;
string page = t.Transcode(url, out b);

if (b)
{
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(page);

    var title = doc.DocumentNode.SelectSingleNode("//title").InnerText;
    var imgUrl = doc.DocumentNode.SelectSingleNode("//meta[@property='og:image']").Attributes["content"].Value;
    var mainText = doc.DocumentNode.SelectSingleNode("//div[@id='readInner']").InnerText;
}
 
精彩推荐
图片推荐