在.net推荐HTML可读性转码库可读性、net、HTML、转码库

2023-09-03 04:03:39 作者:青丝变白发

背景 我试图读取并在页面的主要内容分析从网页内容,重点是 - 没有菜单,工具条,脚本和其他HTML杂乱

Background I'm trying to read and analyze content from web pages, with focus on the main content of the page - without menus, sidebars, scripts, and other HTML clutter.

我有什么企图?

What have I tried?

我已经试过 NReadability ,但它会抛出异常和故障时在太多的情况下。除此之外,它是一个很好的解决方案。 HTML敏捷性包是不是我需要的这里,是因为我想得摆脱非内容code。 I've tried NReadability, but it throws exceptions and fails on too many cases. Other than that it is a good solution. HTML Agility Pack is not what I need here, because I do want too get rid of non-content code.

编辑:我在寻找一个库,它实际上是通过内容进行筛选,并给了我只有相关的从页面文本(也就是此页中的检讨,聊天,元, 大约,并从顶部栏中的常见问题解答将不显示,以及根据授权用户的贡献。

I'm looking for a library that actually sifts through content, and gives me only the "relevant" text from the page (i.e. for this page, the words "review", "chat", "meta", "about" , and "faq" from the top bar will not show, as well as "user contributions licensed under".

那么,你知道从网站上提取内容的任何其他稳定的。NET库?

So, do you know any other stable .Net library for extracting content from websites?

推荐答案

我不知道这是否仍然适用,但是这是一个有趣的问题,我碰到了很多,我还没有看到大量的材料上网络覆盖它。

I don't know if this is still relevant, but this is an interesting question I run into a lot, and I haven't seen much material on the web that covers it.

我实现了在几个月自己的跨度做到这一点的工具。 超出合同义务,我不能自由地共享这个工具。不过,我可以自由地分享什么,你可以做一些建议。

I've implemented a tool that does this over the span of several months myself. Out of contract obligation, I can not share this tool freely. However, I'm free to share some advice about what you can do.

我可以向你们保证,我们已经尝试开展创建可读性工具,我们的任务之前,每一个选项。目前没有这样的工具都满意的正是我们需要的。

I can assure you that we have tried every option before undertaking the task of creating a readability tool ourselves. At the moment no such tools exist that were satisfactory for what we needed.

太棒了!您将需要几件事情

Great! You will need a few things

处理页面的HTML的工具。我使用 CsQuery 这是在回答什么杰米上述建议。它选择要素的伟大工程。 在一种编程语言(这是C#在这个例子中,任何.NET语言都行!) 系统工具,可以让你自己下载的网页。 自己与 createFromUrl CsQuery 它。如果你想pre-过程,并得到更细粒度地控制了头,你可以创建自己的助手类的下载页面。 (尝试与用户代理玩,找移动版本等) A tool for handling the page's HTML. I use CsQuery which is what Jamie suggested in the answer above. It works great for selecting elements. A programming language (That's C# in this example, any .NET language will do!) A tool that lets you download the pages themselves. CsQuery it on its own with createFromUrl. You can create your own helper class for downloading the page if you want to pre-process it and get finer grained control over the headers. (Try playing with the user agent, looking for mobile versions, etc)

有令人惊讶的一点研究的内容提取领域。突出的A片是样板检测浅使用文本功能。您还可以阅读这个答案在StackOverflow上的论文的作者,看看可读性如何工作以及一些方法。

Ok, I'm all set up, what's next?

There is surprisingly little research in the field of content extraction. A piece that stands out is Boilerplate Detection using Shallow Text Features. You can also read this answer here in StackOverflow from the paper's author to see how Readability works and what some approaches are.

下面是一些论文我喜欢:

Here are some more papers I enjoyed:

从Web上最大抽取文章正文 子序列分割 从网络通过文本到标签比文本提取 容易的方法从HTML提取文本 Extracting Article Text from the Web with Maximum Subsequence Segmentation Text Extraction from the Web via Text-to-Tag Ratio The Easy Way to Extract Text from HTML

这是我的经验,以下是提取内容好策略:

From my experience the following are good strategies for extracting content:

简单的启发式:过滤<头> <导航> 标签,只链接删除列表。删除整个< HEAD> 部分。给予正/负比分基于其名称的元素和删除那些用最少的分数(例如,的div包含名称的类导航可能会得到较低的分数)。这就是可读性的作品。

Simple heuristics: Filtering <header> and <nav> tags, removing lists with only links. Removing the entire <head> section. Giving negative/positive score to elements based on their name and removing the ones with the least score (for example, divs with a class that contains the name navigation might get get lower score). This is how readability works.

元内容。分析链接密度为文本,这是对自己的一种强大的工具,你可以比较链接文本到HTML文本,并与工作量,最密集的文字,通常就是内容。 CsQuery ,您可以在嵌套链接标签很容易的文本量比较的文本量。

Meta-Content. Analyzing density of links to text, this is a powerful tool on its own, you can compare the amount of link text to html text and work with that, the most dense text is usually where the content is. CsQuery lets you compare the amount of text to the amount of text in nested link tags easily.

模板。匍匐在同一网站多个页面,并分析它们之间的差异,这些常数通常是页面布局,导航和广告。您可以根据相似性通常过滤。根据该办法模板是的非常的效果。关键是要拿出一个有效的算法来跟踪模板和检测模板本身。

Templating. Crawl several pages on the same website and analyze the differences between them, the constants are usually the page layout, navigation and ads. You can usually filter based on similarities. This 'template' based approach is very effective. The trick is to come up with an efficient algorithm to keep track of templates and detect the template itself.

自然语言处理。这可能是这里最先进的方法,它是自然语言处理工具比较简单的检测段落,篇章结构,因而在实际内容的开始和结束。

Natural language processing. This is probably the most advanced approach here, it is relatively simple with natural language processing tools to detect paragraphs, text structure and thus where the actual content starts and ends.

学习,学习是一个非常强大的概念,这种任务。在最基本的形式这包括创建一个程序,猜测HTML元素删除一组从网站pre定义的结果,并获知哪个模式即可删除。此方法适用于机器每个站点从我的经验,最好的。

Learning, learning is a very powerful concept for this sort of task. In the most basic form this involves creating a program that 'guesses' HTML elements to remove on a set of pre-defined results from a website and learns which patterns is OK to remove. This approach works best on a machine-per-site from my experience.

固定选择列表。奇怪的是,这是非常有效的,人们往往忘了这件事。如果您是从一个特定的几个网站使用选择刮人工提取的内容可能是最快的事情。保持简单,如果你可以:)

Fixed list of selectors. Surprisingly, this is extremely potent and people tend to forget about it. If you are scraping from a specific few sites using selectors and manually extracting the content is probably the fastest thing to do. Keep it simple if you can :)

混合搭配,很好的解决通常涉及多个策略,结合了一些。我们结束了与一些相当复杂的,因为我们使用了复杂的任务。在实践中,内容提取是一个的真正的复杂的任务。不要试图创造一些很一般,坚持内容的您的需要刮掉。测试了很多,单元测试和回归是的非常的对这种方案的重要,总是比较,读可读性的code,这是pretty的简单,它可能会得到你开始

Mix and match, a good solution usually involves more than one strategy, combining a few. We ended up with something quite complex because we use it for a complex task. In practice, content extraction is a really complicated task. Don't try creating something that is very general, stick to the content you need to scrape. Test a lot, unit tests and regression are very important for this sort of program, always compare and read the code of readability, it's pretty simple and it'll probably get you started.

最好的运气,让我知道如何去。

Best of luck, let me know how this goes.