我可以用什么算法来识别网页上的内容可以用、算法、网页、内容

2023-09-11 03:20:17 作者:後世續前緣

我装了在浏览器中(即它的DOM和元素的定位都可以访问到我)一个网页,我想找到块元素(或这些元素的排序列表),其中可能包含了大多数内容(如在文本中的连续块)。我们的目标是要排除的东西,如菜单,页眉,页脚以及这样

I have a web page loaded up in the browser (i.e. its DOM and element positioning are both accessible to me) and I want to find the block element (or a sorted list of these elements), which likely contains the most content (as in a continuous block of text). The goal is to exclude things like menus, headers, footers and such.

推荐答案

这是我个人最喜欢的:的 VIPS:基于视觉的页面分割算法

This is my personal favorite: VIPS: a Vision-based Page Segmentation Algorithm