使用的Nutch如何抓取网页被uisng AJAX动态内容?网页、内容、动态、Nutch

2023-09-11 01:29:41 作者:快乐是选择

我使用的Apache的Nutch 1.10抓取的网页,并提取网页中的内容。有些链接中包含有关于Ajax的调用加载动态内容。 Nutch的能不能抓取和提取AJAX的动态内容。我该如何解决这个问题?有没有什么解决办法吗?如果有,请帮我看看你的答案。

I am using apache Nutch 1.10 to crawl the web pages and to extract the contents in the page. Some of the links contains dynamic contents which are loaded on the call of ajax. Nutch cannot able to crawl and extract the dynamic contents of ajax. How can I solve this? Is there any solution? if yes please help me with your answers.

在此先感谢。

推荐答案

大多数的网络爬虫库不提供的JavaScript渲染开箱。你通常需要另一个插件库或产品,提供JS渲染像硒或PhantomJS。

Most web crawler libraries do not offer javascript rendering out of the box. You usually have to plugin another library or product that offers js rendering like Selenium or PhantomJS.

下面是使用Nutch的硒。

Here is a tutorial using nutch and Selenium.