当存储网络爬虫的数据?爬虫、数据、网络

2023-09-11 03:11:10 作者:天涯离梦残月幽梦

我有一个在根(给定的URL)开始下载根页面的HTML然后扫描超链接和抓取他们一个简单的网络爬虫。我目前存储在SQL数据库的HTML页面。我目前面临的两个问题:

I have a simple web crawler that starts at root (given url) downloads the html of the root page then scans for hyperlinks and crawls them. I currently store the html pages in an SQL database. I am currently facing two problems:

1,这似乎是爬行达到一个瓶颈,无法履带吊快,我读的地方,使得多线程的HTTP请求的页面可以使履带式爬行快,但我不知道如何做这一点。

1-It seems like the crawling reaches a bottleneck and isn’t able to crawler faster, I’ve read somewhere that making multithreaded http requests for pages can make the crawler crawl faster, but I am not sure on how to do this.

2的第二个问题,我需要一个有效的数据结构来存储的HTML页,并能够对它们(当前使用一个SQL数据库希望听到其他建议)运行数据挖掘操​​作

2-The second problem, I need an efficient data structure to store the html pages and be able to run data mining operations on them(currently using an sql database would like to hear other recommendations)

我使用.NET Framework,C#和MS SQL

I am using the .Net framework, C# and MS SQL

推荐答案

因此​​,首先也是最重要的,我就不会担心进入分散式检索和存储,因为顾名思义:它需要机器体面的分机号码是取得好成绩。除非你有电脑的一个农场,那么你将不能够真正从中受益。你可以建立一个履带式的获得每秒300页和150在一台计算机上运行它Mbps的连接。

So first and foremost, I wouldn't worry about getting into distributed crawling and storage, because as the name suggests: it requires a decent number of machines for you to get good results. Unless you have a farm of computers, then you won't be able to really benefit from it. You can build a crawler that gets 300 pages per second and run it on a single computer with 150 Mbps connection.

在名单上的下一件事情是确定的瓶颈。

The next thing on the list is to determine where is your bottleneck.

尽量消除MS SQL:

Try to eliminate MS SQL:

装入您想抓取的,比方说,1000网址的列表。 基准你能多快抓取他们。

如果1000网址不给你一个足够大的抓取,然后得到10000网址或10万的URL(或者,如果你感觉勇敢,然后得到的 Alexa的顶级百万)。在任何情况下,尝试建立与排除尽可能多的变量的基线。

If 1000 URLs doesn't give you a large enough crawl, then get 10000 URLs or 100k URLs (or if you're feeling brave, then get the Alexa top 1 million). In any case, try to establish a baseline with as many variables excluded as possible.

在你有你的底线的爬行速度,然后尝试确定是什么导致你的放缓。此外,您将需要开始使用multitherading ,因为你是I / O密集​​型和你有很多空余时间,获取,你可以在提取链接和做其他事情一样有工作花费的页面之间数据库。

After you have your baseline for the crawl speed, then try to determine what's causing your slowdown. Furthermore, you will need to start using multitherading, because you're i/o bound and you have a lot spare time in between fetching pages that you can spend in extracting links and doing other things like working with the database.

多少页每秒现在你得到些什么?你应该尝试并获得每秒超过10页。

How many pages per second are you getting now? You should try and get more than 10 pages per second.

显然,下一步是调整你履带尽可能:

Obviously the next step is to tweak your crawler as much as possible:

尝试以加快履带所以它击中硬性限制,比如你的带宽。 我会建议使用异步套接字,因为他们要比阻止套接字,WebRequest的/ HttpWebRequest的,等等。更快 使用更快的HTML解析库:先从 HtmlAgilityPack ,如果​​你觉得勇敢不是尝试的 Majestic12 HTML解析器。 使用的embedded数据库,而不是一个SQL数据库,并采取键/值存储优势(哈希的URL键和存储HTML和其他相关数据的值)。 Try to speed up your crawler so it hits the hard limits, such as your bandwidth. I would recommend using asynchronous sockets, since they're MUCH faster than blocking sockets, WebRequest/HttpWebRequest, etc. Use a faster HTML parsing library: start with HtmlAgilityPack and if you're feeling brave than try the Majestic12 HTML Parser. Use an embedded database, rather than an SQL database and take advantage of the key/value storage (hash the URL for the key and store the HTML and other relevant data as the value).

如果你已经掌握了所有上述情况,那么我建议你尝试去亲!重要的是你有一个很好的选择算法模仿的PageRank以平衡的新鲜度和覆盖面: OPIC是pretty的许多最新和最伟大的在这方面(AKA自适应Online网页重要性计算)。如果你有以上的工具,那么你应该能够实现OPIC和运行较快的履带。

If you've mastered all of the above, then I would suggest you try to go pro! It's important that you have a good selection algorithm that mimics PageRank in order to balance freshness and coverage: OPIC is pretty much the latest and greatest in that respect (AKA Adaptive Online Page Importance Computation). If you have the above tools, then you should be able to implement OPIC and run a fairly fast crawler.

如果你是灵活的编程语言,不想偏离C#太远,那么你可以尝试基于Java的企业级爬虫等的 Nutch的。 Nutch的集成Hadoop和其他各种高度可扩展的解决方案。

If you're flexible on the programming language and don't want to stray too far from C#, then you can try the Java-based enterprise level crawlers such as Nutch. Nutch integrates with Hadoop and all kinds of other highly scalable solutions.