多线程的web刮?多线程、web

2023-09-08 09:39:03 作者:葑 吢 絕 纞

我一直在想使我的网站刮板多线程,而不是像正常的线程(egThread刮=新主题(功能);)。但类似线程池,那里可能是一个非常大的线程数

I've been thinking about making my web scraper multithreaded, not like normal threads (egThread scrape = new Thread(Function);) but something like a threadpool where there can be a very large number of threads.

我的刮刀的工作原理是使用循环凑页。

My scraper works by using a for loop to scrape pages.

for (int i = (int)pagesMin.Value; i <= (int)pagesMax.Value; i++)

所以,我怎么能多线程功能(包含循环)的东西,如一个线程池?我从来没有使用线程池和我之前见过的例子已经相当混乱或含糊不清的我。

So how could I multithread the function (that contains the loop) with something like a threadpool? I've never used threadpools before and the examples I've seen have been quite confusing or obscure to me.

我修改我循环到这一点:

I've modified my loop into this:

int min = (int)pagesMin.Value;
int max = (int)pagesMax.Value;
ParallelOptions pOptions = new ParallelOptions();
pOptions.MaxDegreeOfParallelism = Properties.Settings.Default.Threads;
Parallel.For(min, max, pOptions, i =>{
    //Scraping
});

将这项工作还是有我有什么事吗?

Would that work or have I got something wrong?

推荐答案

这是更好地去与第三方物流,即的分区。它会自动管理的工作量。

It's better to go with the TPL, namely Parallel.ForEach using an overload with a Partitioner. It manages workload automatically.

仅供参考。你应该明白,更多的线程并不意味着更快。我会建议你做一些测试,以比较unparametrized Parallel.ForEach 和用户自定义。

FYI. You should understand that more threads doesn't mean faster. I'd advice you to make some tests to compare unparametrized Parallel.ForEach and user defined.

更新

    public void ParallelScraper(int fromInclusive, int toExclusive,
                                Action<int> scrape, int desiredThreadsCount)
    {
        int chunkSize = (toExclusive - fromInclusive +
            desiredThreadsCount - 1) / desiredThreadsCount;
        ParallelOptions pOptions = new ParallelOptions
        {
            MaxDegreeOfParallelism = desiredThreadsCount
        };

        Parallel.ForEach(Partitioner.Create(fromInclusive, toExclusive, chunkSize),
            rng =>
            {
                for (int i = rng.Item1; i < rng.Item2; i++)
                    scrape(i);
            });
    }

注意您可以使用在您的情况更好地异步