在.NET并行​​刮NET

2023-09-03 07:53:14 作者:十八闲客

我工作的公司经营几百非常有活力的网站。它已决定建立一个搜索引擎,我的任务是写刮刀。一些网站的运行在老的硬件,并且不能够采取的惩罚,而其他人可以同时处理用户的巨量。

The company I work for runs a few hundred very dynamic web sites. It has decided to build a search engine and I was tasked with writing the scraper. Some of the sites run on old hardware and are not able to take much punishment, while others can handle massive amount of simultaneous users.

我需要能够说的站点A,2 B站点和1个使用5并行请求的网站℃。

I need to be able to say use 5 parallel requests for site A, 2 for site B and 1 for site C.

我知道我可以使用线程,互斥锁,信号量等来做到这一点,但是这将是相当复杂的。是否有任何的更高层次的框架,如第三方物流,等待/异步,TPL数据流足够强大的做一个简单的方式这个程序?

I know I can use threads, mutexes, semaphores, etc. to accomplish this, but it will be quite complicated. Are any of the higher level frameworks, like TPL, await/async, TPL Dataflow powerful enough to do this app in a simpler manner?

推荐答案

我推荐你使用的HttpClient Task.WhenAll ,与 SemaphoreSlim 进行简单的限制:

I recommend you use HttpClient with Task.WhenAll, with SemaphoreSlim for simple throttling:

private SemaphoreSlim _mutex = new SemaphoreSlim(5);
private HttpClient _client = new HttpClient();
private async Task<string> DownloadStringAsync(string url)
{
  await _mutex.TakeAsync();
  try
  {
    return await _client.GetStringAsync(url);
  }
  finally
  {
    _mutex.Release();
  }
}

IEnumerable<string> urls = ...;
var data = await Task.WhenAll(urls.Select(url => DownloadStringAsync(url));

另外,你可以使用TPL数据流,并设置 MaxDegreeOfParallelism 的限制。