我需要从美国证券交易委员会的网站上下载大约2亿个文件。每个文件都有一个唯一的URL和平均为10KB。这是我目前的执行情况:
名单,其中,串>网址=新的名单,其中,串>();
// ...初始化网址...
web浏览器浏览器=新的web浏览器();
的foreach(在网址字符串URL)
{
browser.Navigate(URL);
而(!browser.ReadyState = WebBrowserReadyState.Complete)Application.DoEvents();
StreamReader的SR =新的StreamReader(browser.DocumentStream);
StreamWriter的SW =新的StreamWriter(),url.Substring(url.LastIndexOf('/')));
sw.Write(sr.ReadToEnd());
sr.Close();
sw.Close();
}
预计的时间有12天左右...是一个更快的方法?
编辑: BTW,本地文件的处理只需要7%的时间
编辑:这是我的最终实现:
无效的主要(无效)
{
ServicePointManager.DefaultConnectionLimit = 10000;
名单<字符串>网址=新的名单,其中,串>();
// ...初始化网址...
INT重试= urls.AsParallel()WithDegreeOfParallelism(8).SUM(ARG => downloadFile(ARG))。
}
公众诠释downloadFile(字符串URL)
{
INT重试= 0;
重试:
尝试
{
HttpWebRequest的WebRequest的=(HttpWebRequest的)WebRequest.Create(URL);
webrequest.Timeout = 10000;
webrequest.ReadWriteTimeout = 10000;
webrequest.Proxy = NULL;
webrequest.KeepAlive = FALSE;
WebResponse类=(HttpWebResponse)webrequest.GetResponse();
使用(数据流Sr = webrequest.GetResponse()。GetResponseStream())
使用(的FileStream SW = File.Create(url.Substring(url.LastIndexOf('/'))))
{
sr.CopyTo(SW);
}
}
赶上(例外EE)
{
如果(!ee.Message =远程服务器返回错误:(404)未找到&放大器;&安培; ee.Message =远程服务器返回错误:(403)禁止。!)
{
如果(ee.Message.StartsWith(以下简称操作超时)|| ee.Message ==无法连接到远程服务器|| ee.Message.StartsWith(请求已中止)|| ee.Message.StartsWith(无法读取从传输连接数据)|| ee.Message ==远程服务器返回错误:(408)请求超时)试++;
其他的MessageBox.show(ee.Message,错误,MessageBoxButtons.OK,MessageBoxIcon.Error);
转到重试;
}
}
返回重试;
}
解决方案
执行下载的同时,而不是连续的,并设置一个合理的MaxDegreeOfParallelism否则你会努力让太多的并发请求,这会看起来像一个DOS攻击:
公共静态无效的主要(字串[] args)
{
VAR的网址=新的名单,其中,串>();
Parallel.ForEach(
网址,
新ParallelOptions {MaxDegreeOfParallelism = 10},
下载文件);
}
公共静态无效DownloadFile(字符串URL)
{
使用(VAR SR =新的StreamReader(HttpWebRequest.Create(URL).GetResponse()。GetResponseStream()))
使用(VAR SW =新的StreamWriter(url.Substring(url.LastIndexOf('/'))))
{
sw.Write(sr.ReadToEnd());
}
}
i need to download about 2 million files from the SEC website. each file has a unique url and is on average 10kB. this is my current implementation:
List<string> urls = new List<string>();
// ... initialize urls ...
WebBrowser browser = new WebBrowser();
foreach (string url in urls)
{
browser.Navigate(url);
while (browser.ReadyState != WebBrowserReadyState.Complete) Application.DoEvents();
StreamReader sr = new StreamReader(browser.DocumentStream);
StreamWriter sw = new StreamWriter(), url.Substring(url.LastIndexOf('/')));
sw.Write(sr.ReadToEnd());
sr.Close();
sw.Close();
}
the projected time is about 12 days... is there a faster way?
Edit: btw, the local file handling takes only 7% of the time
Edit: this is my final implementation:
void Main(void)
{
ServicePointManager.DefaultConnectionLimit = 10000;
List<string> urls = new List<string>();
// ... initialize urls ...
int retries = urls.AsParallel().WithDegreeOfParallelism(8).Sum(arg => downloadFile(arg));
}
public int downloadFile(string url)
{
int retries = 0;
retry:
try
{
HttpWebRequest webrequest = (HttpWebRequest)WebRequest.Create(url);
webrequest.Timeout = 10000;
webrequest.ReadWriteTimeout = 10000;
webrequest.Proxy = null;
webrequest.KeepAlive = false;
webresponse = (HttpWebResponse)webrequest.GetResponse();
using (Stream sr = webrequest.GetResponse().GetResponseStream())
using (FileStream sw = File.Create(url.Substring(url.LastIndexOf('/'))))
{
sr.CopyTo(sw);
}
}
catch (Exception ee)
{
if (ee.Message != "The remote server returned an error: (404) Not Found." && ee.Message != "The remote server returned an error: (403) Forbidden.")
{
if (ee.Message.StartsWith("The operation has timed out") || ee.Message == "Unable to connect to the remote server" || ee.Message.StartsWith("The request was aborted: ") || ee.Message.StartsWith("Unable to read data from the transport connection: ") || ee.Message == "The remote server returned an error: (408) Request Timeout.") retries++;
else MessageBox.Show(ee.Message, "Error", MessageBoxButtons.OK, MessageBoxIcon.Error);
goto retry;
}
}
return retries;
}
解决方案
Execute the downloads concurrently instead of sequentially, and set a sensible MaxDegreeOfParallelism otherwise you will try to make too many simultaneous request which will look like a DOS attack:
public static void Main(string[] args)
{
var urls = new List<string>();
Parallel.ForEach(
urls,
new ParallelOptions{MaxDegreeOfParallelism = 10},
DownloadFile);
}
public static void DownloadFile(string url)
{
using(var sr = new StreamReader(HttpWebRequest.Create(url).GetResponse().GetResponseStream()))
using(var sw = new StreamWriter(url.Substring(url.LastIndexOf('/'))))
{
sw.Write(sr.ReadToEnd());
}
}