一个更快的方法来下载多个文件多个、更快、方法来、文件

2023-09-03 00:59:39 作者：永不止步

我需要从美国证券交易委员会的网站上下载大约2亿个文件。每个文件都有一个唯一的URL和平均为10KB。这是我目前的执行情况：

 名单，其中，串＆GT;网址=新的名单，其中，串＆GT;（）;
    // ...初始化网址...
    web浏览器浏览器=新的web浏览器（）;
    的foreach（在网址字符串URL）
    {
        browser.Navigate（URL）;
        而（！browser.ReadyState = WebBrowserReadyState.Complete）Application.DoEvents（）;
        StreamReader的SR =新的StreamReader（browser.DocumentStream）;
        StreamWriter的SW =新的StreamWriter（），url.Substring（url.LastIndexOf（'/'）））;
        sw.Write（sr.ReadToEnd（））;
        sr.Close（）;
        sw.Close（）;
    }

预计的时间有12天左右...是一个更快的方法？

编辑： BTW，本地文件的处理只需要7％的时间

编辑：这是我的最终实现：

 无效的主要（无效）
    {
        ServicePointManager.DefaultConnectionLimit = 10000;
        名单＆LT;字符串＆GT;网址=新的名单，其中，串＆GT;（）;
        // ...初始化网址...
        INT重试= urls.AsParallel（）WithDegreeOfParallelism（8）.SUM（ARG =＆GT; downloadFile（ARG））。
    }

    公众诠释downloadFile（字符串URL）
    {
        INT重试= 0;

        重试：
        尝试
        {
            HttpWebRequest的WebRequest的=（HttpWebRequest的）WebRequest.Create（URL）;
            webrequest.Timeout = 10000;
            webrequest.ReadWriteTimeout = 10000;
            webrequest.Proxy = NULL;
            webrequest.KeepAlive = FALSE;
            WebResponse类=（HttpWebResponse）webrequest.GetResponse（）;

            使用（数据流Sr = webrequest.GetResponse（）。GetResponseStream（））
            使用（的FileStream SW = File.Create（url.Substring（url.LastIndexOf（'/'））））
            {
                sr.CopyTo（SW）;
            }
        }

        赶上（例外EE）
        {
            如果（！ee.Message =远程服务器返回错误：（404）未找到＆放大器;＆安培; ee.Message =远程服务器返回错误：（403）禁止。！）
            {
                如果（ee.Message.StartsWith（以下简称操作超时）|| ee.Message ==无法连接到远程服务器|| ee.Message.StartsWith（请求已中止）|| ee.Message.StartsWith（无法读取从传输连接数据）|| ee.Message ==远程服务器返回错误：（408）请求超时）试++;
                其他的MessageBox.show（ee.Message，错误，MessageBoxButtons.OK，MessageBoxIcon.Error）;
                转到重试;
            }
        }

        返回重试;
    }

解决方案

执行下载的同时，而不是连续的，并设置一个合理的MaxDegreeOfParallelism否则你会努力让太多的并发请求，这会看起来像一个DOS攻击：

 公共静态无效的主要（字串[] args）
    {
        VAR的网址=新的名单，其中，串＆GT;（）;
        Parallel.ForEach（
            网址，
            新ParallelOptions {MaxDegreeOfParallelism = 10}，
            下载文件）;
    }

    公共静态无效DownloadFile（字符串URL）
    {
        使用（VAR SR =新的StreamReader（HttpWebRequest.Create（URL）.GetResponse（）。GetResponseStream（）））
        使用（VAR SW =新的StreamWriter（url.Substring（url.LastIndexOf（'/'））））
        {
            sw.Write（sr.ReadToEnd（））;
        }
    }

i need to download about 2 million files from the SEC website. each file has a unique url and is on average 10kB. this is my current implementation:

    List<string> urls = new List<string>();
    // ... initialize urls ...
    WebBrowser browser = new WebBrowser();
    foreach (string url in urls)
    {
        browser.Navigate(url);
        while (browser.ReadyState != WebBrowserReadyState.Complete) Application.DoEvents();
        StreamReader sr = new StreamReader(browser.DocumentStream);
        StreamWriter sw = new StreamWriter(), url.Substring(url.LastIndexOf('/')));
        sw.Write(sr.ReadToEnd());
        sr.Close();
        sw.Close();
    }

the projected time is about 12 days... is there a faster way?

Edit: btw, the local file handling takes only 7% of the time

Edit: this is my final implementation:

    void Main(void)
    {
        ServicePointManager.DefaultConnectionLimit = 10000;
        List<string> urls = new List<string>();
        // ... initialize urls ...
        int retries = urls.AsParallel().WithDegreeOfParallelism(8).Sum(arg => downloadFile(arg));
    }

    public int downloadFile(string url)
    {
        int retries = 0;

        retry:
        try
        {
            HttpWebRequest webrequest = (HttpWebRequest)WebRequest.Create(url);
            webrequest.Timeout = 10000;
            webrequest.ReadWriteTimeout = 10000;
            webrequest.Proxy = null;
            webrequest.KeepAlive = false;
            webresponse = (HttpWebResponse)webrequest.GetResponse();

            using (Stream sr = webrequest.GetResponse().GetResponseStream())
            using (FileStream sw = File.Create(url.Substring(url.LastIndexOf('/'))))
            {
                sr.CopyTo(sw);
            }
        }

        catch (Exception ee)
        {
            if (ee.Message != "The remote server returned an error: (404) Not Found." && ee.Message != "The remote server returned an error: (403) Forbidden.")
            {
                if (ee.Message.StartsWith("The operation has timed out") || ee.Message == "Unable to connect to the remote server" || ee.Message.StartsWith("The request was aborted: ") || ee.Message.StartsWith("Unable to read data from the transport connection: ") || ee.Message == "The remote server returned an error: (408) Request Timeout.") retries++;
                else MessageBox.Show(ee.Message, "Error", MessageBoxButtons.OK, MessageBoxIcon.Error);
                goto retry;
            }
        }

        return retries;
    }

解决方案

Execute the downloads concurrently instead of sequentially, and set a sensible MaxDegreeOfParallelism otherwise you will try to make too many simultaneous request which will look like a DOS attack:

    public static void Main(string[] args)
    {
        var urls = new List<string>();
        Parallel.ForEach(
            urls, 
            new ParallelOptions{MaxDegreeOfParallelism = 10},
            DownloadFile);
    }

    public static void DownloadFile(string url)
    {
        using(var sr = new StreamReader(HttpWebRequest.Create(url).GetResponse().GetResponseStream()))
        using(var sw = new StreamWriter(url.Substring(url.LastIndexOf('/'))))
        {
            sw.Write(sr.ReadToEnd());
        }
    }

上一篇：什么是Linux下的C ++等价物的AutoResetEvent？等价物、Linux、AutoResetEvent

下一篇：我可以运行一个.NET程序集code在命令行？命令行、程序、NET、code

相关推荐

精彩图集

精彩推荐

图片推荐

贴日原口号车主被拘!