如何拆分一个巨大的文件进言?巨大、文件

2023-09-04 10:52:10 作者:不倾世丶只倾你一人

如何读取文本文件很长的字符串,然后对其进行处理(分成的话)?

我试过 StreamReader.ReadLine()的方法,但我得到一个内存不足例外。很显然,我的线是非常长的。 这是我的code读取文件:

 使用(VAR的StreamReader = File.OpenText(_filePath))
    {

        INT LINENUMBER = 1;
        字符串currentString =的String.Empty;
        而((currentString = streamReader.ReadLine())!= NULL)
        {

            ProcessString(currentString,行号);
            Console.WriteLine(行{0},行号);
            LINENUMBER ++;
        }
    }
 

而code的分割线成词:

  VAR wordPattern = @\ w +;
VAR matchCollection = Regex.Matches(文字,wordPattern);
VAR话=(从matchCollection匹配词
             选择word.Value.ToLowerInvariant())了ToList()。
 

解决方案 拆分文档 将大文档拆分,从而分配给多个译员

您可以通过炭阅读,建立的话,你走了,使用收益率,使其延迟,以便你不必将整个文件读取一次:

 私有静态的IEnumerable<字符串> ReadWords(字符串文件名)
{
    使用(VAR读卡器=新的StreamReader(文件名))
    {
        VAR建设者=新的StringBuilder();

        而(!reader.EndOfStream)
        {
            炭C =(char)的reader.Read();

            //模仿正则表达式/ W /  - 差不多。
            如果(char.IsLetterOrDigit(三)||ç=='_')
            {
                builder.Append(C);
            }
            其他
            {
                如果(builder.Length大于0)
                {
                    收益回报builder.ToString();
                    builder.Clear();
                }
            }
        }

        收益回报builder.ToString();
    }
}
 

在code读取由字符的文件,它遇到非文字字符时,它会收益率回报建立在此之前字(仅适用于第一个非字母字符)。在code使用了的StringBuilder 打造的字串。

Char.IsLetterOrDigit() 的行为就如同的正则表达式字字符是W 的字符,但下划线(其中包括)也属于后一类。如果输入包含多个字符,你还希望包括,你就必须改变如果()

How can I read a very long string from text file, and then process it (split into words)?

I tried the StreamReader.ReadLine() method, but I get an OutOfMemory exception. Apparently, my lines are extremely long. This is my code for reading file:

using (var streamReader = File.OpenText(_filePath))
    {

        int lineNumber = 1;
        string currentString = String.Empty;
        while ((currentString = streamReader.ReadLine()) != null)
        {

            ProcessString(currentString, lineNumber);
            Console.WriteLine("Line {0}", lineNumber);
            lineNumber++;
        }
    }

And the code which splits line into words:

var wordPattern = @"\w+";
var matchCollection = Regex.Matches(text, wordPattern);
var words = (from Match word in matchCollection
             select word.Value.ToLowerInvariant()).ToList();

解决方案

You could read by char, building up words as you go, using yield to make it deferred so you don't have to read the entire file at once:

private static IEnumerable<string> ReadWords(string filename)
{
    using (var reader = new StreamReader(filename))
    {
        var builder = new StringBuilder();

        while (!reader.EndOfStream)
        {
            char c = (char)reader.Read();

            // Mimics regex /w/ - almost.
            if (char.IsLetterOrDigit(c) || c == '_')
            {
                builder.Append(c);
            }
            else
            {
                if (builder.Length > 0)
                {
                    yield return builder.ToString();
                    builder.Clear();
                }
            }
        }

        yield return builder.ToString();
    }
}

The code reads the file by character, and when it encounters a non-word character it will yield return the word built up until then (only for the first non-letter character). The code uses a StringBuilder to build the word string.

Char.IsLetterOrDigit() behaves just as the regex word character w for characters, but underscores (amongst others) also fall into the latter category. If your input contains more characters you also wish to include, you'll have to alter the if().