如何读取文本文件很长的字符串,然后对其进行处理(分成的话)?
我试过 StreamReader.ReadLine()
的方法,但我得到一个内存不足
例外。很显然,我的线是非常长的。
这是我的code读取文件:
使用(VAR的StreamReader = File.OpenText(_filePath))
{
INT LINENUMBER = 1;
字符串currentString =的String.Empty;
而((currentString = streamReader.ReadLine())!= NULL)
{
ProcessString(currentString,行号);
Console.WriteLine(行{0},行号);
LINENUMBER ++;
}
}
而code的分割线成词:
VAR wordPattern = @\ w +;
VAR matchCollection = Regex.Matches(文字,wordPattern);
VAR话=(从matchCollection匹配词
选择word.Value.ToLowerInvariant())了ToList()。
解决方案
您可以通过炭阅读,建立的话,你走了,使用收益率
,使其延迟,以便你不必将整个文件读取一次:
私有静态的IEnumerable<字符串> ReadWords(字符串文件名)
{
使用(VAR读卡器=新的StreamReader(文件名))
{
VAR建设者=新的StringBuilder();
而(!reader.EndOfStream)
{
炭C =(char)的reader.Read();
//模仿正则表达式/ W / - 差不多。
如果(char.IsLetterOrDigit(三)||ç=='_')
{
builder.Append(C);
}
其他
{
如果(builder.Length大于0)
{
收益回报builder.ToString();
builder.Clear();
}
}
}
收益回报builder.ToString();
}
}
在code读取由字符的文件,它遇到非文字字符时,它会收益率回报
建立在此之前字(仅适用于第一个非字母字符)。在code使用了的StringBuilder
打造的字串。
Char.IsLetterOrDigit()
的行为就如同的正则表达式字字符是W
的字符,但下划线(其中包括)也属于后一类。如果输入包含多个字符,你还希望包括,你就必须改变如果()
。
How can I read a very long string from text file, and then process it (split into words)?
I tried the StreamReader.ReadLine()
method, but I get an OutOfMemory
exception. Apparently, my lines are extremely long.
This is my code for reading file:
using (var streamReader = File.OpenText(_filePath))
{
int lineNumber = 1;
string currentString = String.Empty;
while ((currentString = streamReader.ReadLine()) != null)
{
ProcessString(currentString, lineNumber);
Console.WriteLine("Line {0}", lineNumber);
lineNumber++;
}
}
And the code which splits line into words:
var wordPattern = @"\w+";
var matchCollection = Regex.Matches(text, wordPattern);
var words = (from Match word in matchCollection
select word.Value.ToLowerInvariant()).ToList();
解决方案
You could read by char, building up words as you go, using yield
to make it deferred so you don't have to read the entire file at once:
private static IEnumerable<string> ReadWords(string filename)
{
using (var reader = new StreamReader(filename))
{
var builder = new StringBuilder();
while (!reader.EndOfStream)
{
char c = (char)reader.Read();
// Mimics regex /w/ - almost.
if (char.IsLetterOrDigit(c) || c == '_')
{
builder.Append(c);
}
else
{
if (builder.Length > 0)
{
yield return builder.ToString();
builder.Clear();
}
}
}
yield return builder.ToString();
}
}
The code reads the file by character, and when it encounters a non-word character it will yield return
the word built up until then (only for the first non-letter character). The code uses a StringBuilder
to build the word string.
Char.IsLetterOrDigit()
behaves just as the regex word character w
for characters, but underscores (amongst others) also fall into the latter category. If your input contains more characters you also wish to include, you'll have to alter the if()
.