有效的方式来读取文件的特定行号。 (奖金:Python的手册错印)行号、奖金、有效、手册

2023-09-03 11:27:38 作者:把妹

我有一个100 GB的文本文件,它是从数据库中BCP转储。当我尝试用 BULK INSERT 导入它,我得到行号219506324.一个神秘的错误解决这个问题,我想看看这行之前,但可惜我最喜欢的方法对

I have a 100 GB text file, which is a BCP dump from a database. When I try to import it with BULK INSERT, I get a cryptic error on line number 219506324. Before solving this issue I would like to see this line, but alas my favorite method of

import linecache
print linecache.getline(filename, linenumber)

被扔的MemoryError 。有趣的是手册上说说的此功能不会抛出异常。的在此大的文件时,它抛出一个作为我尝试阅读1号线,和我有6GB可用RAM ...

is throwing a MemoryError. Interestingly the manual says that "This function will never throw an exception." On this large file it throws one as I try to read line number 1, and I have about 6GB free RAM...

我想知道什么是最优雅的方法来获取该可达行。可用的工具是Python 2中,巨蟒3和C#4(Visual Studio 2010中)。是的,我明白,我总是可以做这样的事情

I would like to know what is the most elegant method to get to that unreachable line. Available tools are Python 2, Python 3 and C# 4 (Visual Studio 2010). Yes, I understand that I can always do something like

var line = 0;
using (var stream = new StreamReader(File.OpenRead(@"s:\source\transactions.dat")))
{
     while (++line < 219506324) stream.ReadLine(); //waste some cycles
     Console.WriteLine(stream.ReadLine());
}

这将工作,但我怀疑它的的最的优雅的方式。

编辑:我等着收这个线程,因为正在使用该文件所在的硬盘,现在由另一个进程。我要测试两个建议的方法和报告定时。谢谢大家的建议和意见。

I'm waiting to close this thread, because the hard drive containing the file is being used right now by another process. I'm going to test both suggested methods and report timings. Thank you all for your suggestions and comments.

的结果是我实现加贝斯和Alexes方法,看看哪一个更快。如果我做错什么,不要告诉。我使用的方法加布建议,然后使用方法亚历克斯建议,我大致翻译成C#去为我的100GB文件中的1000万行......我是从我自己添加的唯一的事情,是一个300一读MB的文件到内存中只是清除硬盘缓存。

The Results are in I implemented Gabes and Alexes methods to see which one was faster. If I'm doing anything wrong, do tell. I'm going for the 10 millionth line in my 100GB file using the method Gabe suggested and then using the method Alex suggested which i loosely translated into C#... The only thing I'm adding from myself, is first reading in a 300 MB file into memory just to clear the HDD cache.

const string file = @"x:\....dat"; // 100 GB file
const string otherFile = @"x:\....dat"; // 300 MB file
const int linenumber = 10000000;

ClearHDDCache(otherFile);
GabeMethod(file, linenumber);  //Gabe's method

ClearHDDCache(otherFile);
AlexMethod(file, linenumber);  //Alex's method

// Results
// Gabe's method: 8290 (ms)
// Alex's method: 13455 (ms)

Gabe的方法的实现如下:

The implementation of gabe's method is as follows:

var gabe = new Stopwatch();
gabe.Start();
var data = File.ReadLines(file).ElementAt(linenumber - 1);
gabe.Stop();
Console.WriteLine("Gabe's method: {0} (ms)",  gabe.ElapsedMilliseconds);

尽管亚历克斯的方法稍有特里克:

While Alex's method is slightly tricker:

var alex = new Stopwatch();
alex.Start();
const int buffersize = 100 * 1024; //bytes
var buffer = new byte[buffersize];
var counter = 0;
using (var filestream = File.OpenRead(file))
{
    while (true) // Cutting corners here...
    {
        filestream.Read(buffer, 0, buffersize);
        //At this point we could probably launch an async read into the next chunk...
        var linesread = buffer.Count(b => b == 10); //10 is ASCII linebreak.
        if (counter + linesread >= linenumber) break;
        counter += linesread;
    }
}
//The downside of this method is that we have to assume that the line fit into the buffer, or do something clever...er
var data = new ASCIIEncoding().GetString(buffer).Split('\n').ElementAt(linenumber - counter - 1);
alex.Stop();
Console.WriteLine("Alex's method: {0} (ms)", alex.ElapsedMilliseconds);

所以,除非亚历克斯在乎评论,因为接受了我就标志着Gabe的解决方案。

So unless Alex cares to comment I'll mark Gabe's solution as accepted.

推荐答案

下面是我的优雅版在C#:

Here's my elegant version in C#:

Console.Write(File.ReadLines(@"s:\source\transactions.dat").ElementAt(219506323));

或更一般的:

or more general:

Console.Write(File.ReadLines(filename).ElementAt(linenumber - 1));

当然,你可能需要之前和给定的行后表现出一定的语境:

Of course, you may want to show some context before and after the given line:

Console.Write(string.Join("\n",
              File.ReadLines(filename).Skip(linenumber - 5).Take(10)));

或更流畅:

File
.ReadLines(filename)
.Skip(linenumber - 5)
.Take(10)
.AsObservable()
.Do(Console.WriteLine);

顺便说一句,在 linecache 模块没有做任何事情,聪明的大文件。它只是读取整个事情,保持它都在内存中。它捕获唯一的例外是I / O相关的(无法访问文件,文件未找到,等)。下面是的code中的重要组成部分:

BTW, the linecache module does not do anything clever with large files. It just reads the whole thing in, keeping it all in memory. The only exceptions it catches are I/O-related (can't access file, file not found, etc.). Here's the important part of the code:

    fp = open(fullname, 'rU')
    lines = fp.readlines()
    fp.close()

在换句话说,它试图以适应整个100GB文件到6GB内存!什么手册应该说的是或许此功能不会抛出异常的如果它无法访问文件的。

In other words, it's trying to fit the whole 100GB file into 6GB of RAM! What the manual should say is maybe "This function will never throw an exception if it can't access the file."