性能问题,从二进制文件中读取的整数在F#的特定位置整数、性能、位置、二进制文件

2023-09-03 21:37:12 作者:优雅的

今天早上,我问here为什么我的Python code是(很多)慢然后我的F#版本,但我不知道F#的版本是否可以进行得更快。任何想法如何,我可以创造的低于code读取唯一索引的排序列表从二进制文件,32位整数的速度更快的版本?请注意,我试过2的方法,一种是基于BinaryReader在基于MemoryMappedFile另一个(的和一些Github上)。

This morning I asked here why my Python code was (a lot) slower then my F# version but I'm wondering whether the F# version can be made faster. Any ideas how I could create a faster version of the below code that reads a sorted list of unique indexes from a binary file with 32-bit integers? Note that I tried 2 approaches, one based on a BinaryReader, the other one based on MemoryMappedFile (and some more on Github).

module SimpleRead            
    let readValue (reader:BinaryReader) cellIndex = 
        // set stream to correct location
        reader.BaseStream.Position <- cellIndex*4L
        match reader.ReadInt32() with
        | Int32.MinValue -> None
        | v -> Some(v)

    let readValues fileName indices = 
        use reader = new BinaryReader(File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
        // Use list or array to force creation of values (otherwise reader gets disposed before the values are read)
        let values = List.map (readValue reader) (List.ofSeq indices)
        values

module MemoryMappedSimpleRead =

    open System.IO.MemoryMappedFiles

    let readValue (reader:MemoryMappedViewAccessor) offset cellIndex =
        let position = (cellIndex*4L) - offset
        match reader.ReadInt32(position) with
        | Int32.MinValue -> None
        | v -> Some(v)

    let readValues fileName indices =
        use mmf = MemoryMappedFile.CreateFromFile(fileName, FileMode.Open)
        let offset = (Seq.min indices ) * 4L
        let last = (Seq.max indices) * 4L
        let length = 4L+last-offset
        use reader = mmf.CreateViewAccessor(offset, length, MemoryMappedFileAccess.Read)
        let values = (List.ofSeq indices) |> List.map (readValue reader offset)
        values

有关比较这里是我的最新版本numpy的

For comparison here is my latest numpy version

import numpy as np

def convert(v):
    if v <> -2147483648:
        return v
    else:
        return None

def read_values(filename, indices):
    values_arr = np.memmap(filename, dtype='int32', mode='r')
    return map(convert, values_arr[indices])

更新 在相反的是我以前在这里说,我的蟒蛇还是慢了很多,然后在F#版本,但由于一个错误在我的Python测试中,它似乎并非如此。 这里留下了这个问题,以防有人在BinaryReader在该深入了解或MemoryMappedFile知道一些改进。

Update In contrary to what I said before here, my python is still a lot slower then the F# version but due to an error in my python tests it appeared otherwise. Leaving this question here in case someone with in depth knowledge of the BinaryReader or MemoryMappedFile knows some improvements.

推荐答案

我设法获得SimpleReader 30%的速度通过​​,而不是reader.BaseStream.Position reader.BaseStream.Seek。我也通过阵列替换名单,但这并没有发生很大的变化。

I managed to get the SimpleReader 30% faster by using reader.BaseStream.Seek instead of reader.BaseStream.Position. I also replaced lists by arrays but this didn't change a lot.

我的简单的读满code现在是:

The full code of my simple reader is now:

open System
open System.IO

let readValue (reader:BinaryReader) cellIndex = 
    // set stream to correct location
    reader.BaseStream.Seek(int64 (cellIndex*4), SeekOrigin.Begin) |> ignore
    match reader.ReadInt32() with
    | Int32.MinValue -> None
    | v -> Some(v)

let readValues indices fileName = 
    use reader = new BinaryReader(File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
    // Use list or array to force creation of values (otherwise reader gets disposed before the values are read)
    let values = Array.map (readValue reader) indices
    values

满code和版本在其他语言都在 GitHub上