是否有可能来连接只使用一个分配的字符串列表?有可能、字符串、分配、列表

2023-09-03 17:35:21 作者:你住在我心里

做一些分析后,我们发现,在我们的应用程序连接字符串目前的方式会导致记忆流失和CPU时间的巨大金额。

After doing some profiling, we've discovered that the current way in which our app concatenates strings causes an enormous amount of memory churn and CPU time.

我们正在建立一个名单,其中,串> 字符串来连接的是50万个元素,引用几百兆字节值得字符串的顺序。我们正在努力,因为它似乎占了CPU和内存使用情况不成比例,以优化这一块我们的小应用程序的一部分。

We're building a List<string> of strings to concatenate that is on the order of 500 thousand elements long, referencing several hundred megabytes worth of strings. We're trying to optimize this one small part of our app since it seems to account for a disproportionate amount of CPU and memory usage.

我们做了很多文本处理的:)

We do a lot of text processing :)

从理论上讲,我们应该能够在一个单一的分配和N份进行级联 - 我们可以知道一共有多少个字符是在我们的字符串可用,所以它应该只是简单的总结部分字符串的长度并分配足够的底层内存来保存结果。

Theoretically, we should be able to perform the concatenation in a single allocation and N copies - we can know how many total characters are available in our string, so it should just be as simple as summing up the lengths of the component strings and allocating enough underlying memory to hold the result.

假设我们开始了pre-充满名单,其中,串&GT; ,是有可能采用单个分配来连接在该列表中的所有字符串

Assuming we're starting with a pre-filled List<string>, is it possible to concatenate all strings in that list using a single allocation?

目前,我们使用了的StringBuilder 类,但这种存储所有的人物自身的中间缓冲区 - 所以我们有一个不断增长的块阵列,每​​个块存储我们给它的字符的副本。很不理想。该拨款块的排列并不可怕,但最糟糕的是,它分配中间的字符数组,这是指N分配和拷贝。

Currently, we're using the StringBuilder class, but this stores its own intermediate buffer of all of the characters - so we have an ever growing chunk array, with each chunk storing a copy of the characters we're giving it. Far from ideal. The allocations for the array of chunks aren't horrible, but the worst part is that it allocates intermediate character arrays, which means N allocations and copies.

我们现在可以做的最好的是调用名单,其中,串&GT; .ToArray() - 它执行500K元素的数组的一个副本 - 并通过所产生的的String [] string.Concat(PARAMS字符串[]) string.Concat()然后执行两笔拨款,一个输入数组复制到内部数组,一个分配目标字符串的内存。

The best we can do right now is to call List<string>.ToArray() - which performs one copy of a 500k element array - and pass the resulting string[] to string.Concat(params string[]). string.Concat() then performs two allocations, one to copy the input array into an internal array, and the one to allocate the destination string's memory.

从referencesource.microsoft.com:

From referencesource.microsoft.com:

    public static String Concat(params String[] values) {
        if (values == null)
            throw new ArgumentNullException("values");
        Contract.Ensures(Contract.Result<String>() != null);
        // Spec#: Consider a postcondition saying the length of this string == the sum of each string in array
        Contract.EndContractBlock();
        int totalLength=0;

        // -----------> Allocation #1 <---------
        String[] internalValues = new String[values.Length];

        for (int i=0; i<values.Length; i++) {
            string value = values[i];
            internalValues[i] = ((value==null)?(String.Empty):(value));
            totalLength += internalValues[i].Length;
            // check for overflow
            if (totalLength < 0) {
                throw new OutOfMemoryException();
            }
        }

        return ConcatArray(internalValues, totalLength);
    }

    private static String ConcatArray(String[] values, int totalLength) {

        // -----------------> Allocation #2 <---------------------
        String result =  FastAllocateString(totalLength);
        int currPos=0;

        for (int i=0; i<values.Length; i++) {
            Contract.Assert((currPos <= totalLength - values[i].Length), 
                            "[String.ConcatArray](currPos <= totalLength - values[i].Length)");

            FillStringChecked(result, currPos, values[i]);
            currPos+=values[i].Length;
        }

        return result;
    }

因此​​,在最好的情况下,我们有三个分配,两个用于数组引用组件串,和一个用于目标串联的字符串

Thus, in the best case, we have three allocations, two for arrays referencing the component strings, and one for the destination concatenated string.

我们可以改善呢?是否有可能来连接一个名单,其中,串&GT; 采用单个分配和字符拷贝一个循环

Can we improve on this? Is it possible to concatenate a List<string> using a single allocation and a single loop of character copies?

我想总结一下迄今为止所讨论的各种方法,以及为什么他们仍然是次优的。我还想设置在混凝土多一点的情况的参数,因为我已经收到了很多问题,这尝试边踩的核心问题。

I'd like to summarize the various approaches discussed so far, and why they are still sub-optimal. I'd also like to set the parameters of the situation in concrete a little more, since I've received a lot of questions that try to side step the central question.

...

第一,code,我在工作的结构。有三个层次:

First, the structure of the code that I am working within. There are three layers:

在一层是一组生产我的内容的方法。这些方法返回小十岁上下的字符串对象,我会打电话给我的'分量'弦'。这些字符串对象最终会被连接成一个字符串。我不必修改这些方法的能力;我不得不面对的现实是,他们返回的字符串对象,向前迈进。 在二层是我的code调用这些内容制作商和组装输出,并且是这个问题的主题。我必须调用内容制作方法,收集它们返回的字符串,最终并置返回的字符串成一个字符串(实际上是一个稍微复杂;返回的字符串被划分取决于他们是如何路由输出,所以我有几套串的大集合)。 在三层是一组接受一家独大的字符串进行进一步的处理方法。改变了code接口超出了我的控制。

讲一些数字:一个典型的批处理将收集从内容制作,再presenting约200-500 MB内存〜50万串。我需要最有效的方式来连接这些50万串入一个字符串。

Talking about some numbers: a typical batch run will collect ~500000 strings from the content producers, representing about 200-500 MB of memory. I need the most efficient way to concatenate these 500k strings into a single string.

...

现在我想考察迄今讨论的方法。为了数字,假设我们正在运行的64位,假设我们正在收集50万的字符串对象,并承担该字符串对象的总面积共计200兆字节的价值的字符数据。另外,假设原始字符串对象的内存的不计入的对任何方法总在下面的分析。我做这样的假设,因为它必然是共同的任何和所有的方法,因为它是我们无法改变的内容制作商的接口的假设 - 他们返回500K相对较小完全形成的字符串对象,我必须再接受并以某种方式连接起来。如上所述,我不能改变这个接口。

Now I'd like to examine the approaches discussed so far. For the sake of numbers, assume we're running 64-bit, assume that we are collecting 500000 string objects, and assume that the aggregate size of the string objects totals 200 megabytes worth of character data. Also, assume that the original string object's memory is not counted toward any approach's total in the below analysis. I make this assumption because it is necessarily common to any and all approaches, because it is an assumption that we cannot change the interface of the content producers - they return 500k relatively small fully formed strings objects that I must then accept and somehow concatenate. As stated above, I cannot change this interface.

内容生产商----> 的StringBuilder ----> 字符串

Content producers ----> StringBuilder ----> string

在概念上,这将是调用内容制作,并直接写入它们返回到的StringBuilder ,然后再调用 StringBuilder.ToString琴弦()以获得连接字符串。

Conceptually, this would be invoking the content producers, and directly writing the strings they return to a StringBuilder, and then later calling StringBuilder.ToString() to obtain the concatenated string.

通过分析的StringBuilder 的实施中,我们可以看到,这个成本归结为400 MB分配和副本:

By analyzing StringBuilder's implementation, we can see that the cost of this boils down to 400 MB of allocations and copies:

在这里我们收集的内容生产者输出的阶段,我们正在写200 MB的数据到的StringBuilder 。我们会进行一个200 MB拨款pre-分配的StringBuilder ,值得副本则200 MB,因为我们复制,丢弃从内容生产商返回的字符串 在我们收集的内容制作所有的输出,并具有完全形成的StringBuilder ,然后我们需要调用 StringBuilder.ToString() 。这正是执行一个分配( string.FastAllocateString()),然后将文件从内部缓冲区字符串对象的内部存储器中的字符串数据。 During the stage where we collect the output from the content producers, we're writing 200 MB of data to the StringBuilder. We would be performing one 200 MB allocation to pre-allocate the StringBuilder, and then 200 MB worth of copies as we copy and discard the strings returned from the content producers After we've collected all output from the content producers and have a fully formed StringBuilder, we then need to call StringBuilder.ToString(). This performs exactly one allocation (string.FastAllocateString()), and then copies the string data from its internal buffers to the string object's internal memory.

总成本:约400 MB的分配和副本

内容生产商---> pre分配的char [] ---> 字符串

Content producers ---> pre-allocated char[] ---> string

这种策略是相当简单的。假设我们知道我们大概有多少字符数据将被从生产者收集,我们可以pre-分配的char [] 是200 MB大。然后,我们所说的内容制作商,我们它们返回的字符串复制到我们的的char [] 。这占200 MB分配和副本。最后一步把它变成一个字符串对象是将其传递给新的字符串(的char [])的构造。然而,由于字符串是不可变和数组都没有,构造将使该整个阵列的一个副本,使其分配和复制另一个200 MB的字符数据。

This strategy is fairly simple. Assuming we know roughly how much character data we're going to be collecting from the producers, we can pre-allocate a char[] that is 200 MB large. Then, as we call the content producers, we copy the strings they return into our char[]. This accounts for 200 MB of allocations and copies. The final step to turn this into a string object is to pass it to the new string(char[]) constructor. However, since strings are immutable and arrays are not, the constructor will make a copy of that entire array, causing it to allocate and copy another 200 MB of character data.

总成本:约400 MB的分配和副本

内容生产商---> 名单,其中,串&GT; ----> 的String [] ----> string.Concat(字符串[])

Content producers ---> List<string> ----> string[] ----> string.Concat(string[])

pre-分配名单,其中,串&GT; 为约50万的元素 - 大约4 MB分配的名单的基础数组(每个指针50万* 8字节= = 4 MB的内存)。 呼叫所有的内容制作商,以收集他们的字符串。大约4 MB份,为我们的指针复制到返回的字符串到列表的底层数组。 呼叫名单,其中,串&GT; .ToArray()来获得的String [] 。大约4 MB的分配和复印件(再次,我们真的只复制指针)的。 呼叫 string.Concat(字符串[]): Concat的将它做任何实际工作之前,使提供给它的数组的一个副本。大约4 MB的分配和拷贝,再次。 Concat的然后将使用内部 string.FastAllocateString()特殊的方法分配一个单独的目的地的字符串对象。大约200 MB分配。 Concat的然后将复制从提供的数组,其内部副本字符串直接到目的地。大约200 MB的副本。 Pre-allocate a List<string> to be about 500k elements - approximately 4 MB of allocations for List's underlying array (500k * 8 bytes per pointer == 4 MB of memory). Call all of the content producers to collect their strings. Approximately 4 MB of copies, as we copy the pointer to the returned string into List's underlying array. Call List<string>.ToArray() to obtain a string[]. Approximately 4 MB of allocations and copies (again, we're really just copying pointers). Call string.Concat(string[]): Concat will make a copy of the array provided to it before it does any real work. Approximately 4 MB of allocations and copies, again. Concat will then allocate a single 'destination' string object using the internal string.FastAllocateString() special method. Approximately 200 MB of allocations. Concat will then copy strings from its internal copy of the provided array directly into the destination. Approximately 200 MB of copies.

总成本:约212 MB的分配和副本

它们都不是理想的,但做法#3是非常接近的。我们假设的内存绝对最低需要分配并复制为200 MB(在目标字符串),在这里我们得到pretty的接近 - 212 MB。

None of these approaches are ideal, however approach #3 is very close. We're assuming that the absolute minimum of memory that needs to be allocated and copied is 200 MB (for the destination string), and here we get pretty close - 212 MB.

如果有一个 string.Concat 重载1)接受了一个的IList&LT;字符串&GT; 2)做在使用前没有作出这样的IList的副本,那么问题将得到解决。没有这样的方法是由净,这一问题因此受到提供的。

If there were a string.Concat overload that 1) Accepted an IList<string> and 2) did not make a copy of that IList before using it, then the problem would be solved. No such method is provided by .Net, hence the subject of this question.

这是一个解决方案的进度

我已经做了一些黑客攻击IL一些测试,发现直接调用 string.FastAllocateString(N)(这通常不是可调用...)约为尽可能快地调用新的字符串('\ 0',N),两者的似乎的分配完全一样多的内存的预期。

I've done some testing with some hacked IL, and found that directly invoking string.FastAllocateString(n) (which is not usually invokable...) is about as fast as invoking new string('\0', n), and both seem to allocate exactly as much memory as is expected.

从那里,似乎它可以使用来获得一个指向新分配的字符串不安全固定声明。

From there, it seems its possible to acquire a pointer to the freshly allocated string using the unsafe and fixed statements.

所以,粗略的解决方案开始出现:

And so, a rough solution begins to appear:

    private static string Concat( List<string> list )
    {
        int concatLength = 0;

        for( int i = 0; i < list.Count; i++ )
        {
            concatLength += list[i].Length;
        }

        string newString = new string( '\0', concatLength );

        unsafe
        {
            fixed( char* ptr = newString )
            {
                ...
            }
        }

        return newString;
    }

接下来的最大的障碍正在实施或发现除了一个高效的块复制方法,ALA Buffer.BlockCopy,将接受的char * 类型。

推荐答案

我实现了一个方法来连接列表到执行只有一个分配。

I have implemented a method to concatenate a List into a single string that performs exactly one allocation.

下面code的.Net 4.6下的编译 - Block.MemoryCopy 未添加到.NET,直到4.6

The following code compiles under .Net 4.6 - Block.MemoryCopy wasn't added to .Net until 4.6.

不安全的实现:

public static unsafe class FastConcat
{
    public static string Concat( IList<string> list )
    {
        string destinationString;
        int destLengthChars = 0;

        for( int i = 0; i < list.Count; i++ )
        {
            destLengthChars += list[i].Length;
        }

        destinationString = new string( '\0', destLengthChars );

        unsafe
        {
            fixed( char* origDestPtr = destinationString )
            {
                char* destPtr = origDestPtr; // a pointer we can modify.
                string source;

                for( int i = 0; i < list.Count; i++ )
                {
                    source = list[i];

                    fixed( char* sourcePtr = source )
                    {
                        Buffer.MemoryCopy(
                            sourcePtr,
                            destPtr,
                            long.MaxValue,
                            source.Length * sizeof( char )
                        );
                    }

                    destPtr += source.Length;
                }
            }
        }

        return destinationString;
    }

}

相互竞争的实现是下面的安全的实现:

The competing implementation is the following "safe" implementation:

public static string Concat( IList<string> list )
{
    return string.Concat( list.ToArray() )
}

内存消耗

在不安全的实施正好进行一个分配和零临时拨款。该名单,其中,串&GT; 直接连接到单一的,新分配字符串对象 在安全的实施需要在列表的两个副本 - 一个,当我打电话的ToArray()将它传递给string.Concat,另外当string.Concat执行所述阵列的其自己的内部副本。 The "unsafe" implementation performs exactly one allocation and zero temporary allocations. The List<string> is directly concatenated into a single, freshly allocated string object. The "safe" implementation requires two copies of the list - one, when I call ToArray() to pass it to string.Concat, and another when string.Concat performs its own internal copy of the array.

在串联一个500K元素的列表中,安全string.Concat方法分配额外的内存在64位进程正好是8 MB,我已经运行在内存监视器测试驱动证实。这是我们希望与安全执行进行阵列复制。

When concatenating a 500k element list, the "safe" string.Concat method allocates exactly 8 MB of extra memory in a 64-bit process, which I've confirmed by running the test driver in a memory monitor. This is what we would expect with the array copies performed by the safe implementation.

CPU性能

对于小型工作集,不安全的执行似乎赢得了约25%。

For small worksets, the unsafe implementation seems to win by about 25%.

测试驱动程序被编译为64位,通过NGEN把程序安装到本机映像缓存,并从调试器外部已卸载的工作站上运行测试。

The test driver was tested by compiling for 64-bit, installing the program into the native image cache via NGEN, and running from outside the debugger on an unloaded workstation.

这是我的一个小工作集试车手(每2-10长的碳化500K字符串):

From my test driver with a small workset (500k strings each 2-10 chars long):

Unsafe Time: 17.266 ms
Unsafe Time: 18.419 ms
Unsafe Time: 16.876 ms

Safe Time: 21.265 ms
Safe Time: 21.890 ms
Safe Time: 24.492 ms

不安全的平均值为:17.520毫秒。安全平均值为:22.549毫秒。安全花费比不安全长大约25%。这可能是由于额外的工作安全地进行所要做的,分配临时数组。

Unsafe average: 17.520 ms. Safe average: 22.549 ms. Safe takes about 25% longer than unsafe. This is likely due to the extra work the safe implementation has to do, allocating temporary arrays.

...

这是我的一个大的工作集试车手(50万串,每500-800碳化的长):

From my test driver with a large workset (500k strings, each 500-800 chars long):

Unsafe Time: 498.122 ms
Unsafe Time: 513.725 ms
Unsafe Time: 515.016 ms

Safe Time: 487.456 ms
Safe Time: 499.508 ms
Safe Time: 512.390 ms

正如你所看到的,用大串的性能差异大致为零,可能是因为时间由原始副本为主。

As you can see, the performance difference with large strings is roughly zero, likely because the time is dominated by the raw copy.

结论

如果你不关心阵副本,安全的实现是死的简单实现,并且是大致一样快,不安全的执行情况。如果你想和内存使用绝对完美的,使用不安全的实现。

If you don't care about the array copies, the safe implementation is dead simple to implement, and is roughly as fast as the unsafe implementation. If you want to be absolutely perfect with memory usage, use the unsafe implementation.

我已经把它贴在code我使用的测试工具:

I've attached the code I used for the test harness:

class PerfTestHarness
{
    private List<string> corpus;

    public PerfTestHarness( List<string> corpus )
    {
        this.corpus = corpus;

        // Warm up the JIT

        // Note that `result` is discarded. We reference it via 'result[0]' as an 
        // unused paramater to my prints to be absolutely sure it doesn't get 
        // optimized out. Cheap hack, but it works.
        string result;

        result = FastConcat.Concat( this.corpus );
        Console.WriteLine( "Fast warmup done", result[0] );

        result = string.Concat( this.corpus.ToArray() );
        Console.WriteLine( "Safe warmup done", result[0] );

        GC.Collect();
        GC.WaitForPendingFinalizers();
    }

    public void PerfTestSafe()
    {
        Stopwatch watch = new Stopwatch();
        string result;

        GC.Collect();
        GC.WaitForPendingFinalizers();

        watch.Start();
        result = string.Concat( this.corpus.ToArray() );
        watch.Stop();

        Console.WriteLine( "Safe Time: {0:0.000} ms", watch.Elapsed.TotalMilliseconds, result[0] );
        Console.WriteLine( "Memory usage: {0:0.000} MB", Environment.WorkingSet / 1000000.0 );
        Console.WriteLine();
    }

    public void PerfTestUnsafe()
    {
        Stopwatch watch = new Stopwatch();
        string result;

        GC.Collect();
        GC.WaitForPendingFinalizers();

        watch.Start();
        result = FastConcat.Concat( this.corpus );
        watch.Stop();

        Console.WriteLine( "Unsafe Time: {0:0.000} ms", watch.Elapsed.TotalMilliseconds, result[0] );
        Console.WriteLine( "Memory usage: {0:0.000} MB", Environment.WorkingSet / 1000000.0 );
        Console.WriteLine();
    }
}