不安全的指针迭代和位图 - 这是为什么UINT64快?这是、位图、指针、不安全

2023-09-04 02:51:48 作者:愁恨年年长相似

我一直在做一些不安全的位图操作,并已发现了越来越多的指针较少的时间可能会导致一些大的性能提升。我不知道为什么会这样,即使您在循环做更多的位运算,它仍然不如少做重复的指针。

因此​​,例如,而不是遍历的32位像素与UInt32的迭代两个像素与UINT64,并在一个周期内执行的操作的两倍。

下面通过读取两个像素和修改他们做它(当然,这将与奇宽的图像失败,但它只是用于测试)。

 私人无效removeBlueWithTwoPixelIteration()
    {
        //想用数据的大图
        BMP位图=新位图(15000,15000,System.Drawing.Imaging.PixelFormat.Format32bppArgb);
        时间跨度的startTime,endTime的;

        不安全{

            UINT64 doublePixel;
            UInt32的pixel1;
            UInt32的pixel2;

            const int的READSIZE = sizeof的(UINT64);
            常量UINT64 rightHalf = UInt32.MaxValue;

            的PerformanceCounter PF =新的PerformanceCounter(系统,系统运行时间); pf.NextValue();

            的BitmapData BD = bmp.LockBits(新的Rectangle(0,0,bmp.Width,bmp.Height),System.Drawing.Imaging.ImageLockMode.ReadWrite,bmp.PixelFormat);
            字节*图像=(BYTE *)bd.Scan0.ToPointer();

            的startTime = TimeSpan.FromSeconds(pf.NextValue());

            对于(字节*线=图像;行<图像+ bd.Stride * bd.Height;行+ = bd.Stride)
            {
                对于(VAR指针=行;指针<线+ bd.Stride;指针+ = READSIZE)
                {
                    doublePixel = *((UINT64 *)指针);
                    pixel1 =(UInt32的)(doublePixel>>(READSIZE * 8/2))>> 8; //宽松的最后8位(蓝色)
                    pixel2 =(UInt32的)(doublePixel&安培; rightHalf)>> 8; //宽松的最后8位(蓝色)
                    *((* UInt32的)指针)= pixel1&其中;&其中; 8; //补篮不过却将因此ARG回到原来的位置
                    *((* UInt32的)指针+ 1)=&pixel2其中;&其中; 8; //补篮不过却将因此ARG回到原来的位置
                }
            }

            endTime的= TimeSpan.FromSeconds(pf.NextValue());

            bmp.UnlockBits(BD);
            bmp.Dispose();

        }

        的MessageBox.show((endTime的 - 的startTime).TotalMilliseconds.ToString());

    }
 

下面code做它逐个像素是各地要慢70%低于previous:

 私人无效removeBlueWithSinglePixelIteration()
    {
        //想用数据的大图
        BMP位图=新位图(15000,15000,System.Drawing.Imaging.PixelFormat.Format32bppArgb);
        时间跨度的startTime,endTime的;

        不安全
        {

            UInt32的singlePixel;

            const int的READSIZE = sizeof的(UInt32的);

            的PerformanceCounter PF =新的PerformanceCounter(系统,系统运行时间); pf.NextValue();

            的BitmapData BD = bmp.LockBits(新的Rectangle(0,0,bmp.Width,bmp.Height),System.Drawing.Imaging.ImageLockMode.ReadWrite,bmp.PixelFormat);
            字节*图像=(BYTE *)bd.Scan0.ToPointer();

            的startTime = TimeSpan.FromSeconds(pf.NextValue());

            对于(字节*线=图像;行<图像+ bd.Stride * bd.Height;行+ = bd.Stride)
            {
                对于(VAR指针=行;指针<线+ bd.Stride;指针+ = READSIZE)
                {
                    singlePixel = *((* UInt32的)指针)>> 8; //松乙
                    *((* UInt32的)指针)= singlePixel&其中;&其中; 8; //调整R G回来
                }
            }

            endTime的= TimeSpan.FromSeconds(pf.NextValue());

            bmp.UnlockBits(BD);
            bmp.Dispose();

        }

        的MessageBox.show((endTime的 - 的startTime).TotalMilliseconds.ToString());
    }
 

有人能解释,为什么是递增的指针更昂贵操作不是做一些位操作?

我使用.NET 4的框架。

对于C ++

难道这样的事情是真的吗?

NB。 32位和64位两种方法的比例是相等的,但是这两种方式都像64慢20%和32位?

编辑:作为建议的Porges和arul这可能是因为内存减少大量的读取和分支的开销

EDIT2:

最快69秒逆向DRAM地址映射,百度安全论文入选国际电子设计顶会DAC

在一些测试似乎从内存更少的时间阅读的答案:

通过这个code假定图像宽度是整除5,你得到更快的400%:

  [StructLayout(LayoutKind.Sequential,包= 1)]
结构PixelContainer {
    公共UInt32的pixel1;
    公共UInt32的pixel2;
    公共UInt32的pixel3类型,pixel3;
    公共UInt32的pixel4;
    公共UInt32的pixel5;
}
 

然后,使用这样的:

  INT READSIZE = sizeof的(PixelContainer);

            // .....

            对于(VAR指针=行;指针<线+ bd.Stride;指针+ = READSIZE)
            {
                多像素= *((PixelContainer *)指针);
                multiPixel.pixel1和放大器; = 0xFFFFFF00u;
                multiPixel.pixel2和放大器; = 0xFFFFFF00u;
                multiPixel.pixel3和放大器; = 0xFFFFFF00u;
                multiPixel.pixel4和放大器; = 0xFFFFFF00u;
                multiPixel.pixel5和放大器; = 0xFFFFFF00u;
                *((PixelContainer *)指针)=多像素;
            }
 

解决方案

这是被称为循环展开技术。主要的性能提升应该从降低分支开销。

作为一个方面说明,您可以通过使用一个位掩码加快了位:

  *((UINT64 *)指针)及= 0xFFFFFF00FFFFFF00ul;
 

I have been doing some unsafe bitmap operations and have found out that increasing the pointer less times can lead to some big performance improvements. I am not sure why is that so, even though you do lot more bitwise operations in the loop, it still is better to do less iterations on the pointer.

So for example instead of iterating over 32 bit pixels with a UInt32 iterate over two pixels with UInt64 and do twice the operations in one cycle.

The following does it by reading two pixels and modifying them (of course it will fail with images with odd width, but its just for testing).

    private void removeBlueWithTwoPixelIteration()
    {
        // think of a big image with data
        Bitmap bmp = new Bitmap(15000, 15000, System.Drawing.Imaging.PixelFormat.Format32bppArgb);
        TimeSpan startTime, endTime;

        unsafe {

            UInt64 doublePixel;
            UInt32 pixel1;
            UInt32 pixel2;

            const int readSize = sizeof(UInt64);
            const UInt64 rightHalf = UInt32.MaxValue;

            PerformanceCounter pf = new PerformanceCounter("System", "System Up Time"); pf.NextValue();

            BitmapData bd = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), System.Drawing.Imaging.ImageLockMode.ReadWrite, bmp.PixelFormat);
            byte* image = (byte*)bd.Scan0.ToPointer();

            startTime = TimeSpan.FromSeconds(pf.NextValue());

            for (byte* line = image; line < image + bd.Stride * bd.Height; line += bd.Stride)
            {
                for (var pointer = line; pointer < line + bd.Stride; pointer += readSize)
                {
                    doublePixel = *((UInt64*)pointer);
                    pixel1 = (UInt32)(doublePixel >> (readSize * 8 / 2)) >> 8; // loose last 8 bits (Blue color)
                    pixel2 = (UInt32)(doublePixel & rightHalf) >> 8; // loose last 8 bits (Blue color)
                    *((UInt32*)pointer) = pixel1 << 8; // putback but shift so A R G get back to original positions
                    *((UInt32*)pointer + 1) = pixel2 << 8; // putback but shift so A R G get back to original positions
                }
            }

            endTime = TimeSpan.FromSeconds(pf.NextValue());

            bmp.UnlockBits(bd);
            bmp.Dispose();

        }

        MessageBox.Show((endTime - startTime).TotalMilliseconds.ToString());

    }

The following code does it pixel by pixel and is around 70% slower than the previous:

    private void removeBlueWithSinglePixelIteration()
    {
        // think of a big image with data
        Bitmap bmp = new Bitmap(15000, 15000, System.Drawing.Imaging.PixelFormat.Format32bppArgb);
        TimeSpan startTime, endTime;

        unsafe
        {

            UInt32 singlePixel;

            const int readSize = sizeof(UInt32);

            PerformanceCounter pf = new PerformanceCounter("System", "System Up Time"); pf.NextValue();

            BitmapData bd = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), System.Drawing.Imaging.ImageLockMode.ReadWrite, bmp.PixelFormat);
            byte* image = (byte*)bd.Scan0.ToPointer();

            startTime = TimeSpan.FromSeconds(pf.NextValue());

            for (byte* line = image; line < image + bd.Stride * bd.Height; line += bd.Stride)
            {
                for (var pointer = line; pointer < line + bd.Stride; pointer += readSize)
                {
                    singlePixel = *((UInt32*)pointer) >> 8; // loose B
                    *((UInt32*)pointer) = singlePixel << 8; // adjust A R G back
                }
            }

            endTime = TimeSpan.FromSeconds(pf.NextValue());

            bmp.UnlockBits(bd);
            bmp.Dispose();

        }

        MessageBox.Show((endTime - startTime).TotalMilliseconds.ToString());
    }

Could someone clarify why is incrementing the pointer a more costly operation than doing a few bitwise operations?

I am using .NET 4 framework.

Could something like this be true for C++?

NB. 32 bit vs 64 bit the ratio of the two methods is equal, however both ways are like 20% slower on 64 vs 32 bit?

EDIT: As suggested by Porges and arul this could be because of decreased number of memory reads and branching overhead.

EDIT2:

After some testing it seems that reading from memory less time is the answer:

With this code assuming the image width is divisible by 5 you get 400% faster:

[StructLayout(LayoutKind.Sequential,Pack = 1)]
struct PixelContainer {
    public UInt32 pixel1;
    public UInt32 pixel2;
    public UInt32 pixel3;
    public UInt32 pixel4;
    public UInt32 pixel5;
}

Then use this:

            int readSize = sizeof(PixelContainer);

            // .....

            for (var pointer = line; pointer < line + bd.Stride; pointer += readSize)
            {
                multiPixel = *((PixelContainer*)pointer);
                multiPixel.pixel1 &= 0xFFFFFF00u;
                multiPixel.pixel2 &= 0xFFFFFF00u;
                multiPixel.pixel3 &= 0xFFFFFF00u;
                multiPixel.pixel4 &= 0xFFFFFF00u;
                multiPixel.pixel5 &= 0xFFFFFF00u;
                *((PixelContainer*)pointer) = multiPixel;
            }

解决方案

This is a technique known as loop unrolling. The main performance benefit should come from reducing the branching overhead.

As a side note, you could speed it up a bit by using a bitmask:

*((UInt64 *)pointer) &= 0xFFFFFF00FFFFFF00ul;