32位和64位进程之间的memcpy的性能差异进程、差异、性能、memcpy

2023-09-07 22:28:15 作者:拽爷就是我

我们有酷睿2机(戴尔T5400)与XP64。

We have Core2 machines (Dell T5400) with XP64.

我们看到,在运行32位程序时, 的memcpy的性能的量级 1.2GByte /秒;然而,与memcpy在64位进程 达到约2.2GByte /秒(或2.4GByte /秒 使用英特尔编译器CRT的memcpy的)。虽然 最初的反应可能是刚解释 走为由于可用更广泛的寄存器 在64位code,我们观察到,我们自己的memcpy样 上证所组装code(这应该使用128位 宽负载家店面,无论六十四分之三十二-位数的 过程中)表明在类似上限 复制带宽它实现。

We observe that when running 32-bit processes, the performance of memcpy is on the order of 1.2GByte/s; however memcpy in a 64-bit process achieves about 2.2GByte/s (or 2.4GByte/s with the Intel compiler CRT's memcpy). While the initial reaction might be to just explain this away as due to the wider registers available in 64-bit code, we observe that our own memcpy-like SSE assembly code (which should be using 128-bit wide load-stores regardless of 32/64-bitness of the process) demonstrates similar upper limits on the copy bandwidth it achieves.

我的问题是,什么是这种差异实际上 由于 ?难道32位进程已经跳通过 一些额外的WOW64箍得到的内存?它是什么 做的TLB或prefetchers或...什么?

My question is, what's this difference actually due to ? Do 32-bit processes have to jump through some extra WOW64 hoops to get at the RAM ? Is it something to do with TLBs or prefetchers or... what ?

感谢您的见解。

还就英特尔论坛。

推荐答案

当然,你真的需要看看正在对memcpy的最里面的循环中执行的实际机器指令,步进到机器$ C $下用调试器。别的只是猜测。

Of course, you really need to look at the actual machine instructions that are being executed inside the innermost loop of the memcpy, by stepping into the machine code with a debugger. Anything else is just speculation.

我quess是,它可能没有任何与32位与64位的本身;我的猜测是更快的库例程,使用SSE非临时存储写入。

My quess is that it probably doesn't have anything to do with 32-bit versus 64-bit per se; my guess is that the faster library routine was written using SSE non-temporal stores.

如果内环包含常规负载存储指令的任何变化, 那么目标,必须读入机器的缓存,修改,写回。由于该读是完全没有必要的 - 正在读取的位将立即覆盖 - 您可以通过使用非时间写指令,它绕过缓存节省一半的内存带宽。通过这种方式,目的地存储器只是写入制备单程到所述存储器,而不是一个往返行程。

If the inner loop contains any variation of conventional load-store instructions, then the destination memory must be read into the machine's cache, modified, and written back out. Since that read is totally unnecessary -- the bits being read are overwritten immediately -- you can save half the memory bandwidth by using the "non-temporal" write instructions, which bypass the caches. That way, the destination memory is just written making a one-way trip to the memory instead of a round trip.

我不知道英特尔编译器的CRT库,所以这只是一个猜测。有没有特别的原因,32位libCRT不能做同样的事情,但你引用的加速是什么,我希望球场只是通过转换MOVDQA说明movnt ...

I don't know the Intel compiler's CRT library, so this is just a guess. There's no particular reason why the 32-bit libCRT can't do the same thing, but the speedup you quote is in the ballpark of what I would expect just by converting the movdqa instructions to movnt...

由于memcpy的是没有做任何的计算,它总是受你能多快的读写内存。

Since memcpy is not doing any calculations, it's always bound by how fast you can read and write memory.

 
精彩推荐
图片推荐