我们有酷睿2机(戴尔T5400)与XP64。
We have Core2 machines (Dell T5400) with XP64.
我们看到,在运行32位程序时, 的memcpy的性能的量级 1.2GByte /秒;然而,与memcpy在64位进程 达到约2.2GByte /秒(或2.4GByte /秒 使用英特尔编译器CRT的memcpy的)。虽然 最初的反应可能是刚解释 走为由于可用更广泛的寄存器 在64位code,我们观察到,我们自己的memcpy样 上证所组装code(这应该使用128位 宽负载家店面,无论六十四分之三十二-位数的 过程中)表明在类似上限 复制带宽它实现。
We observe that when running 32-bit processes, the performance of memcpy is on the order of 1.2GByte/s; however memcpy in a 64-bit process achieves about 2.2GByte/s (or 2.4GByte/s with the Intel compiler CRT's memcpy). While the initial reaction might be to just explain this away as due to the wider registers available in 64-bit code, we observe that our own memcpy-like SSE assembly code (which should be using 128-bit wide load-stores regardless of 32/64-bitness of the process) demonstrates similar upper limits on the copy bandwidth it achieves.
我的问题是,什么是这种差异实际上 由于 ?难道32位进程已经跳通过 一些额外的WOW64箍得到的内存?它是什么 做的TLB或prefetchers或...什么?
My question is, what's this difference actually due to ? Do 32-bit processes have to jump through some extra WOW64 hoops to get at the RAM ? Is it something to do with TLBs or prefetchers or... what ?
感谢您的见解。
还就英特尔论坛。
当然,你真的需要看看正在对memcpy的最里面的循环中执行的实际机器指令,步进到机器$ C $下用调试器。别的只是猜测。
Of course, you really need to look at the actual machine instructions that are being executed inside the innermost loop of the memcpy, by stepping into the machine code with a debugger. Anything else is just speculation.
我quess是,它可能没有任何与32位与64位的本身;我的猜测是更快的库例程,使用SSE非临时存储写入。
My quess is that it probably doesn't have anything to do with 32-bit versus 64-bit per se; my guess is that the faster library routine was written using SSE non-temporal stores.
如果内环包含常规负载存储指令的任何变化, 那么目标,必须读入机器的缓存,修改,写回。由于该读是完全没有必要的 - 正在读取的位将立即覆盖 - 您可以通过使用非时间写指令,它绕过缓存节省一半的内存带宽。通过这种方式,目的地存储器只是写入制备单程到所述存储器,而不是一个往返行程。
If the inner loop contains any variation of conventional load-store instructions, then the destination memory must be read into the machine's cache, modified, and written back out. Since that read is totally unnecessary -- the bits being read are overwritten immediately -- you can save half the memory bandwidth by using the "non-temporal" write instructions, which bypass the caches. That way, the destination memory is just written making a one-way trip to the memory instead of a round trip.
我不知道英特尔编译器的CRT库,所以这只是一个猜测。有没有特别的原因,32位libCRT不能做同样的事情,但你引用的加速是什么,我希望球场只是通过转换MOVDQA说明movnt ...
I don't know the Intel compiler's CRT library, so this is just a guess. There's no particular reason why the 32-bit libCRT can't do the same thing, but the speedup you quote is in the ballpark of what I would expect just by converting the movdqa instructions to movnt...
由于memcpy的是没有做任何的计算,它总是受你能多快的读写内存。
Since memcpy is not doing any calculations, it's always bound by how fast you can read and write memory.