为什么加入本地变量使.NET code慢变量、NET、code

2023-09-02 12:00:54 作者:帆布鞋走过的流年

为什么注释掉的前两行的for循环,并在取消对第三个结果中42%的增速?

 诠释计数= 0;
对于(UINT I = 0; I< 10亿; ++ I){
    VAR isMultipleOf16 = I%16 == 0;
    数+ = isMultipleOf16? 1:0;
    //计数+ = I%16 == 0? 1:0;
}
 

背后的时机是千差万别的组装code:13主场迎战循环中的7条指令。该平台是Windows 7上运行.NET 4.0的x64。 code已启用优化,测试应用程序是VS2010之外运行。 [更新: 摄制项目,验证项目设置实用]

省去了中间布尔是一个基本的优化,最简单的我在20世纪80年代的时代龙书。怎么产生的CIL或JIT编译的64位机code时,优化不会应用?

是否有真编译器,我想请你来优化这个code,请开关?虽然我很同情那个premature优化的情绪如同对爱钱的,我可以看到,试图来分析复杂的算法,有过这样散落在它的程序​​问题的无奈。你会工作,通过热点,但没有提示,可以手工调整从编译器给予我们通常采取极大地改善了更广泛的温暖地区。我当然希望我失去了一些东西在这里。

更新:速度差异也出现在x86,但取决于订单的方法只是在时间上编译。请参见为什么JIT为了影响性能?

大会code (如需要):

  VAR isMultipleOf16 = I%16 == 0;
00000037 MOV EAX,EDX
00000039和EAX,0Fh时
0000003c XOR ECX,ECX
0000003e TEST EAX,EAX
00000040 SETE CL
    数+ = isMultipleOf16? 1:0;
00000043 MOVZX EAX,CL
00000046 TEST EAX,EAX
00000048 JNE 0000000000000050
0000004a XOR EAX,EAX
0000004c JMP 0000000000000055
0000004e XCHG斧,斧
00000050 MOV EAX,1
00000055 LEA r8d,[RBX + RAX]
 

 数+ = I%16 == 0? 1:0;
00000037 MOV EAX,ECX
00000039和EAX,0Fh时
0000003c济0000000000000042
0000003e XOR EAX,EAX
00000040 JMP 0000000000000047
00000042 MOV EAX,1
00000047 LEA EDX,[RBX + RAX]
 
使用Visual Studio Code开发.NET程序

解决方案

问题应该是为什么我看到我的机器上这样的区别?。我无法重现如此巨大的速度差,并怀疑有一些具体的事情到您的环境。很难说这可能是什么,但。可一些(编译)选项已设置前一段时间,忘了他们。

我已创建一个控制台应用程序,重建发布模式(86)和VS.外运行结果几乎相同,1.77秒为这两种方法。这是确切的code:

 静态无效的主要(字串[] args)
{
    秒表SW =新的秒表();
    sw.Start();
    诠释计数= 0;

    对于(UINT I = 0; I< 10亿; ++ I)
    {
        //第1种方法
        VAR isMultipleOf16 = I%16 == 0;
        数+ = isMultipleOf16? 1:0;

        //第二个方法
        //计数+ = I%16 == 0? 1:0;
    }

    sw.Stop();
    Console.WriteLine(的String.Format(Ellapsed {0},算{1},sw.Elapsed,计数));
    Console.ReadKey();
}
 

请,谁拥有5分钟复制code,重建,VS和后期效果外意见,这个答案运行。我想,以避免说它的工作原理我的机器上。

修改

要确保我创建了一个 64位 WinForms应用程序和结果是相似的,在这个问题 - 在第一种方法是慢(1.57秒),比第二酮(1.05秒)。我观察到的差异是33% - 仍然有很多。似乎没有在.NET4 64位JIT编译器中的错误。

Why does commenting out the first two lines of this for loop and uncommenting the third result in a 42% speedup?

int count = 0;
for (uint i = 0; i < 1000000000; ++i) {
    var isMultipleOf16 = i % 16 == 0;
    count += isMultipleOf16 ? 1 : 0;
    //count += i % 16 == 0 ? 1 : 0;
}

Behind the timing is vastly different assembly code: 13 vs. 7 instructions in the loop. The platform is Windows 7 running .NET 4.0 x64. Code optimization is enabled, and the test app was run outside VS2010. [Update: Repro project, useful for verifying project settings.]

Eliminating the intermediate boolean is a fundamental optimization, one of the simplest in my 1980's era Dragon Book. How did the optimization not get applied when generating the CIL or JITing the x64 machine code?

Is there a "Really compiler, I would like you to optimize this code, please" switch? While I sympathize with the sentiment that premature optimization is akin to the love of money, I could see the frustration in trying to profile a complex algorithm that had problems like this scattered throughout its routines. You'd work through the hotspots but have no hint of the broader warm region that could be vastly improved by hand tweaking what we normally take for granted from the compiler. I sure hope I'm missing something here.

Update: Speed differences also occur for x86, but depend on the order that methods are just-in-time compiled. See Why does JIT order affect performance?

Assembly code (as requested):

    var isMultipleOf16 = i % 16 == 0;
00000037  mov         eax,edx 
00000039  and         eax,0Fh 
0000003c  xor         ecx,ecx 
0000003e  test        eax,eax 
00000040  sete        cl 
    count += isMultipleOf16 ? 1 : 0;
00000043  movzx       eax,cl 
00000046  test        eax,eax 
00000048  jne         0000000000000050 
0000004a  xor         eax,eax 
0000004c  jmp         0000000000000055 
0000004e  xchg        ax,ax 
00000050  mov         eax,1 
00000055  lea         r8d,[rbx+rax] 

    count += i % 16 == 0 ? 1 : 0;
00000037  mov         eax,ecx 
00000039  and         eax,0Fh 
0000003c  je          0000000000000042 
0000003e  xor         eax,eax 
00000040  jmp         0000000000000047 
00000042  mov         eax,1 
00000047  lea         edx,[rbx+rax] 

解决方案

Question should be "Why do I see such a difference on my machine?". I cannot reproduce such a huge speed difference and suspect there is something specific to your environment. Very difficult to tell what it can be though. Can be some (compiler) options you have set some time ago and forgot about them.

I have create a console application, rebuild in Release mode (x86) and run outside VS. Results are virtually identical, 1.77 seconds for both methods. Here is the exact code:

static void Main(string[] args)
{
    Stopwatch sw = new Stopwatch();
    sw.Start();
    int count = 0;

    for (uint i = 0; i < 1000000000; ++i)
    {
        // 1st method
        var isMultipleOf16 = i % 16 == 0;
        count += isMultipleOf16 ? 1 : 0;

        // 2nd method
        //count += i % 16 == 0 ? 1 : 0;
    }

    sw.Stop();
    Console.WriteLine(string.Format("Ellapsed {0}, count {1}", sw.Elapsed, count));
    Console.ReadKey();
}

Please, anyone who has 5 minutes copy the code, rebuild, run outside VS and post results in comments to this answer. I'd like to avoid saying "it works on my machine".

EDIT

To be sure I have created a 64 bit Winforms application and the results are similar as in the the question - the first method is slower (1.57 sec) than the second one (1.05 sec). The difference I observe is 33% - still a lot. Seems there is a bug in .NET4 64 bit JIT compiler.

 
精彩推荐
图片推荐