为什么循环的矩阵乘法算法的顺序会影响性能?乘法、矩阵、算法、顺序

2023-09-12 21:16:54 作者:假装、很幸福

我给出两个函数查找两个矩阵的乘积:

I am given two functions for finding the product of two matrices:

 void MultiplyMatrices_1(int **a, int **b, int **c, int n){
      for (int i = 0; i < n; i++)
          for (int j = 0; j < n; j++)
              for (int k = 0; k < n; k++)
                  c[i][j] = c[i][j] + a[i][k]*b[k][j];
  }

 void MultiplyMatrices_2(int **a, int **b, int **c, int n){
      for (int i = 0; i < n; i++)
          for (int k = 0; k < n; k++)
              for (int j = 0; j < n; j++)
                  c[i][j] = c[i][j] + a[i][k]*b[k][j];
 }

我跑去异形使用两个可执行文件 gprof的,每一个具有相同的code,除了这个功能。这些第二个是显著(约5倍),更快的尺寸为2048×2048的矩阵的任何想法,为什么?

I ran and profiled two executables using gprof, each with identical code except for this function. The second of these is significantly (about 5 times) faster for matrices of size 2048 x 2048. Any ideas as to why?

推荐答案

我相信,你在看什么是locality在计算机的存储层次参考。

I believe that what you're looking at is the effects of locality of reference in the computer's memory hierarchy.

典型地,计算机存储器被分成不同类型,它们有不同的性能特征(这通常被称为在的 存储器层次 的)。最快的内存是在处理器的寄存器,其可以(通常)访问和读出在单个时钟周期。然而,通常只有极少数的这些寄存器(超过1KB通常较无)。计算机的主存储器,在另一方面,是巨大的(比如说,8GB),但是慢得多的访问。为了提高性能,该计算机通常物理地构造成具有几个层次的高速缓存的的在中间的处理器和主存储器。这些高速缓存比寄存器,但比​​主内存快很多慢,因此,如果你这样做,看起来是在高速缓存存储器访问它往往是很多比你必须去主内存(通常更快,5-25x之间快点)。当访问内存,处理器首先检查,然后回到主存储器中读取的值,值的内存缓存。如果您持续访问的缓存值,你最终会比,如果你跳过围绕更好的性能内存,随机存取值。

Typically, computer memory is segregated into different types that have different performance characteristics (this is often called the memory hierarchy). The fastest memory is in the processor's registers, which can (usually) be accessed and read in a single clock cycle. However, there are usually only a handful of these registers (usually no more than 1KB). The computer's main memory, on the other hand, is huge (say, 8GB), but is much slower to access. In order to improve performance, the computer is usually physically constructed to have several levels of caches in-between the processor and main memory. These caches are slower than registers but much faster than main memory, so if you do a memory access that looks something up in the cache it tends to be a lot faster than if you have to go to main memory (typically, between 5-25x faster). When accessing memory, the processor first checks the memory cache for that value before going back to main memory to read the value in. If you consistently access values in the cache, you will end up with much better performance than if you're skipping around memory, randomly accessing values.

大多数程序都写的方式,其中,如果在内存中一个字节被读入内存,该方案后来从周围的内存区域以及读取多个不同的值。因此,这些缓存通常被设计成当您从内存中读取一个值,内存块(通常是地方1KB和1MB之间)围绕单个值值也被拉入缓存中。这样一来,如果你的程序读取附近的价值观,他们已经在缓存中,你不必去主内存。

Most programs are written in a way where if a single byte in memory is read into memory, the program later reads multiple different values from around that memory region as well. Consequently, these caches are typically designed so that when you read a single value from memory, a block of memory (usually somewhere between 1KB and 1MB) of values around that single value is also pulled into the cache. That way, if your program reads the nearby values, they're already in the cache and you don't have to go to main memory.

现在,最后一个细节 - 在C / C ++,数组存储在行优先顺序,这意味着所有以矩阵的单个行中的值的旁边存储到彼此。因此,在存储器阵列看起来像的第一行,然后是第二行,然后在第三行,等等。

Now, one last detail - in C/C++, arrays are stored in row-major order, which means that all of the values in a single row of a matrix are stored next to each other. Thus in memory the array looks like the first row, then the second row, then the third row, etc.

鉴于此,让我们来看看你的code。第一个版本是这样的:

Given this, let's look at your code. The first version looks like this:

  for (int i = 0; i < n; i++)
      for (int j = 0; j < n; j++)
          for (int k = 0; k < n; k++)
              c[i][j] = c[i][j] + a[i][k]*b[k][j];

现在,让我们来看看code表示最里行。在每次迭代,k的值是变化增加。这意味着运行内部循环时,该循环的每个迭代很可能装载了 B值时,有一个高速缓存未命中[k]的[J] 。这样做的原因是,由于矩阵存储在行优先的顺序,每次递增K,你跳过矩阵的整个行,更远跳跃到内存中,可能远过去的值,你已经被缓存。但是,你不抬头的时候有一个小姐 C [I] [J] (因为Ĵ是一样的),也不会对你很可能会错过 A [1] [K] ,因为数值行-major秩序,如果值 A [1] [K] 从previous迭代缓存中, A [i的值] [K] 阅读这个迭代是从相邻的存储位置。因此,在最里面的循环的每个迭代,你很可能有一个高速缓存未命中。

Now, let's look at that innermost line of code. On each iteration, the value of k is changing increasing. This means that when running the innermost loop, each iteration of the loop is likely to have a cache miss when loading the value of b[k][j]. The reason for this is that because the matrix is stored in row-major order, each time you increment k, you're skipping over an entire row of the matrix and jumping much further into memory, possibly far past the values you've cached. However, you don't have a miss when looking up c[i][j] (since i and j are the same), nor will you probably miss a[i][k], because the values are in row-major order and if the value of a[i][k] is cached from the previous iteration, the value of a[i][k] read on this iteration is from an adjacent memory location. Consequently, on each iteration of the innermost loop, you are likely to have one cache miss.

不过考虑这个第二个版本:

But consider this second version:

  for (int i = 0; i < n; i++)
      for (int k = 0; k < n; k++)
          for (int j = 0; j < n; j++)
              c[i][j] = c[i][j] + a[i][k]*b[k][j];

现在,因为你增加Ĵ在每次迭代,让我们想想有多少缓存未命中你可能对最里面的语句。由于该值是行优先的顺序,对 C的值[I] [J] 很可能是在高速缓存,因为值 C [I] [J] 从previous迭代很可能缓存为好,准备好被读取。同样, B [k]的[J] 可能是缓存,并且因为 K 并没有改变,没准 A [1] [K] 缓存为好。这意味着,在内部循环的每次迭代中,你可能没有缓存失误。

Now, since you're increasing j on each iteration, let's think about how many cache misses you'll likely have on the innermost statement. Because the values are in row-major order, the value of c[i][j] is likely to be in-cache, because the value of c[i][j] from the previous iteration is likely cached as well and ready to be read. Similarly, b[k][j] is probably cached, and since i and k aren't changing, chances are a[i][k] is cached as well. This means that on each iteration of the inner loop, you're likely to have no cache misses.

总体而言,这意味着在code中的第二个版本是不太可能在每次循环高速缓存未命中,而第一个版本几乎肯定会的。因此,第二回路可能是比第一速度更快,因为你看到

Overall, this means that the second version of the code is unlikely to have cache misses on each iteration of the loop, while the first version almost certainly will. Consequently, the second loop is likely to be faster than the first, as you've seen.

有趣的是,许多编译器开始有用于检测的code中的第二个版本比第一速度更快原型支持。有些人会尝试自动改写code,以最大限度地提高并行性。如果你有href="http://rads.stackoverflow.com/amzn/click/0321486811">紫金神龙版的