Matrix
Notes on AES-256 Cipher Implementation on PowerPC Processors: Part IV
Grzegorz Kraszewski

IIIIII ● IV ● V

Working on AltiVec implementation of AES-256 I've found an optimization, which could be also applied to the scalar code. Then I did it, the result was unexpectedly good.

MixColumn revisited and made much faster

The MixColumn operation has been earlier optimized by precalculation of a cipher state column multiplied by 2 and by 3, in GF(28) field. Then the matrix multiplication in this field reduces to selecting bytes from either original column C, column multiplied by 2, denoted as M2 or column multiplied by 3, denoted as M3. The original operation may be written down as follows (these equations come right from FIPS-197 document):

C0' = ({02} × C0) + ({03} × C1) + ({01} × C2) + ({01} × C3)
C1' = ({01} × C0) + ({02} × C1) + ({03} × C2) + ({01} × C3)
C2' = ({01} × C0) + ({01} × C1) + ({02} × C2) + ({03} × C3)
C3' = ({03} × C0) + ({01} × C1) + ({01} × C2) + ({02} × C3)

When we consider, that we have precomputed multiplications and use C language table notation for C, M2 and M3, the formulas are a bit simplier:

C'[0] = M2[0] + M3[1] + C[2] + C[3]
C'[1] = C[0] + M2[1] + M3[2] + C[3]
C'[2] = C[0] + C[1] + M2[2] + M3[3]
C'[3] = M3[0] + C[1] + C[2] + M2[3]

Code v7 (and v8) gathers columns of the above equation system in 32-bit registers and then XOR-s them, so it performs three 32-bit XOR operations instead of 12 single byte operations. Unfortunately extracting bytes and merging them into columnar 32-bit words requires many operations. How it can be simplified? XOR operation is commutative. It means I may rearrange order of terms in each equation separately. Then let's rearrange them as follows:

C'[0] = C[2] + C[3] + M2[0] + M3[1]
C'[1] = C[3] + C[0] + M2[1] + M3[2]
C'[2] = C[0] + C[1] + M2[2] + M3[3]
C'[3] = C[1] + C[2] + M2[3] + M3[0]

Now our columns (32-bit words) may be very easily derived from C, M2 and M3:

Of course PowerPC processors have bitwise left rotation instruction. It cannot be explicitly denoted in C language, but fortunately GCC compiler detects simple cases of rotation and uses single rlwinm instruction. Then, the improved MixColumn code looks like this:

ULONG MixColumn(ULONG c)
{
  ULONG m2, m3, mask, p, q, r;

  /* preparation of premultiplied column */

  mask = c & 0x80808080;
  mul2 = (c << 1) & 0xFEFEFEFE;
  mask >>= 7;
  mask *= 0x1B;
  mul2 ^= mask;
  mul3 = c ^ mul2;

  /* matrix multiplication */

  p = (c << 16) | ((c >> 16) & 0xFFFF);
  q = (c << 24) | ((c >> 8) & 0xFFFFFF);
  r = (mul3 << 8) | ((mul3 >> 24) & 0xFF);
  c = p ^ q ^ r ^ m2;
  return c;
}

Note that the first three lines in the "matrix multiplication" section compile to just three CPU instructions. The fourth line yields another three. Then the code of MixColumn is much shorter (72 bytes instead of 184) and indeed it executes much faster. See the benchmarks of the v9 code:

Efika MPC5200B 400 MHz 138 cycles/byte 2.77 MB/s ×2.45
Pegasos I G3 750FX 600 MHz 71 cycles/byte 8.13 MB/s ×3.36
Pegasos II G4 7447A 1000 MHz 68 cycles/byte 14.22 MB/s ×4.67
PowerMac G4 G4 7455 1420 MHz 68 cycles/byte 19.96 MB/s ×4.60
PowerBook G4 G4 7447A 1670 MHz 68 cycles/byte 23.47 MB/s ×4.67
PowerMac G5 G5 970FX 2000 MHz 66 cycles/byte 29.30 MB/s ×3.89
PowerMac G5 G5 970FX 2300 MHz 66 cycles/byte 33.59 MB/s ×3.88

Part V »