#### I ● II ● III ● IV ● V

Working on AltiVec implementation of AES-256 I've found an optimization, which could be also applied to the scalar code. Then I did it, the result was unexpectedly good.

### MixColumn revisited and made much faster

The *MixColumn* operation has been earlier optimized by precalculation of a cipher state column multiplied by 2 and by 3, in GF(2^{8}) field. Then the matrix multiplication in this field reduces to selecting bytes from either original column *C*, column multiplied by 2, denoted as *M2* or column multiplied by 3, denoted as *M3*. The original operation may be written down as follows (these equations come right from FIPS-197 document):

C_{0}' = ({02} × C_{0}) + ({03} × C_{1}) + ({01} × C_{2}) + ({01} × C_{3})

C_{1}' = ({01} × C_{0}) + ({02} × C_{1}) + ({03} × C_{2}) + ({01} × C_{3})

C_{2}' = ({01} × C_{0}) + ({01} × C_{1}) + ({02} × C_{2}) + ({03} × C_{3})

C_{3}' = ({03} × C_{0}) + ({01} × C_{1}) + ({01} × C_{2}) + ({02} × C_{3})

When we consider, that we have precomputed multiplications and use C language table notation for *C*, *M2* and *M3*, the formulas are a bit simplier:

C'[0] = M2[0] + M3[1] + C[2] + C[3]

C'[1] = C[0] + M2[1] + M3[2] + C[3]

C'[2] = C[0] + C[1] + M2[2] + M3[3]

C'[3] = M3[0] + C[1] + C[2] + M2[3]

Code v7 (and v8) gathers columns of the above equation system in 32-bit registers and then XOR-s them, so it performs three 32-bit XOR operations instead of 12 single byte operations. Unfortunately extracting bytes and merging them into columnar 32-bit words requires many operations. How it can be simplified? XOR operation is commutative. It means I may rearrange order of terms in each equation separately. Then let's rearrange them as follows:

C'[0] = C[2] + C[3] + M2[0] + M3[1]

C'[1] = C[3] + C[0] + M2[1] + M3[2]

C'[2] = C[0] + C[1] + M2[2] + M3[3]

C'[3] = C[1] + C[2] + M2[3] + M3[0]

Now our columns (32-bit words) may be very easily derived from *C*, *M2* and *M3*:

- the first column is
*C*rotated left by 16 bits - the second column is
*C*rotated left by 24 bits - the third column is just
*M2* - the fourth column is
*M3*rotated left by 8 bits

Of course PowerPC processors have bitwise left rotation instruction. It cannot be explicitly denoted in C language, but fortunately GCC compiler detects simple cases of rotation and uses single *rlwinm* instruction. Then, the improved *MixColumn* code looks like this:

ULONG MixColumn(ULONG c)

{

ULONG m2, m3, mask, p, q, r;

/* preparation of premultiplied column */

mask = c & 0x80808080;

mul2 = (c << 1) & 0xFEFEFEFE;

mask >>= 7;

mask *= 0x1B;

mul2 ^= mask;

mul3 = c ^ mul2;

/* matrix multiplication */

p = (c << 16) | ((c >> 16) & 0xFFFF);

q = (c << 24) | ((c >> 8) & 0xFFFFFF);

r = (mul3 << 8) | ((mul3 >> 24) & 0xFF);

c = p ^ q ^ r ^ m2;

return c;

}

Note that the first three lines in the "matrix multiplication" section compile to just three CPU instructions. The fourth line yields another three. Then the code of *MixColumn* is much shorter (72 bytes instead of 184) and indeed it executes **much** faster. See the benchmarks of the v9 code:

Efika |
MPC5200B | 400 MHz | 138 cycles/byte |
2.77 MB/s |
×2.45 |

Pegasos I |
G3 750FX | 600 MHz | 71 cycles/byte |
8.13 MB/s |
×3.36 |

Pegasos II |
G4 7447A | 1000 MHz | 68 cycles/byte |
14.22 MB/s |
×4.67 |

PowerMac G4 |
G4 7455 | 1420 MHz | 68 cycles/byte |
19.96 MB/s |
×4.60 |

PowerBook G4 |
G4 7447A | 1670 MHz | 68 cycles/byte |
23.47 MB/s |
×4.67 |

PowerMac G5 |
G5 970FX | 2000 MHz | 66 cycles/byte |
29.30 MB/s |
×3.89 |

PowerMac G5 |
G5 970FX | 2300 MHz | 66 cycles/byte |
33.59 MB/s |
×3.88 |