Matrix
Notes on AES-256 Cipher Implementation on PowerPC Processors: Part II
Grzegorz Kraszewski

I ● II ● IIIIVV

After building and testing naive AES-256 implementation time for some optimization. In this part I will do the most obvious ones and see how it influences the code speed.

PowerPC is not an 8-bit processor

Byte-wise implementation has only one advantage: it follows the cipher specification directly, so it is easy to understand. However processing data one byte at a time on 32-bit processor is simply waste of its cycles. The cipher state fits in 4 words, so it can be easily stored in processor registers. There is just one question, should these registers store rows of the state, or its columns? Let's see:

Most operations work on columns, so I've decided to store state columns in four 32-bit words. Key history uses 32-bit words as well. Now AddRoundKey reduces to four XOR operations. Executing them in a loop just makes no sense. Then I've unrolled it and also inlined in three places in the main routine. Jumping just for 4 CPU instructions is simply not effective. I have also unrolled loading plaintext, initializing key history and unloading encrypted text. The complete source code of this optimization stage is here. Note that AddRoundKey vanished as a separate function. Switching the state and key history to 32 bits required changes in the core Aes256 function as well as in the benchmark (mainly removing some typecasts and adding them in other places). Here are the results, in the last column relative speedup compared to v1:

Efika MPC5200B 400 MHz 315 cycles/byte 1.21 MB/s ×1.07
PowerMac G4 G4 7400 500 MHz 219 cycles/byte 2.19 MB/s ×1.07
Pegasos I G3 750FX 600 MHz 219 cycles/byte 2.61 MB/s ×1.08
Pegasos II G4 7447A 1000 MHz 282 cycles/byte 3.39 MB/s ×1.11
PowerMac G4 G4 7455 1420 MHz 281 cycles/byte 4.82 MB/s ×1.11
PowerBook G4 G4 7447A 1670 MHz 286 cycles/byte 5.58 MB/s ×1.11
PowerMac G5 G5 970FX 2000 MHz 254 cycles/byte 7.52 MB/s ×1.00
PowerMac G5 G5 970FX 2300 MHz 246 cycles/byte 8.94 MB/s ×1.03
PowerMac G5 G5 970FX 2700 MHz 254 cycles/byte 10.15 MB/s ×0.97

Surprisingly G5 got no speedup at all, maybe because of more cracked instructions involved, or less effective filling of dispatch groups. Still it uses less cycles per byte than G4.

Optimizing the key schedule

The key history has been already converted to ULONG-s, so making the KeySchedule running in 32 bits is the next logical step. All XOR-s (including XOR with Rcon value, done in every even round) can be converted directly to 32-bit ones. RotWord is now coded as two shifts and one OR. Finally SubBytes is reworked. Unfortunately it still has to work byte by byte, so bytes are extracted from a subkey word, used as indexes of SBox, then values read are put back into the word.

void KeySchedule(ULONG *key_history, UBYTE round)
{
  ULONG eightkey, prevkey, sboxedkey;
  UBYTE k;
  ULONG rcon = 0x01000000 << (round >> 1);

  if ((round & 1) == 0)      /* even round */
  {
    eightkey = key_history[0];
    prevkey = key_history[7];

    /* RotWord */

    prevkey = (prevkey << 8) | ((prevkey >> 24) & 0xFF);

    /* SubBytes */

    k = (prevkey >> 24) & 0xFF;
    sboxedkey = SBox[k] << 24;
    k = (prevkey >> 16) & 0xFF;
    sboxedkey |= SBox[k] << 16;
    k = (prevkey >> 8) & 0xFF;
    sboxedkey |= SBox[k] << 8;
    k = prevkey & 0xFF;
    sboxedkey |= SBox[k];

    /* Rcon */

    sboxedkey ^= rcon;

    /* XOR-s */

    key_history[0] = sboxedkey ^ eightkey;
    key_history[1] = key_history[0] ^ key_history[1];
    key_history[2] = key_history[1] ^ key_history[2];
    key_history[3] = key_history[2] ^ key_history[3];
  }
  else      /* odd round */
  {
    eightkey = key_history[4];
    prevkey = key_history[3];

    /* SubBytes */

    k = (prevkey >> 24) & 0xFF;
    sboxedkey = SBox[k] << 24;
    k = (prevkey >> 16) & 0xFF;
    sboxedkey |= SBox[k] << 16;
    k = (prevkey >> 8) & 0xFF;
    sboxedkey |= SBox[k] << 8;
    k = prevkey & 0xFF;
    sboxedkey |= SBox[k];

    /* XOR-s */

    key_history[4] = sboxedkey ^ eightkey;
    key_history[5] = key_history[4] ^ key_history[5];
    key_history[6] = key_history[5] ^ key_history[6];
    key_history[7] = key_history[6] ^ key_history[7];
  }
}

Usually optimizing the key schedule is not that important if given implementation uses constant key and precalculates all the subkeys. As my approach assumes variable key it runs the schedule for each cipher block. Then optimizing it should have significant impact on speed. And indeed measurements of the code v3 acknowledge it:

Efika MPC5200B 400 MHz 283 cycles/byte 1.35 MB/s ×1.19
PowerMac G4 G4 7400 500 MHz 194 cycles/byte 2.46 MB/s ×1.20
Pegasos I G3 750FX 600 MHz 198 cycles/byte 2.90 MB/s ×1.20
Pegasos II G4 7447A 1000 MHz 243 cycles/byte 3.93 MB/s ×1.29
PowerMac G4 G4 7455 1420 MHz 242 cycles/byte 5.60 MB/s ×1.29
PowerBook G4 G4 7447A 1670 MHz 246 cycles/byte 6.48 MB/s ×1.29
PowerMac G5 G5 970FX 2000 MHz 182 cycles/byte 10.48 MB/s ×1.39
PowerMac G5 G5 970FX 2300 MHz 185 cycles/byte 11.88 MB/s ×1.37
PowerMac G5 G5 970FX 2700 MHz 181 cycles/byte 14.27 MB/s ×1.36

G5 made a huge step this time, and is still ahead of G4.

Attempt to improve SubBytes

At the first glance it is impossible to gain anything here. The lookup table operates on bytes no matter what. However, as the state is now in 32-bit words, one may consider to load it as such and then extract bytes by shifts and masks, do lookup and reassemble. But that's not all. The next operation after SubBytes is always ShiftRows. As the state is stored as columns, shifting rows is another byte juggling. Then we can merge both operations and after reading bytes from SBox we can put them in already shifted positions. Then it can be coded as follows:

void SubBytesShifted(ULONG *state)
{
  UBYTE p, q, r, s;

  p = (state[0] >> 24) & 0xFF;    p = SBox[p];
  q = (state[1] >> 24) & 0xFF;    q = SBox[q];
  r = (state[2] >> 24) & 0xFF;    r = SBox[r];
  s = (state[3] >> 24) & 0xFF;    s = SBox[s];
  state[0] = (state[0] & 0x00FFFFFF) | (p << 24);
  state[1] = (state[1] & 0x00FFFFFF) | (q << 24);
  state[2] = (state[2] & 0x00FFFFFF) | (r << 24);
  state[3] = (state[3] & 0x00FFFFFF) | (s << 24);

  p = (state[0] >> 16) & 0xFF;    p = SBox[p];
  q = (state[1] >> 16) & 0xFF;    q = SBox[q];
  r = (state[2] >> 16) & 0xFF;    r = SBox[r];
  s = (state[3] >> 16) & 0xFF;    s = SBox[s];
  state[0] = (state[0] & 0xFF00FFFF) | (q << 16);
  state[1] = (state[1] & 0xFF00FFFF) | (r << 16);
  state[2] = (state[2] & 0xFF00FFFF) | (s << 16);
  state[3] = (state[3] & 0xFF00FFFF) | (p << 16);

  p = (state[0] >> 8) & 0xFF;    p = SBox[p];
  q = (state[1] >> 8) & 0xFF;    q = SBox[q];
  r = (state[2] >> 8) & 0xFF;    r = SBox[r];
  s = (state[3] >> 8) & 0xFF;    s = SBox[s];
  state[0] = (state[0] & 0xFFFF00FF) | (r << 8);
  state[1] = (state[1] & 0xFFFF00FF) | (s << 8);
  state[2] = (state[2] & 0xFFFF00FF) | (p << 8);
  state[3] = (state[3] & 0xFFFF00FF) | (q << 8);

  p = state[0] & 0xFF;    p = SBox[p];
  q = state[1] & 0xFF;    q = SBox[q];
  r = state[2] & 0xFF;    r = SBox[r];
  s = state[3] & 0xFF;    s = SBox[s];
  state[0] = (state[0] & 0xFFFFFF00) | s;
  state[1] = (state[1] & 0xFFFFFF00) | p;
  state[2] = (state[2] & 0xFFFFFF00) | q;
  state[3] = (state[3] & 0xFFFFFF00) | r;
}

OK, I've added some serious amount of code here. Then it turned out that this "v4" code is slower than "v3". When I've seen it for the first time, I've stepped back to "v2" and applied the next optimization to MixColumns, described in the next section. Then, I've readded SubBytesShifted and then both the optimizations together were faster than MixColumns alone. It is also worth noting, that this "step back" effect of the above code happened only with −O3 optimizer, with −O2 speed has been icreased. The v4 code.

Efika MPC5200B 400 MHz 335 cycles/byte 1.14 MB/s ×1.01
PowerMac G4 G4 7400 500 MHz 216 cycles/byte 2.22 MB/s ×1.08
Pegasos I G3 750FX 600 MHz 217 cycles/byte 2.64 MB/s ×1.09
Pegasos II G4 7447A 1000 MHz 255 cycles/byte 3.74 MB/s ×1.23
PowerMac G4 G4 7455 1420 MHz 254 cycles/byte 5.33 MB/s ×1.23
PowerBook G4 G4 7447A 1670 MHz 258 cycles/byte 6.17 MB/s ×1.23
PowerMac G5 G5 970FX 2000 MHz 208 cycles/byte 9.17 MB/s ×1.22
PowerMac G5 G5 970FX 2300 MHz 206 cycles/byte 10.67 MB/s ×1.23
PowerMac G5 G5 970FX 2700 MHz 202 cycles/byte 12.75 MB/s ×1.22

Optimizing MixColumns

As MixColumns is the most complex operation of AES-256, I've left it to the end of the part II. What can be optimized here? My new code takes one cipher state column and then multiplies the whole column in GF(28) by 2 and by 3. Then I can complete the matrix multiplication by selecting bytes from either original column or column multiplied by 2 or column multiplied by 3, according to the matrix. I use the same scheme for multiplication by 2 as in previous versions, just do 4 bytes at a time. First a mask register is prepared. It should contain $1B bytes where column bytes have the highest bit set, $00 otherwise. The first step is to extract the highest bits or all four bytes by masking the column with $80808080. The next step is to convert all the $80 bytes to $1B bytes, while leaving $00 bytes intact. There are two ways to do it:

  1. $1B has four "1" bits. Then I can create $1B from $80 with 3 shifts and 2 OR-s.
  2. I can shift the whole register 7 bits left to turn $80 bytes into $01 bytes, then multiply the register by $1B.
A bit unexpectedly, the second method is faster on G4, it speeds up the whole code by 3%. It is probably because G4 7447(A) has a separate execution unit for integer multiplication and division. As multiplication is not used anywhere else in the code, it comes "for free", as executed parallelly with other instructions.

Then the mask should be XOR-ed with the original column shifted left one bit. But wait. Not the whole column, but each byte separately has to be shifted. To cancel possible highest bits, shifted left as lowest bits of a next byte, I mask the whole register with $FEFEFEFE before XOR-ing with mask. Finally I can get a column GF(28)-multiplied by 3, as XOR of original column with column multiplied by 2.

The last block performs the proper matrix multiplication by selecting bytes from precalculated columns according to the matrix and finally XOR-ing them together. Finally transformed column is stored back to the cipher state.

void MixColumns(ULONG *state)
{
  UBYTE c, i;

  for (c = 0; c < 4; c++)
  {
    ULONG mul2, mul3, news, mask, p, q, r, s;

    /* preparation of multiplied column */

    mask = state[c] & 0x80808080;
    mul2 = (state[c] << 1) & 0xFEFEFEFE;
    mask >>= 7;
    mask *= 0x1B;
    mul2 ^= mask;
    mul3 = state[c] ^ mul2;

    /* matrix multiplication */

    p = (mul2 & 0xFF000000) | ((state[c] & 0xFF000000) >> 8)
      | ((state[c] & 0xFF000000) >> 16) | ((mul3 & 0xFF000000) >> 24);
    q = ((mul3 & 0x00FF0000) << 8) | (mul2 & 0x00FF0000)
      | ((state[c] & 0x00FF0000) >> 8) | ((state[c] & 0x00FF0000) >> 16);
    r = ((state[c] & 0x0000FF00) << 16) | ((mul3 & 0x0000FF00) << 8)
      | ((mul2 & 0x0000FF00)) | ((state[c] & 0x0000FF00) >> 8);
    s = ((state[c] & 0x000000FF) << 24) | ((state[c] & 0x000000FF) << 16)
      | ((mul3 & 0x000000FF) << 8) | (mul2 & 0x000000FF);
    news = p ^ q ^ r ^ s;

    /* back to state */

    state[c] = news;
  }
}

Now our code is almost two times faster, while still staying pure C, and in spite of using the strognest optimizer GCC can offer. This shows the power of code handtuning :-). The code v5.

Efika MPC5200B 400 MHz 321 cycles/byte 1.19 MB/s ×1.05
PowerMac G4 G4 7400 500 MHz 169 cycles/byte 2.83 MB/s ×1.38
Pegasos I G3 750FX 600 MHz 169 cycles/byte 3.40 MB/s ×1.40
Pegasos II G4 7447A 1000 MHz 161 cycles/byte 5.95 MB/s ×1.95
PowerMac G4 G4 7455 1420 MHz 160 cycles/byte 8.49 MB/s ×1.96
PowerBook G4 G4 7447A 1670 MHz 161 cycles/byte 9.91 MB/s ×1.97
PowerMac G5 G5 970FX 2000 MHz 170 cycles/byte 11.25 MB/s ×1.49
PowerMac G5 G5 970FX 2300 MHz 183 cycles/byte 11.99 MB/s ×1.38
PowerMac G5 G5 970FX 2700 MHz 172 cycles/byte 15.01 MB/s ×1.43

What next? Seems that we have reached some wall here and breaking through it will require the assembler demolition hammer. There are two things to consider. Firstly I pass the cipher state and round keys table to subfunctions by pointer. Then state and keys must have an address, so must be stored in memory. They could be fitted into four and eight processor registers respectively. PowerPC has 32 integer registers, so why not to do it? The problem is that subfunctions take the state, modify it and return. There is no easy way to return multiple variables from a C function. It may be solved to some degree by extensive use of complex, parametrized macros. The second thing is rlwimi PowerPC instruction. It is perfect, when one has to extract a byte from a register and put it as another byte of another register. Seems that GCC never uses it in the code, doing shifting with masking (rlwinm) and logical sum.

Part III »