Matrix
Notes on AES-256 Cipher Implementation on PowerPC Processors: Part V
Grzegorz Kraszewski

IIIIIIIV ● V

Time for AltiVec. Most of operations are easy to vectorize, but some create problems. The first AltiVec routine is hybrid, it means the dynamic key schedule is done with scalar code.

Overview of the code vectorization

The cipher state of AES-256 fits in one AltiVec register, and it is stored column by column, similarly as for the scalar code. Then let's look into vectorization possibilities. The AddRoundKey() vectorization is trivial, assuming we have current subkey and the state in two AltiVec registers. It is just one AV instruction. ShiftRows(), thanks to the powerful permutation instruction, reduces to a single instruction for the whole state as well. MixColumns() is only a bit harder. Of course it is not a single instruction, but thanks to improvements described in the previous chapter, it can be implemented very efficiently. What about SubBytes()? Seems hard, but the large AltiVec register file can fit the whole 256-byte lookup table. It takes 16 AV registers, leaving the other 16 free for computations. Then, combining permutation and selection instructions, byte substitution can be implemented in less than 32 processor instructions for the whole state, without a single memory access.

A bit unexpectedly, the most problematic part is the key schedule. Each cipher round consumes one subkey consisting of 4 32-bit words. Unfortunately a subkey is not just derived from previous ones. Every word of the schedule depends on the previous word. It means four words of a subkey have to be calculated one by one, and cannot be calculated in parallel. There are four options to solve this problem.

  1. Give up with the dynamic key schedule. Schedule would be precalculated to memory and fetched into AltiVec registers from there. While it is the easiest way, it increases number of memory reads and degrades security, as discussed in the chapter III.
  2. Calculate the schedule dynamically using scalar calculations. It is still relatively easy to do, as existing code may be reused. Furthermore there is an opportunity for (partially) parallel execution of the key schedule and the main cipher code. Unfortunately PowerPC has no means for direct data exchange between general purpose registers and AltiVec registers. Then calculated subkeys would have to be moved through memory. While it does not impact security that much, it may degrade performance.
  3. Calculate the schedule dynamically using AltiVec instructions. It may be inefficient, as often only 1/4 of a register will be used. However, as there will be no memory accesses it may turn to be the fastest option.
  4. Analyze the schedule algorithm and try to decouple subkey words. This would be the best option, but I don't know yet if it is possible at all.

AddRoundKey

As stated above, this is just one AltiVec instruction. Assume state is an AltiVec register containing the cipher state, subkey contains a subkey for the current round. Then the code will be:

state = vec_xor(state, subkey);

ShiftRows

One instruction as well. The permutation vector may be easily derived, as shown below:

Derivation of the permutation vector for ShiftRows().

Then the following code performs the operation. Of course the permutation vector is loaded once for all the blocks to be processed.

VECTOR_UBYTE shift_rows = {
  0x00, 0x05, 0x0A, 0x0F,
  0x04, 0x09, 0x0E, 0x03,
  0x08, 0x0D, 0x02, 0x07,
  0x0C, 0x01, 0x06, 0x0B
};

state = vec_perm(state, state, shift_rows);

SubBytes

This time consuming operation may be performed entirely inside AltiVec registers. The key instruction is again permutation. One can view it as a lookup in 32-byte table (two AV registers). It can be extended however. Imagine we need 64-byte table. Then we can do two permutations. One for two registers containing table values for indexes from 0 to 31. Then another one for two registers containing values for indexes from 32 to 63. Finally we have to select values from the first register for indexes lower than 32 or from the second register otherwise. When we need to cover 256 bytes, 8 such steps have to be performed. To generate the selection mask we can use vector compare instructions. I've implemented this operation as a macro, because it uses 16 vector variables containing the SBox. If I want them to be kept in registers, they should be local. Let's look at the macro:

#define SBOX16(state)
{
  VECTOR_UBYTE u, v, w, x, y;

  u = vec_splat(sbox5, 4);
  v = u;
  y = vec_perm(sbox0, sbox1, state);
  x = vec_cmplt(state, v);
  w = vec_perm(sbox2, sbox3, state);
  v += u;
  y = vec_sel(w, y, x);
  x = vec_cmplt(state, v);
  w = vec_perm(sbox4, sbox5, state);
  v += u;
  y = vec_sel(w, y, x);
  x = vec_cmplt(state, v);
  w = vec_perm(sbox6, sbox7, state);
  v += u;
  y = vec_sel(w, y, x);
  x = vec_cmplt(state, v);
  w = vec_perm(sbox8, sbox9, state);
  v += u;
  y = vec_sel(w, y, x);
  x = vec_cmplt(state, v);
  w = vec_perm(sboxA, sboxB, state);
  v += u;
  y = vec_sel(w, y, x);
  x = vec_cmplt(state, v);
  w = vec_perm(sboxC, sboxD, state);
  v += u;
  y = vec_sel(w, y, x);
  x = vec_cmplt(state, v);
  w = vec_perm(sboxE, sboxF, state);
  state = vec_sel(w, y, x);
}

The macro needs some explanation. Note first, that I've ommitted all the backslashes at line ends for clarity. They are required in the code however, as for every multiline macro. I've assumed that state variable is of vector type and contains the cipher state. Then, sbox0 to sboxF are 16 vector variables containing the whole SBox. I've enclosed the whole macro in a block and declared temporary vector variables inside the block. It tells the compiler they are not important outside the block, so it can reuse them later.

The temporary variable u contains bytes of value 32. It is used to generate selection mask. There are many ways to initialize a vector register to 16 bytes of 32, but they either involve a memory read or more than one instruction. Fortunately the SBox contains all the possible byte values so I can make use of it and vec_splat() instruction. I will repeat this trick in MixColumns to fill a register with bytes $1B.

In the first step I perform the lookup as if all the indexes were lower than 32. I use the first two SBox rows for this, sbox0 and sbox1 namely. Of course, for any index higher than 31, the value looked up is wrong now, but it will be overwritten later. Similarly I do another permutation, with sbox2 and sbox3. This gives me a set of values only valid for indexes between 32 and 63 including. Next I generate the first mask using value 32 and vec_cmplt() compare instruction. I get $FF for all the indexes lower than 32, and $00 otherwise. Finally I use this mask for selection. $FF bytes select lookup results valid for 0–31 range, $00 bytes select results valid for range 32–63. Then, after selection I have a vector containing valid lookup results for indexes from 0 to 63, and bogus values elsewhere. I continue then, doing permutation with sbox5 and sbox6. Variable u used as a selection mask threshold is increased to 64. A new mask is created, bytes $FF select already valid values for indexes up to 63, bytes $00 select new valid values for indexes from 64 up to 95 and still bad values for higher indexes.

The cycle is then repeated, mask threshold is increased by 32 in each step. After 8 lookup permutations and 7 selections, the complete lookup is finished. Every line of the macro compiles to a single AltiVec instruction. It makes 30 instructions for 16 lookups.

MixColumns

The MixColumns code is based on ideas described in the part IV. It is implemented as a macro:

#define MIXCOLUMNS(state)
{
  VECTOR_UBYTE u, v, w, x, y;

  u = vec_splat(sbox4, 4);                             // 'u' filled with $1B
  w = vec_splat_u8(0);                                 // 'w' zeroed
  v = vec_cmplt((VECTOR_BYTE)state, (VECTOR_BYTE)w);   // signed comparision of the state with 0
  x = vec_sel(w, u, v);                                // $1B for bytes with MSB set
  w = vec_splat_u8(1);                                 // prep. for one bit shift left
  u = vec_sl(state, w);                                // shifting
  u = vec_xor(u, x);                                   // XOR with $1B masked -> GF-multiplied by 2
  v = vec_xor(u, state);                               // 'v' is the state GF-multiplied by 3
  w = vec_splat_u32(8);                                // prep. for column rolling
  v = vec_rl((VECTOR_ULONG)v, (VECTOR_ULONG)w);        // v = M3 rolled 1 byte left
  x = vec_rl((VECTOR_ULONG)state, (VECTOR_ULONG)w);
  x = vec_rl((VECTOR_ULONG)x, (VECTOR_ULONG)w);        // x = state rolled 2 bytes left
  y = vec_rl((VECTOR_ULONG)x, (VECTOR_ULONG)w);        // y = state rolled 3 bytes left
  state = y ^ x ^ u ^ v;
}

Similarly as in SubBytes(), macro is defined as a block with local variables inside, to optimize AltiVec registers usage by the compiler. The first instruction fills vector u with values $1B needed for Galois field multiplication by 2 (math behind it has been explained in the part I). Then a mask is created, containing $FF bytes where the state has bytes with most significant bit set. It is done by performing signed comparision with 0. Next vec_sel() creates a vector containing $00 where the state byte has highest bit cleared, and $1B otherwise. This vector (x) is finally XOR-ed with the original state shifted left by one bit (each byte is shifted separately). The same as in scalar code, state multiplied by 3 is created by XOR-ing the original with the state multiplied by 2.

The second stage is bitwise rotation of columns (32-bit words) of the original state and state multiplied by 3. Columns of state multiplied by 2 do not need this rotation. Finally rotated columns are XOR-ed together, to yield the final new state after column mixing.

Benchmarks of the hybrid code

The current AltiVec routine for AES-256 is hybrid, as it uses AltiVec for cipher operations and scalar execution units for the dynamic key schedule. Subkeys have to be stored in memory and fetched into AltiVec registers. In benchmarks there is no measurable difference between using −O2 and −O3 optimizer. The complete v10 code performance is listed below. Of course machines with processors not having AltiVec (Efika, Pegasos I) are not listed anymore. Unfortunately I also do not have results from my PowerMac G4 500 MHz, as its graphic card stopped to work.

PowerMac G4 G4 7455 1420 MHz 46 cycles/byte 29.44 MB/s ×6.781.48)
PowerBook G4 G4 7447A 1670 MHz 50 cycles/byte 31.85 MB/s ×6.341.36)
PowerMac G5 G5 970FX 2000 MHz 42 cycles/byte 45.41 MB/s ×6.021.57)
PowerMac G5 G5 970FX 2300 MHz 41 cycles/byte 53.50 MB/s ×6.171.61)
PowerMac G5 G5 970FX 2700 MHz 41 cycles/byte 62.80 MB/s ×6.001.61)

Speedup in parentheses is calculated relative to the fastest scalar routine.