#### I ● II ● III ● IV ● V

After building and testing naive AES-256 implementation time for some optimization. In this part I will do the most obvious ones and see how it influences the code speed.

### PowerPC is not an 8-bit processor

Byte-wise implementation has only one advantage: it follows the cipher specification directly, so it is easy to understand. However processing data one byte at a time on 32-bit processor is simply waste of its cycles. The cipher state fits in 4 words, so it can be easily stored in processor registers. There is just one question, should these registers store rows of the state, or its columns? Let's see:

- Plaintext is loaded by columns.
*AddRoundKey*operates on columns.- For
*SubBytes*it doesn't really matter. *ShiftRows*operates on rows.*MixColumns*operates on columns.- Encrypted text is unloaded by columns.

Most operations work on columns, so I've decided to store state columns in four 32-bit words. Key history uses 32-bit words as well. Now *AddRoundKey* reduces to four XOR operations. Executing them in a loop just makes no sense. Then I've unrolled it and also inlined in three places in the main routine. Jumping just for 4 CPU instructions is simply not effective. I have also unrolled loading plaintext, initializing key history and unloading encrypted text. The complete source code of this optimization stage is here. Note that *AddRoundKey* vanished as a separate function. Switching the state and key history to 32 bits required changes in the core *Aes256* function as well as in the benchmark (mainly removing some typecasts and adding them in other places). Here are the results, in the last column relative speedup compared to v1:

Efika |
MPC5200B | 400 MHz | 315 cycles/byte |
1.21 MB/s |
×1.07 |

PowerMac G4 |
G4 7400 | 500 MHz | 219 cycles/byte |
2.19 MB/s |
×1.07 |

Pegasos I |
G3 750FX | 600 MHz | 219 cycles/byte |
2.61 MB/s |
×1.08 |

Pegasos II |
G4 7447A | 1000 MHz | 282 cycles/byte |
3.39 MB/s |
×1.11 |

PowerMac G4 |
G4 7455 | 1420 MHz | 281 cycles/byte |
4.82 MB/s |
×1.11 |

PowerBook G4 |
G4 7447A | 1670 MHz | 286 cycles/byte |
5.58 MB/s |
×1.11 |

PowerMac G5 |
G5 970FX | 2000 MHz | 254 cycles/byte |
7.52 MB/s |
×1.00 |

PowerMac G5 |
G5 970FX | 2300 MHz | 246 cycles/byte |
8.94 MB/s |
×1.03 |

PowerMac G5 |
G5 970FX | 2700 MHz | 254 cycles/byte |
10.15 MB/s |
×0.97 |

Surprisingly G5 got no speedup at all, maybe because of more cracked instructions involved, or less effective filling of dispatch groups. Still it uses less cycles per byte than G4.

### Optimizing the key schedule

The key history has been already converted to ULONG-s, so making the *KeySchedule* running in 32 bits is the next logical step. All XOR-s (including XOR with *Rcon* value, done in every even round) can be converted directly to 32-bit ones. *RotWord* is now coded as two shifts and one OR. Finally *SubBytes* is reworked. Unfortunately it still has to work byte by byte, so bytes are extracted from a subkey word, used as indexes of *SBox*, then values read are put back into the word.

void KeySchedule(ULONG *key_history, UBYTE round)

{

ULONG eightkey, prevkey, sboxedkey;

UBYTE k;

ULONG rcon = 0x01000000 << (round >> 1);

if ((round & 1) == 0) /* even round */

{

eightkey = key_history[0];

prevkey = key_history[7];

/* RotWord */

prevkey = (prevkey << 8) | ((prevkey >> 24) & 0xFF);

/* SubBytes */

k = (prevkey >> 24) & 0xFF;

sboxedkey = SBox[k] << 24;

k = (prevkey >> 16) & 0xFF;

sboxedkey |= SBox[k] << 16;

k = (prevkey >> 8) & 0xFF;

sboxedkey |= SBox[k] << 8;

k = prevkey & 0xFF;

sboxedkey |= SBox[k];

/* Rcon */

sboxedkey ^= rcon;

/* XOR-s */

key_history[0] = sboxedkey ^ eightkey;

key_history[1] = key_history[0] ^ key_history[1];

key_history[2] = key_history[1] ^ key_history[2];

key_history[3] = key_history[2] ^ key_history[3];

}

else /* odd round */

{

eightkey = key_history[4];

prevkey = key_history[3];

/* SubBytes */

k = (prevkey >> 24) & 0xFF;

sboxedkey = SBox[k] << 24;

k = (prevkey >> 16) & 0xFF;

sboxedkey |= SBox[k] << 16;

k = (prevkey >> 8) & 0xFF;

sboxedkey |= SBox[k] << 8;

k = prevkey & 0xFF;

sboxedkey |= SBox[k];

/* XOR-s */

key_history[4] = sboxedkey ^ eightkey;

key_history[5] = key_history[4] ^ key_history[5];

key_history[6] = key_history[5] ^ key_history[6];

key_history[7] = key_history[6] ^ key_history[7];

}

}

Usually optimizing the key schedule is not that important if given implementation uses constant key and precalculates all the subkeys. As my approach assumes variable key it runs the schedule for each cipher block. Then optimizing it should have significant impact on speed. And indeed measurements of the code v3 acknowledge it:

Efika |
MPC5200B | 400 MHz | 283 cycles/byte |
1.35 MB/s |
×1.19 |

PowerMac G4 |
G4 7400 | 500 MHz | 194 cycles/byte |
2.46 MB/s |
×1.20 |

Pegasos I |
G3 750FX | 600 MHz | 198 cycles/byte |
2.90 MB/s |
×1.20 |

Pegasos II |
G4 7447A | 1000 MHz | 243 cycles/byte |
3.93 MB/s |
×1.29 |

PowerMac G4 |
G4 7455 | 1420 MHz | 242 cycles/byte |
5.60 MB/s |
×1.29 |

PowerBook G4 |
G4 7447A | 1670 MHz | 246 cycles/byte |
6.48 MB/s |
×1.29 |

PowerMac G5 |
G5 970FX | 2000 MHz | 182 cycles/byte |
10.48 MB/s |
×1.39 |

PowerMac G5 |
G5 970FX | 2300 MHz | 185 cycles/byte |
11.88 MB/s |
×1.37 |

PowerMac G5 |
G5 970FX | 2700 MHz | 181 cycles/byte |
14.27 MB/s |
×1.36 |

G5 made a huge step this time, and is still ahead of G4.

### Attempt to improve *SubBytes*

At the first glance it is impossible to gain anything here. The lookup table operates on bytes no matter what. However, as the state is now in 32-bit words, one may consider to load it as such and then extract bytes by shifts and masks, do lookup and reassemble. But that's not all. The next operation after *SubBytes* is always *ShiftRows*. As the state is stored as columns, shifting rows is another byte juggling. Then we can merge both operations and after reading bytes from *SBox* we can put them in **already shifted** positions. Then it can be coded as follows:

void SubBytesShifted(ULONG *state)

{

UBYTE p, q, r, s;

p = (state[0] >> 24) & 0xFF; p = SBox[p];

q = (state[1] >> 24) & 0xFF; q = SBox[q];

r = (state[2] >> 24) & 0xFF; r = SBox[r];

s = (state[3] >> 24) & 0xFF; s = SBox[s];

state[0] = (state[0] & 0x00FFFFFF) | (p << 24);

state[1] = (state[1] & 0x00FFFFFF) | (q << 24);

state[2] = (state[2] & 0x00FFFFFF) | (r << 24);

state[3] = (state[3] & 0x00FFFFFF) | (s << 24);

p = (state[0] >> 16) & 0xFF; p = SBox[p];

q = (state[1] >> 16) & 0xFF; q = SBox[q];

r = (state[2] >> 16) & 0xFF; r = SBox[r];

s = (state[3] >> 16) & 0xFF; s = SBox[s];

state[0] = (state[0] & 0xFF00FFFF) | (q << 16);

state[1] = (state[1] & 0xFF00FFFF) | (r << 16);

state[2] = (state[2] & 0xFF00FFFF) | (s << 16);

state[3] = (state[3] & 0xFF00FFFF) | (p << 16);

p = (state[0] >> 8) & 0xFF; p = SBox[p];

q = (state[1] >> 8) & 0xFF; q = SBox[q];

r = (state[2] >> 8) & 0xFF; r = SBox[r];

s = (state[3] >> 8) & 0xFF; s = SBox[s];

state[0] = (state[0] & 0xFFFF00FF) | (r << 8);

state[1] = (state[1] & 0xFFFF00FF) | (s << 8);

state[2] = (state[2] & 0xFFFF00FF) | (p << 8);

state[3] = (state[3] & 0xFFFF00FF) | (q << 8);

p = state[0] & 0xFF; p = SBox[p];

q = state[1] & 0xFF; q = SBox[q];

r = state[2] & 0xFF; r = SBox[r];

s = state[3] & 0xFF; s = SBox[s];

state[0] = (state[0] & 0xFFFFFF00) | s;

state[1] = (state[1] & 0xFFFFFF00) | p;

state[2] = (state[2] & 0xFFFFFF00) | q;

state[3] = (state[3] & 0xFFFFFF00) | r;

}

OK, I've added some serious amount of code here. Then it turned out that this "v4" code is **slower** than "v3". When I've seen it for the first time, I've stepped back to "v2" and applied the next optimization to *MixColumns*, described in the next section. Then, I've readded *SubBytesShifted* and then both the optimizations together were faster than *MixColumns* alone. It is also worth noting, that this "step back" effect of the above code happened only with −O3 optimizer, with −O2 speed has been icreased. The v4 code.

Efika |
MPC5200B | 400 MHz | 335 cycles/byte |
1.14 MB/s |
×1.01 |

PowerMac G4 |
G4 7400 | 500 MHz | 216 cycles/byte |
2.22 MB/s |
×1.08 |

Pegasos I |
G3 750FX | 600 MHz | 217 cycles/byte |
2.64 MB/s |
×1.09 |

Pegasos II |
G4 7447A | 1000 MHz | 255 cycles/byte |
3.74 MB/s |
×1.23 |

PowerMac G4 |
G4 7455 | 1420 MHz | 254 cycles/byte |
5.33 MB/s |
×1.23 |

PowerBook G4 |
G4 7447A | 1670 MHz | 258 cycles/byte |
6.17 MB/s |
×1.23 |

PowerMac G5 |
G5 970FX | 2000 MHz | 208 cycles/byte |
9.17 MB/s |
×1.22 |

PowerMac G5 |
G5 970FX | 2300 MHz | 206 cycles/byte |
10.67 MB/s |
×1.23 |

PowerMac G5 |
G5 970FX | 2700 MHz | 202 cycles/byte |
12.75 MB/s |
×1.22 |

### Optimizing *MixColumns*

As *MixColumns* is the most complex operation of AES-256, I've left it to the end of the part II. What can be optimized here? My new code takes one cipher state column and then multiplies the whole column in GF(2^{8}) by 2 and by 3. Then I can complete the matrix multiplication by selecting bytes from either original column or column multiplied by 2 or column multiplied by 3, according to the matrix. I use the same scheme for multiplication by 2 as in previous versions, just do 4 bytes at a time. First a mask register is prepared. It should contain $1B bytes where column bytes have the highest bit set, $00 otherwise. The first step is to extract the highest bits or all four bytes by masking the column with $80808080. The next step is to convert all the $80 bytes to $1B bytes, while leaving $00 bytes intact. There are two ways to do it:

- $1B has four "1" bits. Then I can create $1B from $80 with 3 shifts and 2 OR-s.
- I can shift the whole register 7 bits left to turn $80 bytes into $01 bytes, then multiply the register by $1B.

Then the mask should be XOR-ed with the original column shifted left one bit. But wait. Not the whole column, but each byte separately has to be shifted. To cancel possible highest bits, shifted left as lowest bits of a next byte, I mask the whole register with $FEFEFEFE before XOR-ing with mask. Finally I can get a column GF(2^{8})-multiplied by 3, as XOR of original column with column multiplied by 2.

The last block performs the proper matrix multiplication by selecting bytes from precalculated columns according to the matrix and finally XOR-ing them together. Finally transformed column is stored back to the cipher state.

void MixColumns(ULONG *state)

{

UBYTE c, i;

for (c = 0; c < 4; c++)

{

ULONG mul2, mul3, news, mask, p, q, r, s;

/* preparation of multiplied column */

mask = state[c] & 0x80808080;

mul2 = (state[c] << 1) & 0xFEFEFEFE;

mask >>= 7;

mask *= 0x1B;

mul2 ^= mask;

mul3 = state[c] ^ mul2;

/* matrix multiplication */

p = (mul2 & 0xFF000000) | ((state[c] & 0xFF000000) >> 8)

| ((state[c] & 0xFF000000) >> 16) | ((mul3 & 0xFF000000) >> 24);

q = ((mul3 & 0x00FF0000) << 8) | (mul2 & 0x00FF0000)

| ((state[c] & 0x00FF0000) >> 8) | ((state[c] & 0x00FF0000) >> 16);

r = ((state[c] & 0x0000FF00) << 16) | ((mul3 & 0x0000FF00) << 8)

| ((mul2 & 0x0000FF00)) | ((state[c] & 0x0000FF00) >> 8);

s = ((state[c] & 0x000000FF) << 24) | ((state[c] & 0x000000FF) << 16)

| ((mul3 & 0x000000FF) << 8) | (mul2 & 0x000000FF);

news = p ^ q ^ r ^ s;

/* back to state */

state[c] = news;

}

}

Now our code is almost two times faster, while still staying pure C, and in spite of using the strognest optimizer GCC can offer. This shows the power of code handtuning :-). The code v5.

Efika |
MPC5200B | 400 MHz | 321 cycles/byte |
1.19 MB/s |
×1.05 |

PowerMac G4 |
G4 7400 | 500 MHz | 169 cycles/byte |
2.83 MB/s |
×1.38 |

Pegasos I |
G3 750FX | 600 MHz | 169 cycles/byte |
3.40 MB/s |
×1.40 |

Pegasos II |
G4 7447A | 1000 MHz | 161 cycles/byte |
5.95 MB/s |
×1.95 |

PowerMac G4 |
G4 7455 | 1420 MHz | 160 cycles/byte |
8.49 MB/s |
×1.96 |

PowerBook G4 |
G4 7447A | 1670 MHz | 161 cycles/byte |
9.91 MB/s |
×1.97 |

PowerMac G5 |
G5 970FX | 2000 MHz | 170 cycles/byte |
11.25 MB/s |
×1.49 |

PowerMac G5 |
G5 970FX | 2300 MHz | 183 cycles/byte |
11.99 MB/s |
×1.38 |

PowerMac G5 |
G5 970FX | 2700 MHz | 172 cycles/byte |
15.01 MB/s |
×1.43 |

What next? Seems that we have reached some wall here and breaking through it will require the assembler demolition hammer. There are two things to consider. Firstly I pass the cipher state and round keys table to subfunctions by pointer. Then state and keys must have an address, so must be stored in memory. They could be fitted into four and eight processor registers respectively. PowerPC has 32 integer registers, so why not to do it? The problem is that subfunctions take the state, **modify** it and **return**. There is no easy way to return multiple variables from a C function. It may be solved to some degree by extensive use of complex, parametrized macros. The second thing is *rlwimi* PowerPC instruction. It is perfect, when one has to extract a byte from a register and put it as another byte of another register. Seems that GCC never uses it in the code, doing shifting with masking (*rlwinm*) and logical sum.