Notes on AES-256 Cipher Implementation on PowerPC Processors: Part III
Grzegorz Kraszewski


While I've talked about assembler at the end of previous part, I decide to continue with C. Macros will allow me to gain more speed by avoiding passing pointers to subfunctions. I also discover how nasty GCC code optimizer can be.

Mix one column at a time

As said in the previous part, further improvements of the code may be achieved by keeping the cipher state and possibly also partial key schedule in processor registers. Then, when one wants to do it in C, all the functions getting either the state or key schedule via pointers, modifying them and returning, must be reimplemented as macros. There is one exception however. The MixColumns function operates on a single column at a time, and single column can be returned as 32-bit value. Then I've made a new function MixColumn taking a column as 32-bit number, mixing it and returning back. Then I call MixColumn four times in appropriate places.

ULONG MixColumn(ULONG c)
  ULONG mul2, mul3, mask, p, q, r, s;

  /* preparation of premultiplied column */

  mask = c & 0x80808080;
  mul2 = (c << 1) & 0xFEFEFEFE;
  mask >>= 7;
  mask *= 0x1B;
  mul2 ^= mask;
  mul3 = c ^ mul2;

  /* matrix multiplication */

  p = (mul2 & 0xFF000000) | ((c & 0xFF000000) >> 8)
    | ((c & 0xFF000000) >> 16) | ((mul3 & 0xFF000000) >> 24);
  q = ((mul3 & 0x00FF0000) << 8) | (mul2 & 0x00FF0000)
    | ((c & 0x00FF0000) >> 8) | ((c & 0x00FF0000) >> 16);
  r = ((c & 0x0000FF00) << 16) | ((mul3 & 0x0000FF00) << 8)
    | ((mul2 & 0x0000FF00)) | ((c & 0x0000FF00) >> 8);
  s = ((c & 0x000000FF) << 24) | ((c & 0x000000FF) << 16)
   | ((mul3 & 0x000000FF) << 8) | (mul2 & 0x000000FF);
  c = p ^ q ^ r ^ s;
  return c;

This change alone does not increase the speed, well in fact the code is a bit slower. Not a big surprise, both the cipher state and key history are still kept in memory. This is the code v6.

Efika MPC5200B 400 MHz 320 cycles/byte 1.19 MB/s ×1.06
PowerMac G4 G4 7400 500 MHz 170 cycles/byte 2.82 MB/s ×1.38
Pegasos I G3 750FX 600 MHz 170 cycles/byte 3.37 MB/s ×1.39
Pegasos II G4 7447A 1000 MHz 160 cycles/byte 5.97 MB/s ×1.96
PowerMac G4 G4 7455 1420 MHz 161 cycles/byte 8.46 MB/s ×1.95
PowerBook G4 G4 7447A 1670 MHz 162 cycles/byte 9.85 MB/s ×1.96
PowerMac G5 G5 970FX 2000 MHz 177 cycles/byte 10.83 MB/s ×1.44
PowerMac G5 G5 970FX 2300 MHz 175 cycles/byte 12.58 MB/s ×1.45
PowerMac G5 G5 970FX 2700 MHz 170 cycles/byte 15.18 MB/s ×1.45

Powered by Macros

Then, it is the time to make bigger changes in the code and convert all the subfunctions to macros. Let's start from a macro for a single byte substitution, optionally merged with shift. The macro 'src' and 'dst' parameters are 32-bit variables containing a column of the state, 'src_pos' and 'dst_pos' are numbers of byte in a column, from 0 to 3. There is also 'tmp' 32-bit variable. It should be also noted that this macro assumes, that the destination column has been cleared before. This saves one masking operation, on the other hand it means that source must be different than destination.

#define SUB_BYTE(src, src_pos, dst, dst_pos, tmp) \
  tmp = ((src) >> ((3 - src_pos) * 8)) & 0xFF; \
  tmp = SBox[tmp]; \
  dst |= tmp << ((3 - src_pos) * 8)

Using this macro one can define the whole SubBytesShifted operation for a round of the cipher as another macro, which calls SUB_BYTE 16 times. The macro reads the state from four 32-bit variables 's0' to 's3' and stores it back into four 32-bit variables 'r0' to 'r3', also requiring that 'tmp' variable is declared somewhere. The destination cipher state is cleared at start of the macro.

  r0 = 0;   r1 = 0;   r2 = 0;   r3 = 0; \
  SUB_BYTE(s0, 0, r0, 0, tmp); \
  SUB_BYTE(s1, 0, r1, 0, tmp); \
  SUB_BYTE(s2, 0, r2, 0, tmp); \
  SUB_BYTE(s3, 0, r3, 0, tmp); \
  SUB_BYTE(s0, 1, r3, 1, tmp); \
  SUB_BYTE(s1, 1, r0, 1, tmp); \
  SUB_BYTE(s2, 1, r1, 1, tmp); \
  SUB_BYTE(s3, 1, r2, 1, tmp); \
  SUB_BYTE(s0, 2, r2, 2, tmp); \
  SUB_BYTE(s1, 2, r3, 2, tmp); \
  SUB_BYTE(s2, 2, r0, 2, tmp); \
  SUB_BYTE(s3, 2, r1, 2, tmp); \
  SUB_BYTE(s0, 3, r1, 1, tmp); \
  SUB_BYTE(s1, 3, r2, 1, tmp); \
  SUB_BYTE(s2, 3, r3, 1, tmp); \
  SUB_BYTE(s3, 3, r0, 1, tmp)

The SUB_BYTE macro is also used for the key schedule. To avoid conditional jumps I've splitted the schedule to two separate macros, one for even rounds, one for odd. Both these macros assume the subkey history is stored in 8 32-bit variables, named from 'kh0' to 'kh7', also assume some other 32-bit variables like 'key1', 'key8', 'sbkey' and 'tmp'. Also 'rcon' value has to be defined and initialized to $01000000. As there are now two key schedule macros, the main loop has to be partially unrolled, so one turn now does even round then odd round. Round 13 has been moved out of the loop.

  key8 = kh0; \
  key1 = kh7; \
  sbkey = 0; \
  key1 = (key1 << 8) | ((key1 >> 24) & 0xFF); \
  SUB_BYTE(key1, 0, sbkey, 0, tmp); \
  SUB_BYTE(key1, 1, sbkey, 1, tmp); \
  SUB_BYTE(key1, 2, sbkey, 2, tmp); \
  SUB_BYTE(key1, 3, sbkey, 3, tmp); \
  sbkey ^= rcon; \
  rcon <<= 1; \
  kh0 = sbkey ^ key8; \
  kh1 = kh0 ^ kh1; \
  kh2 = kh1 ^ kh2; \
  kh3 = kh2 ^ kh3;

  key8 = kh4; \
  key1 = kh3; \
  sbkey = 0; \
  SUB_BYTE(key1, 0, sbkey, 0, tmp); \
  SUB_BYTE(key1, 1, sbkey, 1, tmp); \
  SUB_BYTE(key1, 2, sbkey, 2, tmp); \
  SUB_BYTE(key1, 3, sbkey, 3, tmp); \
  kh4 = sbkey ^ key8; \
  kh5 = kh4 ^ kh5; \
  kh6 = kh5 ^ kh6; \
  kh7 = kh6 ^ kh7;

The v7 code uses all these macros and MixColumn function. There is also another important change. The Aes256 function now uses all the 32 processor registers, 20 of them have to be saved when entering the function and restored at exit. Doing it for every cipher block would be overkill. Then the main loop running over data has been moved inside Aes256. As such, my routine now encrypts in ECB mode. Not very useful, but I guess CBC and CTR modes will not be much slower.

GCC optimizer surprise

I have been very surprised after compiling and testing the v7 code. All the effort put in macros seemed to be wasted, as v7 was a bit slower than v6. Before giving up, I've decided to compile with −O2 instead of −O3. That was it! Speedup is huge, see the results:

Efika MPC5200B 400 MHz 277 cycles/byte 1.38 MB/s ×1.22
PowerMac G4 G4 7400 500 MHz 127 cycles/byte 3.77 MB/s ×1.84
Pegasos I G3 750FX 600 MHz 130 cycles/byte 4.42 MB/s ×1.82
Pegasos II G4 7447A 1000 MHz 113 cycles/byte 8.46 MB/s ×2.78
PowerMac G4 G4 7455 1420 MHz 116 cycles/byte 11.76 MB/s ×2.71
PowerBook G4 G4 7447A 1670 MHz 114 cycles/byte 14.00 MB/s ×2.79
PowerMac G5 G5 970FX 2000 MHz 129 cycles/byte 14.83 MB/s ×1.97
PowerMac G5 G5 970FX 2300 MHz 130 cycles/byte 16.90 MB/s ×1.95
PowerMac G5 G5 970FX 2700 MHz 129 cycles/byte 20.07 MB/s ×1.92

What's the conclusion? When optimizing code by hand, do not trust GCC optimizer and always check with −O2. In this particular case even −O1 produces faster code than −O3 (but slower than −O2)... Unfortunately G5 processor does not enjoy the same acceleration. It is probably caused by high number of 'rlwinm' instructions in the code. The code of Aes256 has 483 instructions, 135 of them (28%) are 'rlwinm'. This instruction is unfortunately cracked (splitted into two) on G5. High number of these instructions is due to a byte nature of the cipher.

What about assembler then? Well, I've hoped that 'rlwimi' can give me some advantage. But now it turned out that my macros put bytes into pre-cleared registers. In this case 'rlwimi' has no advantages over 'rlwinm'. Both the instructions are cracked on G5 and both use the same execution units. Also I'm not sure I could compete with the compiler instruction scheduler. Then I decide to not use assembler for the scalar code. I simply see no way for more improvements yielding substantial speed increase. Of course I've not yet played my best ace in the hole – the mighty AltiVec :-).

Precalculated key schedule and security

In most applications of a block cipher, the key is constant, at least for some reasonable number of blocks. Then, a precalculated key schedule, which I have avoided so far, seems to be a good idea to accelerate the cipher even more. I've decided to check how much speed can be gained with the key schedule precalculation. Then I've modified the v7 code a bit. Now the whole schedule is stored in a local array of 60 32-bit words. Then it is precalculated completely with slightly modified KEY_SCHEDULE_EVEN and KEY_SCHEDULE_ODD macros. In fact, as it is done once for 5 millions of blocks in my benchmark, it needs not to be urnolled. But since the macros were already written, I've done an unrolled version, speed difference is negligible anyway. Then, the main cipher code fetches subkeys from the schedule array. So I've got the code v8, and here are benchmark results:

Efika MPC5200B 400 MHz 262 cycles/byte 1.46 MB/s ×1.29
PowerMac G4 G4 7400 500 MHz 116 cycles/byte 4.13 MB/s ×2.02
Pegasos I G3 750FX 600 MHz 121 cycles/byte 4.75 MB/s ×1.96
Pegasos II G4 7447A 1000 MHz 105 cycles/byte 9.12 MB/s ×2.99
PowerMac G4 G4 7455 1420 MHz 104 cycles/byte 13.07 MB/s ×3.01
PowerBook G4 G4 7447A 1670 MHz 106 cycles/byte 15.04 MB/s ×3.00
PowerMac G5 G5 970FX 2000 MHz 119 cycles/byte 16.05 MB/s ×2.13
PowerMac G5 G5 970FX 2300 MHz 120 cycles/byte 18.35 MB/s ×2.12
PowerMac G5 G5 970FX 2700 MHz 119 cycles/byte 21.70 MB/s ×2.07

The speed gain is 6 to 7% for both G4 and G5. Not that much as one may expect. Also using precalculated schedule exposes it to malicious key extractors, as it is stored in RAM. MorphOS is a single address space system. The main line of security defence is a difficulty to run malicious code remotely. Usually MorphOS runs very few (if any) network listening services, also most applications are non-standard. Of course they are compiled for PPC, making things even worse for an attacker. Then MorphOS users are safe from typical massive vulnerability scans, as noone cares to target a system with two thousands of users. Things change however, when we talk about a targetted attack. Attacking party can try to find a hole in Odyssey browser JS engine, also social engineering attack may be easy, as MorphOS users tend to trust applications and other users. But all this is mostly security by obscurity. Once attacker can run a code on attacked system, finding key schedule in memory is a piece of cake. All the processes run in one address space, so there is no need to use clever side-channel techniques like cache-timing. One just scans memory and checks for a schedule sequence. Because of strictly defined dependencies between subkeys and also because of the schedule length, identifying it in memory is an easy and precise process.

How a runtime key scheduling changes the picture? The master key would be still in memory, but master key alone is just eight more or less random bytes. Identifying it as a key is much more difficult as it requires analysis of a crypto library and application. A partial schedule is kept in processor registers, so one may argue that it is unreachable for other applications. This is unfortunately not true. CPU registers are stored on application stack, when the process scheduler gives the processor for another process, for example a malicious key extractor.

Part IV »