Matrix
Notes on AES-256 Cipher Implementation on PowerPC Processors: Part I
Grzegorz Kraszewski

I ● IIIIIIVV

AES-256 is one of the most popular symmetric ciphers nowadays. This article describes my optimized implementation of the cipher on PowerPC (mainly G4) processors. I've started with naive C code, directly based on the cipher specification, going through C optimization, and finally not forgetting about powerful AltiVec unit. Results of code throughput measurements are given, as well as source codes. Source code is available on NetBSD two clause licence.

Introduction

Note that this publication is not a regular scientific paper. While I've used the scientific methodology, it has been neither formally peer-reviewed, nor published in a journal or on a conference. Also considering that AES-256 has been researched by cryptologists to death, you probably won't find any "big science" inside, rather software engineering stuff. Then I will not proceed using formal, stiff style. I will not repeat the complete AES-256 cipher specification, the official document is FIPS-197, also Wikipedia provides a good description, and many references for further reading.

When estimating code effectiveness, I've used CPU clocks per encrypted byte. This is a measure commonly used for symmetric ciphers. Measurements were done by executing code encrypting 20 MB of random data in RAM, and measuring execution time. Mean value of several runs has been calculated. All the results were obtained on Apple PowerBook G4 laptop, featuring G4 7447A processor running at 1.67 GHz, and verified on Pegasos II computer with 7447A as well, running at 1.0 GHz. Both the systems were running MorphOS operating system. Later I have also obtained results from numerous PowerPC hardware, listed at the end of this introduction.

When choosing implementation details I have assumed two constraints. First I wanted to avoid large, 4 kB set of lookup tables. This implementation is known to be the fastest possible on general purpose CPUs, but it has some security flaws, as it allows for side-channel attacks based on cache timing. Also lookup tables are a bit uncomfortable with AltiVec, as they may require additional shifts and merging. On the other hand small 256-byte lookup table for SubBytes() transformation seems to be unavoidable, at least for scalar (not using AltiVec) code. There are AES-256 implementations doing Galois field multiplication in code, but they are much slower. This possibility may be considered for AltiVec however.

The second constraint is no precalculated key schedule. First, not all AES applications run with constant key. Secondly precalculated schedule stored in memory is an easy target for malicious software scanning memory for keys. Then the schedule is done on-the-fly for each data block and not stored (as a whole) anywhere. In later, optimized attempt I will even manage to keep the partial key schedule in processor registers, so in theory it will be not visible for other processes. However, in a multitasking operating system CPU registers are stored on the process stack (so in RAM), when our process is scheduled out. More on this later, when I attempt optimized version.

I would like to thank people who ran my tests on their machines, so I could test my code on a wide range of hardware, starting from tiny MPC5200B SoC running at 400 MHz, up to G5 970FX beast running at 2.7 GHz. People who helped me are: Robert "Phibrizzo" Krajcarz (Efika), Tomasz "virago" Okoniewski (Pegasos I), Michał "rzookol" Żukowski (PowerMac G4 1.42 GHz), Arkadiusz "Deez" Nowacki (PowerMac G5 2.0 GHz), Jerzy "Drako" Guc (PowerMac G5 2.3 GHz) and Wiktor "pampers" Głowacki (PowerMac G5 2.7 GHz).

Code execution time measurements

PPC processors deliver a good way to measure code performance: the timebase register. This register is incremented periodically on a clock synchronized to the CPU clock. Basically one can say that timebase clock is just CPU clock divided by some integer number. On MorphOS operating system, the timebase divider may be calculated from three factors:

  1. The CPU master clock frequency.
  2. The bus clock frequency.
  3. The bus clock to timebase clock ratio.

It should be noted that CPU/bus divider need not to be integer, but it may be also at half between two integers (for example 3.5). Bus/timebase divider is integer, but as it may be odd, the final CPU/timebase divider may be at half. To avoid fractions one can just prescale the CPU clock by 2 before all the calculations and remember to scale back at end.

The timebase register is 64-bit, unfortunately on 32-bit processors it is divided to two halves. Upper half is incremented when lower overflows. It may happen that overflow happens just between reading both halves, then the result is invalid. To avoid it, reading is done as follows:

UQUAD get_time_base(void)
{
  ULONG hi0, hi1, lo;

  do
  {
    asm volatile("mftbu %0" : "=r" (hi0));
    asm volatile("mftb %0" : "=r" (lo));
    asm volatile("mftbu %0" : "=r" (hi1));
  }
  while (hi0 != hi1);

  return (((UQUAD)hi0 << 32) | lo);
}

This simple function reads the high timebase word first with mftbu instruction, then reads the low word with mftb, and then reads the high word again. Finally it compares both values of the high word. If they are the same, no low word overflow happened in the meantime, so the result is good. If overflow is detected, the whole reading process is repeated. Possible overhead added by repeating the read process is negligible, as resolution of the timebase is usually not better than 10 CPU clocks. Also chances for encountering low word overflow are, well, low.

Having this nice function at hand, one can create a complete benchmarking framework. Note that unlike the above get_time_base(), which is pretty generic, and applicable to any PowerPC platform, the code below is MorphOS specific.

UQUAD start, end, busclock, cpuclock, cycles_used;
ULONG busticks;

NewGetSystemAttrs(&busticks, sizeof(ULONG), SYSTEMINFOTYPE_PPC_BUSTICKS, NULL);
NewGetSystemAttrs(&busclock, sizeof(UQUAD), SYSTEMINFOTYPE_PPC_BUSCLOCK, NULL);
NewGetSystemAttrs(&cpuclock, sizeof(UQUAD), SYSTEMINFOTYPE_PPC_CPUCLOCK, NULL);

busmultiplier = (ULONG)((cpuclock << 1) / busclock) * busticks;
start = get_time_base();

/* Code to be benchmarked */

end = get_time_base();
cycles_used = ((end - start) * busmultiplier) >> 1;

The busmultiplier variable is, as said before, prescaled by 2 (by a left shift) to avoid fractions. Consequently, the final result is scaled back by shifting it right.

When benchmarking on laptops (like PowerBooks or iBooks) one has to be aware of DFS (Dynamic Frequency Scaling) feature of the processor. The CPU switches to a half of the nominal clock frequency when not loaded with work. It of course influences benchmark results. Then, to avoid DFS, one has to run some CPU intensive task with low priority in the background to "warm up" the processor and kick it out of DFS state. I have used the distributed.net client for this purpose. Also to avoid background tasks influence the benchmark code raises its own priority from 0 to +1 for the benchmark time.

The straightforward attempt

The complete AES-256 cipher consists of 4 basic operations named in the specification as SubBytes, ShiftRows, MixColumns, AddRoundKey and the key schedule, which also uses SubBytes and some primitive operations. All these operations are defined in terms of bytes, so my initial implementation also operates on bytes, which is – as one may guess – highly inefficient at times. Anyway it works as a start point. The cipher state is stored as an array of 16 unsigned bytes. Also (sub)keys are treated as arrays of bytes.

AddRoundKey

The first operation is AddRoundKey. It simply XORs a 16-byte round key (derived from the master 32-byte key with the key schedule) with the cipher state:

void AddRoundKey(UBYTE *roundkey, UBYTE *state)
{
  UBYTE i;

  for (i = 0; i < 16; i++) state[i] ^= roundkey[i];
}

SubBytes

Code for SubBytes is similarly simple, as it simply uses "SBox" lookup table 16 times. I will not reproduce the table here, it can be found in numerous places (including the original NSA document), as well as in my source code linked below.

void SubBytes(UBYTE *state)
{
  UBYTE i;

  for (i = 0; i < 16; i++) state[i] = SBox[state[i]];
}

ShiftRows

The next operation is ShiftRows. It just shuffles the state array. Note that the array is indexed in column order, not row order. The first state row is not shifted.

void ShiftRows(UBYTE *state)
{
  UBYTE a;

  a = state[1]; state[1] = state[5]; state[5] = state[9]; state[9] = state[13]; state[13] = a;
  a = state[2]; state[2] = state[10]; state[10] = a; a = state[6]; state[6] = state[14]; state[14] = a;
  a = state[3]; state[3] = state[15]; state[15] = state[11]; state[11] = state[7]; state[7] = a;
}

MixColumns

Now it is time for MixColumns. This is a bit more complicated. It seems to be just matrix multiplication, as one state column is multiplied by square 4×4 matrix. One must consider however, that this multiplication is not arithmetic, but done in the Galois field GF(28), the same as used by SubBytes. Vector by matrix multiplication breaks down to scalar additions (which are XOR operations in the GF(28)) and scalar multiplications, which are complex operations from processor point of view. Fortunately matrix coefficients are only 1, 2 and 3. Multiplication by 1 is identity. Multiplication by 2 can be accomplished by plain multiplication by 2 (replaced with bit shift) and then XOR-ing with the value $1B if the highest bit before shifting was 1. Finally Galois field multiplication by 3 can be done by XOR-ing original value with the value multiplied by 2. Then the whole column mixing can be coded as follows:

void MixColumns(UBYTE *state)
{
  UBYTE c, i;
  for (c = 0; c < 16; c += 4)
  {
    UBYTE mul2[4];
    UBYTE mul3[4];
    UBYTE news[4];

    /* preparation of multiplied column */

    for (i = 0; i < 4; i++)
    {
      UBYTE mask;

      mask = (state[c + i] & 0x80) ? 0x1B : 0x00;
      mul2[i] = state[c + i] << 1;
      mul2[i] ^= mask;
      mul3[i] = mul2[i] ^ state[c + i];
    }

    /* now matrix multiplication */

    news[0] = mul2[0] ^ mul3[1] ^ state[c + 2] ^ state[c + 3];
    news[1] = state[c] ^ mul2[1] ^ mul3[2] ^ state[c + 3];
    news[2] = state[c] ^ state[c + 1] ^ mul2[2] ^ mul3[3];
    news[3] = mul3[0] ^ state[c + 1] ^ state[c + 2] ^ mul2[3];

    /* back to state */

    for (i = 0; i < 4; i++) state[c + i] = news[i];
  }
}

KeySchedule

The key schedule produces 15 round keys (16 bytes each) out of 32-byte master key. In fact it only has to produce 13 new round keys, as the first two are just halves of the master key. The key schedule algorithm operates on 32-bit words, so the complete schedule consists of 60 words: 8 of them are the master key, 52 are generated. A new round key word is obtained by XOR-ing word [-8] in the history with word [-1]. Additionally for each first word of a round key, the [-1] word goes through SubBytes before xoring. Also, for even rounds, the word [-1] is rotated before SubBytes and XOR-ed by $01000000 value after. The value is shifted right one bit after each even round.

For on-the-fly key scheduling only last 8 words (two round keys) have to be stored. I use a circular buffer for them.

void KeySchedule(UBYTE *key_history, UBYTE round)
{
  UBYTE eightkey[4], prevkey[4];
  UBYTE i, k;
  UBYTE rcon = 0x01 << (round >> 1);

  if ((round & 1) == 0)      /* even round */
  {
    for (i = 0; i < 4; i++) eightkey[i] = key_history[i];
    for (i = 0; i < 4; i++) prevkey[i] = key_history[28 + i];

    /* RotWord */

    k = prevkey[0];
    prevkey[0] = prevkey[1];
    prevkey[1] = prevkey[2];
    prevkey[2] = prevkey[3];
    prevkey[3] = k;

    /* SubBytes */

    for (i = 0; i < 4; i++) prevkey[i] = SBox[prevkey[i]];

    /* Rcon */

    prevkey[0] ^= rcon;

    for (i = 0; i < 4; i++) key_history[i] = prevkey[i] ^ eightkey[i];
    for (i = 0; i < 4; i++) key_history[4 + i] = key_history[i] ^ key_history[4 + i];
    for (i = 0; i < 4; i++) key_history[8 + i] = key_history[4 + i] ^ key_history[8 + i];
    for (i = 0; i < 4; i++) key_history[12 + i] = key_history[8 + i] ^ key_history[12 + i];
  }
  else      /* odd round */
  {
    for (i = 0; i < 4; i++) eightkey[i] = key_history[16 + i];
    for (i = 0; i < 4; i++) prevkey[i] = key_history[12 + i];

    /* SubBytes */

    for (i = 0; i < 4; i++) prevkey[i] = SBox[prevkey[i]];

    for (i = 0; i < 4; i++) key_history[16 + i] = prevkey[i] ^ eightkey[i];
    for (i = 0; i < 4; i++) key_history[20 + i] = key_history[16 + i] ^ key_history[20 + i];
    for (i = 0; i < 4; i++) key_history[24 + i] = key_history[20 + i] ^ key_history[24 + i];
    for (i = 0; i < 4; i++) key_history[28 + i] = key_history[24 + i] ^ key_history[28 + i];
  }
}

Results

Having all the building blocks, one can write a complete AES-256 naive byte-wise implementation and benchmark it. I've compiled code for general PowerPC architecture (no -mcpu, no -mtune) with -O3 optimization. Results are as follows:

Efika MPC5200B 400 MHz 338 cycles/byte 1.13 MB/s
PowerMac G4 G4 7400 500 MHz 233 cycles/byte 2.05 MB/s
Pegasos I G3 750FX 600 MHz 237 cycles/byte 2.42 MB/s
Pegasos II G4 7447A 1000 MHz 313 cycles/byte 3.05 MB/s
PowerMac G4 G4 7455 1420 MHz 312 cycles/byte 4.34 MB/s
PowerBook G4 G4 7447A 1670 MHz 317 cycles/byte 5.02 MB/s
PowerMac G5 G5 970FX 2000 MHz 253 cycles/byte 7.53 MB/s
PowerMac G5 G5 970FX 2300 MHz 253 cycles/byte 8.66 MB/s
PowerMac G5 G5 970FX 2700 MHz 246 cycles/byte 10.47 MB/s

Considering that handtuned assembler routines for x86 processors, using SSE2 and precalculated key schedule achieve 30 to 15 cycles per byte, my result is not very interesting, to put it mildly. One may also note that latest x86_64 units achieve 3.5 cycles per byte, but they have almost the whole cipher implemented in hardware. It would be hard to compete with them. Anyway this first naive code is not expected to be fast. It just works, which is verified in the code by using an example key and plaintext from the FIPS-197 paper, and checking if encrypted message matches the example.

Part II »