Duff's Device

Duff’s Device is probably one of the most puzzling pieces of C code to date, and Duff shows us the incredible properties of the switch statement. Understanding Duff’s Device helps us implement a plain coroutine.

Before we start introducing the Duff device, we need to understand the concept called loop unrolling.

loop unrolling is an optimization method to speed up the execution of a program at the expense of its size. It can be done by the programmer or automatically by the compiler. loop unrolling is most often used to reduce loop overhead and provide instruction-level parallelism for processors with multiple functional units. It also facilitates the scheduling of instruction pipelines.

An example code for a loop unfolding is as follows.

// 正常情况下的 for 循环，
// 需要连续迭代 100000000 次
int sum = 0;
for (int i = 0; i < 100000000; i++) {
    sum += i;
}

// 每次循环展开 5 次，
// 只用迭代 20000000 次即可
int sum = 0;
for (int i = 0; i < 100000000; i += 5) {
    sum += i;
    sum += i + 1;
    sum += i + 2;
    sum += i + 3;
    sum += i + 4;
}

The reader can click on loop-unroll.c for the complete code, compile it and run it, and get the following results.

compiled results

It can be seen that the unroll is about 47 ms faster than the normal case during the 100000000 summations. Although this is a negligible improvement in 2020, it should be significant in the ancient era when hardware performance was weak. Major compilers now also provide optimization options for loop expansion, for example, Clang provides #pragma unroll, eliminating the need for the programmer to manually unroll.

Now let’s get to know what a Duff device is. I believe nothing could be more accurate than Duff’s own description, and in fact we can trace it back to Duff’s email describing the idea in November 1983. The text is from Duff’s own email, italics are additional.

The following program (routine), extracted from Evans & Sutherland Picture System II, is used to copy a short array to the IO data register.

send(to, from, count)
register short *to, *from;
register count;
{
    do
        *to = *from++;
    while (--count > 0);
}

(obviously, it stops when count is 0)

The C compiler for VAX compiles the above loop into two instructions (I think they are the movw instruction and the sobleq instruction). As it turns out, this program is the bottleneck of the real-time animation program, slowing it down by about 50%. The standard way to deal with this is to reduce the number of calls to the sobleq instruction by expanding it in a loop, thus achieving a higher speed. When you do this, you also have to deal with the remainder of the data at the end of the loop (note that the number of loops is not necessarily divisible by the number of expansions, so you need to deal with the extra remainder that is not divisible). My habit is to make a copy of the Loop unrolling statement as a switch statement. Of course, if I were writing assembly, I would choose to jump directly to the Loop unrolling statement. With that in mind, I wrote the following implementation yesterday.

send(to, from, count)
register short *to, *from;
register count;
{
    register n = (count + 7) / 8;
    switch (count % 8) {
    case 0: do { *to = *from++;
    case 7:      *to = *from++;
    case 6:      *to = *from++;
    case 5:      *to = *from++;
    case 4:      *to = *from++;
    case 3:      *to = *from++;
    case 2:      *to = *from++;
    case 1:      *to = *from++;
            } while (--n > 0);
    }
}

Feeling nasty? But it gets compiled and runs fine. I’m both proud and disgusted by this discovery. If no one had thought of this implementation before, I was going to name it after myself.

To my surprise, after ten years of writing C, there are still little corners of the language that I haven’t explored (in fact, I have another way to implement state machine interrupts using switch, but it’s also horrible to implement).

Many people (even bwk?) have said that the worst feature of C is that the switch statement doesn’t automatically break before it has traversed all the cases, and this code supports that view, but I’m not sure if I’m for or against it.

That’s all there is to Duff’s email. Some readers may not quite understand what the text means by “you still have to deal with the part of the data left after the loop ends”, so we can take a look at the equivalent code of Duff’s device under normal circumstances (code example from Wikipedia:)

send(to, from, count)
register short *to, *from;
register count;
{
    register n = (count + 7) / 8;

    // 循环次数不一定能被展开次数整除，
    // 所以需要额外处理整除不尽的余数；
    // 此处 switch 没有 break 语句，
    // 会自动从余数匹配的 case 向下遍历执行（falls through）
    switch (count % 8) {
        case 0: *to = *from++;
        case 7: *to = *from++;
        case 6: *to = *from++;
        case 5: *to = *from++;
        case 4: *to = *from++;
        case 3: *to = *from++;
        case 2: *to = *from++;
        case 1: *to = *from++;
    }

    // 每次循环展开 8 次
    while (--n > 0) {
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
    }
}

As you can see, Duff has combined the normally separate switch and while statements into a confusingly interleaved switch and while code, but the code size has been reduced by almost half. Normally we would treat the scope of the case statement as a single block, but the Duff device shows that this is not the case. So why is it possible to write it this way?

We can consult the latest C17 draft PDF version, jumping directly to in 6.8 Statements and blocks, we can see that the case statements belong to Labeled statements (statements with labels.

Syntax definition of Labeled statements

Any statement may be preceded by a prefix that declares an identifier as a label name. Labels in themselves do not alter the flow of control, which continues unimpeded across them.

The clever thing about the Duff device is that it cleverly takes advantage of the fact that Labeled statements do not change the syntactic definition of the control flow; whereas normally we would treat the scope of a case statement as a separate block of code, a mere best practice for code style, and not a strong compiler constraint.

When the Duff device starts to run, it will first match the corresponding case statement based on switch, and since break is not declared, it will execute all the way down (falls through) until it is caught by while and enters the loop logic. Here is a copy of the C code duff-like.c and the corresponding assembly file duff-like.s that simulates the Duff device, the reader can easily verify the running logic by focusing on the jump instructions like ja, jg, jmp, jmpq, etc., so we won’t expand on them here.

Duff devices are specific to a particular time period, and it is almost impossible to see such a loop unfolding implementation nowadays.