Duff’s Device is probably one of the most puzzling pieces of C code to date, and Duff shows us the incredible properties of the switch statement. Understanding Duff’s Device helps us implement a plain coroutine.
Before we start introducing the Duff device, we need to understand the concept called loop unrolling.
loop unrolling is an optimization method to speed up the execution of a program at the expense of its size. It can be done by the programmer or automatically by the compiler. loop unrolling is most often used to reduce loop overhead and provide instruction-level parallelism for processors with multiple functional units. It also facilitates the scheduling of instruction pipelines.
An example code for a loop unfolding is as follows.
The reader can click on loop-unroll.c for the complete code, compile it and run it, and get the following results.
It can be seen that the unroll is about 47 ms faster than the normal case during the 100000000 summations. Although this is a negligible improvement in 2020, it should be significant in the ancient era when hardware performance was weak. Major compilers now also provide optimization options for loop expansion, for example, Clang provides #pragma unroll, eliminating the need for the programmer to manually unroll.
Now let’s get to know what a Duff device is. I believe nothing could be more accurate than Duff’s own description, and in fact we can trace it back to Duff’s email describing the idea in November 1983. The text is from Duff’s own email, italics are additional.
The following program (routine), extracted from Evans & Sutherland Picture System II, is used to copy a short array to the IO data register.
(obviously, it stops when
count is 0)
The C compiler for VAX compiles the above loop into two instructions (I think they are the
movw instruction and the
sobleq instruction). As it turns out, this program is the bottleneck of the real-time animation program, slowing it down by about 50%. The standard way to deal with this is to reduce the number of calls to the
sobleq instruction by expanding it in a loop, thus achieving a higher speed. When you do this, you also have to deal with the remainder of the data at the end of the loop (note that the number of loops is not necessarily divisible by the number of expansions, so you need to deal with the extra remainder that is not divisible). My habit is to make a copy of the Loop unrolling statement as a switch statement. Of course, if I were writing assembly, I would choose to jump directly to the Loop unrolling statement. With that in mind, I wrote the following implementation yesterday.
Feeling nasty? But it gets compiled and runs fine. I’m both proud and disgusted by this discovery. If no one had thought of this implementation before, I was going to name it after myself.
To my surprise, after ten years of writing C, there are still little corners of the language that I haven’t explored (in fact, I have another way to implement state machine interrupts using switch, but it’s also horrible to implement).
Many people (even bwk?) have said that the worst feature of C is that the switch statement doesn’t automatically break before it has traversed all the cases, and this code supports that view, but I’m not sure if I’m for or against it.
That’s all there is to Duff’s email. Some readers may not quite understand what the text means by “you still have to deal with the part of the data left after the loop ends”, so we can take a look at the equivalent code of Duff’s device under normal circumstances (code example from Wikipedia:)
As you can see, Duff has combined the normally separate switch and while statements into a confusingly interleaved switch and while code, but the code size has been reduced by almost half. Normally we would treat the scope of the case statement as a single block, but the Duff device shows that this is not the case. So why is it possible to write it this way?
We can consult the latest C17 draft PDF version, jumping directly to in 6.8 Statements and blocks, we can see that the case statements belong to Labeled statements (statements with labels.
Any statement may be preceded by a prefix that declares an identifier as a label name. Labels in themselves do not alter the flow of control, which continues unimpeded across them.
The clever thing about the Duff device is that it cleverly takes advantage of the fact that Labeled statements do not change the syntactic definition of the control flow; whereas normally we would treat the scope of a case statement as a separate block of code, a mere best practice for code style, and not a strong compiler constraint.
When the Duff device starts to run, it will first match the corresponding case statement based on switch, and since break is not declared, it will execute all the way down (falls through) until it is caught by while and enters the loop logic. Here is a copy of the C code duff-like.c and the corresponding assembly file duff-like.s that simulates the Duff device, the reader can easily verify the running logic by focusing on the jump instructions like
jmpq, etc., so we won’t expand on them here.
Duff devices are specific to a particular time period, and it is almost impossible to see such a loop unfolding implementation nowadays.