# 10.6: Generating Small FFT Kernels

[ "article:topic" ]

The base cases of FFTW's recursive plans are its codelets, and these form a critical component of FFTW's performance. They consist of long blocks of highly optimized, straight-line code, implementing many special cases of the DFT that give the planner a large space of plans in which to optimize. Not only was it impractical to write numerous codelets by hand, but we also needed to rewrite them many times in order to explore different algorithms and optimizations. Thus, we designed a special-purpose “FFT compiler” called genfft that produces the codelets automatically from an abstract description.

A typical codelet in FFTW computes a DFT of a small, fixed size $$n$$ (usually, $$n\leq 64$$ possibly with the input or output multiplied by twiddle factors Cooley-Tukey plans. Several other kinds of codelets can be produced by genfft , but we will focus here on this common case.

In principle, all codelets implement some combination of the Cooley-Tukey algorithm from the equation and/or some other DFT algorithm expressed by a similarly compact formula. However, a high-performance implementation of the DFT must address many more concerns than the equation alone suggests. For example, the equation contains multiplications by 1 that are more efficient to omit. The equation entails a run-time factorization of $$n$$ which can be precomputed if $$n$$ is known in advance. The equation operates on complex numbers, but breaking the complex-number abstraction into real and imaginary components turns out to expose certain non-obvious optimizations. Additionally, to exploit the long pipelines in current processors, the recursion implicit in the equation should be unrolled and re-ordered to a significant degree. Many further optimizations are possible if the complex input is known in advance to be purely real (or imaginary). Our design goal for genfft was to keep the expression of the DFT algorithm independent of such concerns. This separation allowed us to experiment with various DFT algorithms and implementation strategies independently and without (much) tedious rewriting.

Genfft is structured as a compiler whose input consists of the kind and size of the desired codelet, and whose output is C code. genfft operates in four phases: creation, simplification, scheduling, and unparsing.

In the creation phase, genfft produces a representation of the codelet in the form of a directed acyclic graph (dag). The dag is produced according to well-known DFT algorithms: Cooley-Tukey equation, prime-factor, split-radix and Rader. Each algorithm is expressed in a straightforward math-like notation, using complex numbers, with no attempt at optimization. Unlike a normal FFT implementation, however, the algorithms here are evaluated symbolically and the resulting symbolic expression is represented as a dag, and in particular it can be viewed as a linear network (in which the edges represent multiplication by constants and the vertices represent additions of the incoming edges).

In the simplification phase, genfft applies local rewriting rules to each node of the dag in order to simplify it. This phase performs algebraic transformations (such as eliminating multiplications by 1) and common-subexpression elimination. Although such transformations can be performed by a conventional compiler to some degree, they can be carried out here to a greater extent because genfft can exploit the specific problem domain. For example, two equivalent subexpressions can always be detected, even if the subexpressions are written in algebraically different forms, because all subexpressions compute linear functions. Also, genfft can exploit the property that network transposition (reversing the direction of every edge) computes the transposed linear operation, in order to transpose the network, simplify, and then transpose back—this turns out to expose additional common subexpressions. In total, these simplifications are sufficiently powerful to derive DFT algorithms specialized for real and/or symmetric data automatically from the complex algorithms. For example, it is known that when the input of a DFT is real (and the output is hence conjugate-symmetric), one can save a little over a factor of two in arithmetic cost by specializing FFT algorithms for this case—with genfft , this specialization can be done entirely automatically, pruning the redundant operations from the dag, to match the lowest known operation count for a real-input FFT starting only from the complex-data algorithm. We take advantage of this property to help us implement real-data DFTs, to exploit machine-specific “SIMD” instructions SIMD instructions , and to generate codelets for the discrete cosine (DCT) and sine (DST) transforms. Furthermore, by experimentation we have discovered additional simplifications that improve the speed of the generated code. One interesting example is the elimination of negative constants: multiplicative constants in FFT algorithms often come in positive/negative pairs, but every C compiler we are aware of will generate separate load instructions for positive and negative versions of the same constants.11 We thus obtained a 10–15% speedup by making all constants positive, which involves propagating minus signs to change additions into subtractions or vice versa elsewhere in the dag (a daunting task if it had to be done manually for tens of thousands of lines of code).

In the scheduling phase, genfft produces a topological sort of the dag (a schedule). The goal of this phase is to find a schedule such that a C compiler can subsequently perform a good register allocation. The scheduling algorithm used by genfft offers certain theoretical guarantees because it has its foundations in the theory of cache-oblivious algorithms (here, the registers are viewed as a form of cache), as described in Memory strategies in FFTW . As a practical matter, one consequence of this scheduler is that FFTW's machine-independent codelets are no slower than machine-specific codelets generated by SPIRAL.

In the stock genfft implementation, the schedule is finally unparsed to C. A variation from this implements the rest of a compiler back end and outputs assembly code.

Footnote

11    Floating-point constants must be stored explicitly in memory; they cannot be embedded directly into the CPU instructions like integer “immediate” constants.

### Contributor

• ContribEEBurrus