There are many complexities of computer architectures that impact the optimization of FFT implementations, but one of the most pervasive is the memory hierarchy. On any modern general-purpose computer, memory is arranged into a hierarchy of storage devices with increasing size and decreasing speed: the fastest and smallest memory being the CPU registers, then two or three levels of cache, then the main-memory RAM, then external storage such as hard disks.3 Most of these levels are managed automatically by the hardware to hold the most-recently-used data from the next level in the hierarchy.4 There are many complications, however, such as limited cache associativity (which means that certain locations in memory cannot be cached simultaneously) and cache lines (which optimize the cache for contiguous memory access), which are reviewed in numerous textbooks on computer architectures. In this section, we focus on the simplest abstract principles of memory hierarchies in order to grasp their fundamental impact on FFTs.
Because access to memory is in many cases the slowest part of the computer, especially compared to arithmetic, one wishes to load as much data as possible in to the faster levels of the hierarchy, and then perform as much computation as possible before going back to the slower memory devices. This is called temporal locality: if a given datum is used more than once, we arrange the computation so that these usages occur as close together as possible in time.
Understanding FFTs with an ideal cache
To understand temporal-locality strategies at a basic level, in this section we will employ an idealized model of a cache in a two-level memory hierarchy. This ideal cache stores \(\mathbf{Z}\) data items from main memory (e.g. complex numbers for our purposes): when the processor loads a datum from memory, the access is quick if the datum is already in the cache (a cache hit) and slow otherwise (a cache miss, which requires the datum to be fetched into the cache). When a datum is loaded into the cache,5 it must replace some other datum, and the ideal-cache model assumes that the optimal replacement strategy is used: the new datum replaces the datum that will not be needed for the longest time in the future; in practice, this can be simulated to within a factor of two by replacing the least-recently used datum, but ideal replacement is much simpler to analyze. Armed with this ideal-cache model, we can now understand some basic features of FFT implementations that remain essentially true even on real cache architectures. In particular, we want to know the cache complexity, the number Q(n;Z)Q(n;Z)" role="presentation" style="position:relative;" tabindex="0">\(Q(n;\mathbf{Z})\) of cache misses for an FFT of size nn" role="presentation" style="position:relative;" tabindex="0">\(n\) with an ideal cache of size \(\mathbf{Z}\), and what algorithm choices reduce this complexity.
ZZ" role="presentation" style="position:relative;" tabindex="0">First, consider a textbook radix-2 algorithm, which divides nn" role="presentation" style="position:relative;" tabindex="0">\(n\) by 2 at each stage and operates breadth-first as in Fig. 10.2.1 (left), performing all butterflies of a given size at a time. If n>Zn>Z" role="presentation" style="position:relative;" tabindex="0">nn" role="presentation" style="position:relative;" tabindex="0">ZZ" role="presentation" style="position:relative;" tabindex="0">\(n>\mathbf{Z}\), then each pass over the array incurs \(\Theta (n)\) cache misses to reload the data, and there are \(\log _2n\) passes, for Θ(nlog2n)Θ(nlog2n)" role="presentation" style="position:relative;" tabindex="0">\(\Theta (n\log _2n)\) cache misses in total—no temporal locality at all is exploited!
One traditional solution to this problem is blocking: the computation is divided into maximal blocks that fit into the cache, and the computations for each block are completed before moving on to the next block. Here, a block of \(\mathbf{Z}\) numbers can fit into the cache6 (not including storage for twiddle factors and so on), and thus the natural unit of computation is a sub-FFT of size \(\mathbf{Z}\). Since each of these blocks involves Θ(ZlogZ)Θ(ZlogZ)" role="presentation" style="position:relative;" tabindex="0">\(\Theta (\mathbf{Z}\log \mathbf{Z})\) arithmetic operations, and there are Θ(nlog2n)Θ(nlog2n)" role="presentation" style="position:relative;" tabindex="0">\(\Theta (n\log n)\)Θ(nlogn)Θ(nlogn)" role="presentation" style="position:relative;" tabindex="0"> operations overall, there must be \(\Theta \left ( \frac{n}{\mathbf{Z}}\log _\mathbf{Z}n \right )\) such blocks. More explicitly, one could use a radix-\(\mathbf{Z}\) Cooley-Tukey algorithm, breaking nn" role="presentation" style="position:relative;" tabindex="0">\(n\) down by factors of \(\mathbf{Z}\) or \(\Theta (\mathbf{Z})\) until a size ZZ" role="presentation" style="position:relative;" tabindex="0">\(\mathbf{Z}\) is reached: each stage requires \(n/\mathbf{Z}\) blocks, and there are logZnlogZn" role="presentation" style="position:relative;" tabindex="0">\(\log _\mathbf{Z}n\) stages, again giving \(\Theta \left ( \frac{n}{\mathbf{Z}}\log _\mathbf{Z}n \right )\) blocks overall. Since each block requires \(\mathbf{Z}\) cache misses to load it into cache, the cache complexity QbQb" role="presentation" style="position:relative;" tabindex="0">\(Q_b\) of such a blocked algorithm is
\[Q_b(n;\mathbf{Z})=\Theta (\log _\mathbf{Z}n) \nonumber \]
In fact, this complexity is rigorously optimal for Cooley-Tukey FFT algorithms, and immediately points us towards large radices (not radix 2!) to exploit caches effectively in FFTs.
However, there is one shortcoming of any blocked FFT algorithm: it is cache aware, meaning that the implementation depends explicitly on the cache size \(\mathbf{Z}\). The implementation must be modified (e.g. changing the radix) to adapt to different machines as the cache size changes. Worse, as mentioned above, actual machines have multiple levels of cache, and to exploit these one must perform multiple levels of blocking, each parameterized by the corresponding cache size. In the above example, if there were a smaller and faster cache of size z<Zz<Z" role="presentation" style="position:relative;" tabindex="0">\(z<\mathbf{Z}\), the size-\(\mathbf{Z}\) sub-FFTs should themselves be performed via radix-\(z\) Cooley-Tukey using blocks of size zz" role="presentation" style="position:relative;" tabindex="0">\(z\) and so on. There are two paths out of these difficulties: one is self-optimization, where the implementation automatically adapts itself to the hardware (implicitly including any cache sizes), as described in Adaptive Composition of FFT Algorithms; the other is to exploit cache-oblivious algorithms. FFTW employs both of these techniques.
The goal of cache-obliviousness is to structure the algorithm so that it exploits the cache without having the cache size as a parameter: the same code achieves the same asymptotic cache complexity regardless of the cache size \(\mathbf{Z}\). An optimal cache-oblivious algorithm achieves the optimal cache complexity (that is, in an asymptotic sense, ignoring constant factors). Remarkably, optimal cache-oblivious algorithms exist for many problems, such as matrix multiplication, sorting, transposition, and FFTs. Not all cache-oblivious algorithms are optimal, of course—for example, the textbook radix-2 algorithm discussed above is “pessimal” cache-oblivious (its cache complexity is independent of ZZ" role="presentation" style="position:relative;" tabindex="0">\(\mathbf{Z}\) because it always achieves the worst case!).
For instance, Fig 10.2.1 (right) and the algorithm of Pre shows a way to obliviously exploit the cache with a radix-2 Cooley-Tukey algorithm, by ordering the computation depth-first rather than breadth-first. That is, the DFT of size nn" role="presentation" style="position:relative;" tabindex="0">\(n\) is divided into two DFTs of size n/2n/2" role="presentation" style="position:relative;" tabindex="0">nn" role="presentation" style="position:relative;" tabindex="0">\(n/2\), and one DFT of size n/2n/2" role="presentation" style="position:relative;" tabindex="0">nn" role="presentation" style="position:relative;" tabindex="0">\(n/2\)n/2n/2" role="presentation" style="position:relative;" tabindex="0"> is completely finished before doing any computations for the second DFT of size n/2n/2" role="presentation" style="position:relative;" tabindex="0">nn" role="presentation" style="position:relative;" tabindex="0">\(n/2\). The two subtransforms are then combined using n/2n/2" role="presentation" style="position:relative;" tabindex="0">nn" role="presentation" style="position:relative;" tabindex="0">\(n/2\) radix-2 butterflies, which requires a pass over the array and (hence nn" role="presentation" style="position:relative;" tabindex="0">\(n\) cache misses if n>Zn>Z" role="presentation" style="position:relative;" tabindex="0">\(n>\mathbf{Z}\)). This process is repeated recursively until a base-case (e.g. size 2) is reached. The cache complexity Q2(n;Z)Q2(n;Z)" role="presentation" style="position:relative;" tabindex="0">\(Q_2(n;\mathbf{Z})\) of this algorithm satisfies the recurrence
\[Q_2(n;\mathbf{Z})=\begin{cases} n & n\leq \mathbf{Z} \\ 2Q_2(n/2;\mathbf{Z})+\Theta (n) & \text{ otherwise } \end{cases} \nonumber \]
The key property is this: once the recursion reaches a size n≤Zn≤Z" role="presentation" style="position:relative;" tabindex="0">\(n\leq \mathbf{Z}\), the subtransform fits into the cache and no further misses are incurred. The algorithm does not “know” this and continues subdividing the problem, of course, but all of those further subdivisions are in-cache because they are performed in the same depth-first branch of the tree. The solution o the equation is
\[Q_2(n;\mathbf{Z})=\Theta (n\log [n/\mathbf{Z}]) \nonumber \]
This is worse than the theoretical optimum \(Q_b(n;\mathbf{Z})\) from the equation, but it is cache-oblivious (\(\mathbf{Z}\) never entered the algorithm) and exploits at least some temporal locality.7 On the other hand, when it is combined with FFTW's self-optimization and larger radices in Adaptive Composition of FFT Algorithms, this algorithm actually performs very well until nn" role="presentation" style="position:relative;" tabindex="0">\(n\) becomes extremely large. By itself, however, the algorithm of Pre must be modified to attain adequate performance for reasons that have nothing to do with the cache. These practical issues are discussed further in Cache-obliviosness in practice below.
There exists a different recursive FFT that is optimal cache-oblivious, however, and that is the radix-\(\sqrt{n}\) “four-step” Cooley-Tukey algorithm (again executed recursively, depth-first). The cache complexity \(Q_o\) of this algorithm satisfies the recurrence:
\[Q_o(n;\mathbf{Z})=\begin{cases} n & n\leq \mathbf{Z} \\ 2\sqrt{n}Q_o(\sqrt{n};\mathbf{Z})+\Theta (n) & \text{ otherwise } \end{cases} \nonumber \]
That is, at each stage one performs \(\sqrt{n}\) DFTs of size \(\sqrt{n}\) (recursively), then multiplies by the Θ(n)Θ(n)" role="presentation" style="position:relative;" tabindex="0">\(\Theta (n)\) twiddle factors (and does a matrix transposition to obtain in-order output), then finally performs another \(\sqrt{n}\) DFTs of size \(\sqrt{n}\). The solution of the equation is
nn" role="presentation" style="position:relative;" tabindex="0">\[Q_o(n;\mathbf{Z})=\Theta (n\log _\mathbf{Z}n) \nonumber \]
the same as the optimal cache complexity equation!
These algorithms illustrate the basic features of most optimal cache-oblivious algorithms: they employ a recursive divide-and-conquer strategy to subdivide the problem until it fits into cache, at which point the subdivision continues but no further cache misses are required. Moreover, a Cache-oblivious algorithm exploits all levels of the cache in the same way, so an optimal cache-oblivious algorithm exploits a multi-level cache optimally as well as a two-level cache: the multi-level “blocking” is implicit in the recursion.
Cache-obliviousness in practice
Even though the radix-\(\sqrt{n}\) algorithm is optimal cache-oblivious, it does not follow that FFT implementation is a solved problem. The optimality is only in an asymptotic sense, ignoring constant factors, O(n)O(n)" role="presentation" style="position:relative;" tabindex="0">\(O(n)\) terms, etcetera, all of which can matter a great deal in practice. For small or moderate \(n\), quite different algorithms may be superior, as discussed in Memory strategies in FFTW below. Moreover, real caches are inferior to an ideal cache in several ways. The unsurprising consequence of all this is that cache-obliviousness, like any complexity-based algorithm property, does not absolve one from the ordinary process of software optimization. At best, it reduces the amount of memory/cache tuning that one needs to perform, structuring the implementation to make further optimization easier and more portable.
Perhaps most importantly, one needs to perform an optimization that has almost nothing to do with the caches: the recursion must be “coarsened” to amortize the function-call overhead and to enable compiler optimization. For example, the simple pedagogical code of the algorithm in Pre recurses all the way down to n=1n=1" role="presentation" style="position:relative;" tabindex="0">\(n=1\) and hence there are ≈2n≈2n" role="presentation" style="position:relative;" tabindex="0">\(\approx 2n\) function calls in total, so that every data point incurs a two-function-call overhead on average. Moreover, the compiler cannot fully exploit the large register sets and instruction-level parallelism of modern processors with an n=1n=1" role="presentation" style="position:relative;" tabindex="0">\(n=1\) function body.8 These problems can be effectively erased, however, simply by making the base cases larger, e.g. the recursion could stop when n=1n=1" role="presentation" style="position:relative;" tabindex="0">\(n=32\) is reached, at which point a highly optimized hard-coded FFT of that size would be executed. In FFTW, we produced this sort of large base-case using a specialized code-generation program described in Generating Small FFT Kernels.
One might get the impression that there is a strict dichotomy that divides cache-aware and cache-oblivious algorithms, but the two are not mutually exclusive in practice. Given an implementation of a cache-oblivious strategy, one can further optimize it for the cache characteristics of a particular machine in order to improve the constant factors. For example, one can tune the radices used, the transition point between the radix-\(\sqrt{n}\) algorithm and the bounded-radix algorithm, or other algorithmic choices as described in Memory strategies in FFTW below. The advantage of starting cache-aware tuning with a cache-oblivious approach is that the starting point already exploits all levels of the cache to some extent, and one has reason to hope that good performance on one machine will be more portable to other architectures than for a purely cache-aware “blocking” approach. In practice, we have found this combination to be very successful with FFTW.
Memory strategies in FFTW
The recursive cache-oblivious strategies described above form a useful starting point, but FFTW supplements them with a number of additional tricks, and also exploits cache-obliviousness in less-obvious forms.
We currently find that the general radix-\(\sqrt{n}\) algorithm is beneficial only when \(n\) becomes very large, on the order of 220≈106220≈106" role="presentation" style="position:relative;" tabindex="0">\(2^{20}\approx 10^6\). In practice, this means that we use at most a single step of radix-\(\sqrt{n}\) (two steps would only be used for n≳240n≳240" role="presentation" style="position:relative;" tabindex="0">\(n\underset{\sim }{> }2^{40}\)). The reason for this is that the implementation of radix-\(\sqrt{n}\) is less efficient than for a bounded radix: the latter has the advantage that an entire radix butterfly can be performed in hard-coded loop-free code within local variables/registers, including the necessary permutations and twiddle factors.
Thus, for more moderate \(n\), FFTW uses depth-first recursion with a bounded radix, similar in spirit to the algorithm of Pre but with much larger radices (radix 32 is common) and base cases (size 32 or 64 is common) as produced by the code generator of Generating Small FFT Kernels. The self-optimization described in Adaptive Composition of FFT Algorithms allows the choice of radix and the transition to the radix-nn" role="presentation" style="position:relative;" tabindex="0">\(\sqrt{n}\) algorithm to be tuned in a cache-aware (but entirely automatic) fashion.
For small \(n\) (including the radix butterflies and the base cases of the recursion), hard-coded FFTs (FFTW's codelets) are employed. However, this gives rise to an interesting problem: a codelet for (e.g.) n=64n=64" role="presentation" style="position:relative;" tabindex="0">\(n=64\; \text{is}\sim 2000\) lines long, with hundreds of variables and over 1000 arithmetic operations that can be executed in many orders, so what order should be chosen? The key problem here is the efficient use of the CPU registers, which essentially form a nearly ideal, fully associative cache. Normally, one relies on the compiler for all code scheduling and register allocation, but but the compiler needs help with such long blocks of code (indeed, the general register-allocation problem is NP-complete). In particular, FFTW's generator knows more about the code than the compiler—the generator knows it is an FFT, and therefore it can use an optimal cache-oblivious schedule (analogous to the radix-nn" role="presentation" style="position:relative;" tabindex="0">\(\sqrt{n}\) algorithm) to order the code independent of the number of registers. The compiler is then used only for local “cache-aware” tuning (both for register allocation and the CPU pipeline).9 As a practical matter, one consequence of this scheduler is that FFTW's machine-independent codelets are no slower than machine-specific codelets generated by an automated search and optimization over many possible codelet implementations, as performed by the SPIRAL project.
(When implementing hard-coded base cases, there is another choice because a loop of small transforms is always required. Is it better to implement a hard-coded FFT of size 64, for example, or an unrolled loop of four size-16 FFTs, both of which operate on the same amount of data? The former should be more efficient because it performs more computations with the same amount of data, thanks to the \(\log n\) factor in the FFT's nlognnlogn" role="presentation" style="position:relative;" tabindex="0">\(n\log n\) complexity.)
In addition, there are many other techniques that FFTW employs to supplement the basic recursive strategy, mainly to address the fact that cache implementations strongly favor accessing consecutive data—thanks to cache lines, limited associativity, and direct mapping using low-order address bits (accessing data at power-of-two intervals in memory, which is distressingly common in FFTs, is thus especially prone to cache-line conflicts). Unfortunately, the known FFT algorithms inherently involve some non-consecutive access (whether mixed with the computation or in separate bit-reversal/transposition stages). There are many optimizations in FFTW to address this. For example, the data for several butterflies at a time can be copied to a small buffer before computing and then copied back, where the copies and computations involve more consecutive access than doing the computation directly in-place. Or, the input data for the subtransform can be copied from (discontiguous) input to (contiguous) output before performing the subtransform in-place (see Adaptive Composition of FFT Algorithms ), rather than performing the subtransform directly out-of-place (as in the algorithm in Review of the Cooley-Tukey FFT ). Or, the order of loops can be interchanged in order to push the outermost loop from the first radix step [the \(l_2\) loop in the equation] down to the leaves, in order to make the input access more consecutive (see Adaptive Composition of FFT Algorithms ). Or, the twiddle factors can be computed using a smaller look-up table (fewer memory loads) at the cost of more arithmetic (see Numerical Accuracy in FFTs ). The choice of whether to use any of these techniques, which come into play mainly for moderate \(n(2^{13}< n< 2^{20})\), is made by the self-optimizing planner as described in the next section.
Footnotes
3 A hard disk is utilized by “out-of-core” FFT algorithms for very large nn" role="presentation" style="position:relative;" tabindex="0">\(n\), but these algorithms appear to have been largely superseded in practice by both the gigabytes of memory now common on personal computers and, for extremely large nn" role="presentation" style="position:relative;" tabindex="0">nn" role="presentation" style="position:relative;" tabindex="0">\(n\), by algorithms for distributed-memory parallel computers.
4 This includes the registers: on current “x86” processors, the user-visible instruction set (with a small number of floating-point registers) is internally translated at runtime to RISC-like "μ-ops" with a much larger number of physical rename registers that are allocated automatically.
5 More generally, one can assume that a cache line of LL" role="presentation" style="position:relative;" tabindex="0">L consecutive data items are loaded into the cache at once, in order to exploit spatial locality. The ideal-cache model in this case requires that the cache be tall: \(\mathbf{Z}=\Omega (L)^2\)
6 Of course, \(O(n)\) additional storage may be required for twiddle factors, the output data (if the FFT is not in-place), and so on, but these only affect the nn" role="presentation" style="position:relative;" tabindex="0">\(n\ that fits int cache by a constant factor and hence do not impact cache-complexity analysis. We won't worry about such constant factors in this section.
7 This advantage of depth-first recursive implementation of the radix-2 FFT was pointed out many years ago by Singleton (where the “cache” was core memory)
8 In principle, it might be possible for a compiler to automatically coarsen the recursion, similar to how compilers can partially unroll loops. We are currently unaware of any general-purpose compiler that performs this optimization, however.
9 One practical difficulty is that some “optimizing” compilers will tend to greatly re-order the code, destroying FFTW's optimal schedule. With GNU gcc, we circumvent this problem by using compiler flags that explicitly disable certain stages of the optimizer.