10.4: FFTs and the Memory Hierarchy

Last updated
Save as PDF

Page ID: 2024

$ \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } $ $ \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} $$\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$ $\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$$\newcommand{\AA}{\unicode[.8,0]{x212B}}$

There are many complexities of computer architectures that impact the optimization of FFT implementations, but one of the most pervasive is the memory hierarchy. On any modern general-purpose computer, memory is arranged into a hierarchy of storage devices with increasing size and decreasing speed: the fastest and smallest memory being the CPU registers, then two or three levels of cache, then the main-memory RAM, then external storage such as hard disks.³ Most of these levels are managed automatically by the hardware to hold the most-recently-used data from the next level in the hierarchy.⁴ There are many complications, however, such as limited cache associativity (which means that certain locations in memory cannot be cached simultaneously) and cache lines (which optimize the cache for contiguous memory access), which are reviewed in numerous textbooks on computer architectures. In this section, we focus on the simplest abstract principles of memory hierarchies in order to grasp their fundamental impact on FFTs.

Because access to memory is in many cases the slowest part of the computer, especially compared to arithmetic, one wishes to load as much data as possible in to the faster levels of the hierarchy, and then perform as much computation as possible before going back to the slower memory devices. This is called temporal locality: if a given datum is used more than once, we arrange the computation so that these usages occur as close together as possible in time.

Understanding FFTs with an ideal cache

To understand temporal-locality strategies at a basic level, in this section we will employ an idealized model of a cache in a two-level memory hierarchy. This ideal cache stores $\mathbf{Z}$ data items from main memory (e.g. complex numbers for our purposes): when the processor loads a datum from memory, the access is quick if the datum is already in the cache (a cache hit) and slow otherwise (a cache miss, which requires the datum to be fetched into the cache). When a datum is loaded into the cache,⁵ it must replace some other datum, and the ideal-cache model assumes that the optimal replacement strategy is used: the new datum replaces the datum that will not be needed for the longest time in the future; in practice, this can be simulated to within a factor of two by replacing the least-recently used datum, but ideal replacement is much simpler to analyze. Armed with this ideal-cache model, we can now understand some basic features of FFT implementations that remain essentially true even on real cache architectures. In particular, we want to know the cache complexity, the number $Q (n; Z) " role="presentation" style="position:relative;" tabindex="0">$

One traditional solution to this problem is blocking: the computation is divided into maximal blocks that fit into the cache, and the computations for each block are completed before moving on to the next block. Here, a block of $\mathbf{Z}$ numbers can fit into the cache⁶ (not including storage for twiddle factors and so on), and thus the natural unit of computation is a sub-FFT of size $\mathbf{Z}$. Since each of these blocks involves $Θ (Z log Z) " role="presentation" style="position:relative;" tabindex="0">$

However, there is one shortcoming of any blocked FFT algorithm: it is cache aware, meaning that the implementation depends explicitly on the cache size $\mathbf{Z}$. The implementation must be modified (e.g. changing the radix) to adapt to different machines as the cache size changes. Worse, as mentioned above, actual machines have multiple levels of cache, and to exploit these one must perform multiple levels of blocking, each parameterized by the corresponding cache size. In the above example, if there were a smaller and faster cache of size $z < Z " role="presentation" style="position:relative;" tabindex="0">$

The goal of cache-obliviousness is to structure the algorithm so that it exploits the cache without having the cache size as a parameter: the same code achieves the same asymptotic cache complexity regardless of the cache size $\mathbf{Z}$. An optimal cache-oblivious algorithm achieves the optimal cache complexity (that is, in an asymptotic sense, ignoring constant factors). Remarkably, optimal cache-oblivious algorithms exist for many problems, such as matrix multiplication, sorting, transposition, and FFTs. Not all cache-oblivious algorithms are optimal, of course—for example, the textbook radix-2 algorithm discussed above is “pessimal” cache-oblivious (its cache complexity is independent of $Z " role="presentation" style="position:relative;" tabindex="0">$

This is worse than the theoretical optimum $Q_b(n;\mathbf{Z})$ from the equation, but it is cache-oblivious ($\mathbf{Z}$ never entered the algorithm) and exploits at least some temporal locality.⁷ On the other hand, when it is combined with FFTW's self-optimization and larger radices in Adaptive Composition of FFT Algorithms, this algorithm actually performs very well until $n " role="presentation" style="position:relative;" tabindex="0">$

There exists a different recursive FFT that is optimal cache-oblivious, however, and that is the radix-$\sqrt{n}$ “four-step” Cooley-Tukey algorithm (again executed recursively, depth-first). The cache complexity $Q_o$ of this algorithm satisfies the recurrence:

\[Q_o(n;\mathbf{Z})=\begin{cases} n & n\leq \mathbf{Z} \\ 2\sqrt{n}Q_o(\sqrt{n};\mathbf{Z})+\Theta (n) & \text{ otherwise } \end{cases} \nonumber \]

the same as the optimal cache complexity equation!

These algorithms illustrate the basic features of most optimal cache-oblivious algorithms: they employ a recursive divide-and-conquer strategy to subdivide the problem until it fits into cache, at which point the subdivision continues but no further cache misses are required. Moreover, a Cache-oblivious algorithm exploits all levels of the cache in the same way, so an optimal cache-oblivious algorithm exploits a multi-level cache optimally as well as a two-level cache: the multi-level “blocking” is implicit in the recursion.

Cache-obliviousness in practice

$O(n)$ terms, etcetera, all of which can matter a great deal in practice. For small or moderate $n$, quite different algorithms may be superior, as discussed in Memory strategies in FFTW below. Moreover, real caches are inferior to an ideal cache in several ways. The unsurprising consequence of all this is that cache-obliviousness, like any complexity-based algorithm property, does not absolve one from the ordinary process of software optimization. At best, it reduces the amount of memory/cache tuning that one needs to perform, structuring the implementation to make further optimization easier and more portable.

One might get the impression that there is a strict dichotomy that divides cache-aware and cache-oblivious algorithms, but the two are not mutually exclusive in practice. Given an implementation of a cache-oblivious strategy, one can further optimize it for the cache characteristics of a particular machine in order to improve the constant factors. For example, one can tune the radices used, the transition point between the radix-$\sqrt{n}$ algorithm and the bounded-radix algorithm, or other algorithmic choices as described in Memory strategies in FFTW below. The advantage of starting cache-aware tuning with a cache-oblivious approach is that the starting point already exploits all levels of the cache to some extent, and one has reason to hope that good performance on one machine will be more portable to other architectures than for a purely cache-aware “blocking” approach. In practice, we have found this combination to be very successful with FFTW.

Memory strategies in FFTW

The recursive cache-oblivious strategies described above form a useful starting point, but FFTW supplements them with a number of additional tricks, and also exploits cache-obliviousness in less-obvious forms.

Thus, for more moderate $n$, FFTW uses depth-first recursion with a bounded radix, similar in spirit to the algorithm of Pre but with much larger radices (radix 32 is common) and base cases (size 32 or 64 is common) as produced by the code generator of Generating Small FFT Kernels. The self-optimization described in Adaptive Composition of FFT Algorithms allows the choice of radix and the transition to the radix- $n " role="presentation" style="position:relative;" tabindex="0">$

$n=64\; \text{is}\sim 2000$ lines long, with hundreds of variables and over 1000 arithmetic operations that can be executed in many orders, so what order should be chosen? The key problem here is the efficient use of the CPU registers, which essentially form a nearly ideal, fully associative cache. Normally, one relies on the compiler for all code scheduling and register allocation, but but the compiler needs help with such long blocks of code (indeed, the general register-allocation problem is NP-complete). In particular, FFTW's generator knows more about the code than the compiler—the generator knows it is an FFT, and therefore it can use an optimal cache-oblivious schedule (analogous to the radix- $n " role="presentation" style="position:relative;" tabindex="0">$ $\sqrt{n}$ algorithm) to order the code independent of the number of registers. The compiler is then used only for local “cache-aware” tuning (both for register allocation and the CPU pipeline).⁹ As a practical matter, one consequence of this scheduler is that FFTW's machine-independent codelets are no slower than machine-specific codelets generated by an automated search and optimization over many possible codelet implementations, as performed by the SPIRAL project.

(When implementing hard-coded base cases, there is another choice because a loop of small transforms is always required. Is it better to implement a hard-coded FFT of size 64, for example, or an unrolled loop of four size-16 FFTs, both of which operate on the same amount of data? The former should be more efficient because it performs more computations with the same amount of data, thanks to the $\log n$ factor in the FFT's $n log n " role="presentation" style="position:relative;" tabindex="0">$

In addition, there are many other techniques that FFTW employs to supplement the basic recursive strategy, mainly to address the fact that cache implementations strongly favor accessing consecutive data—thanks to cache lines, limited associativity, and direct mapping using low-order address bits (accessing data at power-of-two intervals in memory, which is distressingly common in FFTs, is thus especially prone to cache-line conflicts). Unfortunately, the known FFT algorithms inherently involve some non-consecutive access (whether mixed with the computation or in separate bit-reversal/transposition stages). There are many optimizations in FFTW to address this. For example, the data for several butterflies at a time can be copied to a small buffer before computing and then copied back, where the copies and computations involve more consecutive access than doing the computation directly in-place. Or, the input data for the subtransform can be copied from (discontiguous) input to (contiguous) output before performing the subtransform in-place (see Adaptive Composition of FFT Algorithms ), rather than performing the subtransform directly out-of-place (as in the algorithm in Review of the Cooley-Tukey FFT ). Or, the order of loops can be interchanged in order to push the outermost loop from the first radix step [the $l_2$ loop in the equation] down to the leaves, in order to make the input access more consecutive (see Adaptive Composition of FFT Algorithms ). Or, the twiddle factors can be computed using a smaller look-up table (fewer memory loads) at the cost of more arithmetic (see Numerical Accuracy in FFTs ). The choice of whether to use any of these techniques, which come into play mainly for moderate $n(2^{13}< n< 2^{20})$, is made by the self-optimizing planner as described in the next section.

Footnotes

⁷ This advantage of depth-first recursive implementation of the radix-2 FFT was pointed out many years ago by Singleton (where the “cache” was core memory)

⁸ In principle, it might be possible for a compiler to automatically coarsen the recursion, similar to how compilers can partially unroll loops. We are currently unaware of any general-purpose compiler that performs this optimization, however.

⁹ One practical difficulty is that some “optimizing” compilers will tend to greatly re-order the code, destroying FFTW's optimal schedule. With GNU gcc, we circumvent this problem by using compiler flags that explicitly disable certain stages of the optimizer.

Contributor

ContribEEBurrus