# 10.5: Adaptive Composition of FFT Algorithms

As alluded to several times already, FFTW implements a wide variety of FFT algorithms (mostly rearrangements of Cooley-Tukey) and selects the “best” algorithm for a given \(n\) automatically. In this section, we describe how such self-optimization is implemented, and especially how FFTW's algorithms are structured as a composition of algorithmic fragments. These techniques in FFTW are described in greater detail elsewhere, so here we will focus only on the essential ideas and the motivations behind them.

An FFT algorithm in FFTW is a composition of algorithmic steps called a **plan**. The algorithmic steps each solve a certain class of **problems** (either solving the problem directly or recursively breaking it into sub-problems of the same type). The choice of plan for a given problem is determined by a **planner** that selects a composition of steps, either by runtime measurements to pick the fastest algorithm, or by heuristics, or by loading a pre-computed plan. These three pieces: problems, algorithmic steps, and the planner, are discussed in the following subsections.

### The problem to be solved

In early versions of FFTW, the only choice made by the planner was the sequence of radices, and so each step of the plan took a DFT of a given size \(n\), possibly with discontiguous input/output, and reduced it (via a radix \(r\)) to DFTs of size \(n/r\), which were solved recursively. That is, each step solved the following problem: given a size \(n\), an **input pointer**\(\mathbf{I}\) an **input stride** \(\imath\), an **output pointer** \(\mathbf{O}\), and an **output stride** \(\mathit{o}\), it computed the DFT of \(\mathbf{I}[\mathit{l}_\imath ]\; \text{for}\; 0\leq \mathit{l}< n\) and stored the result in \(\mathbf{O}[\mathbf{ko}]\; \text{for}\; 0\leq k< n\). However, we soon found that we could not easily express many interesting algorithms within this framework; for example, **in-place** \((\mathbf{I}=\mathbf{O})\) FFTs that do not require a separate bit-reversal stage. It became clear that the key issue was not the choice of algorithms, as we had first supposed, but the definition of the problem to be solved. Because only problems that can be expressed can be solved, the representation of a problem determines an outer bound to the space of plans that the planner can explore, and therefore it ultimately constrains FFTW's performance.

The difficulty with our initial \((n,\mathbf{I},\imath ,\mathbf{O},o)\) problem definition was that it forced each algorithmic step to address only a single DFT. In fact, FFTs break down DFTs into **multiple** smaller DFTs, and it is the **combination** of these smaller transforms that is best addressed by many algorithmic choices, especially to rearrange the order of memory accesses between the subtransforms. Therefore, we redefined our notion of a problem in FFTW to be not a single DFT, but rather a **loop** of DFTs, and in fact **multiple nested loops** of DFTs. The following sections describe some of the new algorithmic steps that such a problem definition enables, but first we will define the problem more precisely.

DFT problems in FFTW are expressed in terms of structures called I/O tensors,** ^{10}** which in turn are described in terms of ancillary structures called I/O dimensions. An

**I/O dimension**\(d\) is a triple \(d=(n,\imath ,o)\) where \(n\) is a non-negative integer called the

**length**, \(\imath\) is an integer called the

**input stride**, and \(o\) is an integer called the

**output stride**. An

**I/O tensor**\(t=\left \{ d_1,d_2,...,d_p\right \}\) is a set of I/O dimensions. The non-negative integer \(\rho =\mid t\mid\) is called the

**rank**of the I/O tensor. A

**DFT problem**, denoted by \(\text{dft}(\mathbf{N},\mathbf{V},\mathbf{I},\mathbf{O})\) consists of two I/O tensors \(\mathbf{N}\) and \(\mathbf{V}\), and of two

**pointers**\(\mathbf{I}\) and \(\mathbf{O}\). Informally, this describes \(\left | \mathbf{V}\right |\) nested loops of \(\left | \mathbf{N}\right |\)-dimensional DFTs with input data starting at memory location \(\mathbf{I}\) and output data starting at \(\mathbf{O}\).

For simplicity, let us consider only one-dimensional DFTs, so that \(\mathbf{N}=\left \{ (n,\imath ,o) \right \}\) implies a DFT of length \(n\) on input data with stride \(\imath\), and output data with stride \(\mathit{o}\), much like in the original FFTW as described above. The main new feature is then the addition of zero or more “loops” \(\mathbf{V}\). More formally,

\[\text{dft}\left ( \mathbf{N},\left \{ (n,\imath ,o) \right \}\cup \mathbf{V},\mathbf{I},\mathbf{O}\right )\]

is recursively defined as a “loop” of \(n\) problems: for all \(0\leq k< n\), do all computations in

\[\text{dft}(\mathbf{N},\mathbf{V},\mathbf{I}+k.\imath ,\mathbf{O}+k.o)\]

The case of multi-dimensional DFTs is defined more precisely elsewhere, but essentially each I/O dimension in \(mathbf{N}\) gives one dimension of the transform.

We call \(mathbf{N}\) the **size** of the problem. The **rank** of a problem is defined to be the rank of its size (i.e., the dimensionality of the DFT). Similarly, we call \(mathbf{V}\) the **vector size** of the problem, and the **vector rank** of a problem is correspondingly defined to be the rank of its vector size. Intuitively, the vector size can be interpreted as a set of “loops” wrapped around a single DFT, and we therefore refer to a single I/O dimension of \(mathbf{V}\) as a **vector loop**. (Alternatively, one can view the problem as describing a DFT over a \(\left | \mathbf{V} \right |\)-dimensional vector space.) The problem does not specify the order of execution of these loops, however, and therefore FFTW is free to choose the fastest or most convenient order.

### DFT problem examples

A more detailed discussion of the space of problems in FFTW can be found, but a simple understanding can be gained by examining a few examples demonstrating that the I/O tensor representation is sufficiently general to cover many situations that arise in practice, including some that are not usually considered to be instances of the DFT. A single one-dimensional DFT of length \(n\), with stride-1 input \(\mathbf{X}\) and output \(\mathbf{Y}\), as in the equation, is denoted by the problem \(\text{dft}(\left \{ (n,1,1) \right \},\left \{ \right \},\mathbf{X},\mathbf{Y})\) (no loops: vector-rank zero).

As a more complicated example, suppose we have an \(n_1\times n_2\) matrix \(\mathbf{X}\) stored as \(n_1\) consecutive blocks of contiguous length-\(n_2\) rows (this is called **row-major** format). The in-place DFT of all the **rows** of this matrix would be denoted by the problem \(\text{dft}(\left \{ (n_2,1,1) \right \},\left \{ n_1,n_2,n_2\right \},\mathbf{X},\mathbf{X})\): a length-\(n_1\) loop of size \(n_2\) contiguous DFTs, where each iteration of the loop offsets its input/output data by a stride \(n_2\). Conversely, the in-place DFT of all the **columns** of this matrix would be denoted by \(\text{dft}(\left \{ (n_1,n_2,n_2) \right \},\left \{ n_2,1,1\right \},\mathbf{X},\mathbf{X})\): compared to the previous example, \(\mathbf{N}\) and \(\mathbf{V}\) are swapped. In the latter case, each DFT operates on discontiguous data, and FFTW might well choose to interchange the loops: instead of performing a loop of DFTs computed individually, the subtransforms themselves could act on \(n_2\) -component vectors, as described in **The space of plans in FFTW **below.

A size-1 DFT is simply a copy \(\mathbf{Y}[0]=\mathbf{X}[0]\), and here this can also be denoted by \(\mathbf{N}=\left \{ \right \}\) (rank zero, a “zero-dimensional” DFT). This allows FFTW's problems to represent many kinds of copies and permutations of the data within the same problem framework, which is convenient because these sorts of operations arise frequently in FFT algorithms. For example, to copy \(n\) consecutive numbers from \(\mathbf{I}\) to \(\mathbf{O}\), one would use the rank-zero problem \(\text{dft}(\left \{ \right \},\left \{ (n,1,1) \right \},\mathbf{I},\mathbf{O})\). More interestingly, the in-place **transpose** of an \(n_1\times n_2\) matrix \(\mathbf{X}\) stored in row-major format, as described above, is denoted by \(\text{dft}(\left \{ \right \},\left \{ (n_1,n_2,1),(n_2,1,n_1), \right \},\mathbf{X},\mathbf{X})\) (rank zero, vector-rank two).

### The space of plans in FFTW

Here, we describe a subset of the possible plans considered by FFTW; while not exhaustive, this subset is enough to illustrate the basic structure of FFTW and the necessity of including the vector loop(s) in the problem definition to enable several interesting algorithms. The plans that we now describe usually perform some simple “atomic” operation, and it may not be apparent how these operations fit together to actually compute DFTs, or why certain operations are useful at all. We shall discuss those matters in **Discussion** below.

Roughly speaking, to solve a general DFT problem, one must perform three tasks. First, one must reduce a problem of arbitrary vector rank to a set of loops nested around a problem of vector rank 0, i.e., a single (possibly multi-dimensional) DFT. Second, one must reduce the multi-dimensional DFT to a sequence of of rank-1 problems, i.e., one-dimensional DFTs; for simplicity, however, we do not consider multi-dimensional DFTs below. Third, one must solve the rank-1, vector rank-0 problem by means of some DFT algorithm such as Cooley-Tukey. These three steps need not be executed in the stated order, however, and in fact, almost every permutation and interleaving of these three steps leads to a correct DFT plan. The choice of the set of plans explored by the planner is critical for the usability of the FFTW system: the set must be large enough to contain the fastest possible plans, but it must be small enough to keep the planning time acceptable.

#### Rank-0 plans

The rank-0 problem \(\text{dft}(\left \{ \right \},\mathbf{V},\mathbf{I},\mathbf{O})\) denotes a permutation of the input array into the output array. FFTW does not solve arbitrary rank-0 problems, only the following two special cases that arise in practice.

- When \(\left | \mathbf{V} \right |=1\) and \(\mathbf{I}\neq 0\), FFTW produces a plan that copies the input array into the output array. Depending on the strides, the plan consists of a loop or, possibly, of a call to the ANSI C function
`memcpy`

, which is specialized to copy contiguous regions of memory. - When \(\left | \mathbf{V} \right |=2\) and \(\mathbf{I}=0\), and the strides denote a matrix-transposition problem, FFTW creates a plan that transposes the array in-place. FFTW implements the square transposition \(\text{dft}(\left \{ \right \},\left \{ (n,\imath ,o),(n,o,\imath ), \right \},\mathbf{I},\mathbf{O})\) by means of the cache-oblivious algorithm, which is fast and, in theory, uses the cache optimally regardless of the cache size (using principles similar to those described in the section FFTs and the Memory Hierarchy ). A generalization of this idea is employed for non-square transpositions with a large common factor or a small difference between the dimensions, adapting algorithms from

#### Rank-1 plans

Rank-1 DFT problems denote ordinary one-dimensional Fourier transforms. FFTW deals with most rank-1 problems as follows.

##### Direct plans

When the DFT rank-1 problem is “small enough” (usually, \(n\leq 64\), FFTW produces a **direct plan** that solves the problem directly. These plans operate by calling a fragment of C code (a **codelet**) specialized to solve problems of one particular size, whose generation is described in Generating Small FFT Kernels . More precisely, the codelets compute a loop \((\left | \mathbf{V} \right |\leq 1)\) of small DFTs.

##### Cooley-Tukey plans

For problems of the form \(\text{dft}(\left \{ (n,\imath ,o)\right \},\mathbf{V},\mathbf{I},\mathbf{O})\) where \(n=rm\), FFTW generates a plan that implements a radix-\(r\) Cooley-Tukey algorithm Review of the Cooley-Tukey FFT . Both decimation-in-time and decimation-in-frequency plans are supported, with both small fixed radices (usually, \(r\leq 64\)) produced by the codelet generator Generating Small FFT Kernels and also arbitrary radices (e.g. radix-\(\sqrt{n}\)).

The most common case is a **decimation in time** (**DIT**) plan, corresponding to a **radix** \(r=n_2\) (and thus \(m=n_1\)) in the notation of Review of the Cooley-Tukey FFT : it first solves \(\text{dft}\left ( \left \{ (m,r.\imath ,o) \right \},\mathbf{V}\cup \left \{ (r,\imath ,m.o) \right \},\mathbf{I},\mathbf{O} \right )\), then multiplies the output array \(\mathbf{O}\) by the twiddle factors, and finally solves \(\text{dft}\left ( \left \{ (r,m.o,m.o) \right \},\mathbf{V}\cup \left \{ (m,o,o) \right \},\mathbf{O},\mathbf{O} \right )\). For performance, the last two steps are not planned independently, but are fused together in a single “twiddle” codelet—a fragment of C code that multiplies its input by the twiddle factors and performs a DFT of size \(r\), operating in-place on \(\mathbf{O}\).

#### Plans for higher vector ranks

These plans extract a vector loop to reduce a DFT problem to a problem of lower vector rank, which is then solved recursively. Any of the vector loops of \(\mathbf{V}\) could be extracted in this way, leading to a number of possible plans corresponding to different loop orderings.

Formally, to solve \(\text{dft}(\mathbf{N},\mathbf{V},\mathbf{I},\mathbf{O})\), where \(\mathbf{V}=\left \{ (n,\imath ,o) \right \}\cup \mathbf{V_1}\), FFTW generates a loop that, for all \(k\) such that \(0\leq k< n\), invokes a plan for \(\text{dft}(\mathbf{N},\mathbf{V_1},\mathbf{I}+k.\imath ,\mathbf{O}+k.o)\).

##### Indirect plans

Indirect plans transform a DFT problem that requires some data shuffling (or discontiguous operation) into a problem that requires no shuffling plus a rank-0 problem that performs the shuffling.

Formally, to solve \(\text{dft}(\mathbf{N},\mathbf{V},\mathbf{I},\mathbf{O})\) where \(\left | \mathbf{N} \right |> 0\), FFTW generates a plan that first solves \[\text{dft}(\left \{ \right \}\mathbf{N}\cup \mathbf{V},\mathbf{I},\mathbf{O})\), and then solves \(\text{dft}(\text{copy-o}(\mathbf{N}),\text{copy-o}(\mathbf{V}),\mathbf{O},\mathbf{O})\). Here we define \(\text{copy-o}(t)\) to be the I/O tensor \(\left \{ (n,o,o)\mid (n,\imath ,o) \in t\right \}\): that is, it replaces the input strides with the output strides. Thus, an indirect plan first rearranges/copies the data to the output, then solves the problem in place.

##### Plans for prime sizes

As discussed in Goals and Background of the FFTW Project it turns out to be surprisingly useful to be able to handle large prime \(n\) (or large prime factors). **Rader plans** implement the algorithm from above to compute one-dimensional DFTs of prime size in \(\Theta (n\log n)\) time. **Bluestein plans** implement Bluestein's “chirp-z” algorithm, which can also handle prime (n\) in \(\Theta (n\log n)\) time. **Generic plans** implement a naive \(\Theta (n^2)\) algorithm (useful for \(n\underset{\sim }{< }100\)).

### Discussion

Although it may not be immediately apparent, the combination of the recursive rules in **The space of plans in FFTW **above, can produce a number of useful algorithms. To illustrate these compositions, we discuss three particular issues: depth- vs. breadth-first, loop reordering, and in-place transforms.

*Fig. 10.5.1 Two possible decompositions for a size-30 DFT, both for the arbitrary choice of DIT radices 3 then 2 then 5, and prime-size codelets. Items grouped by a "{*$\{$* " result from the plan for a single sub-problem. In the depth-first case, the vector rank was reduced to zero as per Plans for higher vector ranks before decomposing sub-problems, and vice-versa in the breadth-first case*

As discussed previously in sections Review of the Cooley-Tukey FFT and Understanding FFTs with an ideal cache, the same Cooley-Tukey decomposition can be executed in either traditional breadth-first order or in recursive depth-first order, where the latter has some theoretical cache advantages. FFTW is explicitly recursive, and thus it can naturally employ a depth-first order. Because its sub-problems contain a vector loop that can be executed in a variety of orders, however, FFTW can also employ breadth-first traversal. In particular, a 1d algorithm resembling the traditional breadth-first Cooley-Tukey would result from applying **Cooley-Tukey plans **to completely factorize the problem size before applying the loop rule

**Plans for higher vector ranks**to reduce the vector ranks, whereas depth-first traversal would result from applying the loop rule before factorizing each subtransform. These two possibilities are illustrated by an example in Fig. 10.5.1 above.

Another example of the effect of loop reordering is a style of plan that we sometimes call **vector recursion** (unrelated to “vector-radix” FFTs). The basic idea is that, if one has a loop (vector-rank 1) of transforms, where the vector stride is smaller than the transform size, it is advantageous to push the loop towards the leaves of the transform decomposition, while otherwise maintaining recursive depth-first ordering, rather than looping “outside” the transform; i.e., apply the usual FFT to “vectors” rather than numbers. Limited forms of this idea have appeared for computing multiple FFTs on vector processors (where the loop in question maps directly to a hardware vector). For example, Cooley-Tukey produces a unit **input**-stride vector loop at the top-level DIT decomposition, but with a large **output** stride; this difference in strides makes it non-obvious whether vector recursion is advantageous for the sub-problem, but for large transforms we often observe the planner to choose this possibility.

In-place 1d transforms (with no separate bit reversal pass) can be obtained as follows by a combination DIT and DIF plans **Cooley-Tukey**** ****plans **with transposes "Rank-0 plans". First, the transform is decomposed via a radix-\(p\) DIT plan into a vector of \(p\) transforms of size \(qm\), then these are decomposed in turn by a radix-\(q\), DIF plan into a vector (rank 2) of \(p\times q\) transforms of size \(m\). These transforms of size \(m\) have input and output at different places/strides in the original array, and so cannot be solved independently. Instead, an indirect plan "Indirect plans" is used to express the sub-problem as size \(pq\) in-place transforms of size \(m\), followed or preceded by an \(m\times p\times q\) rank-0 transform. The latter sub-problem is easily seen to be \(m\) in-place \(p\times q\) transposes (ideally square, i.e. \(p=q\).

### The FFTW planner

Given a problem and a set of possible plans, the basic principle behind the FFTW planner is straightforward: construct a plan for each applicable algorithmic step, time the execution of these plans, and select the fastest one. Each algorithmic step may break the problem into subproblems, and the fastest plan for each subproblem is constructed in the same way. These timing measurements can either be performed at runtime, or alternatively the plans for a given set of sizes can be precomputed and loaded at a later time.

A direct implementation of this approach, however, faces an exponential explosion of the number of possible plans, and hence of the planning time, as \(n\) increases. In order to reduce the planning time to a manageable level, we employ several heuristics to reduce the space of possible plans that must be compared. The most important of these heuristics is **dynamic programmin**: it optimizes each sub-problem locally, independently of the larger context (so that the “best” plan for a given sub-problem is re-used whenever that sub-problem is encountered). Dynamic programming is not guaranteed to find the fastest plan, because the performance of plans is context-dependent on real machines (e.g., the contents of the cache depend on the preceding computations); however, this approximation works reasonably well in practice and greatly reduces the planning time. Other approximations, such as restrictions on the types of loop-reorderings that are considered **Plans for higher vector ranks **are also described.

Alternatively, there is an **estimate mode** that performs no timing measurements whatsoever, but instead minimizes a heuristic cost function. This can reduce the planner time by several orders of magnitude, but with a significant penalty observed in plan efficiency; e.g., a penalty of 20% is typical for moderate \(n\underset{\sim }{< }2^{13}\), whereas a factor of 2–3 can be suffered for large \(n\underset{\sim }{> }2^{16}\). Coming up with a better heuristic plan is an interesting open research question; one difficulty is that, because FFT algorithms depend on factorization, knowing a good plan for \(n\) does not immediately help one find a good plan for nearby \(n\).

**Footnote**

^{10}** **I/O tensors are unrelated to the tensor-product notation used by some other authors to describe FFT algorithms.

- ContribEEBurrus