# 6.5: The Automatic Generation of Winograd's Short DFTs

### Introduction

Efficient prime length DFTs are important for two reasons. A particular application may require a prime length DFT and secondly, the maximum length and the variety of lengths of a PFA or WFTA algorithm depend upon the availability of prime length modules.

This discusses automation of the process Winograd used for constructing prime length FFTs for \(N<7\) and that Johnson and Burrus extended to \(N<19\). It also describes a program that will design any prime length FFT in principle, and will also automatically generate the algorithm as a C program and draw the corresponding flow graph.

Winograd's approach uses Rader's method to convert a prime length DFT into a \(P-1\) length cyclic convolution, polynomial residue reduction to decompose the problem into smaller convolutions, and the Toom-Cook algorithm. The Chinese Remainder Theorem (CRT) for polynomials is then used to recombine the shorter convolutions. Unfortunately, the design procedure derived directly from Winograd's theory becomes cumbersome for longer length DFTs, and this has often prevented the design of DFT programs for lengths greater than \(19\).

Here we use three methods to facilitate the construction of prime length FFT modules. First, the matrix exchange property is used so that the transpose of the reduction operator can be used rather than the more complicated CRT reconstruction operator. This is then combined with the numerical method for obtaining the multiplication coefficients rather than the direct use of the CRT. We also deviate from the Toom-Cook algorithm, because it requires too many additions for the lengths in which we are interested. Instead we use an iterated polynomial multiplication algorithm. We have incorporated these three ideas into a single structural procedure that automates the design of prime length FFTs.

### Matrix Description

It is important that each step in the Winograd FFT can be described using matrices. By expressing cyclic convolution as a bilinear form, a compact form of prime length DFTs can be obtained.

If \(y\) is the cyclic convolution of \(h\) and \(x\), then \(y\) can be expressed as

\[y=C\left [ Ax.\ast Bh \right ]\]

where, using the Matlab convention, .* represents point by point multiplication. When \(A\), \(B\) and \(C\) are allowed to be complex, \(A\) and \(B\) are seen to be the DFT operator and \(C\), the inverse DFT. When only real numbers are allowed, \(A\), \(B\) and \(C\) will be rectangular. Using the matrix exchange property this form can be written as

\[y=RB^{T}\left [ C^{T}Rh.\ast Ax \right ]\]

where \(R\) is the permutation matrix that reverses order.

When \(h\) is fixed, as it is when considering prime length DFTs, the term \(C^TRh\) can be precomputed and a diagonal matrix \(D\) formed by

\[D=diag{C^{T}Rh}\]

This is advantageous because in general, \(C\) is more complicated than \(B\), so the ability to “hide" \(C\) saves computation. Now

\[y=RB^{T}DAx\; \; or\; \; y=RA^{T}DAx\]

since \(A\) and \(B\) can be the same; they implement a polynomial reduction. The form \(y=R^{T}DAxT\) can also be used for the prime length DFTs, it is only necessary to permute the entries of \(x\) and to ensure that the \(DC\) term is computed correctly. The computation of the \(DC\) term is simple, for the residue of a polynomial modulo \(a-1\) is always the sum of the coefficients. After adding the \(x_0\) term of the original input sequence, to the \(s-1\) residue, the \(DC\) term is obtained. Now

\[DFT\left \{ x \right \}=RA^{T}DAx\]

Johnson observes that by permuting the elements on the diagonal of \(D\), the output can be permuted, so that the \(R\) matrix can be hidden in \(D\), and

\[DFT\left \{ x \right \}=A^{T}DAx\]

From the knowledge of this form, once \(A\) is found, \(D\) can be found numerically.

### Programming the Design Procedure

Because each of the above steps can be described by matrices, the development of a prime length FFTs is made convenient with the use of a matrix oriented programming language such as Matlab. After specifying the appropriate matrices that describe the desired FFT algorithm, generating code involves compiling the matrices into the desired code for execution.

Each matrix is a section of one stage of the flow graph that corresponds to the DFT program. The four stages are:

- Permutation Stage: Permutes input and output sequence.
- Reduction Stage: Reduces the cyclic convolution to smaller polynomial products.
- Polynomial Product Stage: Performs the polynomial multiplications.
- Multiplication Stage: Implements the point-by-point multiplication in the bilinear form.

Each of the stages can be clearly seen in the flow graphs for the DFTs. Fig. 6.5.1 shows the flow graph for a length \(17\) DFT algorithm that was automatically drawn by the program.

*Fig. 6.5.1 Flowgraph of length-17 DFT*

The programs that accomplish this process are written in Matlab and C. Those that compute the appropriate matrices are written in Matlab. These matrices are then stored as two ASCII files, with the dimensions in one and the matrix elements in the second. A C program then reads the flies and compiles them to produce the final FFT program in C.

### The Reduction Stage

The reduction of an \(N^{th}\) degree polynomial, \(X(s)\), modulo the cyclotomic polynomial factors of (\(s^N-1\)) requires only additions for many \(N\), however, the actual number of additions depends upon the way in which the reduction proceeds. The reduction is most efficiently performed in steps. For example,

\[If\; \; N=4\; \; and\; \; ((X(s)))_{s-1},\; \; ((X(s)))_{s+1}\; \; and\; \; ((X(s)))_{s^{2}+1},\]

where the double parenthesis denote polynomial reduction modulo

\[(s-1),\; \; (s+1),\; \; and\; \; (s^{2}+1)\]

then in the first step

\[((X(s)))_{s^{2}-1}\; \; and\; \; ((X(s)))_{s^{2}+1}\]

should be computed.

In the second step,

\[((X(s)))_{s-1}\; \; and\; \; ((X(s)))_{s+1}\]

can be found by reducing

\[((X(s)))_{s^{2}-1}\]

This process is described by the diagram in Fig. 6.5.2 below.

*Fig. 6.5.2 Factorization of s ^{4}-1 in steps*

When \(N\) is even, the appropriate first factorization is

\[(S^{N/2}-1)(S^{N/2}+1)\]

However, the next appropriate factorization is frequently less obvious. The following procedure has been found to generate a factorization in steps that coincides with the factorization that minimizes the cumulative number of additions incurred by the steps. The prime factors of \(N\) are the basis of this procedure and their importance is clear from the useful well-known equation:

\[s^{N}-1=\prod _{n\mid N}C_{n}(s)\]

where \(C_n(s)\) is the \(n^{th}\) cyclotomic polynomial.

We first introduce the following two functions defined on the positive integers,

\[\psi (N)=the\; smallest\; prime\; factor\; of\; N\; for\; N>1\; and\; \psi (1)=1\]

Suppose \(P(s)\) is equal to either (\(s^N-1\)) or an intermediate noncyclotomic polynomial appearing in the factorization process, for example, (\(a^2-1\)), above. Write \(P(s)\) in terms of its cyclotomic factors,

\[P(s)=C_{k_{1}}(s)C_{k_{2}}(s)...C_{k_{L}}(s)\]

define the two sets, \(G\) and \(G\), by

\[G=\left \{ k_{1},...,k_{L} \right \}\; and\; G=\left \{ k/gcd(G):k\epsilon G \right \}\]

and define the two integers, \(t\) and \(T\), by

\[t=min\left \{ \psi (k):k\: \epsilon\: G,\: k>1 \right \}\; \; and\; \; T=max\: nu(k,t):k\: \epsilon \: G\]

Then form two new sets,

\[A=\left \{ k\: \epsilon \: G:T\mid k \right \}\; \; and\; \; B=\left \{ k\: \epsilon \: G:T\mid k \right \}\]

The factorization of \(P(s)\),

\[P(s)=\left ( \prod _{k\: \epsilon \: A} C_{k}(s)\right )\left ( \prod _{k\: \epsilon \: B} C_{k}(s)\right )\]

has been found useful in the procedure for factoring (\(s^N-1\)). This is best illustrated with an example.

#### Example

\(N=36\)

**Step 1:**

Let

\[P(s)=s^{36}-1\]

Since

\[P=C_{1}C_{2}C_{3}C_{4}C_{6}C_{9}C_{12}C_{18}C_{36}\]

\[G=G=\left \{ 1,2,3,4,6,9,12,18,36 \right \}\]

\[t=min\; \left \{ 2,3 \right \}=2\]

\[A=\left \{ k\: \epsilon \: G:4\mid k \right \}=\left \{ 1,2,3,6,9,18 \right \}\]

\[B=\left \{ k\: \epsilon \: G:4\mid k \right \}=\left \{ 4,12,36 \right \}\]

Hence the factorization of \(s^{36}-1\) into two intermediate polynomials is as expected,

\[\prod _{k\: \epsilon \: A} C_{k}(s)=s^{18}-1,\; \; \prod _{k\: \epsilon \: B} C_{k}(s)=s^{18}+1\]

If a \(36^{th}\) degree polynomial, \(X(s)\), is represented by a vector of coefficients,

\[X=(x_{35},...x_{0})'\; then\; ((X(s)))_{s^{18}-1}\; represented\; by\; X'\; and\; ((X(s)))_{s^{18}+1}\; represented\; by\; X''\]

is given by **test **which entails \(36\) additions.

**Step 2:**

This procedure is repeated with

\[P(s)=s^{18}-1 \; and\; P(s)=s^{18}+1\]

We will just show it for the later. Let

\[P(s)=s^{18}+1\]

Since

\[P=C_{4}C_{12}C_{36}\]

\[G=\left \{ 4,12,36 \right \}\; and\; G'=\left \{ 1,3,9 \right \}\]

\[t=min\: 3=3\]

\[T=max\: \nu (k,3):k\: \epsilon \: G=max\: 1,3,9=9\]

\[A=\left \{ k\: \epsilon \: G:9\mid k \right \}=\left \{ 4,12\right \}\]

\[B=\left \{ k\: \epsilon \: G:9\mid k \right \}=\left \{ 36\right \}\]

This yields the two intermediate polynomials

\[s^{6}+1\; \; and\; \; s^{12}-s^{6}+1\]

In the notation used above,

\[\begin{bmatrix} X' & X'' \end{bmatrix}=\begin{bmatrix} I_{6} & -I_{6} & I_{6}\\ I_{6} & I_{6} & \\ -I_{6} & & I_{6} \end{bmatrix}X\]

entailing \(24\) additions. Continuing this process results in a factorization in steps.

In order to see the number of additions this scheme uses for numbers of the form \(N=P-1\) (which is relevant to prime length FFT algorithms).

### The Polynomial Product Stage

The iterated convolution algorithm can be used to construct an \(N\) point linear convolution algorithm from shorter linear convolution algorithms. Suppose the linear convolution \(y\), of the n point vectors \(x\) and \(h\) (\(h\) known) is described by

\[y=E_{n}^{T}DE_{n}x\]

where \(E_n\) is an “expansion" matrix the elements of which are \(\pm 1\)'s and \(0\)'s and \(D\) is an appropriate diagonal matrix. Because the only multiplications in this expression are by the elements of \(D\), the number of multiplications required, \(M(n)\), is equal to the number of rows of \(E_n\). The number of additions is denoted by \(M(n)\).

Given a matrix \(E_{n1}\)_{ }and a matrix \(E_{n2}\), the iterated algorithm gives a method for combining \(E_{n1}\) and \(E_{n2}\) to construct a valid expansion matrix, \(E_{n}\), for

\[N\leq n_{1}n_{2}\]

Specifically,

\[E_{n_{1},n_{2}}=(I_{m(n_{2})}\otimes E_{n_{1}})(E_{n_{2}}\times I_{n_{1}})\]

The product \(n_1n_2\) may be greater than \(N\), for zeros can be (conceptually) appended to \(x\). The operation count associated with \(E_{n1}E_{n2}\) is

\[A(n_{1},n_{2})=n!A(n_{2})+A(n_{1})M(n_{2})\]

\[M(n_{1},n_{2})=M(n_{1})M(n_{2})\]

Although they are both valid expansion matrices,

\[E_{n_{1},n_{2}}\neq E_{n_{2},n_{1}}\; \; and\; \; A_{n_{1},n_{2}}\neq A_{n_{2},n_{1}}\]

Because

\[M_{n_{1},n_{2}}\neq M_{n_{2},n_{1}}\]

it is desirable to chose an ordering of factors to minimize the additions incurred by the expansion matrix.

#### Multiple Factors

Note that a valid expansion matrix, \(E_{n}\), can be constructed from \(E_{n_1,n_2}\) and \(E_{n3}\), for

\[N\leq n_{1}n_{2}n_{3}\]

In general, any number of factors can be used to create larger expansion matrices. The operation count associated with is \(E_{n_1,n_2,n_3}\) is

\[A(n_{1},n_{2},n_{3})=n_{1}n_{2}A(n_{3})+n_{1}A(n_{2})M(n_{3})+A(n_{1})M(n_{2})M(n_{3})\]

\[M(n_{1},n_{2},n_{3})=M(n_{1})M(n_{2})M(n_{3})\]

These equations generalize in the predicted way when more factors are considered. Because the ordering of the factors is relevant in the equation for \(A(\cdot )\) but not for \(M(\cdot )\), it is again desirable to order the factors to minimize the number of additions. By exploiting the following property of the expressions for \(A(\cdot )\) and \(M(\cdot )\), the optimal ordering can be found.

#### Reservation of Optimal Ordering

Suppose

\( A(n_1,n_2,n_3) \le \text{min} \{A(n_{k_1},n_{k_2},n_{k_3}) \: k_1,k_2,k_3 \in \{1,2,3\} \text{ and distinct} \} \), then

1. | \(A(n_1,n_2) \le A(n_2,n_1)\) |

2. | \(A(n_2,n_3) \le A(n_3,n_2)\) |

3. | \(A(n_1,n_3) \le A(n_3,n_1)\) |

The generalization of this property to more than two factos revelas that an optimal ordering of \( \{n_1, \ldots , n_{L-i} \} \) is preserved in an optimal ordering of \( \{n_1, \ldots , n_{L} \} \). Therefore, if \( (n_1, \ldots , n_{L-i}) \) is an optimal odering of \( \{n_1, \ldots , n_{L} \} \), then \( (n_k,n_{k+1}) \) is an optimal ordering of \( \{ n_k,n_{k+1} \} \) and consequently

\[ \dfrac{A(n_k)}{M(n_k)-n_k} \le \dfrac{A(n_{k+1})}{M(n_{k+1})-n_{k+1}} \]

for all \(k=1,2,\ldots,L-1\).

This immediately suggests that an optimal ordering of \( \{n_1, \ldots , n_L \} \), to minimize the number of additions incurred by \( E_{n_1, \ldots , n_L}\) simply involves computing the appropriate ratios.

### Discussion and Conclusion

We have designed prime length FFTs up to length \(53\) that are as good as the previous designs that only went up to \(19\). Table 1 gives the operation counts for the new and previously designed modules, assuming complex inputs.

It is interesting to note that the operation counts depend on the factorability of \(P−1\). The primes \(11\), \(23\), and \(47\) are all of the form \(1+2P_1\) making the design of efficient FFTs for these lengths more difficult.

Further deviations from the original Winograd approach than we have made could prove useful for longer lengths. We investigated, for example, the use of twiddle factors at appropriate points in the decomposition stage; these can sometimes be used to divide the cyclic convolution into smaller convolutions. Their use means, however, that the 'center* multiplications would no longer be by purely real or imaginary numbers.

N |
Mult |
Adds |

7 | 16 | 72 |

11 | 40 | 168 |

13 | 40 | 188 |

17 | 82 | 274 |

19 | 88 | 360 |

23 | 174 | 672 |

29 | 190 | 766 |

31 | 160 | 984 |

37 | 220 | 920 |

41 | 282 | 1140 |

43 | 304 | 1416 |

47 | 640 | 2088 |

53 | 556 | 2038 |

Operation counts for prime length DFTs

The approach in writing a program that writes another program is a valuable one for several reasons. Programming the design process for the design of prime length FFTs has the advantages of being practical, error-free, and flexible. The flexibility is important because it allows for modification and experimentation with different algorithmic ideas. Above all, it has allowed longer DFTs to be reliably designed.

More details on the generation of programs for prime length FFTs can be found in the 1993 Technical Report.

### Contributor

- ContribEEBurrus