# 6.4: Winograd's Complexity Theorems

Because Winograd's work has been the foundation of the modern results in efficient convolution and DFT algorithms, it is worthwhile to look at his theoretical conclusions on optimal algorithms. Most of his results are stated in terms of polynomial multiplication as Polynomial Description of Signals. The measure of computational complexity is usually the number of multiplications, and only certain multiplications are counted. This must be understood in order not to misinterpret the results.

This section will simply give a statement of the pertinent results and will not attempt to derive or prove anything. A short interpretation of each theorem will be given to relate the result to the algorithms developed in this chapter. The indicated references should be consulted for background and detail.

### Theorem 1

Given two polynomials, \(x(s)\) and \(h(s)\), of degree \(N\) and \(M\) respectively, each with indeterminate coefficients that are elements of a field \(H\), \(N+M+1\) multiplications are necessary to compute the coefficients of the product polynomial \(x(s)h(s)\). Multiplication by elements of the field \(G\)(the field of constants), which is contained in \(H\), are not counted and \(G\) contains at least \(N+M\) distinct elements.

The upper bound in this theorem can be realized by choosing an arbitrary modulus polynomial \(P(s)\) of degree \(N+M+1\) composed of \(N+M+1\) distinct linear polynomial factors with coefficients in \(G\) which, since its degree is greater than the product \(x(s)h(s)\), has no effect on the product, and by reducing \(x(s)\) and \(h(s)\) to \(N+M+1\) residues modulo the \(N+M+1\) factors of \(P(s)\). These residues are multiplied by each other, requiring \(N+M+1\) multiplications, and the results recombined using the Chinese remainder theorem (CRT). The operations required in the reduction and recombination are not counted, while the residue multiplications are. Since the modulus \(P(s)\) is arbitrary, its factors are chosen to be simple so as to make the reduction and CRT simple. Factors of zero, plus and minus unity, and infinity are the simplest. Plus and minus two and other factors complicate the actual calculations considerably, but the theorem does not take that into account. This algorithm is a form of the Toom-Cook algorithm and of Lagrange interpolation. For our applications, \(H\) is the field of reals and \(G\) the field of rationals.

### Theorem 2

If an algorithm exists which computes \(x(s)h(s)\) in \(N+M+1\) multiplications, all but one of its multiplication steps must necessarily be of the form

\[mk=(gk'+xg(k))(gk''+hg(k))\; \; for\; \; k=0,1,...,N+M\]

where \(g_k\) are distinct elements of \(G\); and \(g_k'\) and \(g_k''\) are arbitrary elements of \(G\).

This theorem states that the structure of an optimal algorithm is essentially unique although the factors of \(P(s)\) may be chosen arbitrarily.

### Theorem 3

Let \(P(s)\) be a polynomial of degree \(N\) and be of the form \(P(s)=Q(s)k\), where \(Q(s)\) is an irreducible polynomial with coefficients in \(G\) and \(k\) is a positive integer. Let \(x(s)\) and \(h(s)\) be two polynomials of degree at least \(N-1\) with coefficients from \(H\), then \(2N-1\) multiplications are required to compute the product \(x(s)h(s)\) modulo \(P(s)\).

This theorem is similar to Theorem 1 with the operations of the reduction of the product modulo \(P(s)\) not being counted.

### Theorem 4

Any algorithm that computes the product \(x(s)h(s)\) modulo \(P(s)\) according to the conditions stated in Theorem 3 and requires \(2N-1\) multiplications will necessarily be of one of three structures, each of which has the form of Theorem 2 internally.

As in Theorem 2, this theorem states that only a limited number of possible structures exist for optimal algorithms.

### Theorem 5

If the modulus polynomial \(P(s)\) has degree \(N\) and is not irreducible, it can be written in a unique factored form

\[P(s)=P_{1}^{m_{1}}(s)P_{2}^{m_{2}}(s)...P_{k}^{m_{k}}(s)\]

where each of the \(P_i(s)\) are irreducible over the allowed coefficient field \(G\). \(2N-k\) multiplications are necessary to compute the product \(x(s)h(s)\) modulo \(P(s)\) where \(x(s)\) and \(h(s)\) have coefficients in \(H\) and are of degree at least \(N-1\). All algorithms that calculate this product in \(2N-k\) multiplications must be of a form where each of the k residue polynomials of \(x(s)\) and \(h(s)\) are separately multiplied modulo the factors of \(P(s)\) via the CRT.

#### Corollary

If the modulus polynomial is \(P(s)=s^{N}-1\) then \(2N-t(N)\) multiplications are necessary to compute \(x(s)h(s)\) modulo \(P(s)\), where \(t(N)\) is the number of positive divisors of \(N\).

Theorem 5 is very general since it allows a general modulus polynomial. The proof of the upper bound involves reducing \(x(s)\) and \(h(s)\) modulo the \(k\) factors of \(P(s)\). Each of the \(k\) irreducible residue polynomials is then multiplied using the method of Theorem 4 requiring \(2Ni-1\) multiplies and the products are combined using the CRT. The total number of multiplies from the \(k\) parts is \(2n-k\). The theorem also states the structure of these optimal algorithms is essentially unique. The special case of \(P(s)=s^{N}-1\) is interesting since it corresponds to cyclic convolution and, as stated in the corollary, \(k\) is easily determined. The factors of \(s^N-1\) are called cyclotomic polynomials and have interesting properties.

### Theorem 6

Consider calculating the DFT of a prime length real-valued number sequence. If \(G\) is chosen as the field of rational numbers, the number of real multiplications necessary to calculate a length-\(P\) DFT is

\[u(DFT(N))=2P-3-t(P-1)\; \; where\; \; t(P-1)\; is\; the\; number\; of\; divisors\; of\; P-1\]

This theorem not only gives a lower limit on any practical prime length DFT algorithm, it also gives practical algorithms for \(N=3,5,\: \text{and}\: 7\). Consider the operation counts inthe Table to understand this theorem. In addition to the real multiplications counted by complexity theory, each optimal prime-length algorithm will have one multiplication by a rational constant. That constant corresponds to the residue modulo (\(s-1\)) which always exists for the modulus \(P(s)=s^{N}-1\). In a practical algorithm, this multiplication must be carried out, and that accounts for the difference in the prediction of Theorem 6 and count in the Table. In addition, there is another operation that for certain applications must be counted as a multiplication. That is the calculation of the zero frequency term \(X(0)\) in the first row of the example in The DFT as Convolution or Filtering. For applications to the WFTA discussed in The Prime Factor and Winograd Fourier Transform Algorithms, that operation must be counted as a multiply. For lengths longer than \(7\), optimal algorithms require too many additions, so compromise structures are used.

### Theorem 7

If \(G\) is chosen as the field of rational numbers, the number of real multiplications necessary to calculate a length-\(N\) DFT where \(N\) is a prime number raised to an integer power: \(N=Pm\), is given by

\[u(DFT(N))=2N-((m2+m)/2)t(P-1)-m-1\]

where \(t(P-1)\) is the number of divisors of (\(P-1\)) .

This result seems to be practically achievable only for \(N=9\) , or perhaps \(25\). In the case of \(N=9\), there are two rational multiplies that must be carried out and are counted in the Table but are not predicted by Theorem 7. Experience indicates that even for \(N=25\), an algorithm based on a Cooley-Tukey FFT using a type 2 index map gives an over-all more balanced result.

### Theorem 8

If \(G\) is chosen as the field of rational numbers, the number of real multiplications necessary to calculate a length-\(N\) DFT where \(N=2m\) is given by

\[u(DFT(N))=2N-m2-m-2\]

This result is not practically useful because the number of additions necessary to realize this minimum of multiplications becomes very large for lengths greater than \(16\). Nevertheless, it proves the minimum number of multiplications required of an optimal algorithm is a linear function of \(N\) rather than of \(N\log N\) which is that required of practical algorithms. The best practical power-of-two algorithm seems to the Split-Radix FFT discussed in The Cooley-Tukey Fast Fourier Transform Algorithm.

All of these theorems use ideas based on residue reduction, multiplication of the residues, and then combination by the CRT. It is remarkable that this approach finds the minimum number of required multiplications by a constructive proof which generates an algorithm that achieves this minimum; and the structure of the optimal algorithm is, within certain variations, unique. For shorter lengths, the optimal algorithms give practical programs. For longer lengths the uncounted operations involved with the multiplication of the higher degree residue polynomials become very large and impractical. In those cases, efficient suboptimal algorithms can be generated by using the same residue reduction as for the optimal case, but by using methods other than the Toom-Cook algorithm of Theorem 1 to multiply the residue polynomials.

Practical long DFT algorithms are produced by combining short prime length optimal DFT's with the Type 1 index map from Multidimensional Index Mapping to give the Prime Factor Algorithm (PFA) and the Winograd Fourier Transform Algorithm (WFTA) discussed in The Prime Factor and Winograd Fourier Transform Algorithms. It is interesting to note that the index mapping technique is useful inside the short DFT algorithms to replace the Toom-Cook algorithm and outside to combine the short DFT's to calculate long DFT's.

### Contributor

- ContribEEBurrus