9.4: Modifications of the PFA and WFTA Type Algorithms
In the previous section it was seen how using the permutation property of the elementary operators in the PFA allowed the nesting of the multiplications to reduce their number. It was also seen that a proper ordering of the operators could minimize the number of additions. These ideas have been extended in formulating a more general algorithm optimizing problem. If the DFT operator \(F\) in the equation is expressed in a still more factored form obtained from Winograd's Short DFT Algorithms, a greater variety of ordering can be optimized. For example if the \(A\) operators have two factors
\[F_1=A_1^TA_1^{'T}D_1A_1^{'}A_1\]
The DFT in the equation becomes
\[X=A_2^TA_2^{'T}D_2A_2^{'}A_2A_1^TA_1^{'T}D_1A_1^{'}A_1x\]
The operator notation is very helpful in understanding the central ideas, but may hide some important facts. It has been shown that operators in different \(F_i\) commute with each other, but the order of the operators within an \(F_i\)cannot be changed. They represent the matrix multiplications in Winograd's Short DFT Algorithms which do not commute.
This formulation allows a very large set of possible orderings, in fact, the number is so large that some automatic technique must be used to find the “best". It is possible to set up a criterion of optimality that not only includes the number of multiplications but the number of additions as well. The effects of relative multiplyadd times, data transfer times, CPU register and memory sizes, and other hardware characteristics can be included in the criterion. Dynamic programming can then be applied to derive an optimal algorithm for a particular application. This is a very interesting idea as there is no longer a single algorithm, but a class and an optimizing procedure. The challenge is to generate a broad enough class to result in a solution that is close to a global optimum and to have a practical scheme for finding the solution.
Results obtained applying the dynamic programming method to the design of fairly long DFT algorithms gave algorithms that had fewer multiplications and additions than either a pure PFA or WFTA. It seems that some nesting is desirable but not total nesting for four or more factors. There are also some interesting possibilities in mixing the CooleyTukey with this formulation. Unfortunately, the twiddle factors are not the same for all rows and columns, therefore, operations cannot commute past a twiddle factor operator. There are ways of breaking the total algorithm into horizontal paths and using different orderings along the different paths. In a sense, this is what the splitradix FFT does with its twiddle factors when compared to a conventional CooleyTukey FFT.
There are other modifications of the basic structure of the Type1 index map DFT algorithm. One is to use the same index structure and conversion of the short DFT's to convolution as the PFA but to use some other method for the highspeed convolution. Table lookup of partial products based on distributed arithmetic to eliminate all multiplications looks promising for certain very specific applications, perhaps for specialized VLSI implementation. Another possibility is to calculate the short convolutions using numbertheoretic transforms. This would also require special hardware. Direct calculation of short convolutions is faster on certain pipelined processor such as the TMS320 microprocessor.

ContribEEBurrus