# 8.5: Evaluation of the Cooley-Tukey FFT Algorithms

[ "article:topic" ]

The evaluation of any FFT algorithm starts with a count of the real (or floating point) arithmetic. The Table 8.5.1 below gives the number of real multiplications and additions required to calculate a length-N FFT of complex data. Results of programs with one, two, three and five butterflies are given to show the improvement that can be expected from removing unnecessary multiplications and additions. Results of radices two, four, eight and sixteen for the Cooley-Tukey FFT as well as of the split-radix FFT are given to show the relative merits of the various structures. Comparisons of these data should be made with the table of counts for the PFA and WFTA programs in The Prime Factor and Winograd Fourier Transform Algorithms . All programs use the four-multiply-two-add complex multiply algorithm. A similar table can be developed for the three-multiply-three-add algorithm, but the relative results are the same.

From the table it is seen that a greater improvement is obtained going from radix-2 to 4 than from 4 to 8 or 16. This is partly because length 2 and 4 butterflies have no multiplications while length 8, 16 and higher do. It is also seen that going from one to two butterflies gives more improvement than going from two to higher values. From an operation count point of view and from practical experience, a three butterfly radix-4 or a two butterfly radix-8 FFT is a good compromise. The radix-8 and 16 programs become long, especially with multiple butterflies, and they give a limited choice of transform length unless combined with some length 2 and 4 butterflies.

 N M1 M2 M3 M5 A1 A2 A3 A5 2 4 0 0 0 6 4 4 4 4 16 4 0 0 24 18 16 16 8 48 20 8 4 72 58 52 52 16 128 68 40 28 192 162 148 148 32 320 196 136 108 480 418 388 388 64 768 516 392 332 1152 1026 964 964 128 1792 1284 1032 908 2688 2434 2308 2308 256 4096 3076 2568 2316 6144 5634 5380 5380 512 9216 7172 6152 5644 13824 12802 12292 12292 1024 20480 16388 14344 13324 30720 28674 27652 27652 2048 45056 36868 32776 30732 67584 63490 61444 61444 4096 98304 81924 73736 69644 147456 139266 135172 135172 4 12 0 0 0 22 16 16 16 16 96 36 28 24 176 146 144 144 64 576 324 284 264 1056 930 920 920 256 3072 2052 1884 1800 5632 5122 5080 5080 1024 15360 11268 10588 10248 28160 26114 25944 25944 4096 73728 57348 54620 53256 135168 126978 126296 126296 8 32 4 4 4 66 52 52 52 64 512 260 252 248 1056 930 928 928 512 6144 4100 4028 3992 12672 11650 11632 11632 4096 65536 49156 48572 48280 135168 126978 126832 126832 16 80 20 20 20 178 148 148 148 256 2560 1540 1532 1528 5696 5186 5184 5184 4096 61440 45060 44924 44856 136704 128514 128480 128480 2 0 0 0 0 4 4 4 4 4 8 0 0 0 20 16 16 16 8 24 8 4 4 60 52 52 52 16 72 32 28 24 164 144 144 144 32 184 104 92 84 412 372 372 372 64 456 288 268 248 996 912 912 912 128 1080 744 700 660 2332 2164 2164 2164 256 2504 1824 1740 1656 5348 5008 5008 5008 512 5688 4328 4156 3988 12060 11380 11380 11380 1024 12744 10016 9676 9336 26852 25488 25488 25488 2048 28216 22760 22076 21396 59164 56436 56436 56436 4096 61896 50976 49612 48248 129252 123792 123792 123792

Table 8.5.1: Number of Real Multiplications and Additions for Complex Single Radix FFTs

In Table 8.2.1 Mi and Ai refer to the number of real multiplications and real additions used by an FFT with i separately written butterflies. The first block has the counts for Radix-2, the second for Radix-4, the third for Radix-8, the fourth for Radix-16, and the last for the Split-Radix FFT. For the split-radix FFT, M3 and A3 refer to the two- butterfly-plus program and M5 and A5 refer to the three-butterfly program.

The first evaluations of FFT algorithms were in terms of the number of real multiplications required as that was the slowest operation on the computer and, therefore, controlled the execution speed. Later with hardware arithmetic both the number of multiplications and additions became important. Modern systems have arithmetic speeds such that indexing and data transfer times become important factors. Morris has looked at some of these problems and has developed a procedure called autogen to write partially straight-line program code to significantly reduce overhead and speed up FFT run times. Some hardware, such as the TMS320 signal processing chip, has the multiply and add operations combined. Some machines have vector instructions or have parallel processors. Because the execution speed of an FFT depends not only on the algorithm, but also on the hardware architecture and compiler, experiments must be run on the system to be used.

In many cases the unscrambler or bit-reverse-counter requires 10% of the execution time, therefore, if possible, it should be eliminated. In high-speed convolution where the convolution is done by multiplication of DFT's, a decimation-in-frequency FFT can be combined with a decimation-in-time inverse FFT to require no unscrambler. It is also possible for a radix-2 FFT to do the unscrambling inside the FFT but the structure is not very regular.

Although there can be significant differences in the efficiencies of the various Cooley-Tukey and Split-Radix FFTs, the number of multiplications and additions for all of them is on the order of $$N\log N$$. That is fundamental to the class of algorithms.

### Contributor

• ContribEEBurrus