10.12.2012 Views

Prime Numbers

Prime Numbers

Prime Numbers

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

9.5 Large-integer multiplication 483<br />

formulae to recover the correct DFT of the original signal x. An example of<br />

this pure-real signal approach for number-theoretical transforms as applied to<br />

cyclic convolution is embodied in Algorithm 9.5.22, with split-radix symbolic<br />

pseudocode given in [Crandall 1997b] (see Exercise 9.51 for discussion of the<br />

negacyclic scenario).<br />

Incidentally, there are yet lower-complexity FFTs, called split-radix FFTs,<br />

which employ an identity more complicated than the Danielson–Lanczos<br />

formula. And there is even a split-radix, pure-real-signal FFT due to Sorenson<br />

that is quite efficient and in wide use [Crandall 1994b]. The vast “FFT forest”<br />

is replete with specially optimized FFTs, and whole texts have been written<br />

in regard to the structure of FFT algorithms; see, for example, [Van Loan<br />

1992].<br />

Even at the close of the 20th century there continue to be, every year, a<br />

great many published papers on new FFT optimizations. Because our present<br />

theme is the implementation of FFTs for large-integer arithmetic, we close this<br />

section with one more algorithm: a “parallel,” or “piecemeal,” FFT algorithm<br />

that is quite useful in at least two practical settings. First, when signal data<br />

are particularly numerous, the FFT must be performed in limited memory. In<br />

practical terms, a signal might reside on disk memory, and exceed a machine’s<br />

physical random-access memory. The idea is to “spool” pieces of the signal<br />

off the larger memory, process them, and combine in just the right way to<br />

deposit a final FFT in storage. Because computations occur in large part on<br />

separate pieces of the transform, the algorithm can also be used in a parallel<br />

setting, with each separate processor handling a respective piece of the FFT.<br />

The algorithm following has been studied by various investigators [Agarwal<br />

and Cooley 1986], [Swarztrauber 1987], [Ashworth and Lyne 1988], [Bailey<br />

1990], especially with respect to practical memory usage. It is curious that<br />

the essential ideas seem to have originated with [Gentleman and Sande 1966].<br />

Perhaps, in view of the extreme density and proliferation of FFT research,<br />

one might forgive investigators for overlooking these origins for two decades.<br />

The parallel-FFT algorithm stems from the observation that a length-<br />

(D = WH) DFT can be performed by tracking over rows and columns of an<br />

H × W (height times width) matrix. Everything follows from the following<br />

algebraic reduction of the DFT X of x:<br />

⎛<br />

D−1 <br />

X = DFT(x) = ⎝ xjg −jk<br />

⎞<br />

⎠<br />

=<br />

=<br />

<br />

W −1<br />

H−1 <br />

J=0 M=0<br />

<br />

W −1<br />

J=0<br />

H−1<br />

<br />

M=0<br />

j=0<br />

D−1<br />

k=0<br />

xJ+MWg −(J+MW)(K+NH)<br />

xJ+MWg −MK<br />

H<br />

<br />

g −JK g −JN<br />

W<br />

D−1<br />

K+NH=0<br />

D−1<br />

K+NH=0<br />

,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!