Prime Numbers

Prime Numbers Prime Numbers

thales.doa.fmph.uniba.sk
from thales.doa.fmph.uniba.sk More from this publisher
10.12.2012 Views

9.5 Large-integer multiplication 487 1. [Initialize] B = 2b lg b ; for(j ∈ [0,D− 1]) { µj = ωj + 1 ; } 2 θj = ωj − µj; // And so |θj| ≤1/2. 2. [Perform a total of 8B standard FFTs] for(K ∈ [0, 7]) { for(β ∈ [0,B− 1]) { s = (0) D−1 0 ; // Zero signal of length D. for(j ∈ [0,D− 1]) { µ = µj mod D; sµ = sµ + xje−2πiKωj/8θ β j . } FK,β = FFT(s); // So (FK,β,m : m ∈ [0,D− 1]) is a transform. } } 3. [Create the transform approximation] X ′ = ∪ 7 K=0 B−1 β=0 D/8−1 β 1 FK,β,m(−2πim/D) β! m=0 return X ′ ; // Approximation to the nonuniform DFT (9.23). Algorithm 9.5.8 is written above in rather compact form, but a typical implementation on a symbolic processor looks much like this pseudocode. Note that the signal union at the end—being the left-right concatenation of length-(D/8) signals—can be effected in some programming languages about as compactly as we have. Incidentally, though the symbolics mask somewhat the reason why the algorithm works, it is not hard to see that the Taylor expansion of e −2πi(µj+θj)k/D in powers of θj, together with adroit manipulation of indices, brings success. It is a curious and happy fact that decimating the transform-signal length by the fixed factor of 8 suffices for all possible input signals x to the algorithm. Such is the utility of the inequality (9.25). In summary, the number of standard FFTs we require, to yield b-bit accuracy, is about 16b/ lg b. This is a “worst-case” FFT count, in that practical applications often enjoy far better than b-bit accuracy when a particular b parameter is passed to Algorithm 9.5.8. Since the pioneering work of Dutt–Rokhlin, various works such as [Ware 1998], [Nguyen and Liu 1999] have appeared, revealing somewhat better accuracy, or slightly improved execution speed, or other enhancements. There has even been an approach that minimizes the worst-case error for input signals of unit norm [Fessler and Sutton 2003]. But all the way from the Dutt–Rokhlin origins to the modern fringe, the basic idea remains the same: Transform the calculation of X ′ to that of obtaining a set of uniform and standard FFTs of only somewhat wasteful overall length.

488 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC We close this FFT section by mentioning some new developments in regard to “gigaelement” FFTs that have now become possible at reasonable speed. For example, [Crandall et al. 2004] discusses theory and implementation for each of these gigaelement cases: Length-2 30 , one-dimensional FFT (effected via Algorithm 9.5.7); 2 15 × 2 15 , two-dimensional FFT; 2 10 × 2 10 × 2 10 , three-dimensional FFT. With such massive signal sizes come the difficult yet fascinating issues of fast matrix transposition, cache-friendly memory action, and vectorization of floating-point arithmetic. The bottom line as regards performance is that the one-dimensional, length-2 30 case takes less than one minute on a modern hardware cluster, if double-precision floating-point is used. (The two- and three-dimensional cases are about as fast; in fact the two-dimensional case is usually fastest, for technical reasons.) For computational number theory, these new results mean this: On a hardware cluster that fits into a closet, say, numbers of a billion decimal digits can be multiplied together in roughly a minute. Such observations depend on proper resolution of the following problem: The errors in such “monster” FFTs can be nontrivial. There are just so many terms being added/multiplied that one deviates from the truth, so to speak, in a kind of random walk. Interestingly, a length-D FFT can be modeled as a random walk in D dimensions, having O(ln D) steps. The paper [Crandall et al. 2004] thus reports quantitative bounds on FFT errors, such bounds having been pioneered by E. Mayer and C. Percival. 9.5.3 Convolution theory Let x denote a signal (x0,x1,...,xD−1), where for example, the elements of x could be the digits of Definitions (9.1.1) or (9.1.2) (although we do not a priori insist that the elements be digits; the theory to follow is fairly general). We start by defining fundamental convolution operations on signals. In what follows, we assume that signals x, y have been assigned the same length (D) of elements. In all the summations of this next definition, indices i, j each run over the set {0,...,D− 1}: Definition 9.5.9. The cyclic convolution of two length-D signals x, y is a signal denoted z = x × y having D elements given by zn = xiyj, i+j≡n (mod D) while the negacyclic convolution of x, y is a signal v = x ×− y having D elements given by vn = xiyj − xiyj, i+j=n i+j=D+n

9.5 Large-integer multiplication 487<br />

1. [Initialize]<br />

B =<br />

2b<br />

lg b<br />

<br />

;<br />

for(j ∈ [0,D− 1]) {<br />

µj = ωj + 1<br />

<br />

;<br />

}<br />

2<br />

θj = ωj − µj; // And so |θj| ≤1/2.<br />

2. [Perform a total of 8B standard FFTs]<br />

for(K ∈ [0, 7]) {<br />

for(β ∈ [0,B− 1]) {<br />

s = (0) D−1<br />

0 ; // Zero signal of length D.<br />

for(j ∈ [0,D− 1]) {<br />

µ = µj mod D;<br />

sµ = sµ + xje−2πiKωj/8θ β<br />

j .<br />

}<br />

FK,β = FFT(s); // So (FK,β,m : m ∈ [0,D− 1]) is a transform.<br />

}<br />

}<br />

3. [Create the transform approximation]<br />

X ′ = ∪ 7 K=0<br />

B−1<br />

β=0<br />

D/8−1 β 1<br />

FK,β,m(−2πim/D) β!<br />

m=0<br />

return X ′ ; // Approximation to the nonuniform DFT (9.23).<br />

Algorithm 9.5.8 is written above in rather compact form, but a typical<br />

implementation on a symbolic processor looks much like this pseudocode.<br />

Note that the signal union at the end—being the left-right concatenation<br />

of length-(D/8) signals—can be effected in some programming languages<br />

about as compactly as we have. Incidentally, though the symbolics mask<br />

somewhat the reason why the algorithm works, it is not hard to see that<br />

the Taylor expansion of e −2πi(µj+θj)k/D in powers of θj, together with adroit<br />

manipulation of indices, brings success. It is a curious and happy fact that<br />

decimating the transform-signal length by the fixed factor of 8 suffices for all<br />

possible input signals x to the algorithm. Such is the utility of the inequality<br />

(9.25). In summary, the number of standard FFTs we require, to yield b-bit<br />

accuracy, is about 16b/ lg b. This is a “worst-case” FFT count, in that practical<br />

applications often enjoy far better than b-bit accuracy when a particular b<br />

parameter is passed to Algorithm 9.5.8.<br />

Since the pioneering work of Dutt–Rokhlin, various works such as [Ware<br />

1998], [Nguyen and Liu 1999] have appeared, revealing somewhat better<br />

accuracy, or slightly improved execution speed, or other enhancements. There<br />

has even been an approach that minimizes the worst-case error for input<br />

signals of unit norm [Fessler and Sutton 2003]. But all the way from the<br />

Dutt–Rokhlin origins to the modern fringe, the basic idea remains the same:<br />

Transform the calculation of X ′ to that of obtaining a set of uniform and<br />

standard FFTs of only somewhat wasteful overall length.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!