Prime Numbers
Prime Numbers Prime Numbers
9.5 Large-integer multiplication 487 1. [Initialize] B = 2b lg b ; for(j ∈ [0,D− 1]) { µj = ωj + 1 ; } 2 θj = ωj − µj; // And so |θj| ≤1/2. 2. [Perform a total of 8B standard FFTs] for(K ∈ [0, 7]) { for(β ∈ [0,B− 1]) { s = (0) D−1 0 ; // Zero signal of length D. for(j ∈ [0,D− 1]) { µ = µj mod D; sµ = sµ + xje−2πiKωj/8θ β j . } FK,β = FFT(s); // So (FK,β,m : m ∈ [0,D− 1]) is a transform. } } 3. [Create the transform approximation] X ′ = ∪ 7 K=0 B−1 β=0 D/8−1 β 1 FK,β,m(−2πim/D) β! m=0 return X ′ ; // Approximation to the nonuniform DFT (9.23). Algorithm 9.5.8 is written above in rather compact form, but a typical implementation on a symbolic processor looks much like this pseudocode. Note that the signal union at the end—being the left-right concatenation of length-(D/8) signals—can be effected in some programming languages about as compactly as we have. Incidentally, though the symbolics mask somewhat the reason why the algorithm works, it is not hard to see that the Taylor expansion of e −2πi(µj+θj)k/D in powers of θj, together with adroit manipulation of indices, brings success. It is a curious and happy fact that decimating the transform-signal length by the fixed factor of 8 suffices for all possible input signals x to the algorithm. Such is the utility of the inequality (9.25). In summary, the number of standard FFTs we require, to yield b-bit accuracy, is about 16b/ lg b. This is a “worst-case” FFT count, in that practical applications often enjoy far better than b-bit accuracy when a particular b parameter is passed to Algorithm 9.5.8. Since the pioneering work of Dutt–Rokhlin, various works such as [Ware 1998], [Nguyen and Liu 1999] have appeared, revealing somewhat better accuracy, or slightly improved execution speed, or other enhancements. There has even been an approach that minimizes the worst-case error for input signals of unit norm [Fessler and Sutton 2003]. But all the way from the Dutt–Rokhlin origins to the modern fringe, the basic idea remains the same: Transform the calculation of X ′ to that of obtaining a set of uniform and standard FFTs of only somewhat wasteful overall length.
488 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC We close this FFT section by mentioning some new developments in regard to “gigaelement” FFTs that have now become possible at reasonable speed. For example, [Crandall et al. 2004] discusses theory and implementation for each of these gigaelement cases: Length-2 30 , one-dimensional FFT (effected via Algorithm 9.5.7); 2 15 × 2 15 , two-dimensional FFT; 2 10 × 2 10 × 2 10 , three-dimensional FFT. With such massive signal sizes come the difficult yet fascinating issues of fast matrix transposition, cache-friendly memory action, and vectorization of floating-point arithmetic. The bottom line as regards performance is that the one-dimensional, length-2 30 case takes less than one minute on a modern hardware cluster, if double-precision floating-point is used. (The two- and three-dimensional cases are about as fast; in fact the two-dimensional case is usually fastest, for technical reasons.) For computational number theory, these new results mean this: On a hardware cluster that fits into a closet, say, numbers of a billion decimal digits can be multiplied together in roughly a minute. Such observations depend on proper resolution of the following problem: The errors in such “monster” FFTs can be nontrivial. There are just so many terms being added/multiplied that one deviates from the truth, so to speak, in a kind of random walk. Interestingly, a length-D FFT can be modeled as a random walk in D dimensions, having O(ln D) steps. The paper [Crandall et al. 2004] thus reports quantitative bounds on FFT errors, such bounds having been pioneered by E. Mayer and C. Percival. 9.5.3 Convolution theory Let x denote a signal (x0,x1,...,xD−1), where for example, the elements of x could be the digits of Definitions (9.1.1) or (9.1.2) (although we do not a priori insist that the elements be digits; the theory to follow is fairly general). We start by defining fundamental convolution operations on signals. In what follows, we assume that signals x, y have been assigned the same length (D) of elements. In all the summations of this next definition, indices i, j each run over the set {0,...,D− 1}: Definition 9.5.9. The cyclic convolution of two length-D signals x, y is a signal denoted z = x × y having D elements given by zn = xiyj, i+j≡n (mod D) while the negacyclic convolution of x, y is a signal v = x ×− y having D elements given by vn = xiyj − xiyj, i+j=n i+j=D+n
- Page 446 and 447: 8.8 Research problems 437 element o
- Page 448 and 449: 8.8 Research problems 439 the Leveq
- Page 450 and 451: 8.8 Research problems 441 for every
- Page 452 and 453: Chapter 9 FAST ALGORITHMS FOR LARGE
- Page 454 and 455: 9.1 Tour of “grammar-school” me
- Page 456 and 457: 9.2 Enhancements to modular arithme
- Page 458 and 459: 9.2 Enhancements to modular arithme
- Page 460 and 461: 9.2 Enhancements to modular arithme
- Page 462 and 463: 9.2 Enhancements to modular arithme
- Page 464 and 465: 9.2 Enhancements to modular arithme
- Page 466 and 467: 9.3 Exponentiation 457 Algorithm 9.
- Page 468 and 469: 9.3 Exponentiation 459 But there is
- Page 470 and 471: 9.3 Exponentiation 461 the benefit
- Page 472 and 473: 9.4 Enhancements for gcd and invers
- Page 474 and 475: 9.4 Enhancements for gcd and invers
- Page 476 and 477: 9.4 Enhancements for gcd and invers
- Page 478 and 479: 9.4 Enhancements for gcd and invers
- Page 480 and 481: 9.4 Enhancements for gcd and invers
- Page 482 and 483: 9.5 Large-integer multiplication 47
- Page 484 and 485: 9.5 Large-integer multiplication 47
- Page 486 and 487: 9.5 Large-integer multiplication 47
- Page 488 and 489: 9.5 Large-integer multiplication 47
- Page 490 and 491: 9.5 Large-integer multiplication 48
- Page 492 and 493: 9.5 Large-integer multiplication 48
- Page 494 and 495: 9.5 Large-integer multiplication 48
- Page 498 and 499: 9.5 Large-integer multiplication 48
- Page 500 and 501: 9.5 Large-integer multiplication 49
- Page 502 and 503: 9.5 Large-integer multiplication 49
- Page 504 and 505: 9.5 Large-integer multiplication 49
- Page 506 and 507: 9.5 Large-integer multiplication 49
- Page 508 and 509: 9.5 Large-integer multiplication 49
- Page 510 and 511: 9.5 Large-integer multiplication 50
- Page 512 and 513: 9.5 Large-integer multiplication 50
- Page 514 and 515: 9.5 Large-integer multiplication 50
- Page 516 and 517: 9.5 Large-integer multiplication 50
- Page 518 and 519: 9.6 Polynomial arithmetic 509 can i
- Page 520 and 521: 9.6 Polynomial arithmetic 511 Incid
- Page 522 and 523: 9.6 Polynomial arithmetic 513 where
- Page 524 and 525: 9.6 Polynomial arithmetic 515 such
- Page 526 and 527: 9.6 Polynomial arithmetic 517 Note
- Page 528 and 529: 9.7 Exercises 519 (3) Write out com
- Page 530 and 531: 9.7 Exercises 521 where “do” si
- Page 532 and 533: 9.7 Exercises 523 9.23. How general
- Page 534 and 535: 9.7 Exercises 525 two (and thus, me
- Page 536 and 537: 9.7 Exercises 527 0 2 +3 2 +0 2 is
- Page 538 and 539: 9.7 Exercises 529 9.49. In the FFT
- Page 540 and 541: 9.7 Exercises 531 adjustment step.
- Page 542 and 543: 9.7 Exercises 533 9.69. Implement A
- Page 544 and 545: 9.8 Research problems 535 less than
9.5 Large-integer multiplication 487<br />
1. [Initialize]<br />
B =<br />
2b<br />
lg b<br />
<br />
;<br />
for(j ∈ [0,D− 1]) {<br />
µj = ωj + 1<br />
<br />
;<br />
}<br />
2<br />
θj = ωj − µj; // And so |θj| ≤1/2.<br />
2. [Perform a total of 8B standard FFTs]<br />
for(K ∈ [0, 7]) {<br />
for(β ∈ [0,B− 1]) {<br />
s = (0) D−1<br />
0 ; // Zero signal of length D.<br />
for(j ∈ [0,D− 1]) {<br />
µ = µj mod D;<br />
sµ = sµ + xje−2πiKωj/8θ β<br />
j .<br />
}<br />
FK,β = FFT(s); // So (FK,β,m : m ∈ [0,D− 1]) is a transform.<br />
}<br />
}<br />
3. [Create the transform approximation]<br />
X ′ = ∪ 7 K=0<br />
B−1<br />
β=0<br />
D/8−1 β 1<br />
FK,β,m(−2πim/D) β!<br />
m=0<br />
return X ′ ; // Approximation to the nonuniform DFT (9.23).<br />
Algorithm 9.5.8 is written above in rather compact form, but a typical<br />
implementation on a symbolic processor looks much like this pseudocode.<br />
Note that the signal union at the end—being the left-right concatenation<br />
of length-(D/8) signals—can be effected in some programming languages<br />
about as compactly as we have. Incidentally, though the symbolics mask<br />
somewhat the reason why the algorithm works, it is not hard to see that<br />
the Taylor expansion of e −2πi(µj+θj)k/D in powers of θj, together with adroit<br />
manipulation of indices, brings success. It is a curious and happy fact that<br />
decimating the transform-signal length by the fixed factor of 8 suffices for all<br />
possible input signals x to the algorithm. Such is the utility of the inequality<br />
(9.25). In summary, the number of standard FFTs we require, to yield b-bit<br />
accuracy, is about 16b/ lg b. This is a “worst-case” FFT count, in that practical<br />
applications often enjoy far better than b-bit accuracy when a particular b<br />
parameter is passed to Algorithm 9.5.8.<br />
Since the pioneering work of Dutt–Rokhlin, various works such as [Ware<br />
1998], [Nguyen and Liu 1999] have appeared, revealing somewhat better<br />
accuracy, or slightly improved execution speed, or other enhancements. There<br />
has even been an approach that minimizes the worst-case error for input<br />
signals of unit norm [Fessler and Sutton 2003]. But all the way from the<br />
Dutt–Rokhlin origins to the modern fringe, the basic idea remains the same:<br />
Transform the calculation of X ′ to that of obtaining a set of uniform and<br />
standard FFTs of only somewhat wasteful overall length.