13.07.2015 Views

FPGA Implementation of an Extended Binary GCD Algorithm for ...

FPGA Implementation of an Extended Binary GCD Algorithm for ...

FPGA Implementation of an Extended Binary GCD Algorithm for ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>FPGA</strong> <strong>Implementation</strong> <strong>of</strong> <strong>an</strong> <strong>Extended</strong> <strong>Binary</strong><strong>GCD</strong> <strong>Algorithm</strong> <strong>for</strong> Systolic Reduction <strong>of</strong>Rational NumbersBogd<strong>an</strong> Mătăsaru <strong>an</strong>d Tudor Jebele<strong>an</strong>RISC-Linz, A–4040 Linz, Austriaemail: bmatasar@risc.uni-linz.ac.atAbstract. We present the <strong>FPGA</strong> implementation <strong>of</strong> <strong>an</strong> extension <strong>of</strong> thebinary plus–minus systolic algorithm which computes the <strong>GCD</strong> (greatestcommon divisor) <strong>an</strong>d also the normal <strong>for</strong>m <strong>of</strong> a rational number, withoutusing division. A sample array <strong>for</strong> 8 bit oper<strong>an</strong>ds consumes 83.4% <strong>of</strong> <strong>an</strong>Atmel 40K10 chip <strong>an</strong>d operates at 25 MHz.1 IntroductionArbitrary precision (or exact) arithmetic is necessary in various computer algebraapplications (e. g. solving <strong>of</strong> systems <strong>of</strong> polynomial equations) <strong>an</strong>d it mayconsume most <strong>of</strong> the computing time [2].Reduction <strong>of</strong> rational numbers occurs very <strong>of</strong>ten in exact arithmetic: sooneror later, the reduction <strong>of</strong> the result is required in order to avoid <strong>an</strong> unacceptableincrease in size <strong>of</strong> its numerator <strong>an</strong>d denominator. The usual approach is to usethe <strong>GCD</strong> (greatest common divisor) once (or twice on shorter oper<strong>an</strong>ds [3]) <strong>an</strong>dsome divisions (or better exact divisions [4,8]). Alternatively, one c<strong>an</strong> use theextended <strong>GCD</strong> algorithm [7], but this will probably be less efficient <strong>for</strong> sequentialimplementations.In this paper we present the systolic implementation <strong>of</strong> <strong>an</strong> extension <strong>of</strong> thebinary <strong>GCD</strong> algorithm [10], more precisely <strong>of</strong> the plus-minus algorithm [1] asimproved in [5]. Extensions <strong>of</strong> the original binary algorithm have been consideredby Gosper [7], <strong>an</strong>d also in [11]. Our algorithm has a different flavor [9] becauseit is designed <strong>for</strong> systolic implementation.Since it avoids the division, <strong>an</strong>d it is also based on simple operations like shifts<strong>an</strong>d additions/subtractions, our device may be interesting even <strong>for</strong> classical implementationson some sequential architectures. In this paper we demonstratethe usefulness <strong>of</strong> this approach on a systolic architecture, by designing a systolicversion <strong>of</strong> the algorithm <strong>an</strong>d by implementing it on <strong>an</strong> Atmel <strong>FPGA</strong> (fieldprogrammable gate array) circuit. The systolic algorithm is <strong>an</strong> extension <strong>of</strong> thesystolic <strong>GCD</strong> presented in [5] which avoids global broadcasting by pipeliningthe “comm<strong>an</strong>d” through the array. This in turn has the disadv<strong>an</strong>tage that waitstates have to be introduced, because <strong>of</strong> the two-way flow <strong>of</strong> the in<strong>for</strong>mation. Ouralgorithm takes adv<strong>an</strong>tage <strong>of</strong> these wait states: the computations required by the


extended algorithm are done instead <strong>of</strong> wait. Hence, the speed is the same as theone <strong>of</strong> the previous algorithm, but additionally we obtain the reduced fraction.The circuit is presented as a systolic array, enjoying the properties <strong>of</strong> uni<strong>for</strong>mdesign (m<strong>an</strong>y identical cells) <strong>an</strong>d local communications (only between adjacentcells). The input <strong>of</strong> the oper<strong>an</strong>ds is done in parallel m<strong>an</strong>ner, while the outputc<strong>an</strong> be org<strong>an</strong>ized serially or in parallel, depending on the application.The sample implementation <strong>of</strong> <strong>an</strong> array <strong>for</strong> 8 bit oper<strong>an</strong>ds on the Atmel<strong>FPGA</strong> part 40K10 consists <strong>of</strong> 968 elementary macros <strong>an</strong>d consumes after layout83.4% <strong>of</strong> the area. The longest delay is 40 ns, thus operation at 25 MHz is possible,which rivals the speed <strong>of</strong> the best s<strong>of</strong>tware implementations on sequentialmachines <strong>for</strong> words <strong>of</strong> 32 bits, but it will probably gain considerably in speed ina special device <strong>for</strong> more words.2 The algorithmThe input <strong>of</strong> the binary (plus-minus) <strong>GCD</strong> algorithm consists <strong>of</strong> 2 n-bit oper<strong>an</strong>dsa <strong>an</strong>d b. The operations executed on the oper<strong>an</strong>ds are additions, subtractions<strong>an</strong>d shifts <strong>an</strong>d they are decided by inspecting the least signific<strong>an</strong>t two bits <strong>of</strong>each oper<strong>an</strong>d.Let us denote by a k , respectively b k , the values <strong>of</strong> the oper<strong>an</strong>ds at step k.The extended <strong>GCD</strong> algorithm computes also the sequences <strong>of</strong> c<strong>of</strong>actors <strong>of</strong> u k ,v k , t k , <strong>an</strong>d w k such that at each step k:u k · a + v k · b = a k · 2 k , (1)t k · a + w k · b = b k · 2 k . (2)The algorithm ends when b k = 0. Then, a k equals the <strong>GCD</strong> <strong>an</strong>d t k · a + w k · b = 0,hence a/b = −w k /t k .The sequence <strong>of</strong> c<strong>of</strong>actors is defined recursively starting from the initial values:u 0 = 1,v 0 = 0,t 0 = 0,w 0 = 1. (We denote by x[1],x[0] the least signific<strong>an</strong>tbits <strong>of</strong> <strong>an</strong> oper<strong>an</strong>d x.)Shift both: (a k [0] = 0, b k [0] = 0): the c<strong>of</strong>actors remain unch<strong>an</strong>ged.Interch<strong>an</strong>ge <strong>an</strong>d shift b: (a k [0] = 0, b k [0] = 1):Shift b: (a k [0] = 1, b k [0] = 0):u k+1 := 2 · t k , v k+1 := 2 · w k ,t k+1 := u k , w k+1: = v k .u k+1 := 2 · u k , v k+1 := 2 · v k ,t k+1 := t k , w k+1 := w k .Plus-minus: (a k [0] = 1, b k [0] = 1), plus if a k [1] ≠ b k [1], otherwise minus:u k+1 := 2 · t k , v k+1 := 2 · w kt k+1 := u ± t k , w k+1 := v k ± w k


One c<strong>an</strong> easily verify that the relations (1) <strong>an</strong>d (2) are preserved at eachstep, <strong>an</strong>d one c<strong>an</strong> prove that t k <strong>an</strong>d w k remain relatively prime <strong>an</strong>d at least one<strong>of</strong> them is not null. Thus, when b k becomes null, −w k /t k is the reduced <strong>for</strong>m <strong>of</strong>the initial fraction a/b. (For the pro<strong>of</strong>s <strong>an</strong>d a more detailed description <strong>of</strong> thealgorithm see [9].)3 The systolic arrayThe systolic array is org<strong>an</strong>ized as in [5]: the oper<strong>an</strong>ds are fed in parallel, onedigit <strong>of</strong> each in each processor, the rightmost processor (P 0 ) corresponds to theleast-signific<strong>an</strong>t bit – see Fig. 1. An array <strong>of</strong> N +1 processors will accommodateoper<strong>an</strong>ds up to N bits long. (The leftmost processor <strong>an</strong>d the ones above thesignific<strong>an</strong>t bits <strong>of</strong> the oper<strong>an</strong>ds will contain the sign bit). All the intermediatevalues are kept in complement representation: there<strong>for</strong>e additions/subtractionsc<strong>an</strong> be per<strong>for</strong>med without knowing the actual sign <strong>of</strong> the oper<strong>an</strong>ds.Each processor communicates only with its neighbors: the comm<strong>an</strong>ds (states)<strong>an</strong>d the carries propagate right-to-left, while the intermediate values <strong>of</strong> theoper<strong>an</strong>ds propagate left-to-right. All the processors except P 0 are identical. P 0computes at each step a comm<strong>an</strong>d code (depending on a[1],a[0],b[1],b[0]) whichis then propagated to the other processors <strong>an</strong>d controls their operations.✛comm<strong>an</strong>d✛comm<strong>an</strong>dPna,b,tags✲P1a,b,tags✲P0✛c<strong>of</strong>actors✛c<strong>of</strong>actorsFig.1. The systolic array <strong>for</strong> the extended binary algorithm.Each processor has a memory <strong>of</strong> 16 one-bit registers. 8 registers are necessary<strong>for</strong> the plus-minus <strong>GCD</strong> array <strong>an</strong>d they have the same me<strong>an</strong>ing as in [5]: s 1 ,s 2 ,s 3contain the comm<strong>an</strong>d code (or state), a,b are bits <strong>of</strong> the oper<strong>an</strong>ds, ta,tb (tags)indicate their signific<strong>an</strong>t bits, <strong>an</strong>d sa stores the sign <strong>of</strong> a.Additionally, 8 registers are used by the extended algorithm: u, v, t,w keepthe values <strong>of</strong> the c<strong>of</strong>actors, ct <strong>an</strong>d cw are carries needed <strong>for</strong> the plus-minusoperations, <strong>an</strong>d u ′ ,v ′ keep intermediate values needed <strong>for</strong> left-shifts.In the next tables we describe the operation <strong>of</strong> the processors. For a valuex on a processor, x [+] <strong>an</strong>d x [−] will denote the value x on the left, respectivelyright neighboring processor. (The rightmost processor receives copies its ownvalues as [+] values.) By < c, r >:=expr, we specify that, after the evaluation <strong>of</strong>the expression expr, r will contain the result restricted to the register’s size <strong>an</strong>dc will contain the carry.


The operation <strong>of</strong> the processor P 0 :<strong>Algorithm</strong> P 0beginif ta [+] = 1 then sa := a [+] ; [find sign <strong>of</strong> A]else sa := sa [+] ;if s ≠w then[update the c<strong>of</strong>actors on the wait step]switch scase C:(u, t,u ′ ) := (0,u, t);(v,w,v ′ ) := (0,v,w);case S:(u, u ′ ) := (0,u);(v,v ′ ) := (0,v);case P,p:(u, u ′ ,< ct,t >) := (0,t, u + t);(v,v ′ ,< cw, w >) := (0,w, v + w);case M,m:(u, u ′ ,< ct,t >) := (0,t, u − t);(v,v ′ ,< cw, w >) := (0,w, v − w);s :=w;else[find the appropriate active state]if a = 0 ∧ b = 0 then [shift both A <strong>an</strong>d B]s := B;(a,b, ta,tb) := (a [+] ,b [+] ,ta [+] ,tb [+] );if a = 0 ∧ b = 1 then [interch<strong>an</strong>ge A <strong>an</strong>d B <strong>an</strong>d shift B]s :=C;(a,b, ta,tb) := (b, a [+] ,tb, ta [+] );if a = 1 ∧ b = 0 then [shift B]s :=S;(b, tb) := (b [+] ,tb [+] );if a = 1 ∧ b = 1 then[plus or minus]if a [+] = b [+] then s :=m;[minus]else s :=P;[plus]b := 0;end


The rest <strong>of</strong> the processors (P 1 , . .., P N ) operate like this:<strong>Algorithm</strong> P ibeginif ta [+] = 1 then sa := a [+] ; [find sign <strong>of</strong> A]else sa := sa [+] ;switch s [−][right state considered]endcase w:switch scase C:(u, t,u ′ ) := (u ′ [−],u, t);(v,w,v ′ ) := (v[−] ′ ,v,w);case S:(u, u ′ ) := (u ′ [−] ,u);[update the c<strong>of</strong>actors on the wait step](v,v ′ ) := (v ′ [−] ,v);case P,p:(u, u ′ ,< ct,t >) := (u ′ [−] ,t, u + t + ct [−]);(v,v ′ ,< cw,w >) := (v ′ [−] ,w, v + w + cw [−]);case M,m:(u, u ′ ,< ct,t >) := (u ′ [−] ,t, u − t − ct [−]);(v,v ′ ,< cw, w >) := (v ′ [−] ,w, v − w − cw [−]);case B: [shift both A <strong>an</strong>d B]s := s [−] ;[state is propagated leftwards](a,b, ta,tb) := (a [+] ,b [+] ,ta [+] ,tb [+] );case C: [interch<strong>an</strong>ge A <strong>an</strong>d B <strong>an</strong>d shift B]s := s [−] ;[state is propagated leftwards](a,b, ta,tb) := (b, a [+] ,tb, ta [+] );case S: [shift B]s := s [−] ;[state is propagated leftwards](b, tb) := (b [+] ,tb [+] );case P,p:[plus]s := s [−] ;[state is propagated leftwards](a,ta) := (b, tb); [set a to old b]< s 2 ,b >:= a [+] + b [+] + s 2 ; [set b to shifted sum]tb := (ta ∧ tb)[tag the minimal correct position]case M,m:[minus](a,ta) := (b, tb); [set a to old b]< s 2 ,b >:= a [+] − b [+] − s 2 ; [set b to shifted difference]tb := (ta ∧ tb)[tag the minimal correct position]The wait state was introduced in the systolic <strong>GCD</strong> algorithm in order toeliminate the global broadcasting <strong>of</strong> the operation code. In this state the pro-


cessor was not used. In the extended algorithm we replace this wait time by thecomputation <strong>of</strong> the c<strong>of</strong>actors, there<strong>for</strong>e increasing the efficiency <strong>of</strong> the circuit.Note that the data used to build the c<strong>of</strong>actors flows from right to left, eitheras a carry in the plus/minus steps, or by shifting in the shift steps. That is, thepartial value <strong>of</strong> a c<strong>of</strong>actor kept in the registers <strong>of</strong> the processor P k depends onlyon the values kept on the processors P 0 , ..., P k . Moreover, the c<strong>of</strong>actors areless or equal to the correspondent input oper<strong>an</strong>ds, so their length is less or equalto n. There<strong>for</strong>e, the number <strong>of</strong> processors needed <strong>for</strong> the <strong>GCD</strong> computation issufficient also <strong>for</strong> obtaining the reduced <strong>for</strong>m <strong>of</strong> the rational number.The termination <strong>of</strong> the systolic <strong>GCD</strong> algorithm is detected by P 0 when thevalue <strong>of</strong> b is 0 <strong>an</strong>d the tag <strong>of</strong> b is 1. Sometimes, the extended algorithm requiresseveral additional steps to finish the propagation <strong>of</strong> the “comm<strong>an</strong>d” to the mostsignific<strong>an</strong>t bits <strong>of</strong> the c<strong>of</strong>actors. However, this is not a problem if the result isretrieved in a LSF way because after b k becomes 0 the circuit pipelines only Boperations which do not ch<strong>an</strong>ge the values <strong>of</strong> the c<strong>of</strong>actors t k <strong>an</strong>d w k .4 <strong>FPGA</strong> ExperimentsWe implemented the array on <strong>an</strong> ATMEL <strong>FPGA</strong> using the Atmel IDS environmentfrom Cadence Systems Inc. <strong>an</strong>d Workview Office from Viewlogic SystemsInc. This represents <strong>an</strong> extension <strong>of</strong> the <strong>GCD</strong> implementation reported in in [6].For a circuit containing 9 processors (accomodates oper<strong>an</strong>ds up to 8 bits), thenetlist phase reports a number <strong>of</strong> 968 elementary blocks, which is smaller th<strong>an</strong>the area used by the <strong>GCD</strong> together with division presented in [6]. Some <strong>of</strong> theintermediate values between the <strong>GCD</strong> algorithms <strong>an</strong>d the two exact divisions arenot needed <strong>an</strong>ymore; this simplifies the function that updates the oper<strong>an</strong>ds. Thefunction that computes the c<strong>of</strong>actors requires 31 elementary gates per processor.Thus, the circuit <strong>for</strong> computing the extended gcd is also simpler th<strong>an</strong> a circuitwhich computes the gcd followed by two exact divisions.The longest delay path passes through 13 ports <strong>an</strong>d it is determined by theoriginal <strong>GCD</strong> array - the computation <strong>of</strong> the c<strong>of</strong>actors does not increase thecomputing time.The automatic layout was successful on a Atmel AT40K10 chip, <strong>an</strong>d consumes384 logical cells (83.4% <strong>of</strong> the total). The longest delay through our thecircuit is 40 ns – corresponding to a speed <strong>of</strong> 25 MHz. This gives <strong>an</strong> estimatedtime <strong>of</strong> 0.005 ms <strong>for</strong> computing the reduced fraction <strong>of</strong> 32 bit oper<strong>an</strong>ds, whichrivals the best current sequential processor (on <strong>an</strong> Ultra SPARC 60 machine withthe GMP library we have got <strong>an</strong> average time <strong>of</strong> 0.006 ms). However, a specialdevice constructed on the bases <strong>of</strong> this implementation <strong>for</strong> longer oper<strong>an</strong>ds (e. g.4 to 10 words) will gain in speed by a considerable factor, because the computingtime <strong>of</strong> the systolic array increases linearly with the length <strong>of</strong> the oper<strong>an</strong>ds,while the sequential algorithms have quadratic complexity.


5 Conclusions <strong>an</strong>d further workWe demonstrated the usefulness <strong>of</strong> the extended plus-minus <strong>GCD</strong> algorithm<strong>for</strong> the reduction <strong>of</strong> rational fractions by <strong>an</strong> <strong>FPGA</strong> implementation. Because itavoids the divisions altogether, this approach leads to a smaller implementation(as number <strong>of</strong> gates/cells), while keeping the speed <strong>of</strong> the original algorithm.Based on this sample implementation one c<strong>an</strong> easily construct a device <strong>for</strong> rationalarithmetic <strong>for</strong> long oper<strong>an</strong>ds, because the circuit is uni<strong>for</strong>m <strong>an</strong>d c<strong>an</strong> beextended by simple tiling.Further possible improvements <strong>of</strong> this implementation include: addition <strong>of</strong> abus <strong>for</strong> pipelining the input oper<strong>an</strong>ds (this would allow pipelined preprocessing<strong>for</strong> steps shift both <strong>an</strong>d interch<strong>an</strong>ge <strong>an</strong>d will lead to signific<strong>an</strong>t simplification<strong>of</strong> the processors in the array); as well as addition <strong>of</strong> a bus <strong>for</strong> pipelining <strong>of</strong> theoutput oper<strong>an</strong>ds.References1. R. P. Brent <strong>an</strong>d H. T. Kung, A systolic algorithm <strong>for</strong> integer <strong>GCD</strong> computation,Procs. <strong>of</strong> the 7th Symp. on Computer Arithmetic (K. Hw<strong>an</strong>g, ed.), IEEE ComputerSociety, June 1985, pp. 118–125.2. B. Buchberger, Gröbner Bases: An <strong>Algorithm</strong>ic Method in Polynomial Ideal Theory,Recent trends in Multidimensional Systems (Dordrecht-Boston-L<strong>an</strong>caster)(Bose <strong>an</strong>d Reidel, eds.), D. Reidel Publishing Comp<strong>an</strong>y, 1985, pp. 184–232.3. P. Henrici, A subroutine <strong>for</strong> computations with rational numbers, Journal <strong>of</strong> theACM 3 (1956), 6–9.4. T. Jebele<strong>an</strong>, An algorithm <strong>for</strong> exact division, Journal <strong>of</strong> Symbolic Computation 15(1993), no. 2, 169–180.5. , Systolic normalization <strong>of</strong> rational numbers, ASAP’93: International Conferenceon Application–Specific Array Processors (Venice, Italy) (L. Dadda <strong>an</strong>dB. Wah, eds.), IEEE Computer Society Press, October 1993, pp. 502–513.6. , Rational arithmetic using <strong>FPGA</strong>, More <strong>FPGA</strong>s (W. Luk <strong>an</strong>d W. Moore,eds.), Abingdon EE&CS Books, Ox<strong>for</strong>d, 1994, Proceedings <strong>of</strong> FPLA’93: InternationalWorkshop on Field Programmable Logic <strong>an</strong>d Applications, Ox<strong>for</strong>d, UK,September 1993, pp. 262–273.7. D. E. Knuth, The art <strong>of</strong> computer programming, vol. 2, Addison-Wesley, 1981.8. W. Kr<strong>an</strong>dick <strong>an</strong>d T. Jebele<strong>an</strong>, Bidirectional exact integer division, Journal <strong>of</strong> SymbolicComputation 21 (1996), 441–455.9. B. Matasar, An extension <strong>of</strong> the binary <strong>GCD</strong> algorithm <strong>for</strong>systolic parallelization, Tech. Report 99–47, RISC–Linz, 1999,http://www.risc.uni-linz.ac.at/library.10. J. Stein, Computational problems associated with Racah algebra, J. Comp. Phys. 1(1967), 397–405.11. K. Weber, The accelerated integer <strong>GCD</strong> algorithm, ACM Tr<strong>an</strong>s. on Math. S<strong>of</strong>tware21 (1995), no. 1, 111–122.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!