31.07.2015 Views

A clocking technique for FPGA pipelined designs

A clocking technique for FPGA pipelined designs

A clocking technique for FPGA pipelined designs

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

694 O. Cadenas, G. Megson / Journal of Systems Architecture 50 (2004) 687–696Sine and Cosine functions <strong>for</strong> input angles with 16bit of precision.Obtaining a PP-<strong>pipelined</strong> Cordic Core: Themethodology presented in Section 3.3 was appliedto translate the 15-stage global clocked <strong>pipelined</strong>Cordic Core. First, VHDL code was written <strong>for</strong>the PP-modules, and then captured as a parameterizedstructural object to act as a PP-controller.An equivalent PP-<strong>pipelined</strong> VHDL specification isstraight<strong>for</strong>ward after interconnecting the PP-controllerto the original global clocked Cordic Corespecification.Power estimation process: The Xilinx XPowertool [19] was used to measure power <strong>for</strong> the globalclocked and the PP-<strong>pipelined</strong> Cordic Core. Xpowercomputes power based on in<strong>for</strong>mation of nodeswitching rate activity of circuits. Switching rateactivity was obtained from post place-and-routetiming simulation data obtained from the simulatorModelSim XE 5.5b. Previously, both circuitswere placed and routed with Xilinx ISE 5.1. Synthesis<strong>for</strong> the <strong>designs</strong> was obtained using theproduct Synplify Pro 7.2 targeting Virtex II devices.All the design entry specification was writtenin VHDL code.Area and time results: The global clocked<strong>pipelined</strong> Cordic Core runs at 205 MHz taking 475slices in a Virtex2 XC2V250. An equivalent PP<strong>pipelined</strong>Cordic Core runs at 155 MHz and takes488 slices in the same device. The area overhead ofthe PP-<strong>technique</strong> (mainly the PP-controller) <strong>for</strong>this simple design is less than 5%. For reference,post place-and-route timing simulations showedthat a PP-controller of 16 outputs can run up to324 MHz in a Virtex2 XC2V250 device taking 20slices. This means the PP-<strong>technique</strong> allows amicro<strong>pipelined</strong> design mapped into a <strong>FPGA</strong>synchronous design that would runs at over160 MHz.Clock tree power: A Cordic Core global clockedcircuit running at 10 MHz reported a clock treepower consumption of 2.24 mW. An equivalentPP-<strong>pipelined</strong> Cordic Core circuit showed a globalclock tree power consumption of 0.66 mW. Thisrepresents an overall reduction in dynamic powerconsumption of around 30% <strong>for</strong> the whole design.However, due to the limited number of dedicatedclock lines of <strong>FPGA</strong>s it is expected that the powerreduction capabilities of the PP-<strong>technique</strong> wouldbe immediately better suited <strong>for</strong> VLSI <strong>designs</strong>.Discussion: For the case when all delays areknown as in the synthesis of global clocked <strong>pipelined</strong>esigns it seems that the PP-pipeline <strong>technique</strong>has a potential <strong>for</strong> reducing clock tree powerwhich would be beneficial in VLSI <strong>designs</strong>. ForPP-<strong>pipelined</strong> <strong>designs</strong> converted from globalclocked <strong>designs</strong>, the pulse signals can be used asregister enable signals simplifying the exploitationof clock gating at the level of language descriptionof existing <strong>designs</strong>.4.2. Data-driven frequency modulationA PP-controller generates a pulse signal at astage p iþ1 only after processing at stage i hascompleted (condition I) and no further p i is generateduntil a pulse signal p iþ1 has been generated(condition II). This ensures the correct <strong>for</strong>wardoperation of a pipeline flow. A PP-module circuitcan accept as an input a signal indicating thecompletion of processing at a particular stage.This is modelled by means of a delay line. Thisdelay line is placed between the connection fromp i 1 to done i . For this case, the pipeline runs at aperiod of T pipe ¼ maxðddelay i =T clk eÞ; T clk , wheredelay i is the completion time associated to anystage i and T clk is the period of the global clock tothe PP-controller. In circuit realization a synchronousreset counter can be incorporated toeach pipeline stage that can be triggered by eachpulse p to count processing time in integer intervalsof T clk . More elaborate data completion circuitryare available such as variable processingtime computation driven by data values as in [4].In either case, a PP-controller can manageinstantaneous periods <strong>for</strong> moving data across thepipeline stages in discrete steps from 2T clk toddelay i =T clk e. Consequently computation will beper<strong>for</strong>med at a variable instantaneous frequencymodulated by variable processing time of data attime of execution. A simulation to illustrate thisbehavior was carried out <strong>for</strong> a three stage pipeline.Stages two and three were stripped of processinglogic while stage one was simulated to have avariable processing completion time using theVerilog construct:

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!