A clocking technique for FPGA pipelined designs

More documents

Recommendations

Info

694 O. Cadenas, G. Megson / Journal of Systems Architecture 50 (2004) 687–696Sine and Cosine functions for input angles with 16bit of precision.Obtaining a PP-pipelined Cordic Core: Themethodology presented in Section 3.3 was appliedto translate the 15-stage global clocked pipelinedCordic Core. First, VHDL code was written forthe PP-modules, and then captured as a parameterizedstructural object to act as a PP-controller.An equivalent PP-pipelined VHDL specification isstraightforward after interconnecting the PP-controllerto the original global clocked Cordic Corespecification.Power estimation process: The Xilinx XPowertool [19] was used to measure power for the globalclocked and the PP-pipelined Cordic Core. Xpowercomputes power based on information of nodeswitching rate activity of circuits. Switching rateactivity was obtained from post place-and-routetiming simulation data obtained from the simulatorModelSim XE 5.5b. Previously, both circuitswere placed and routed with Xilinx ISE 5.1. Synthesisfor the designs was obtained using theproduct Synplify Pro 7.2 targeting Virtex II devices.All the design entry specification was writtenin VHDL code.Area and time results: The global clockedpipelined Cordic Core runs at 205 MHz taking 475slices in a Virtex2 XC2V250. An equivalent PPpipelinedCordic Core runs at 155 MHz and takes488 slices in the same device. The area overhead ofthe PP-technique (mainly the PP-controller) forthis simple design is less than 5%. For reference,post place-and-route timing simulations showedthat a PP-controller of 16 outputs can run up to324 MHz in a Virtex2 XC2V250 device taking 20slices. This means the PP-technique allows amicropipelined design mapped into a FPGAsynchronous design that would runs at over160 MHz.Clock tree power: A Cordic Core global clockedcircuit running at 10 MHz reported a clock treepower consumption of 2.24 mW. An equivalentPP-pipelined Cordic Core circuit showed a globalclock tree power consumption of 0.66 mW. Thisrepresents an overall reduction in dynamic powerconsumption of around 30% for the whole design.However, due to the limited number of dedicatedclock lines of FPGAs it is expected that the powerreduction capabilities of the PP-technique wouldbe immediately better suited for VLSI designs.Discussion: For the case when all delays areknown as in the synthesis of global clocked pipelinedesigns it seems that the PP-pipeline techniquehas a potential for reducing clock tree powerwhich would be beneficial in VLSI designs. ForPP-pipelined designs converted from globalclocked designs, the pulse signals can be used asregister enable signals simplifying the exploitationof clock gating at the level of language descriptionof existing designs.4.2. Data-driven frequency modulationA PP-controller generates a pulse signal at astage p iþ1 only after processing at stage i hascompleted (condition I) and no further p i is generateduntil a pulse signal p iþ1 has been generated(condition II). This ensures the correct forwardoperation of a pipeline flow. A PP-module circuitcan accept as an input a signal indicating thecompletion of processing at a particular stage.This is modelled by means of a delay line. Thisdelay line is placed between the connection fromp i 1 to done i . For this case, the pipeline runs at aperiod of T pipe ¼ maxðddelay i =T clk eÞ; T clk , wheredelay i is the completion time associated to anystage i and T clk is the period of the global clock tothe PP-controller. In circuit realization a synchronousreset counter can be incorporated toeach pipeline stage that can be triggered by eachpulse p to count processing time in integer intervalsof T clk . More elaborate data completion circuitryare available such as variable processingtime computation driven by data values as in [4].In either case, a PP-controller can manageinstantaneous periods for moving data across thepipeline stages in discrete steps from 2T clk toddelay i =T clk e. Consequently computation will beperformed at a variable instantaneous frequencymodulated by variable processing time of data attime of execution. A simulation to illustrate thisbehavior was carried out for a three stage pipeline.Stages two and three were stripped of processinglogic while stage one was simulated to have avariable processing completion time using theVerilog construct:
O. Cadenas, G. Megson / Journal of Systems Architecture 50 (2004) 687–696 695p[2]p[1]p[0]clkdelay 17 16 6 2 5 4 1Fig. 9. Simulated waveforms of the PP-technique operation showing data completion at stage one and no processing at stages two andthree. The local clocks are p½0Š, p½1Š and p½2Š, respectively while processing time delay of stage one is related to p½0Š.always @ (posedge p[0])begindelay ¼ $dist_normal (seed,normal,sd);t ¼ 1 0 b0; #(delay) t ¼ 1 0 b1;endCompletion detection for stage one, t, is activatedaccording to a delay time with normal distributiondelay with standard deviation sd. Theactivation occurs after the rising edge of the clockpulse to stage one. Simulated waveforms areshown in Fig. 9 for T clk ¼ 4 ns. Note that whendelay ¼ 25 ns ð> 2T clk ), p½0Š is generated with aperiod of 7T clk complying with the given formula.It is also seen that the minimum period for p½0Š is2T clk . If data completion circuitry with averagecompletiontimes implemented in FPGA hardwarerunning around 50 MHz, frequency modulationcan be obtained in discrete steps of less than 10%.The realization of these implementations is currentlybeing investigated.5. ConclusionThe PP-pipeline is introduced as a versatilepipeline clocking mechanism suitable for FPGAimplementation. PP-pipeline can be used as atechnique to migrate asynchronous pipelined designssuch as micropipelining into FPGAs, andalso as an alternative clocking mechanism toexisting global synchronous pipelines. No redesignis needed to existing pipelined datapaths. PPpipelineddesigns result in equivalent micropipelineddesigns with a fixed bundle constraint butusing a synchronous methodology. A PP-pipelinesynthesizes with circuit resources commonlyavailable in commercial FPGAs. As alternatives toglobal clocked pipelined designs, equivalent PPpipelineddesigns show lower power dissipation ofthe main clock tree and hence are also suitable forVLSI implementation. The technique can be extendedto incorporate data-completion circuitryinto a pipelined design using a synchronous approach.Simulations show that it is possible tohandle variable data-completion times to modulatethe instantaneous frequency in discrete timesteps across the pipeline stages. The time of thediscrete steps is in practice small compared tocoarse-grain combinational logic of typical pipelinestages. These advantages are due to a regularPP-controller based on simple cooperating statemachines.References[1] L. Benini, P. Siegel, G. DeMicheli, Designing for lowpower circuits: practical recipes, IEEE Circuits and Systemsmagazine 1 (1) (2001) 6–25.[2] F.C. Cheng, Practical design and performance evaluationof completion detection circuits, in: Int. Conf. on ComputerDesign, ICCD, October 1998.[3] A. Davis, S. Nowick, An introduction to asynchronoussystem design, University of Utah, Report N. UUCS-97-013, Salt Lake City, 1997.[4] A.D. Gloria, M. Olivieri, Completion-detection carry selectaddition, IEE Proceedings––Computer Digital Techniques147 (2) (2000) 93–100.[5] S. Hauck, S. Burns, G. Borriello, C. Ebeling, Montage: anFPGA for synchronous and asynchronus circuits, in: 2ndInt. Workshop on Field-Programmable Gate Arrays,August 1992.[6] S. Hauck, S. Burns, G. Borriello, C. Ebeling, An FPGA forimplementing asynchronus systems, IEEE Design & Testof Computers 11 (3) (1994) 60–69.[7] J.L. Hennessy, D.A. Patterson, Computer Architecture: AQuantitative Approach, Morgan Kaufmann, San Francisco,1995.
Page 2 and 3: 688 O. Cadenas, G. Megson / Journal
Page 10: 696 O. Cadenas, G. Megson / Journal

A clocking technique for FPGA pipelined designs

Create successful ePaper yourself

Delete template?

Save as template?