10.01.2015 Views

Low-Power Logic Synthesis

Low-Power Logic Synthesis

Low-Power Logic Synthesis

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Low</strong>-<strong>Power</strong> <strong>Logic</strong> <strong>Synthesis</strong><br />

王 行 健<br />

國 立 中 興 大 學<br />

資 訊 科 學 系<br />

Sying-Jyan Wang<br />

Dept. Computer Science<br />

National Chung-Hsing Univ.


Outline<br />

• Overview<br />

– <strong>Power</strong> model<br />

– <strong>Low</strong>-power design techniques<br />

• <strong>Low</strong> <strong>Power</strong> <strong>Logic</strong> <strong>Synthesis</strong><br />

• Switching Activity Reduction<br />

– Sequential<br />

– Combinational<br />

• Retiming-Based Approach<br />

• Preliminary Results<br />

• Conclusion and Future Work<br />

NCHUCS 2


<strong>Power</strong> Model<br />

• Total power<br />

P total<br />

= P s<br />

+ P d<br />

+ P sc<br />

– P s<br />

: Static power due to leakage current<br />

– P d<br />

: Short circuit current due to switching transient<br />

– P d<br />

: Charge and discharge of capacitance<br />

• Dynamic power (P d<br />

) is usually the dominant<br />

factor in CMOS technology<br />

NCHUCS 3


Dynamic <strong>Power</strong><br />

• Dynamic power of a CMOS gate<br />

1<br />

2<br />

Pavg = × CL<br />

× VDD<br />

× fp<br />

× N<br />

2<br />

– C L : Load capacitance<br />

– V DD : <strong>Power</strong> supply voltage<br />

– f p : Clock frequency<br />

– N: Average # of gate output transition (switching<br />

activity) per cycle<br />

NCHUCS 4


<strong>Low</strong>-<strong>Power</strong> Design Techniques<br />

• Can be applied in any<br />

level of the design process<br />

• Various technologies can<br />

be applied simultaneously<br />

• Goal of optimization<br />

– Reduce C L<br />

– Reduce V DD<br />

– Reduce N<br />

– Clock frequency is NOT a<br />

target<br />

Architectural<br />

RTL<br />

<strong>Logic</strong><br />

Circuit<br />

Physical<br />

NCHUCS 5


Architecture Level Techniques<br />

• Sleep Mode<br />

– Deactivate non-operating functional units<br />

– Require little hardware and design complexity<br />

• Asynchronous data processing<br />

– Conduct computation only when necessary<br />

– Reduce switching activities<br />

• Multiple power supply voltages<br />

– Reduce power supply voltages without sacrificing performance<br />

• Architecture transformation<br />

NCHUCS 6


Multiple Supply Voltages<br />

• Basic idea<br />

– Slow down non-critical paths won’t increase cycle time<br />

– So supply voltages of non-critical components can be reduced<br />

• Implementation<br />

– Architecture level: reduce power supply voltage for selected<br />

modules<br />

– <strong>Logic</strong> level: use dual power supply voltages<br />

• Integer Linear Programming (ILP) technique is used to find the best<br />

solution<br />

• Problem: unmatched voltage swing may actually increase power<br />

NCHUCS 7


RTL Techniques<br />

• State encoding for FSM<br />

– Assign adjacent codes to<br />

neighboring states<br />

– To reduce switching<br />

activities in the<br />

combinational logic<br />

• Retiming<br />

– Moving registers to reduce<br />

glitches<br />

00 11<br />

S 1 S 2<br />

S 3<br />

01<br />

NCHUCS 8


Circuit Level Techniques<br />

• Transistor sizing<br />

– Trade silicon area for speed and/or power<br />

• Special flip-flop and latch<br />

– Reduce power in storage elements<br />

• Special logic family<br />

– PPRPL<br />

– PCFL3<br />

– DyCML<br />

– SCSL<br />

– Etc.<br />

NCHUCS 9


<strong>Low</strong>-<strong>Power</strong> <strong>Logic</strong> <strong>Synthesis</strong><br />

• Gate reorganization<br />

– Try to reduce capacitive load<br />

• Pass transistor logic<br />

• Reducing switching activities<br />

– Precomputation<br />

– Partition<br />

– Retiming<br />

NCHUCS 10


Gate Reorganization<br />

• Transform one logic circuit to another that<br />

– Functionally equivalent<br />

– Consume less power<br />

• Apply at<br />

– <strong>Logic</strong> synthesis<br />

– Technology mapping<br />

NCHUCS 11


Pass Transistor <strong>Logic</strong><br />

• Most logic synthesis systems are based on logic<br />

gates<br />

– Processed through Boolean equations<br />

– Implemented with NAND/NOR gates<br />

• Pass transistor logic can implement certain<br />

complex Boolean function efficiently<br />

• In some cases, the power consumption may be<br />

reduced<br />

NCHUCS 12


Switching Activity Reduction<br />

• Basic idea<br />

– Circuit output may be solely decided by part of the<br />

circuit function<br />

– Thus the remaining part of the circuit can be “turned off”<br />

dynamically<br />

• Implementation<br />

– Extract part of the circuit so it can be computed first<br />

• This part of the circuit is usually small<br />

– “Freeze” registers that are not useful if the output is<br />

decided<br />

NCHUCS 13


Example<br />

• Consider the following n-bit comparator<br />

a n–1 …a 0 b n–1 …b 0<br />

Comparator<br />

><br />

– The output is known if a n–1 b n–1 = 10 or 01<br />

– So at least 50% switching activities can be reduced if FFs<br />

corresponding to a n–2 …a 0 and b n–2 …b 0 can be disabled<br />

NCHUCS 14


Precomputation<br />

• Proposed by M. Alidina et al (IEEE Tran. CAD,<br />

1994)<br />

– Synthesize a small “precomputation logic” in addition<br />

to the normal circuit<br />

– This block operated one cycle ahead of the normal<br />

functional circuit (“precomputatiuon”)<br />

– Part of the normal circuit is suspended when the<br />

precomputation condition is true<br />

– The precomputation block is decided by the ODC<br />

(Observability Don’t-Care) of the output function<br />

NCHUCS 15


Basic Architecture<br />

Inputs<br />

Input<br />

Registers<br />

Controllable<br />

Input<br />

Registers<br />

Circuit<br />

Outputs<br />

Precomputaion<br />

<strong>Logic</strong><br />

NCHUCS 16


Precomputation <strong>Logic</strong>— Version 1<br />

x 1<br />

2<br />

‧‧‧‧‧‧<br />

x n<br />

‧‧‧‧‧‧<br />

R1<br />

LE<br />

‧‧‧‧‧‧<br />

A<br />

R2<br />

f<br />

g1<br />

g2<br />

FF<br />

FF<br />

g1=1 ⇒ f = 1<br />

g2=1 ⇒ f = 0<br />

NCHUCS 17


Precomputation <strong>Logic</strong>— Version 1<br />

• Idea<br />

(Cnt’d)<br />

– g1 implements ON-set of f<br />

– g2 implements OFF-set of f<br />

– R1 is disabled if the precomputation condition is TRUE.<br />

• Problem:<br />

– Delay is increased<br />

– Performance is degraded.<br />

NCHUCS 18


Precomputation <strong>Logic</strong>— Version 2<br />

x 1<br />

2<br />

R1<br />

‧‧‧‧<br />

‧‧‧‧<br />

‧‧‧‧<br />

A<br />

R3<br />

f<br />

x n<br />

R2<br />

LE<br />

g1<br />

g2<br />

NCHUCS 19


Self-Timed<br />

Precomputation<br />

x 1<br />

x 2<br />

A<br />

R1<br />

en<br />

x 3<br />

x 4<br />

R2<br />

en<br />

B<br />

f<br />

x 5<br />

g<br />

NCHUCS 20


Partition<br />

• Based on Shannon’s Expansion<br />

f ( x 1<br />

, K,<br />

x ) = x f +<br />

n<br />

i<br />

x<br />

x<br />

i<br />

f<br />

i x i<br />

R<br />

1<br />

f<br />

xi<br />

MUX<br />

f<br />

x 1 … x n<br />

R<br />

2<br />

f<br />

xi<br />

x i<br />

NCHUCS 21


A Retiming-Based Approach<br />

• Move a part of the functional block, rather than<br />

reproduce the block.<br />

– Assume the controlling value of gate G be c<br />

t-1<br />

t-1<br />

x<br />

x 1<br />

1<br />

t-1 R t<br />

t-1<br />

x k 1 C y 1<br />

x k<br />

t<br />

f<br />

t-1<br />

t<br />

t-1<br />

x k+1<br />

G<br />

x k+1<br />

t-1 R 2<br />

x<br />

D<br />

t<br />

t-1<br />

n x n<br />

t-1<br />

C<br />

t-1<br />

y 1<br />

y m<br />

y 2<br />

y 1 ≠c<br />

LE<br />

R 2<br />

D<br />

FF<br />

t<br />

y 2<br />

t<br />

y 1<br />

t<br />

y m<br />

G<br />

f<br />

t<br />

NCHUCS 22


An Example<br />

a<br />

b<br />

c<br />

d<br />

e<br />

f<br />

g<br />

h<br />

i<br />

j<br />

FFa<br />

FFb<br />

FFc<br />

FFd<br />

FFe<br />

FFf<br />

FFg<br />

FFh<br />

FFi<br />

FFj<br />

a'<br />

b'<br />

c'<br />

d'<br />

e'<br />

f'<br />

g'<br />

h'<br />

i'<br />

j'<br />

l<br />

m<br />

n<br />

o<br />

k<br />

p<br />

q<br />

r<br />

z<br />

NCHUCS 23


NCHUCS 24<br />

An Example<br />

An Example (Cnt<br />

(Cnt’d)<br />

d)<br />

a'<br />

b'<br />

c<br />

d<br />

e<br />

f<br />

g<br />

h<br />

i<br />

j<br />

z<br />

k'<br />

r<br />

l<br />

m<br />

n<br />

o<br />

p<br />

q<br />

c'<br />

d'<br />

e'<br />

f'<br />

g'<br />

h'<br />

i'<br />

j'<br />

FFj<br />

E<br />

FFh<br />

E<br />

FFf<br />

E<br />

FFd<br />

E<br />

FFk<br />

FFc<br />

E<br />

FFe<br />

E<br />

FFg<br />

E<br />

FFi<br />

E<br />

k


Data Synchronization<br />

• A retimed block may create synchronization if its<br />

internal node has fanout branches to other gates<br />

• Need extra registers to solve this problem<br />

t-1<br />

x1<br />

t-1<br />

R C<br />

yi t<br />

xk<br />

yj t<br />

t-1<br />

x1<br />

t-1<br />

xk<br />

C<br />

yi t-1<br />

yj t-1<br />

FF<br />

FF<br />

yi t<br />

yj t<br />

NCHUCS 25


Selecting Load-Enabled Registers<br />

• Not all registers can be disabled<br />

– Inputs with fanout to other outputs can not be disabled<br />

– Ex: R 2 can not be disabled<br />

C<br />

FF<br />

yi<br />

R1<br />

D<br />

G<br />

R2<br />

yj<br />

NCHUCS 26


Selecting Retimed Block<br />

• Goal<br />

– Find a logic block which, when retimed, maximizes the<br />

reduced switching activities<br />

– The retimed block must be small in order to reduce the<br />

extra registers<br />

• In our experiment, at most 2 registers are need<br />

• Need to estimate the switching activities<br />

NCHUCS 27


Estimating Switching Activities<br />

• For a net y, let PS(y) be the probability of signal<br />

switching on a clock cycle<br />

PS(<br />

y)<br />

=<br />

=<br />

=<br />

Pr<br />

Pr<br />

Pr<br />

{ a transitionon y at time t}<br />

{<br />

t−1<br />

t<br />

y ⊕ y = 1}<br />

{ t−1<br />

t<br />

( 0) ( 1) ) t−1<br />

t<br />

y = ∧ y = ∨ ( y = 0) ∧ ( y = 1) )}<br />

• ESC(y, c): the expected amount of saved switching<br />

activity if net y is set to value c<br />

ESC(<br />

p)<br />

=<br />

Pr{ y = c}<br />

×∑z<br />

∈<br />

DCNS(<br />

p)<br />

PS(<br />

z)<br />

NCHUCS 28


Algorithm<br />

• Algorithm: Finding the optimal C-cones to be retimed.<br />

• Input: A netlist N and a given k.<br />

• Output: A subset of N to be retimed.<br />

• 1. MaxSaving←0, BestSolution←∅;<br />

• 2. Construct the set of all controlling points CP;<br />

• 3. for (i=1; i MaxSaving) {<br />

• 9. MaxSaving← ES(P);<br />

• 10. BestSolution←P;<br />

}<br />

}<br />

}<br />

11. Report CNS(P);<br />

NCHUCS 29


Experimental Results<br />

NCHUCS 30


CKT NAME New Results Old Paper Reference [1]<br />

cmb 51.76% 22.60% 43.00%<br />

cm138a 52.73% 72.40% 47.00%<br />

majority 55.97% 41.00% 19.00%<br />

cht 43.73% 30.30% 16.00%<br />

cm150 51.27% 22.60% 43.00%<br />

c 43.11% 29.30% 39.00% o<br />

mux 50.20% 52.20% 22.00%<br />

pcle 52.17% 60.40% 30.00%<br />

pcler8 56.09% 41.20% 38.00%<br />

unreg 48.87% 28.40% 18.00%<br />

dalu 43.94% 34.00%<br />

i2 59.40% 65.00%<br />

sao2 41.51% 65.00%<br />

spla 52.31% 41.00%<br />

seq 54.02% 65.00%<br />

apex2 68.40% 42.00%<br />

cps 39.84% 41.00%<br />

duke2 42.57% 23.00%<br />

misex2 48.79% 15.00%<br />

misex3 29.63% 19.00%<br />

e64 66.03% 75.00%<br />

Average 50.11% 40.04% 38.10%<br />

NCHUCS 31


A Comparison<br />

Precomputation Partition Retiming<br />

Area overhead Medium Large Small or No<br />

Critical path delay Unchanged More Unchanged or<br />

Less<br />

Critical path delay<br />

in previous stage<br />

Unchanged or<br />

More<br />

Unchanged<br />

Unchanged or<br />

More<br />

Testability<br />

Problem<br />

Yes No No<br />

NCHUCS 32


Conclusion<br />

• Many low-power design techniques have been developed<br />

– Most of them can be applied simultaneously<br />

– Ad hoc solutions usually work well in practice<br />

• A new logic level low-power synthesis is presented<br />

– The target is to reduce switching activities<br />

– The performance is better than previous methods in terms of<br />

• Reduction in switching activities<br />

• Area<br />

• Critical path delay<br />

NCHUCS 33


Future Works<br />

• Accurate power estimation<br />

• Extension of methods to datapath components<br />

– Should be useful for pipelined ALU<br />

• Retiming circuit to optimize<br />

– <strong>Power</strong><br />

– Speed<br />

• Testability issues<br />

NCHUCS 34

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!