20.01.2013 Views

Power Analysis for High-Speed I/O Transmitters

Power Analysis for High-Speed I/O Transmitters

Power Analysis for High-Speed I/O Transmitters

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Power</strong> <strong>Analysis</strong> <strong>for</strong> <strong>High</strong>-<strong>Speed</strong> I/O <strong>Transmitters</strong><br />

Hamid Hatamkhani, Chih-Kong Ken Yang<br />

University of Cali<strong>for</strong>nia, Los Angeles<br />

9 Marisol, Newport Coast, CA, 92657, USA<br />

Tel: 1-949-246-6129, E-mail: hhatam@icsl.ucla.edu<br />

Abstract<br />

This paper studies the design tradeoffs to minimize<br />

power dissipation of multi-Gbps parallel I/O transmitters. A<br />

macromodel of a transmitter that can be optimized <strong>for</strong> power<br />

is presented. Also discussed is a means to consider the impact<br />

of deterministic jitter due to on-chip buffering on power<br />

dissipation. The model allows analysis that considers varying<br />

design constraints, and circuit architectures. The optimization<br />

results provide some guidance on the choice of architecture,<br />

and data rate to achieve large aggregate I/O bandwidths.<br />

Introduction<br />

As the off-chip bandwidth of ICs is anticipated to<br />

exceed 1Tb/s within the next 5 years [7], power dissipation by<br />

the I/O is an increasing concern. In order to maintain<br />

reasonable power consumption, both the signaling and the<br />

transmitter architecture must be carefully considered and<br />

designed <strong>for</strong> low power.<br />

This paper considers the three sources of power<br />

consumption: the signaling power (power driven off-chip),<br />

driver and pre-driver power (the buffering required to drive<br />

the signal), and the signal-conditioning power (the power<br />

required to maintain good signal integrity). By using a<br />

detailed circuit model that relates output data eye (transmitter<br />

per<strong>for</strong>mance) with transistor size, we can optimize the design<br />

<strong>for</strong> power given various link specifications such as data rates,<br />

signaling levels, signal integrity requirements, and process<br />

technology. This paper studies the trade-off between the link<br />

specifications and power dissipation <strong>for</strong> a multi-Gbps<br />

transmitter. The optimization would indicate desired data<br />

rates <strong>for</strong> optimal power dissipation, and the achievable<br />

minimum power dissipation <strong>for</strong> large aggregate bandwidth<br />

links.<br />

Transmitter Architecture<br />

The basic transmitter architecture in this study, as<br />

shown in Fig. 1, assumes several design conditions. The input<br />

data rate arrives at half the rate of the transmitted output. The<br />

capacitance driven by each input data bit is less than the<br />

equivalent capacitance of a gate width that is 8*Lmin (<strong>for</strong><br />

NMOS). The clock loading is also a gate width of 8*Lmin.<br />

Additional retiming latches and switches are used to precondition<br />

the data <strong>for</strong> pre-emphasis. A pre-driver and driver<br />

produce the output signal with the desired output swing,<br />

common-mode voltage and output impedance.<br />

X[<br />

n]<br />

Mux Buf<br />

PDrv Drv<br />

2x1<br />

Sw<br />

X[ n −1]<br />

Fig.1: Transmitter block diagram<br />

Two common signaling levels are investigated: low<br />

common-mode (LCM) and high common mode (HCM). With<br />

LCM, the output driver devices are in triode and provide<br />

impedance matching (Fig. 2c and Fig. 2d). With HCM, the<br />

output driver devices work in saturation and per<strong>for</strong>m current<br />

steering to the output. A 50 ohms resistor provides the<br />

impedance matching (Fig. 2a and Fig. 2b).<br />

The amount of swing at the output of the driver is<br />

determined either by the specification or by calculating the<br />

amount of attenuation of the channel and sensitivity of the<br />

receiver. In most of the paper, the nominal output swing is<br />

250mV. Other swings are investigated later in the paper.<br />

X<br />

50<br />

Vdd<br />

Out<br />

50<br />

Out<br />

X1<br />

VG<br />

Vdd Vdd<br />

50<br />

Out<br />

X1<br />

(a) (b) (c) (d)<br />

Fig.2: Output drivers: (a) single-ended HCM (b) differential HCM<br />

(c) single-ended LCM (d) differential LCM.<br />

The study not only considers the basic transmitters<br />

but also the impact of signal integrity. To reduce ISI, a firstorder<br />

(2-tap) pre-emphasis filter is used in the study.<br />

(1)<br />

Our analysis studies a digitally implemented pre-emphasis.<br />

Switches (4-bits) choose between X[n] or X[ n − 1]<br />

to program<br />

η. The output driver is split into binary-weighted branches<br />

similar to a current-mode (HCM) or resistive (LCM) D/A<br />

converter [1,2]. Additionally, to consider output-swing<br />

control (HCM), and output-impedance control (LCM), the<br />

pre-driver is also designed and properly sized to use a<br />

feedback control voltage.<br />

Transmitter Model<br />

The transmitter power optimization is constrained by<br />

the desired output data-eye opening. The output voltage swing<br />

and common-mode level determine the driver architecture and<br />

transistor sizing. The signal-integrity considerations<br />

(specifically, swing control, impedance control, and preemphasis)<br />

determine the pre-driver and switching design.<br />

The output timing margin defines maximum<br />

tolerable jitter, both deterministic jitter (DJ) and random jitter<br />

(RJ). Since RJ is due to the clock source (i.e. a PLL) and<br />

supply/substrate noise, it is not considered in the model<br />

currently. DJ at the transmitter output (be<strong>for</strong>e the filtering of<br />

the channel), is primarily due to on-chip ISI that results from<br />

on-chip low-pass filtering 1 . The filtering is due to transistor<br />

sizing and fan-out within the transmitter. So one of the key<br />

optimization included in this model is the efficient device<br />

sizing to minimize power while constraining DJ. The DJ<br />

model <strong>for</strong> a first order system is:<br />

Tbit<br />

/<br />

DJ ( / Tbit<br />

) ln( 1 e )<br />

τ −<br />

Y[n] = X[n] -ηX[N<br />

-1]<br />

= − τ −<br />

(2)<br />

where τ is time constant of the system.<br />

1 Some amount (commonly


For several stages of logic, we approximate<br />

deterministic jitter of a stage denoted by i, with the following<br />

model:<br />

−X<br />

−Y<br />

⎧DJi<br />

= DJi−1<br />

− ( Atdi<br />

ln( 1−<br />

e ) + Ct di−1<br />

ln( 1−<br />

e )) / Tbit<br />

⎪<br />

(3)<br />

⎨ C(<br />

1−<br />

DJi−1<br />

) Tbit<br />

D(<br />

1−<br />

DJi−1<br />

) Tbit<br />

⎪X<br />

=<br />

, Y =<br />

⎩ t di<br />

t di−1<br />

A, B, C and D can be approximated by mean-square<br />

estimation method fitted to simulated data from various logic<br />

blocks in the transmitter (i.e. inverter chain and transmission<br />

gates). The delays (tdi and tdi-1) are modeled to within 15%<br />

error using an α-power model and the Elmore delay <strong>for</strong>mula<br />

[3,4]. Fig. 3 reports DJ relative error which is normalized to<br />

the t FO4<br />

/ T <strong>for</strong> two inverter chain. Chain 1 is composed of<br />

b<br />

six inverters which are sized with constant fanout (FO-4) at<br />

3Gbps. Chain 2 is six inverters with variable fanout (4, 9/8, 3,<br />

1.5, 2 and 4) at 3.5Gbps. The simulated data is from a 0.18-<br />

µm technology. The variable fanout case clearly stresses our<br />

model resulting in error of 20%. However, such large<br />

variations in fanout is not expected so the 20% can be<br />

considered as an error bound.<br />

%(DJ normalize relative error)<br />

20<br />

15<br />

10<br />

5<br />

Chain1: constant FO<br />

Chain2: variable FO<br />

0<br />

1 2 3 4 5 6<br />

Inverter number in the chain<br />

Fig.3: DJ normalized relative error. Chain1: 6 inverters sized in FO4<br />

Chain2: 6 inverters with fanout of: 4, 9/8, 3, 1.5, 2, 4<br />

The DJ model is used in an optimization to find<br />

device sizes that satisfy final output eye constraints. After<br />

determining the optimal device sizes based on the voltage and<br />

timing constraints, the total power can be calculated as a sum<br />

of the dynamic (on-chip CV 2 f and crowbar) power, and the<br />

off-chip signaling power. Note that since active power is the<br />

dominant, we do not consider leakage.<br />

We compare the optimization results from our model<br />

with Spice simulations <strong>for</strong> the power dissipation of basic<br />

transmitters. Fig.4 shows a good agreement between the<br />

results from Spice and the model. In this figure, simulation<br />

reports smaller values <strong>for</strong> power than the model. This is due<br />

to the assumption of switching activity in the model, which is<br />

different than the one in the simulation. Figure 5 reports the<br />

% error when comparing our model with Spice as we increase<br />

the data rate. The bit-time of the plot is normalized<br />

to t FO4<br />

/ T . The error is less than b<br />

± 10%<br />

.<br />

Since the signaling power remains essentially<br />

constant when driving a constant load impedance, one of the<br />

primary tradeoff in power is on-chip through optimizing the<br />

sizing <strong>for</strong> a given DJ.<br />

<strong>Power</strong>(mW)<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

HCM<br />

LCM<br />

Simulation<br />

Model<br />

0<br />

2 3 4 5 6<br />

(bit time)/(FO4 delay)<br />

Fig.4: <strong>Power</strong> values from the model and Spice simulation <strong>for</strong> singleended<br />

HCM and LCM architecture with pre-emphasis. DJ=11% of<br />

bit time<br />

%(DJ normalized relative error)<br />

15<br />

10<br />

5<br />

0<br />

−5<br />

HCM<br />

LCM<br />

−10<br />

2 3 4 5 6<br />

(bit time)/(FO4 delay)<br />

Fig.5: Normalized DJ relative error <strong>for</strong> single-ended HCM and LCM<br />

architecture with pre-emphasis. DJ=11% of bit time.<br />

Achieving smaller DJ necessitates using smaller<br />

fanouts in gate sizing which in turn leads to increase in<br />

capacitance and power dissipation. Figure 6 shows the power<br />

of a single-ended LCM transmitter with pre-emphasis <strong>for</strong><br />

varying DJ specifications. Note that reducing the DJ to less<br />

than 11% causes considerable increase in power.<br />

From our initial analysis, power can be optimized by<br />

maximizing the DJ within the desired constraint. Furthermore,<br />

setting a proper DJ constraint (to no less than 9%) can impact<br />

power significantly.<br />

<strong>Analysis</strong> Using the Optimization Model<br />

Our analytical model can provide further insight into<br />

the design tradeoffs. Already shown is the power penalty of<br />

HCM compared to LCM. The model can also be used to<br />

illustrate the impact of pre-emphasis. There is clearly a<br />

tradeoff between power consumption and data rate. For a<br />

given desired aggregate data rate, there is an optimal data rate<br />

per pin 2 .<br />

2 This is equivalent to energy-per-bit


<strong>Power</strong>(mW)<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

8 10 12 14 16 18 20<br />

DJ(% of bit time)<br />

Fig.6: DJ vs. power <strong>for</strong> a single-ended LCM architecture with preemphasis<br />

at 4Gbps. Process is 0.18µ.<br />

The impact of varying the output signal swing or DJ<br />

constraints can also be studied. Furthermore, by adjusting the<br />

model fitting parameters <strong>for</strong> a different technology, we can<br />

compare the optimal data rate <strong>for</strong> different technologies.<br />

Fig.7 shows the results of the model <strong>for</strong> LCM and<br />

HCM architectures. Since current standards limit the DJ to be<br />

less than 13% of bit time [5,6], we use 11% of bit time <strong>for</strong> the<br />

maximum tolerable DJ from on-chip buffering. For the LCM<br />

architecture, applying pre-emphasis is more power hungry at<br />

higher data rates. This can be explained by the larger logicrelated<br />

capacitance, which makes dynamic power dominant.<br />

Differential signaling reduces Ldi/dt noise considerably at the<br />

expense of more dynamic power budget. However, notice that<br />

the penalty is not significant, especially in HCM architecture,<br />

where signaling power is dominant. From the figure it is<br />

evident that HCM is not an interesting choice <strong>for</strong> power<br />

reduction <strong>for</strong> data rates of less than 4.5Gbps (bit-time of<br />

222ps in 0.18-µm technology).<br />

<strong>Power</strong>(mW)<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

HCM<br />

LCM<br />

Simple<br />

pre−emphasis<br />

Pre−emphasis,Differential<br />

0<br />

2 4 6 8 10<br />

(bit time)/(FO4 delay)<br />

Fig. 7: <strong>Power</strong> dissipation in 0.18µ CMOS process. DJ=11% of bit<br />

time<br />

We next use the model to analyze the impact of the<br />

position of the multiplexing in the transmitter architecture.<br />

The multiplexing function can be placed at various places<br />

along the transmitter architecture (i.e. buffering, pre-emphasis<br />

switches, pre-driver, and driver). Figure 9 shows 2 possible<br />

placements: within the buffering and at the output. The output<br />

multiplexing implementations <strong>for</strong> both HCM and LCM is<br />

shown in Fig. 8. X1 and X2 are the half-rate inputs driven to<br />

the output on appropriate level of clock signal. Multiplexing<br />

at the output achieves higher-speed operation by reducing the<br />

on-chip data rate. However, it is at the expense of more power<br />

consumption because of larger device sizes and clock power.<br />

Figure 9 illustrates the increase in power. The on-chip logic<br />

power essentially doubles, as shown in the LCM case.<br />

<strong>Power</strong>(mW)<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

(a) (b)<br />

Fig8: Output multiplexing (a) HCM (b) LCM<br />

LCM<br />

HCM<br />

Mux(middle)<br />

Mux(output)<br />

0<br />

2 4 6 8 10<br />

(bit time)/(FO4 delay)<br />

Fig.9: Comparison of two different multiplexing in single-ended<br />

LCM and HCM architecture with pre-emphasis. DJ=11% of bit time<br />

To achieve a data channel of a given aggregate data<br />

rate, transceivers are parallelized in a wide bus. The optimum<br />

data rate depends on power budget and number of I/O pins.<br />

<strong>Power</strong>(W)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

CK<br />

X1<br />

50<br />

Vdd<br />

CK<br />

X2<br />

Out<br />

3.6FO4<br />

3.3FO4<br />

0.5<br />

2.5FO4<br />

Simple<br />

0.4<br />

pre−emphasis<br />

Pre−emphasis,Differential<br />

0.3<br />

2 4 6 8 10<br />

(bit time)/(FO4 delay)<br />

Fig.10: <strong>Power</strong> dissipation <strong>for</strong> sending 1Tbps data in LCM<br />

architecture. DJ=11% of bit time. Process is 0.18µ<br />

If the number of pins is not constrained, using the<br />

model we can find the appropriate data rate per pin <strong>for</strong><br />

minimum power. In LCM architecture, signaling power is<br />

X1<br />

CK<br />

CK<br />

VG<br />

Vs<br />

X2<br />

CK<br />

CK<br />

X1 2 X<br />

Vs<br />

Out


dominant only at low data rates. With increasing the data<br />

rates, dynamic power is dominant. Fig.10 illustrates the<br />

tradeoff and shows that the optimum bit time <strong>for</strong> each I/O in<br />

order to transmit a constant aggregate data rate of 1Tbps is<br />

3.6 FO4 <strong>for</strong> a differential transmitter using pre-emphasis. In<br />

HCM architecture, since power consumption is mainly due to<br />

signaling, it is more desirable to increase data rate in order to<br />

reduce the number of signaling I/O outputs. The optimum bit<br />

time <strong>for</strong> each I/O in order to transmit a constant aggregate<br />

data rate of 1Tbps is 2.4FO4 <strong>for</strong> a single-ended HCM<br />

architecture with pre-emphasis. However it is 4.7 times more<br />

power hungry than single-ended LCM architecture with preemphasis.<br />

There<strong>for</strong>e LCM is more interesting <strong>for</strong> power<br />

reduction if the number of pins is not constrained. We will<br />

concentrate on this architecture <strong>for</strong> the remaining analyses.<br />

Raising the output swing increases signaling power<br />

dissipation. However, the larger swing also leads to bigger<br />

devices at the output driver. With both signaling and dynamic<br />

power increasing, the shift in the optimal bit-time is not<br />

immediately. Our optimization shows that the increase in<br />

dynamic power is less than the increase in signaling power<br />

and there<strong>for</strong>e the optimum data rate shifts to higher data rates<br />

of (1/3FO4). By the same reasoning, decreasing the output<br />

swing results in the shift of optimum data rate to smaller<br />

values. Figure 11 shows the corresponding change in power<br />

and optimal bit-time with varying signal swing. It is<br />

worthwhile to note that power increases super-linearly with<br />

signal swing but not quite quadratically since signaling power<br />

is not the only source of power dissipation.<br />

Fig.11: <strong>Power</strong> dissipation <strong>for</strong> sending 1Tbps data in a single-ended<br />

LCM architecture with pre-emphasis <strong>for</strong> various signal swings.<br />

DJ=11% of bit time. Process is 0.18µ<br />

Finally, the impact of technology scaling is analyzed.<br />

As we would expect, technology shrink improves both speed<br />

and power due to reducing both capacitance and supply<br />

voltage. There<strong>for</strong>e from equations (2,3), it is expected that the<br />

optimum data rate <strong>for</strong> signaling at aggregate bandwidth in<br />

LCM architecture should scale roughly the same as gate<br />

delay. Figure 12 validates the scaling in power and bit time.<br />

However, it shows that the optimal bit-time in a LCM<br />

architecture scales slightly faster than gate delay (2.6FO4<br />

versus 3.3FO4 in 0.13µm and 0.18µm respectively). The<br />

reason is that the output swing is not changed in the analysis<br />

causing a higher ratio of signaling power to dynamic power.<br />

This effect would not be nearly as noticeable in HCM<br />

transmitters.<br />

<strong>Power</strong>(W)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

3.3FO4<br />

2.6FO4<br />

0.18<br />

0.13<br />

0.2<br />

0 5 10 15<br />

(bit time)/(FO4 delay)<br />

Fig.12: <strong>Power</strong> dissipation <strong>for</strong> sending 1Tbps data in single-ended<br />

LCM architecture with pre-emphasis. DJ=11% of bit time<br />

Conclusion<br />

This paper presents a macromodel of a transmitter<br />

that can optimize power dissipation under various data rates,<br />

and signal data-eye constraints. This model is used to provide<br />

in<strong>for</strong>mation on the choice of architecture, signal swing, and<br />

optimum data rate. Using the model, we show that the LCM<br />

architecture is more efficient <strong>for</strong> low-power design than HCM<br />

architecture especially when output swing is not large (less<br />

than 350mV). We also show that output multiplexing leads to<br />

higher bandwidth but it is at the expense of 2x the on-chip<br />

power dissipation because of larger device sizes and clock<br />

power. The impact is less <strong>for</strong> HCM because of the smaller<br />

device sizes and the passive pull-up. When transmitting a<br />

given aggregate data rate across multiple pins, the optimal bit<br />

time is larger (roughly 3.5FO4 delays) <strong>for</strong> LCM than it is <strong>for</strong><br />

HCM (2.4FO4). The optimal data rate scales faster than FO4<br />

gate delay by technology scaling due to higher ratio of<br />

signaling power to dynamic power. The analysis reassures us<br />

that transmitting >1Tb/s of aggregate data rate off-chip is<br />

clearly possible without excessive power consumption (

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!