Buffer Insertion

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 47
At a glance
Powered by AI
The document discusses interconnect optimizations and the role of interconnects in integrated circuits. It also discusses how interconnect delay increases with technology scaling.

The document discusses short (local) interconnects, medium to long-distance (global) interconnects, and "fat" wires in higher metal layers.

The document states that global interconnect delay doubles with each technology generation due to the resistance increasing and capacitance remaining the same, which is unsustainable.

Interconnect Optimizations

A scaling primer
G

Ideal process scaling:

Device geometries shrink by S= 0.7x)

Device delay shrinks by s

Wire geometries shrink by

R/ : /(ws.hs) = r/s2
Cc/ : (hs). /(Ss) = Cc
C/ : similar
R/ doubles, C/ and Cc/ unchanged

h
l

w S

Interconnect role
Short (local) interconnect
Used to connect nearby cells
Minimize wire C, i.e., use short min-width wires

Medium to long-distance (global) interconnect


Size wires to tradeoff area vs. delay
Increasing width Capacitance increases, Resistance
decreases Need to find acceptable tradeoff - wire sizing problem

Fat wires
Thicker cross-sections in higher metal layers
Useful for reducing delays for global wires
Inductance issues, sharing of limited resource

Cross-Section of A Chip

Block scaling
Block area often stays same
# cells, # nets doubles
Wiring histogram shape invariant

Global interconnect lengths dont shrink


Local interconnect lengths shrink by s

Interconnect delay scaling

Delay of a wire of length l :


int = (rl)(cl) = rcl2

(first order)

Local interconnects :
int : (r/s2)(c)(ls)2 = rcl2

Local interconnect delay unchanged (compare to faster devices)

Global interconnects :

int : (r/s2)(c)(l)2 = (rcl2)/s2

Global interconnect delay doubles unsustainable!

Interconnect delay increasingly more dominant

Buffer Insertion For Delay


Reduction

Analysis of Simple RC Circuiti(t)


R

R i (t ) v (t ) vT (t )
d (Cv (t ))
dv (t )
i (t )
C
dt
dt
dv (t )
RC
v(t ) vT (t )
dt
state
variable
Input
waveform

vT(t)

v(t)

Analysis of Simple RC Circuit


Step-input response:

v0

v0u(t)
v0(1-e-t/RC)u(t)

dv (t )
v (t ) v0u (t )
dt
t
v (t ) Ke RC v0u (t )

RC

match initial state:


v (0) 0 K v0u (t ) 0
output response for step-input:

v (t ) v0 (1 e

RC

)u (t )

Delays of Simple RC Circuit


v(t) = v0(1 - e-t/RC) -- waveform
under step input v0u(t)
v(t)=0.5v0 t = 0.69RC
i.e., delay = 0.69RC

(50% delay)

v(t)=0.1v0 t = 0.1RC
v(t)=0.9v0 t = 2.3RC
i.e., rise time = 2.2RC (if defined as time from 10% to 90% of Vdd)

Commonly used metric

TD = RC

(= Elmore delay)

Elmore Delay

Delay

Elmore Delay

Driver is modeled as R
Driver intrinsic gate delay t(B)
Delay = all Ri all Cj downstream from Ri Ri*Cj
Elmore delay at n2 R(B)*(C1+C2)+R(w)*C2
Elmore delay at n1 R(B)*(C1+C2)
n1
B

R(B)

C1 R(w)

n2
C2

Elmore Delay
For uniform wire
unit wire capacitance c

unit wire resistance r

delay

( xr )( xc )

( xr )C
2

No matter how to lump, the Elmore delay


is the same

Delay for Buffer


u

v
C

u
C(b)

delay (u, v ) t (b) R(b)C


C (u ) C ( b)
Input capacitance

Driver resistance

Intrinsic buffer delay

Buffers Reduce Wire Delay


x/2

rx/2
cx/4 cx/4

x/2

rx/2
cx/4 cx/4

t
t_unbuf = R( cx + C ) + rx( cx/2 + C )
t_buf = 2R( cx/2 + C ) + rx( cx/4 + C )
+ tb
t_buf t_unbuf = RC + tb rcx2/4

Combinational Logic Delay


Regist
er
Primar
y Input
clock

Combination
al Logic

Regist
er
Primar
y
Output

Combinational logic delay <= clock period

Buffered global interconnects:


Intuition
l

Interconnect delay = r.c.l2

l1

l2

l3

ln

Now, interconnect delay = r.c.li2 < r.c.l2 (where l = lj )


since (lj 2) < (lj )2
(Of course, account for buffer delay also)

Optimal inter-buffer length

First order (lumped parasitic, Elmore delay) analysis


L

Rd On resistance of inverter
Cg Gate input capacitance
r,c Resistance, cap. per micron

Assume N identical buffers with equal inter-buffer length l (L = Nl)

T N Rd C g cl rl C g cl / 2

L rcl / 2 rC g Rd c Rd C g
l

For minimum delay,


dT
0
dl

rc Rd C g
2 0
2
lopt

lopt

2 Rd C g
rc

Optimal interconnect delay


Substituting lopt back into the interconnect delay
expression:

Rd C g
Topt L rcl opt rC g Rd c
lopt

L rc

2 Rd C g
rc

rC g Rd c

Rd C g

2 Rd C g

rc

Topt L 2 Rd C g rc rC g Rd c

Delay grows linearly with L (instead of quadratically)

Total buffer count


% cells used to buffer nets

80
clk-buf

70

buf

60

tot-buf

50
40
30
20
10
0
90nm 65nm 45nm 32nm

Ever-increasing fractions of total cell count will be buffers


70% in 32nm

ITRS projections
Relative
delay
100

Feature size (nm)


250

180

130

90

Gate delay (fanout 4)


Local interconnect (M1,2)
Global interconnect with repeaters
Global interconnect without repeaters
10

Source: ITRS, 2003

0.1

65

45

32

Buffers Improve Slack


RAT = 300
Delay
350 =

Slack = -50

slackmin =
-50
RAT = Required Arrival
Time
Slack = RAT - Delay

slackmin = 50

Decouple
capacitive load
from critical path

RAT = 700
Delay
600 =
Slack
100 =

RAT = 300
Delay
250 =

Slack = 50
RAT = 700
Delay
400 =
Slack
300 =

Timing Driven Buffering


Problem Formulation
Given
A Steiner tree
RAT at each sink
A buffer type
RC parameters
Candidate buffer locations

Find buffer insertion solution such that the


slack at the driver is maximized

Candidate Buffering Solutions

Candidate Solution Characteristics


Each candidate
solution is
associated with

vi is a sink
ci is sink
capacitance

vi: a node
ci: downstream
capacitance
qi: RAT

v is an internal node

Van Ginnekens Algorithm

Candidate solutions are


propagated toward the
source
Dynamic Programming

Solution Propagation: Add Wire

(v2, c2, q2)

(v1, c1, q1)

c2 = c1 + cx
q2 = q1 rcx2/2 rxc1
r: wire resistance per unit length
c: wire capacitance per unit length

Solution Propagation: Insert Buffer

(v1, c1b, q1b)

(v1, c1, q1)

c1b = Cb
q1b = q1 Rbc1 tb
Cb: buffer input capacitance
Rb: buffer output resistance
tb: buffer intrinsic delay

28

Solution Propagation: Merge


(v, cl , ql)

(v, cr , qr)

cmerge = cl + cr
qmerge = min(ql , qr)

Solution Propagation: Add Driver

(v0, c0d, q0d)

(v0, c0, q0)

q0d = q0 Rdc0 = slackmin


Rd: driver resistance
Pick solution with max slackmin

Example of Solution Propagation


2

(v3, 5, 8)

Add wire

(v1, 1, 20) Rb = 1, Cb = 1, tb =
1
Rd = 1

Add wire
(v2, 3, 16)

r = 1, c = 1

(v2, 1, 12)

v1
Insert
buffer

(v3, 3, 8)

Add wire

v1
slack =
3

Add
driver

v1

v1
slack =
5

Add
driver

Example of Merging

Left
candidat
es
Right
candidates

Merged
candidates

32

Solution Pruning
Two candidate solutions
(v, c1, q1)
(v, c2, q2)

Solution 1 is inferior if
c1 > c2 : larger load
and q1 < q2 : tighter timing

Pruning When Insert Buffer

They have the same load cap


Cb, only the one with max q is
kept

Generating Candidates
(1)

(2)

(3)

35

From Dr. Charles Alpert

Pruning Candidates
(3)
(a)

(b)

Both (a) and (b) look the same to the source.


Throw out the one with the worst slack

(4)

36

Candidate Example Continued


(4)

(5)

37

Candidate Example Continued


After pruning
(5)

At driver, compute which candidate maximizes


slack. Result is optimal.

38

Merging Branches

Left
Candidates

Right
Candidates

39

Pruning Merged Branches


Critical
With pruning

40

Van Ginneken Example


(20,400)
Buffer
C=5, d=30
(30,250)
(5, 220)
Buffer
C=5, d=50
C=5, d=30
(45, 50)
(5, 0)
(20,100)
(5, 70)
41

Wire
C=10,d=150
(20,400)

Wire
C=15,d=200
C=15,d=120
(30,250)
(5, 220)

(20,400)

Van Ginneken Example Contd


(45, 50)
(5, 0)
(20,100)
(5, 70)

(30,250)
(5, 220)

(20,400)

(5,0) is inferior to (5,70). (45,50) is inferior to (20,100)


Wire C=10
(30,10)
(15, -10)

(20,100)
(5, 70)

(30,250)
(5, 220)

Pick solution with largest slack, follow arrows to get solution


42

(20,400)

Basic Data Structure


Worse load
cap

(c1, q1)

(c2, q2)

(c3, q3)

Better timing

Sorted list such that


c1 < c2 < c3
If there is no inferior
candidates q1 < q2 < q3

Prune Solution List


Increasing c

(c1, q1)
q1 < q 2 ?

(c2, q2)
Prune 2

(c3, q3)
q1 < q 3 ?

q1 < q 4 ?

Prune 3

Y
q2 < q 3 ?

(c4, q4)

Prune 3

q2 < q 4 ?

Y
q3 < q 4 ?

Prune 4

q3 < q4 ?

Prune 4
44

Pruning In Merging
Left
candidate
s
(cl1, ql1)

Right
candidate
s
(cr1, qr1)

(cl2, ql2)

(cr2, qr2)

(cl3, ql3)
(cl1, ql1)

(cr1, qr1)

(cl2, ql2)

(cr2, qr2)

(cl3, ql3)

ql1 < ql2 < qr1 < ql3 <


qr2
(cl1, ql1)
Merged
candidate
(cl2, ql2)
s
(cl3, ql3)
(cl1+cr1,
ql1)

(cr1, qr1)
(cr2, qr2)

(cl2+cr1,
ql2)

(cl1, ql1)

(cr1, qr1)

(cl2, ql2)

(cr2, qr2)

(cl3+cr1,
qr1)

(cl3, ql3)
45

Van Ginneken Complexity


Generate candidates from sinks to source
Quadratic runtime
Adding a wire does not change #candidates
Adding a buffer adds only one new candidate
Merging branches additive, not multiplicative
Linear time solution list pruning

Optimal for Elmore delay model

Multiple Buffer Types


2

(v2, 3, 16)

(v1, 1, 20)

v1

r = 1, c = 1

Rb1 = 1, Cb1 = 1, tb1 = 1


Rb2 = 0.5, Cb2 = 2, tb2 =
0.5
Rd = 1

(v2, 2, 14)

(v2, 1, 12)
v1

v1

You might also like