IR-drop Reduction Through Combinational Circuit Partitioning
IR-drop Reduction Through Combinational Circuit Partitioning
IR-drop Reduction Through Combinational Circuit Partitioning
Circuit Partitioning
Hai Lin, Yu Wang, Rong Luo, Huazhong Yang, and Hui Wang
EE Department, Tsinghua University, Haidian District
Beijing, 100084, P.R. China
{linhai99, wangyuu99}@mails.tsinghua.edu.cn,
{luorong, yanghz, wangh}@tsinghua.edu.cn
Introduction
With technology stepping into submicron region, circuit design for single-chip
integration of more complex, higher speed, and lower supply voltage systems has
made the on-chip signal-integrity (SI) problem to be a tough task. Among all
the sources of SI problem, the dynamic voltage drop caused mainly by Ldi/dt
and IR-drop draws much attention in recent years.
As the supply voltage goes down continuously, ignoring the dynamic voltage
drop through supply networks will cause run-time errors on real chips. These
errors may include that transistors may not turn on with an unexpected voltage drop, and a timing constraint violation because of a delay increase of the
standard gates with lower supply voltage. Some publications have already paid
attention to reduce the voltage variation on P/G network for all kinds of purposes. Early publications focus directly on the optimization of the P/G network
of the circuit, such as supply wire sizing [1] and P/G network decoupling capacitance (DC) insertion [2], [3] strategies. However as the technology feature
This work was sponsored in party by NSFC under grants #90207001 and #90307016.
J. Vounckx, N. Azemard, and P. Maurine (Eds.): PATMOS 2006, LNCS 4148, pp. 370381, 2006.
c Springer-Verlag Berlin Heidelberg 2006
371
scales down, such eorts become insucient and suer from the drawback of
large on-chip resource occupation.
In recent years, a few researchers have focused on the optimization of the logic
blocks of the circuit[4],[5]. In publication [5], a synchronous digital circuit is rst
divided into clock regions and then these regions are assigned with dierentphase clocks, in this case the author tried to spread the original simultaneous
switching activities on the time axis to reshape the switching current waveform
and reduce the current peak.
However, those algorithms using clock as the controlling signal to distribute
the switching activity have an essential defect. As mentioned in [4], these algorithms lack the ability to control combinational circuit. Even in sequential
circuits, the combinational part which triggered by ip-ops works alone in one
clock cycle and draw corresponding currents from power network. When these
combinational parts are large enough, the current peak created by one single
combinational part is quite considerable. This problem cannot be settled by
algorithms dealing with clock skew assignment.
In this paper, we present our IR-drop reduction method in combinational
circuits. And the paper mainly has three contributions:
1. We derive a formal problem denition of IR-drop reduction in the combinational circuits and propose a novel combinational circuit IR-drop reduction
methodology using Switching Current Redistribution (SCR) method based on
circuit partitioning.
2. We give out a combinational circuit decomposition algorithm with better
circuit slack utility to support our SCR method. Combinational block is partitioned into sub-graphs based on a new partitioning criterion called slack subgraph partitioning to rearrange the switching time of dierent parts. STA tool
is used to insure the original timing constraints and critical paths, in this way
the exact logic function and the highest working frequency are both preserved.
3. A simple and proper additional delay assignment strategy is proposed. Then
we compare some methods which modify the decomposed circuits to redistribute
the switching current while the logical function and the performance constraints
of the circuit are maintained.
The paper is organized as follows. The denition of combinational circuit IRdrop reduction problem is proposed in Section 2. Our novel circuit decomposition
method is presented in Section 3. In Section 4 we present the additional delay
assignment and the exact circuit modication strategy to achieve the additional
delay. The implementation and experimental results are shown and analyzed in
Section 5. In Section 6, we give the conclusion.
2
2.1
Our research focuses on gate level combinational circuits. At the gate level, a
combinational circuit can be represented by a directed acyclic graph (DAG),
372
H. Lin et al.
G=(V, E). A vertex vV represents a CMOS transistor network which realizes a single output logic function (a logic gate), while an edge(i; j)E, i, jV
represents a connection from vertex i to vertex j.
We dene three attributes for every vertex vV, they are , the arrival time
ta (v), the required time treq (v), and the slack time tslk (v). The arrival time ta (v)
is the worst case signal transfer time from the primary inputs to vertex v. treq (v)
is the latest time the signal needs to arrive at vertex v. We dene them as:
ta (v) =
max
if anin(v)
(1)
treq (v) =
min
if anin(v)
(2)
The signal propagation delay of a vertex d(v) can be respectively represented as:
d(v) =
KCL VDD
(VDD VT H )
(3)
Where CL and VT H are the output load capacitance and the transistor threshold
voltage of the gate, respectively; K and are technology dependent constants.
The slack time of a gate v is dened as the dierence of its arrival time and
required time.
(4)
tslk (v) = treq (v) ta (v)
The slack time of a gate v represents the timing laxity of the graph at this point.
The performance will not be harmed if a circuit modication still maintains the
tslk (v) 0. We can call it a slack time limitation.
If we dene a working frequency, the critical path of the circuits is constituted
by the set of gates that has the minimum slack time value. And with the highest
working frequency, this minimum slack value is zero. Our analysis focuses on the
highest working frequency situation to ensure the original best performance of
the circuit.
2.2
Problem Denition
(5)
vV
Where RP/G is the P/G network resistance; I(V,t) is the current of the combinational circuits; Iv is the switching current of the individual gate v V , which
is determined by its input state inputv , input signal arrival time ta (v) and propagation delay d(v). From the equation (5), we can modify Iv through ta (v) and
d( v) in order to minimize the current peak of the combinational circuit. However
373
if we adjust every gate to get the optimal result, the IR-drop reduction problem
will be unacceptably dicult.
As a result, in our method the combinational circuit G=(V, E) is partitioned
into independent blocks Gsub = G1 , G2 , , Gn in order to simplify the IR-drop
problem. Thus the IR-drop can have an alternative denition as below:
1kn
min max {V (t, Gk , Dk )}
(7)
Gk ,Dk
As in the problem denition, the IR-drop reduction problem can not be easily
solved. Based on circuit partition we presented our own method to solve the
problem in a smart way of combinational blocks switching current distribution.
wI / wt
12
Original circuit
Modified circuit
-4
-8
-12
time
374
H. Lin et al.
Shown by Fig. 1, if the combinational circuit are partitioned into two independent blocks without signal dependence, their switching current can be adjusted
independently, by separate the switching time of the two blocks the current peak
can be considerably reduced. Moreover, as mentioned above, the Ldi/dt noise is
becoming signicant in the P/G network. To smooth the currents waveforms in
this way may also help reduce such noise when inductance of the P/G network is
considered (see Fig. 1). We call this Switching Current Redistribution. To achieve
this specic partitioning goal, we present a new algorithm combining static timing analysis (STA) information into the partitioning algorithm and make sure to
maintain the critical paths after partitioning to ensure the circuit performance.
And a simple and proper additional delay time assignment method is proposed
to realize the redistribution of the switching current of dierent blocks.
375
5 VSLK = V VCRI ;
6 While ( VSLK not empty)
Begin while:
vi VSLK ; //randomly choose a vertex vi
Get all the vertexes connected with vi in VSLK , and put them in set VSLK (i);
Get all the edges generated by vertexes in VSLK (i), and put them in set ESLK (i);
GSLK (i) = ( VSLK (i), ESLK (i)); // construct a slack sub graph
VSLK = VSLK VSLK (i);
End while;
7 Return Gsub = {GCRI , {GSLK (i)}}; // return the independent blocks
376
4.1
H. Lin et al.
(8)
The total current from one slack block is the sum of all the gates current
within it. And we can easily calculate the peak when we actually store the
current waveform by discrete value at each time interval.
We perform a simple but practical additional delay time assignment strategy
to achieve a considerable large reduction in the switching current peak of the
combinational circuit.
Here, we propose the experimental based assignment of the articial additional
delay value of input gate in every targeted sub-graph. We assign the additional
delay of input vertex to the amount of its slack time to form the initial solution
SOLDLY , so that we may spread more switching current to the entire circuit
switching period and reduce the overlapping of switching currents of critical block
and the targeted slack blocks. Then a small nearby region search for better solution of this assignment is made based on the evaluation of Ipeak . Experimental results show that little change of the initial solution is needed and this
simple and practical additional delay assignment strategy can appropriately redistribute the switching current of the blocks, utilize the total circuit switching
period more equably and reduce the peak current of the whole circuit to a considerably lower value.
The circuit performance is maintained since the critical block is not changed
and the slack blocks are adjusted following the timing constraints. In the slack
time assignment, slack information is extracted by our STA tool [7]. The nal
delay time information of each input gate is saved in specic data le for circuit
modication procedure.
4.2
I'gi(t)
delay
377
I'peak=
t2
t1
latest possible sw itch time
earliest possible sw itch time
delay
t2-t1
Ipeak
time
Igi(t)
Ipeak
delay
t3
a possible sw itch time in application
tim e
I'gi(t)=0 0 0 0 0 0 1 2 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 2 1 0 0 0 0 0 0 0
Igi(t)=0 0 0 0 0 0 0 0 0 1 2 3 4 3 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
As in our experiment we are using TSMC 0.18m standard cell library for
simulation, the three discrete VT H values are set to VT HO , VT HO + VT HL and
VT HO + VT HH . VT HO is the library determining original transistor threshold
voltage. And in our experiment, we set VT HL = 0.2V , VT HH = 0.4V . In
reality, the actual discrete values can be determined by process limitation. First
of all, the input gate is assigned one of the discrete VT H values that is just below
the calculated value. Then, if the required additional delay exceed the maximum
value that can be achieved by a single input gate, the overow delay would be
assigned to gates in the following stage of the slack block. In our experiment, we
378
H. Lin et al.
only allow two stages of the slack block to be modied (input stage and the stage
following that) to reduce the modication complexity. The simulation result of
this circuit modication strategy is presented in Table.1 and compared to buer
insertion strategy.
Buer insertion strategy is a backup strategy for multi-threshold strategy, and
reduces the fabrication process cost. Instead of change d(v) of the input gates,
specialized buers are inserted right after original input gates, thus change the
arrival time of the other gates in the slack block. However, this strategy has two
major drawbacks: additional area occupation and more power dissipation. Thus
we consider using it only if we can not use the multi-threshold strategy.
The implementation of our algorithm can be illustrated in Fig. 4. Our gate level
netlists are synthesized using Synopsys Design Compiler and a TSMC 0.18m
standard cell library. The DAG extraction and customized circuit partitioning
procedure have been implemented in C++ under a customized STA environment
according to the TSMC standard cell delay library. We implemented a small tool
to automatically generate the modied gate list including the delay time assignment and the two circuit modication strategies. Both the original and modied
circuits are simulated using HSPICE with TSMC 0.18m CMOS process and a
1.67V supply condition. The P/G network is modeled as RC network.
As our algorithm focuses on the redistribution of switching current from logic
blocks, the architecture of P/G network model does not have much inuence in
Gate level
netlist
Customized STA
tool
DAG
extraction
TSMC delay
library
Customed Circuit
partition program
Multilevel
threshold
strategy
Delay buffer
insertion
strategy
Circuit
modification
strategy
Modified gate
level netlist
Spice input
converting
program
Spice input
file
Tsmc18.cdl
Tsmc18core.lib
tsmc18.lb
Hspice
simulation
Output
result
379
the peak current reduction ability. We actually compared the simulation results
from the circuit with simple model(single R and C) and complex model(multiple
R and C connected as a mesh) of P/G network in several circuits. It shows that
the detailed waveform of the current is changing slightly with the P/G network
variation but the reduction rate remains approximately the same (see Fig. 5).
As a result, in our simulation we simply model the P/G network as a 100
resistance connected between VDD and the logic block, and a capacitance of
0.3pf parallel connected to the logic block to reduce the simulation complexity.
We apply the proposed method to ISCAS85 benchmark circuits and all the
circuits are simulated with large number of random input vectors. And we are
running the program on a PC with P4 2.6GHz and 512M memory.
We show in Fig. 5 the transient on-chip current waveform in one processing
cycle of the modied circuit compared with the original circuit of C1355 simulated with both simple P/G network and complex P/G network. As we expected,
the current waveform of both unmodied (gures above) and modied (gures
below) circuit with complex P/G network (dotted line) are dierent from the
ones with simple P/G network (real line). However the peak current reduction
rate remains approximately the same. And comparing the waveforms with the
same P/G model, we can nd that the current curves with single peak in one processing cycle change into curves with two or more lower swing peaks after circuit
modication. Thus the switching current of the two major kinds of blocks (the
slack blocks and the critical block ) is actually separated and the peak current of
the circuit is signicantly reduced.
Table.1 shows the current peak reduction results of both multi-threshold and
buer insertion strategies. We can see that the reduction of current peak varies
with the circuit structures, from 15% up to 33% by multi-threshold and from
12% up to 32% by buer insertion, which are very impressive. The circuits with
more slack to be utilized get a better optimization result through our algorithm.
380
H. Lin et al.
Table 1. Comparison of the multi-threshold and buer insertion strategies
ISCAS85 Original
Circuit
Multi-VT H
Buer insertion
Average
Average
Ipeak
Average
Ipeak
Area Overload
Ipeak (mA) Ipeak (mA) Reduction Ipeak (mA) Reduction (Nbuf /Ntotalgates )
C432
C499
C880
C1355
C1908
C2670
C3540
C5315
C6288
C7552
2.45
4.27
3.05
4.21
3.96
3.68
2.78
4.94
4.78
5.26
average
2.08
2.98
2.28
3.01
3.07
2.95
2.27
3.29
3.87
4.43
15%
31%
25%
29%
22%
20%
18%
33%
19%
16%
23%
2.15
3.01
2.31
2.85
3.04
2.98
2.33
3.75
3.98
4.46
12%
29%
24%
32%
23%
19%
16%
24%
17%
15%
6/160
14/202
38/383
22/545
31/880
49/1269
84/1669
170/2307
119/2416
201/3513
21%
Here, we comment that the algorithm would have a limit of applicability if the
slack blocks have too little slack amount to be utilize, which would be very rare
for functional combinational circuits. Even in that case, we suggest a circuit slow
down be induced to achieve more slack utility if reducing switching current peak
is the most urgent problem for a application.
Although the peak current reduction is nearly the same, the average current
of the circuit shows that buer insertion strategy induces more on-chip current
besides the draw back of on chip area overload due to the insertion of buers.
Some other strategies, such as gate sizing or transistor stacking, can also be considered in order to avoid large addition current meanwhile achieve the equivalent
required delay.
Conclusions
381
physical design is becoming not neglectableto induce a slight circuit slow down
or to maintain a certain amount of the original slack according to the specic
manufacturing technique would be both applicable in order to insure a process
variation tolerance ability. Since our method does not have any performance loss
and do not require modications on P/G network or circuit clock trees, it can
be used with other commonly used methods such as P/G network DC insertion
and clock skew assignment in synchronous circuits to achieve further reduction
ability of on-chip IR-drop.
References
1. Sheldon X.-D, Tan, C.-J, Richard Shi, Jyh-Chwen Lee: Reliability-Constrained
Area Optimization of VLSI Power/Ground Networks Via Sequence of Linear Programmings. IEEE Trans. on CAD, Vol. 22, No. 12, pp. 1678-1684, 2003.
2. Mondira Deb Pant, Pankaj Pant, Donald Scott Wills: On-Chip Decoupling Capacitor Optimization Using Architectural Level Prediction. IEEE Trans. on VLSI,
Vol. 10, No. 3, pp 772-775, 2002.
3. Howard H. Chen, J. Scott Neely, Michael F. Wang, Grieel Co: On-Chip Decoupling
Capacitor Optimization for Noise and Leakage Reduction. Procs of IEEE ISCAS,
2003.
4. P. Vuillod, L. Benini, A. Bogliolo, G. De Micheli: Clock-skew optimization for peak
current reduction. Procs of ISLPED.1996 Monterey CA USA, 1996.
5. Mustafa Badaroglu, Kris Tiri, Stphane Donnay, Piet Wambacq, Ingrid Verbauwhede, Georges Gielen, Hugo De Man: Clock Tree Optimization in Synchronous
CMOS Digital Circuits for Substrate Noise Reduction Using Folding of Supply Current Transients. Procs of DAC 2002, June 10-14, New Orleans, Louisiana, 2002.
6. Amir H. Ajami, Kaustav Banerjee, Amit Mehrotra, Massoud Pedram: Analysis
of IR-Drop Scaling with Implications for Deep Submicron P/G Network Designs.
Procs of ISQED 2003.
7. Yu Wang, Huazhong Yang, Hui Wang: Signal-path level Assignment for Dual-Vt
Technique. Procs of IEEE PRIME 2005, pp 52-55.
8. George Karypis, Rajat Aggarwal, Vipin Kumar, Shashi Shekhar: Multilevel Hypergraph Partitioning: Applications in VLSI Domain. IEEE Trans. on VLSI, Vol.
7, No. 1, pp 69-79, 1999.
9. Navaratnasothie Selvakkumaran, George Karypis: Multi-Objective Hypergraph
Partitioning Algorithms for Cut and Maximum Subdomain Degree Minimization.
Procs of ICCAD03, November 11-13, San Jose, California, USA, 2003.
10. Mohab Anis, Shawki Areibi, and Mohamed Elmasry: Design and Optimization of
Multithreshold CMOS (MTCMOS) Circuits. IEEE Trans on CAD, Vol. 22, No.
10, pp 1324-1342, Oct 2003.