Li 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

International Journal of Advanced Technology and Engineering Exploration, Vol 8(80)

ISSN (Print): 2394-5443 ISSN (Online): 2394-7454


Research Article
http://dx.doi.org/10.19101/IJATEE.2021.874162

PNR flow methodology for congestion optimization using different macro


placement strategies of DDR memories
J. Fadnavis1* and Kariyappa B.S2
Student, Electronics and Communication Engineering Department, RV College of Engineering, Bangalore, India 1
Professor, Electronics and Communication Engineering Department, RV College of Engineering, Bangalore, India 2

Received: 29-May-2021; Revised: 15-July-2021; Accepted: 18-July-2021


©2021 J. Fadnavis and Kariyappa B.S. This is an open access article distributed under the Creative Commons Attribution (CC BY)
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract
The demand for high-performance electronic gadgets has increased two-folds in the last decade, fueling technology
manufacturers to shrink fabrication node sizes. The decreasing channel sizes along with an increase in gate count and cell
density pose numerous congestion issues during physical implementation of the chips, making design closure ever more
difficult. Double Data Rate (DDR) memories that access data on both edges of the clock cycle require extreme timing control
and must meet the strict timing requirements during Physical Design (PD). Floor-plan, being the first stage of back-end
PD implementation, is an important step to mitigate congestion and timing issues during the subsequent stages of the
implementation. On-chip macros, with connections to the standard cells and the Input/Output (IO) ports of the chip, need
to be strategically placed during the floor-plan of the design to enable congestion-free placement of standard cells and
signal routes. Previously, designers opted for island macro placement strategy, wherein macros were grouped close together,
thereby leaving a uniform square region for standard cell placement. However, this method alone cannot be considered for
chip designs today that has denser macro pin connections to the chip IO ports as in the Last Level Cache (LLC) block of a
DDR subsystem. In this paper, two new placement strategies have been considered – peripheral and donut, for the LLC
module. A congestion-optimized, floor-plan to Place and Route (PNR) flow methodology has been presented for each of
these placement strategies using Cadence Innovus Implementation System and Synopsis IC Compiler II. The Quality of
Results (QOR) for each strategy was then compared. The peripheral macro placement strategy is found to be best among
the three, while the donut macro placement is the worst. A 16% improvement in the overall on-chip delay is seen in the
peripheral macro placement when compared to island macro placement. Furthermore, a 19.6% power reduction is observed
in the peripheral macro placement strategy as compared to island macro placement. The overall congestion for peripheral
macro placement is 0.32%, which is the least among the three strategies. Hence, the peripheral macro placement strategy
proves to be the best choice for macro placement, when considering floor-plan for the LLC module in a DDR subsystem.

Keywords
Double data rate, Physical design, Floor-plan, Macro placement, Island, Peripheral, Donut, Congestion.

1.Introduction Hence, there has been a unanimous shift towards the


The large-scale growth of the electronics industry in Double Data Rate (DDR) memory subsystems. The
the past decade has led to small-sized gadgets flooding DDR acts as an interface between the controller and
the market. Initially, only Personal Computers (PC) memory by providing data access on both edges of the
employed the use of multi-core processing, however, clock cycle, thereby, doubling the memory access
now even small handheld gadgets have come to speed to accommodate the growing needs of
possess similar computational power as the PC and computationally extensive tasks.
laptop. With mobile manufacturers opting for multi-
core processing and graphics-intensive application The DDR subsystem has been increasingly employed
usage, memory storage and access has become a even in the Internet of Things (IoT) sector. There is a
crucial aspect to achieve high performance. In recent great problem of reliability and efficient resource
years, the memory subsystem is viewed as a management concerning the sensors and actuators that
bottleneck; drastically reducing performance speeds. are used for real-time data gathering and local
preprocessing [1]. The size of wearable devices has
shrunk drastically, due to the advancement of
semiconductor technology, which has fuelled the
*Author for correspondence
903
J. Fadnavis and Kariyappa B.S.

usage of wearable sensors and smart devices [2]. The Integration (VLSI) system's multifaceted nature
memory requirements for such devices are stringent increments instantly, physical planning is getting
due to real-time data access requirements, imposing increasingly troublesome [6].
strict timing constraints. The usage of large-scale
access control lists in IoT applications [3], furthers the Various floors-plan techniques have been explored in
need for high-speed data access solutions. the past, such as a partition level floor-plan method to
understand the in-depth structure of the block to
The DDR memory subsystem consists of the decide floor-plan and obtain better timing [7].
controller, physical interface, and Input/Output (I/O)
drivers. One of the important blocks within the DDR Challenges such as large design sizes, increasing
is the Last Level Cache (LLC). The LLC is a macro count, timing/power estimations, region
standalone memory inserted between the external shaping and pin assignment, predefined placement
memory and functional blocks to provide another level locations, macro-orientations, and pin positions,
of cache. The LLC is the last memory level to be simultaneous standard cell and macro placement,
checked on-chip before moving to fetch data from congestion, and timing-driven placement is increasing
external memory. The time to access data from an off- for a floor-plan designer [8].
chip memory is very high, hence the LLC, acting as a
buffer cache, helps to reduce data fetch off-chip. Since Macro placement is a crucial step to obtain
the DDR is at the interface of the chip and off-chip congestion-free designs at the later stages of PD flow.
memory, extreme timing control is required to ensure The placement of standard cells, which is done by the
the correct functioning of DDR, requiring several placement tool, ideally requires a uniform square
hardware components and algorithms to facilitate this region on-chip, to perform an optimum placement. To
complex design. Numerous architectural satisfy this requirement, designers initially employed
optimizations such as deep pipelines, branch the island macro placement configuration for a floor-
prediction, and aggressive reordering aim to provide plan designs, which groups all macros in one corner of
high performance [4]. The substantial research carried the chip to provide such a uniform region for standard
out to improve the efficiency of this subsystem, has cells. However, as the macro pin connection to IO
increased the gate level complexity and power ports of the chip grows denser, this method is
consumption of this subsystem. A 33% power inefficient and often leads to more congested designs.
consumption of the LLC and Dynamic Random- Therefore, there is a need to explore different macro
Access Memory (DRAM) alone is observed in the placement strategies to avoid such congested designs.
DDR subsystem [5]. The high gate density and critical In this paper, two new macro placement strategies,
timing due to the physical interface with the off-chip peripheral macro placement, and donut macro
memory needs to be taken care of at the PD placement have been explored for the LLC block, and
implementation level. complete congestion optimized Place and Route
(PNR) flow for each of these has been implemented
Physical Design (PD) implementation is a back-end using Cadence Innovus Implementation Systems and
flow from the net-list to Graphic Data Stream (GDS) Synopsis IC Compiler II. The various inbuilt settings
and is the correlation step between design and chip of these powerful tools have been leveraged to
manufacture. The PD flow ensures that the design optimize congestion and improve timing integrity
created works on the silicon chip. Numerous problems throughout the PNR flow. The Quality of Results
arise when the design is converted to one, which can (QOR) of the three macro placement strategies was
function properly on silicon. The PD flow first then compared to arrive at the best choice for macro
involves proper planning and placement of pins, and placement for LLC modules in DDR subsystems.
custom macros on the chip during the floor-plan stage.
Next, the placement of the logic on the chip is 2.Literature review
performed along with the introduction of the clock and A detailed study of state-of-the-art architectures of
power distribution network. The design is then routed DDR and LLC was carried out. Various developments
and checked for various parameters such as power, in the architecture of both blocks have been explored
area, and performance. The congestion and timing in the past decade to reduce the latency and power of
requirements need to be met during PD to facilitate the each. These developments give an idea to appreciate
correct functioning of the design. All blocks of the the complexity involved during the PD process to
chip need to be implemented and tested according to ensure the proper functioning of the blocks. Numerous
this PD implementation flow. As the Very Large-Scale floor-plan techniques have also been developed in the
904
International Journal of Advanced Technology and Engineering Exploration, Vol 8(80)

past to improve congestion and timing of the blocks in throughput. A sensitivity analysis to estimate the
during back-end flow. signal and power integrity of a PDN for a DDR is
presented in [13]. A synthesized Resistance-
A Power Delivery Network (PDN) in a Re- Inductance-Capacitance (RLC) model is proposed to
Distribution Layer (RDL) for a DDR memory perform model extraction instead of the Computer-
subsystem is presented in [9]. It is observed that Aided Design (CAD) layout model extraction. The
decreasing the PDN loop inductance is critical for a model is created using self and transfer impedance
robust high-speed PDN design. The loop inductance equations that can be incorporated into an algorithm.
depends on the wire width and length. Hence, the wire The models are created quickly and efficiently and
parameters used for the power distribution network match very closely to the CAD layout extraction
must be altered. A voltage ripple reduction between models. The models are passive and causal, and
the Power/Ground (PG) rails is done by a simple PDN correlation is good for both frequency and time
model. The voltage ripple reduction is caused by domains. The above method produces faster analysis
opting for symmetric PG PDN structure and a unity results while maintaining the accuracy of the CAD
PG ratio is a must for maintaining the power integrity layouts.
of the design along with keeping signal integrity at an
optimum level. The DDR memory controller is the With advancements in DDR, rapid advancement in
brain of the entire DDR subsystem, hence, an LLC technology has also taken place to keep up with
optimized controller design as discussed in [10] can fast-growing memory needs. Several developments in
improve the performance of the overall subsystem. All LLC have taken place in the past decade. One such
commands like read/write access and pre-charge development is in stacking technology. The increased
commands were tested and verified. The verification parallelism in LLC has resulted in opting for 3D
was done on System Verilog to provide high coverage stacking as compared to the traditional 2D stacking.
for the code to make sure the perfect functioning of the However, the leakage power is seen to increase greatly
block. The controller was designed to generate timing due to dense 3D integration [14]. A novel hybrid
and control signals to synchronize the command reconfigurable architecture for LLC was proposed.
operations. The drawback of the above design is an The new design combines SRAM along with Spin
increased number of buffers that are inserted. The Transfer Torque (STT) SRAM technology to
inserted buffers result in an extra delay in the data dynamically reduce power at runtime by restoration
paths, which severely affect the timing closure of the and duplication. The power is seen to reduce by 98.4%
designs. as compared to the traditional design. A cache-
partitioning algorithm is used to efficiently divide the
The power rail noise limit is determined by the DDR LLC block among the different processors. A novel
to interface current profile and PDN impedance [11]. method to partition cache using Non-Volatile
The dynamic behavior of the memory subsystem Memories (NVM) instead of SRAM is presented in
greatly increases the power rail noise due to the sudden [15]. The cache is periodically portioned in such a
charge and discharge of current through the Static way, to assign heavily accessed ways to low accessed
Random-Access Memory (SRAM) cells. On-die PDN partitions, thereby distributing the access to the entire
is studied using the solution space analysis, wherein LLC block.
the power rails are decomposed into lumped on-die
capacitors and effective series resistance. Different Sakhare et al. [16] presented the replacement of
currents and voltages are applied to emulate the SRAM-based LLC with STT Magnetic Random
various operating conditions to estimate the overall Access Memory (MRAM) based LLC, due to the
voltage drop. The analysis shows that higher limited scaling capability of SRAM. The STT-MRAM
capacitance and low series resistance lowers the based design proves to provide larger energy gains and
voltage drop. A design of freeway Network On-Chip low access latency. Two more designs, Compressed
(NoC) is proposed in [12] which routes flits on DDR Tag (CT) cache [17] and data shepherding [18] are
and allows bypass pipelining. Pipeline bypassing presented to manage larger LLC blocks. The
reduces the packet latency at a low traffic load. The developments on DDR and LLC have increased the
routing is done in such a way that only flits moving cell level complexity and timing criticality during PD
straight can pass through the bypass pipeline. In flow. The larger number of logic cells that was inserted
smaller networks, the freeway latency is found to be to optimize the design in terms of power, results in a
49% higher than short-path, but in large networks, the highly congested placement if care is not taken to
freeway-NoC latency is 5% lower with a 23% increase prepare the floor-plan. Several floor-plan and macro

905
J. Fadnavis and Kariyappa B.S.

placement techniques have been explored to enable machine learning model to decide the optimum macro
congestion-free standard cell placements and placement for giving floor-plan specifications.
overcome the higher logic density challenges.
The above macro placement techniques, while
A macro placement algorithm for regular placement of accounting for the connections between macros and
macros is presented in [19]. Macros and standard cells standard cells, do not account for the connections
are clustered together in advance according to the between macros and I/O ports of the chip. With
connections between them, creating different modules such as the LLC, the macros majorly have
hierarchies of macros. The macros are then legalized connections to the I/O ports of the chip. These
to obtain an efficient floor-plan. The simulated connections are of utmost importance in an LLC
annealing algorithm combined with the corner module, as these ultimately interfaces with the off-
stitching algorithm is explored for macro placement in chip memory. Moreover, the above algorithms are
[20]. This method is effective to refine the placement provided for a full-custom PNR flow, where macros
of standard cells along with macros according to the and standard cells are simultaneously placed, which
placement regions defined by the algorithm. A requires such automated algorithms. However, since
clustering algorithm for standard cells and macros the LLC module is developed as a semi-custom
built as a tree from the design hierarchy during design, the macros are placed first, followed by
synthesis is presented in [21], allowing the algorithm standard cells. In this paper, two macro placement
to consider the indirect connectivity of macros to the strategies are presented for semi-custom flow that
standard cells. This method is best used when the takes into consideration the connections of macros to
placement of macros and standard cells is done the standard cells as well as the I/O ports of the chip.
simultaneously.
3.Methods
A novel multi-level algorithm that considers the 3.1DDR subsystem
Register Transfer Logic (RTL) connections between The System-on-Chip (SoC) design needs to interface
macros and standard cells is discussed in [22]. The with the off-chip memory as shown in Figure 1. The
synthesis net-list is divided based on dataflow memory subsystem is shared and must respond to
hierarchy and a cost function is evaluated to optimize numerous requests from multiple cores, each having
the wire length and timing of the connections. The its latency and bandwidth requirements. The
proposed algorithm enables easy timing and Design processor, along with the Graphics Processing Unit
Rule Check (DRC) closure. The amount of impact that (GPU) and Digital Signal Processor (DSP), interacts
the macro placement has on the congestion of the with the memory. To decrease the memory access
design is assessed in [23]. Two different macro times, the DDR subsystem acts as an interface between
placement strategies take into consideration and the the processors and the memory. The DDR enables
impact at each PD stage is evaluated. The congestion memory access on both edges of the clock cycle as
and QOR are observed at every step to assess the effect compared to the traditional memory systems accessing
of macro placement. To ensure the timing closure of data on only one clock edge. One of the blocks in the
the design, several manual optimizations are required DDR subsystem is the LLC. The LLC acts as an
to meet the setup and hold times. Different methods to additional cache memory apart from the L1 and L2
fix the setup and hold time are given in [24]. These caches. The LLC was added as an attempt to further
methods provide a robust timing closure method to reduce the memory access times by reducing the
obtain minimal DRCs during the sign-off phase of PD frequency of data access that is off-chip.
implementation. Various algorithms exist to group and
slice up the cell into the gate level net-list according to The DDR subsystem has been increasingly employed
parameters such as maximum interconnect length and in applications such as satellite navigation [27].
logical depth. One such algorithm is the Genetic and However, the physical interface and high-speed data
Simulated Annealing (GSA) algorithm [25], which is access, impose tight PD constraints on the module.
used to define weight values for different cells while Such new architecture furthers the need to meet timing
clustering them to perform an efficient placement. The requirements in all extreme corner cases to ensure the
macro placement has further been explored as a fully proper functioning of the memory interface across
automated solution using machine learning models in several environmental conditions. Hence, a detailed
[26]. Several floors-plans with different macro PD implementation is required to ensure the working
placements have been provided to build a robust of this subsystem.

906
International Journal of Advanced Technology and Engineering Exploration, Vol 8(80)

Figure 1 SoC Architecture

3.2Overall methodology The final placement and optimization, then take place,
The PD implementation starts with importing the gate followed by the legalization to snap to the
level net-list file, Synopsis Design Constraints (SDC) manufacturing grid. Next, the clock specifications
file, physical and technology files, timing liberty need to be defined to lay out the clock network on the
modules, and power intent file. The back-end flow, chip. To form the multi-clock tree, clock drivers are
then begins from the floor-plan stage. The first step is created followed by clock straps generation. Once the
to define the core utilization and the aspect ratio as clock mesh is ready, the global clock tree is built and
seen in Figure 2. The core to IO boundary is then checked. Next, the clock mesh is routed and the entire
decided to snap the corners of the instance grid. Next, clock tree is synthesized and legalized.
the pin placement on the boundary is done based on
the inputs from the top-level hierarchy module, and The routing begins with routing clock and certain
appropriate layers are assigned to the pins. Macro critical nets. Next, the secondary power grid mesh is
placement is carried out using the three different connected. The global route is then performed, where
strategies – island, peripheral, and donut. The various approximate routes are assigned and coarse congestion
placement blockages are then defined to ensure the is calculated. The track assignment is the step where
cleanliness and congestion-free placement of standard the tracks of different routing layers are assigned to the
cells. Placement regions are further defined grouping global routes. After the track assignment, violations
similar logic hierarchy cells together. The physical- may exist which are resolved during the detailed
only cells are then placed all over the core area. placement stage. Post-route optimization is performed
Finally, power rings and straps are generated based on to fix congestion and legalize the routing. Once
the power intent of the design. A sanity check on the routing is complete, the sign-off checks consisting of
floor-plan is performed to ensure a clean design before timing, congestion, area, and power analysis are
placement. performed. The setup and hold timings are fixed based
on timing reports generated, by size or replacing
The standard cell placement is performed by the tool buffers and inverters. The DRC checks are performed
using several inbuilt algorithms. The placement begins to make sure the design is ready for manufacture and
with the initial coarse placement that places the involves metal filling and Engineering Change Order
standard cells randomly according to the space (ECO) fixes.
available. This information is then used to perform
optimization, to adjust cells to reduce congestion and The above methodology is followed for each of the
meet timing. The next step is the refine incremental three different macro placement strategies along with
placement, wherein small perturbations are carried out leveraging the various tool options provided by
iteration-by-iteration to optimize the design further. Synopsis IC Compiler II, which are employed to
907
J. Fadnavis and Kariyappa B.S.

implement power and performance-optimized design each of the macro placement strategies were observed
for each. The timing, power, and congestion values for and analyzed at each stage.

Figure 2 Overall methodology

3.3Floor-plan methodology overlapping. Guard bands are added to voltage areas


In this design, the core utilization is set at 78.54% and to prevent cells from other voltage areas overlapping
the aspect ratio to 0.98. The core to IO boundary is the present voltage area.
defined for the left, right, bottom, and top edges. The 3.3.2 Pin placement
core boundary needs to be decided to snap the The pins are assigned to different edges of the core
boundary points on the instance grid. based on the top-level module connections as shown
3.3.1 Voltage areas in Figure 3. The width and pitch for each layer of the
The voltage areas are specified to aid the multi-voltage pins must be assigned. Even metal layers (M2 and M4)
design as decided in the Unified Power Format (UPF) are assigned to vertical pin tracks and odd metal layers
file. The voltage areas can be nested, disjoint, or (M1 and M3) are assigned to horizontal pin tracks.

908
International Journal of Advanced Technology and Engineering Exploration, Vol 8(80)

Figure 3 Pin assignment

3.3.3 Macro placement


Macros are generally memories and certain hard
Intellectual Properties (IP). The macros have a large
number of connections to standard cells as well as the
I/O pins. Hence, the placement of these macros is
crucial to reduce congestion and wire length. Longer
wire lengths lead to an increased transition time, which
makes timing closure difficult. Hence, placement of
macros based on fly-lines are considered. Fly-lines
provide an idea of macro connection to the pins and
standard cells. Three macro placement strategies have
been explored as under.
1) Island macro placement - All the macros are
placed together on one side of the core area forming
an island. The island is formed such that a regular
rectangular area is made available for standard cell
placement as shown in Figure 4.
2) Peripheral macro placement - The macros are
placed on the periphery of the core area boundary, as
close to the pins as possible in Figure 5. The core area
in the middle is available for the standard cell
placement.
3) Donut macro placement - The macros are placed
on the periphery as well as in the middle of the core
area forming a donut shape to place the standard cells
as in Figure 6. Figure 4 Island macro placement

909
J. Fadnavis and Kariyappa B.S.

3.3.5 Power planning


The power mesh is built in this step to form a power
distribution network across the core area. A set of
guard rails of Voltage Drain Drain (VDD) and Ground
(GND) is built around the core area which connects
the primary supply input ports. A set of horizontal and
vertical rails of both voltage levels are formed to
provide power to all parts of the core as shown in
Figure 7.

Figure 5 Peripheral macro placement

Figure 7 Power stripes

3.3.6 Power switches


Power switches are used in multi-voltage designs.
According to the UPF specifications, the power
switches are placed in a daisy chain fashion as seen in
Figure 8.

Figure 6 Donut macro placement

3.3.4 Placement blockages


Placement blockages are added to control the
placement of standard cells in some regions. The types
of placement blockages are:
1) Hard blockages - Regions where no standard cells
or hard macros can be placed. The regions marked
red in Figure 4 are hard blockages.
2) Soft blockages - Regions where no standard cells
can be placed during initial placement, however,
buffers and inverters can be placed during
optimization. The regions marked orange in Figure
4 are soft blockages.
3) Partial blockages - Regions where standard cells
can be placed, but only up to a certain cell density
as specified. The regions marked pink in Figure 4
are partial blockages with a cell density of 50%. Figure 8 Power switch insertion

910
International Journal of Advanced Technology and Engineering Exploration, Vol 8(80)

3.3.7 Addition of physical-only cells 5. Specifying routing resources - The minimum and
The final stage of the floor-plan is the addition of the maximum routing layers globally, and for specific
physical-only cells. These cells have no logical nets can be set. Layers to be ignored for Resistance-
functionality. End-Cap cells are added at the edges of Capacitance (RC) estimation during optimization
the core and around macros to protect the diffusion and are also set using the set_ignored_layers command.
poly layers during lithography. Well-Tap cells are 6. Defining placement bounds - Placement bounds
added to provide the substrate of the transistors with are of move and group type. Move bounds have a
appropriate well- voltage for proper functionality. fixed location and boundary, whereas group types
Another type of cell added is tie-cell which is used to have a fixed boundary. The bounds are set to group
tie a wire to either logic 1 or 0. These are used to similar logic level cells to reduce wire length time.
interface between powered down and always-on 7. Enable power optimization - Dynamic power
power domains. optimization is enabled for the design by using the
command set_scenario_status-dynamic_power
3.4Place and route methodology true.
3.4.1 Placement preparation 8. Enabling congestion driven placement - The
After the floor-plan is completed, the floor-plan congestion effort can be controlled by,
specifications are written onto a Design Exchange set_app_options -name place_opt.congestion.effort
Format (DEF) file. The inputs to the placement tool -value high.
are the gate-level net-list, floor-plan DEF file, power 9. Enable global route estimation - The optimization
intent file, timing module files, and reference library engine makes use of a virtual route to estimate wire
files. Sanity checks are performed on the floor-plan length for timing fixing. Global routing gives a
file and gate-level net-list. The power intent file is more accurate estimate of the wire length, but
checked and any violations between the floor-plan increases the run time. In the design, global routing
data, gate-level net-list, and power intent are for placement and high fan-out net synthesis is
corrected. The floor-plan information is then loaded enabled.
onto the tool. Next, the power intent is committed, 10. Performing magnet placement - Magnet
which adds the isolation cells, retention cells, enable placement is used to place certain logic cells close
level shifters, and power Mux-s. to objects to reduce the wire length. Certain macros
3.4.2 Optimization preparation are set to act as magnet objects for some logic cells
Placement optimization is an important step during PD to which they connect extensively.
flow. Several parameters and tool options of the 3.4.3 Performing placement
Synopsis IC Compiler II tool need to be set before The place_opt command is run to invoke the tool to
optimization. run placement. Several optimization iterations are
1. Setting target library files - The library files that performed to get an optimized placement for
should be used by the tool for optimization and congestion and timing. The placement is then
clock tree synthesis should be defined by using the legalized to snap the standard cells to the
set_target_library_subset command. manufacturing grid. The placement is checked to
2. Restricting library cells - The command resolve any violations before moving to the Clock
set_lib_cell_purpose restricts the library cells used Tree Synthesis (CTS) stage.
during optimization, clock tree synthesis, and 3.4.4 CTS
setup/hold fixing. This reduces the tool runtime as The CTS starts with deriving the clock trees and
only specific cells will be tried and tested for checking for all clock constraints. The clock
optimization. constraints must be specified for all clocks and a clock
3. Preventing optimization on cells - By setting the reference must be derived for all clock cells. The
size_only option on some cells, optimization can be transition and capacitance for each input port must
prevented on certain cells. This command is set of also be specified. The parameters that are set to
cells present in the clock paths. prepare for CTS are as under:
4. Setting percentage low Voltage Threshold (VT) 1. Enable skew and target latencies - The tool tries
optimization - Low VT cells consume low power to achieve the skew and target latency values as
but have high leakage current. The required by specific designs during the
set_max_lvth_percentage command restricts the optimization.
use of low VT cells to a defined value and the tool 2. Enable local skew optimization and skew groups
considers leakage and power trade-off during - The skew groups are a set of clock cells among
optimization. In this design, the percentage low VT
is set to 20.
911
J. Fadnavis and Kariyappa B.S.

which the skew must be balanced. Local balancing direction can be fixed. Routing blockages are areas
results in a much-optimized timing for clock paths. where routing of certain layers is not allowed.
3. Specifying the primary corner - The optimization Routing blockages are placed close to the pins to
tool uses the set primary corner to resolve setup and reduce routing congestion. Routing corridors are
hold violations. The primary corner defined is regions where the routing of some nets can be
generally the extreme corner for which timing must restricted.
meet the requirements. 2) Defining Non-Default Routing (NDR) for clock
4. Enabling dirty design mode - The constraints in and signal nets - Certain nets require special route
the SDC file can get extremely tight, which layer characteristics. The trunks of the clock tree are
increases the optimization run time. This setting is usually routed with a double width layer which is
specified to get optimum results in lesser time as the specified as an NDR rule. NDR rules are specified
tool ignores a few constraints to meet timing. for certain nets after looking at logical connectivity.
5. Enabling global route - Global routing for clock 3) Routing clock nets - The global routing, track
nets is enabled to get accurate wire length timing assignment, and detail routing is performed for all
values during optimization. clock nets.
6. Enable Concurrent Clock and Data (CCD) 4) Routing critical nets - Certain nets as studied from
optimization - The option is enabled to perform the data flow logic are considered critical nets.
optimization on both clock and data paths. Buffers These nets must be routed first to fix them and
and inverters will be added to the data path to meet prevent optimization of these nets further.
and balance timing.
Once the clock and critical nets are routed, the routing
Once the specifications are enabled, a clock tree can of the entire design can be carried out. The routing
be built. The clock tree is first built by inserting the engine first assigns global routes to all nets and
mesh and tap drivers across the core area. The clock overflow in each global route cell is reported. The
mesh is then built using the create_clock_straps track assignment is then performed which contains
command. The global clock tree is then built which is certain violations. The detailed routing routes the nets
generally an H-tree structure. The mesh and tap drivers completely and resolves violations. Post route
are routed to the global clock tree and mesh, followed optimization is performed which includes legalization
by synthesis and optimization of the entire clock tree. of cells, incremental detail routing, and ECO routing.
A tap driver connected to the various sinks is shown in
Figure 9. 3.5 Implementation specifications
3.4.5 Routing The floor-plan of the LLC module was performed on
The routing parameters that are set before performing Cadence Innovus Implementation System and the
routing area: PNR flow was carried out using Synopsis IC Compiler
1) Defining routing guides, blockages, and II. The LLC module specifications are given below in
corridors – Routing guides are regions where Table 1.
specific routing characteristics such as horizontal
and vertical track utilization, and preferred routing

Figure 9 H-clock tree


912
International Journal of Advanced Technology and Engineering Exploration, Vol 8(80)

Table 1 LLC module specifications


Parameter Number
Inputs 891
Outputs 756
Macros 57
Leaf Instances 125014
Clocks 23
Clock Gating Cells 8675
Registers 97845
Clock Gating Ratio 100
Retention Flops 45689
Buffers 145980

4.Results upsizing the cells present in the data path, to increase


The entire PD flow for the three macro placement data propagation delay. The optimized setup time
techniques was carried out using Cadence Innovus report is shown in Figure 11, where the WNS is made
Implementation System and Synopsis IC Compiler II. positive.
The power, timing, and congestion were monitored at
every step. The post-route setup timing is shown in
Figure 10. The Worst Negative Slack (WNS) is seen
to be negative.

Figure 11 Optimized post-route setup timing report

The placed design of the peripheral macro strategy is


shown in Figure 12. Before exporting the file to the
Figure 10 Post-route setup timing report GDS II format, all the DRC violations were cleared as
shown in Figure 13.
The setup timing needs to be optimized such that the
WNS is positive. The optimization can be done by

Figure 12 Peripheral macro placed design


913
J. Fadnavis and Kariyappa B.S.

Figure 13 DRC checks

The floor-plan sanity check results are tabulated in placement, and highest for island placement. This
Table 2. The observations from the sanity check are as proves that the area available for clock and signal
follows: routing is more for peripheral placement as compared
1. The standard cell area for island macro placement is to the other strategies, thereby reducing congestion of
highest, followed by the donut and then peripheral the core area.
configurations. 5. The number of power switch cells and PG pins placed
2. The blockage area for donut macro placement is the in the core area is highest for island macro placement,
lowest, followed by peripheral macro placement and followed by peripheral and then donut. The power
island macro placement. This shows that the non- switch cells consume extra power and a higher value
uniformity of macro placement is highest for island of these lead to more power consumption of the chip.
placement, making it an inefficient macro placement The peripheral macro placement again is observed to
strategy. be best with a moderate value for power consumption.
3. While the number of cell rows is highest for island 6. The number of Global Cell (GCell) route congestion
macro placement, the number of unique cell rows is is a rough indication of the congestion after routing
the least, which indicates less uniform standard cell takes place per GCell. The congestion is seen to be
placement. In this regard, the peripheral macro minimum for peripheral macro placement, indicating
placement proves to be the best. it to be a better macro placement choice among the
4. The core density and gate density are lowest for three.
peripheral macro placement, moderate for donut

Table 2 Floor-plan parameter comparison


No Property Island placement Peripheral placement Donut placement
1 Standard cell area (nm2) 0.1566 0.1549 0.1550
2 Macro area (nm2) 0.4047 0.4047 0.4047
3 Blockage area (nm2) 0.01589 0.00877 0.006320
4 Number of cell rows 27535 22980 22830
5 Gate density 58.83% 56.53% 56.63%
6 Core density 80.30% 80.15% 80.28%
7 Power switch cells 2098 2025 2003
8 Number of core sites 23423786 23493460 23398572
9 Number of unique 14 28 22
length rows
10 Number of PG pins 15161 15088 15066
11 Row area (nm2) 0.2732 0.2740 0.2729
12 Number of GCells with 321 145 382
routing track overflow
13 Number of vias 17923 16538 17178

914
International Journal of Advanced Technology and Engineering Exploration, Vol 8(80)

The QOR results at each step of PD implementation each of these as the available density for routing has
flow for each macro placement strategy are given as reduced after the introduction of clock cells.
follows: 3)After routing- The WNS value for each macro
1)After standard cell placement- Table 3 tabulates placement strategy has become less negative,
the QOR after standard cell placement as follows: indicating an improvement in timing QOR in Table 5.
i) The overall WNS value is seen to be negative as the The power is seen to further increase, due to the power
design has not been optimized for timing. However, consumption of signal and clock rates. The congestion
the WNS value is least for peripheral macro value is increased slightly after CTS. The slight
placement, indicating better timing QOR. increase is caused due to the routing optimization
ii) The power is seen to be minimum for peripheral carried out by the tool.
macro placement and highest for donut macro 4)After chip sign-off- The WNS has been optimized
placement. to obtain a positive value which indicates better timing
iii) The congestion value is moderate for all the closure as seen in Table 6. The power is seen to have
strategies as the clock and signal routes have not been increased from the routing stage due to the inserted
placed yet. buffers to close timing. The overall routing congestion
2)After CTS- The QOR comparison after CTS is is seen to be lowest for peripheral placement proving
tabulated in Table 4. The WNS value has reduced it to be a better macro placement option along with the
across all the macro placement strategies due to the least WNS and power consumption.
post-placement optimization that occurs before CTS.
The power has increased due to extra power Complete list of abbreviations is shown in Appendix I.
consumption by the clock controllers and cells. The
routing congestion is also seen to have increased for

Table 3 QOR comparison after standard cell placement


Property Island Peripheral Donut
placement placement placement
WNS (ns) -45.36 -34.8 -49.67
Power (µW) 0.467 0.389 0.521
Horizontal congestion overflow 0.19% 0.12% 0.24%
Vertical congestion overflow 0.23% 0.21% 0.28%
Total congestion overflow
0.21% 0.165% 0.26%
(horizontal +vertical)

Table 4 QOR comparison after CTS


Property Island Peripheral Donut
placement placement placement
WNS (ns) -26.89 -23.07 -34.7
Power (µW) 0.551 0.456 0.592
Horizontal congestion overflow 0.38% 0.34% 0.45%
Vertical congestion overflow 0.36% 0.29% 0.38%
Total congestion overflow 0.37% 0.31% 0.415%
(horizontal +vertical)

Table 5 QOR comparison after routing


Property Island Peripheral Donut
placement placement placement
WNS (ns) -24.3 -19.6 -31.7
Power (µW) 0.779 0.654 0.967
Horizontal congestion overflow 0.47% 0.32% 0.41%
Vertical congestion overflow 0.35% 0.31% %
Total congestion overflow 0.315% 0.315% 0.395%
(horizontal +vertical)

915
J. Fadnavis and Kariyappa B.S.

Table 6 QOR comparison after chip sign-off


Property Island Peripheral Donut
placement placement placement
WNS (ns) 0.678 0.57 0.985
Power (µW) 0.9525 0.765 1.043
Horizontal congestion overflow 0.47% 0.24% 0.44%
Vertical congestion overflow 0.26% 0.39% 1.37%
Total congestion overflow 0.37% 0.32% 0.90%
(horizontal +vertical)

5.Discussion the module; hence, different macro placement


The floor-plan stage during the PNR flow is an techniques were discussed. The complete PNR flow
important step towards designing congestion-free chip was presented in the paper to optimize timing, power,
layouts. A clean floor-plan considerably improves the and congestion for each of the placement strategies. A
QOR parameters for the design, and macro placement 16% improvement in timing is observed for the
plays a big role in clean floor-plan design. The macros peripheral macro placement and a 19.6% improvement
have a high number of connections to both the in power is obtained for the peripheral macro
standard cells and IO ports, and an inefficient placement strategy as compared to the island macro
placement can lead to increased wire lengths. The placement. The overall congestion for island macro
macro placement must be done keeping in mind the placement and peripheral macro placement were seen
fly-line connections. The peripheral macro placement to be 0.37% and 0.32%, respectively, while for the
is seen to be the best providing a 16% improvement in donut macro placement it was 0.9%. The island macro
WNS and a 19.6% improvement in power as compared placement and peripheral macro placement can be
to the island macro placement. The total congestion is used to optimize congestion; where island macro
also the least for peripheral macro placement, which is placement can be used for high cell density modules,
0.32%, making chip finish and metal fill steps after whereas the peripheral macro placement can be used
routing easy. On the other hand, the donut macro for a lesser cell density module. The peripheral macro
placement is the worst macro placement strategy with placement is proving to be the best placement strategy
a 45% degradation in WNS and a 9.5% increase in as compared to an island and donut macro placement
power consumption as compared to the island macro strategies.
placement strategy.
This work can be extended in the future by including
5.1Limitations more timing corners to check the design during timing
The limitation of this work is that the timing analysis analysis. The inclusion of more corners with different
is done only on the LLC module of the DDR processes, voltage, and temperature variations can
subsystem. Once the LLC block integrates as a black lead to a more robust design immune to environmental
box with other sub-blocks of the DDR subsystem, new fluctuations. Moreover, the physical implementation
timing paths might get created which result in negative was performed only for the LLC module on the DDR
slack. Hence, the optimization will have to be subsystem. The same design methodology can be
performed for the LLC block again, keeping in mind extended to implement the other blocks present in the
the overall timing paths. Another limitation of the design and optimize them for power and congestion.
work is the limited corners used to optimize the setup
and hold timing paths. Only 10 extreme corners were Acknowledgment
used to perform design closure. The addition of more None
corners to restrict the design will ensure a more robust
design. Conflicts of interest
The authors have no conflicts of interest to declare.
6.Conclusion and future work
References
The increased demand for high-performance [1] Haseeb K, Din IU, Almogren A, Jan Z, Abbas N, Adnan
electronic gadgets has led to the exploration of the M. Ddr-esc: a distributed and data reliability model for
DDR memory subsystem. The timing critical physical mobile edge-based sensor-cloud. IEEE Access. 2020;
interface and complex logical architecture of the DDR 8:185752-60.
need to be handled during PD implementation. An [2] Maity S, Jiang X, Sen S. Theoretical analysis of AM
optimized floor-plan leads to reduced congestion of and FM interference robustness of integrating DDR
916
International Journal of Advanced Technology and Engineering Exploration, Vol 8(80)

receiver for human body communication. IEEE [16] Sakhare S, Perumkunnil M, Bao TH, Rao S, Kim W,
Transactions on Biomedical Circuits and Systems. Crotti D, et al. Enablement of STT-MRAM as last level
2019; 13(3):566-78. cache for the high performance computing domain at
[3] Inoue K, Yano Y. A large scale access-control list for the 5nm node. In international electron devices meeting
IoT security comprising embedded IP-core and DDR 2018. IEEE.
DRAM. In international SoC design conference 2016 [17] Cho H, Kong J, Munir A, Giri NK. CT-cache:
(pp. 197-8). IEEE. compressed tag-driven cache architecture. In computer
[4] Hassan M. On the off-chip memory latency of real-time society annual symposium on VLSI 2018 (pp. 94-9).
systems: Is DDR dram really the best option? In real- IEEE.
time systems symposium 2018 (pp. 495-505). IEEE. [18] Jang G, Gaudiot JL. Data shepherding: a last level
[5] Behnam P, Bojnordi MN. STFL-DDR: improving the cache design for large scale chips. In international
energy-efficiency of memory interface. IEEE conference on high performance computing and
Transactions on Computers. 2020; 69(12):1823-34. communications; international conference on smart
[6] Soni A, Soni B, Mehta R. Congestion estimation using city; international conference on data science and
various floorplan techniques in 28nm soc design. In systems 2019 (pp. 1920-7). IEEE.
international conference on intelligent computing and [19] Lin JM, Deng YL, Li ST, Yu BH, Chang LY, Peng TW.
control systems 2020 (pp. 199-204). IEEE. Regularity-aware routability-driven macro placement
[7] Zhang Y, Peng X. A partition level floorplan method methodology for mixed-size circuits with obstacles.
based on data flow analysis for physical design of IEEE Transactions on Very Large Scale Integration
digital IC. In international conference on integrated (VLSI) Systems. 2018; 27(1):57-68.
circuits and microsystems 2017 (pp. 74-7). IEEE. [20] Lin JM, Deng YL, Yang YC, Chen JJ, Chen YC. A
[8] Garg S, Shukla NK. A study of floorplanning novel macro placement approach based on simulated
challenges and analysis of macro placement approaches evolution algorithm. In international conference on
in physical aware synthesis. International Journal of computer-aided design 2019 (pp. 1-7). IEEE.
Hybrid Information Technology. 2016; 9(1):279-90. [21] Lin JM, Li ST, Wang YT. Routability-driven mixed-
[9] Chan CK, Wu TM, Wu ML, Fan GJ, Shiah C, Lu NC, size placement prototyping approach considering
et al. Power distribution network modeling and design design hierarchy and indirect connectivity between
of re-distribution layer in DDR application. In macros. In proceedings of the annual design automation
workshop on signal and power integrity 2020 (pp. 1-4). conference 2019 (pp. 1-6).
IEEE. [22] Vidal-obiols A, Cortadella J, Petit J, Galceran-oms M,
[10] MP PK, Panda SK. Design and verification of DDR Martorell F. Multi-level dataflow-driven macro
SDRAM memory controller using systemverilog for placement guided by RTL structure and analytical
higher coverage. In international conference on methods. IEEE Transactions on Computer-Aided
intelligent computing and control systems 2019 (pp. Design of Integrated Circuits and Systems. 2020.
689-94). IEEE. [23] Uppula V, Kesav SV, Vura B. Impact on the physical
[11] Sim SW, Andersson W. On-die decoupling capacitor design flow, due to repositioning the macros in the
optimization for DDR IO interface power rail. In floorplan stage of video decoder at lower technologies.
conference on electrical performance of electronic International conference on distributed computing,
packaging and systems 2018 (pp. 229-31). IEEE. VLSI, electrical circuits and robotics 2019 ((pp. 1-6).
[12] Ejaz A, Papaefstathiou V, Sourdis I. FreewayNoC: a IEEE.
DDR NoC with pipeline bypassing. In international [24] Shaikh M, Soni B, Mehta R. Optimization of floorplan
symposium on networks-on-chip 2018 (pp. 1-8). IEEE. strategies to reduce timing violation on 28nm ASIC and
[13] Mohamed J, Michalka T, Ozbayat S, Luevano GR. scopes of improvement for data center ASICs. In
PDN design and sensitivity analysis using synthesized international conference on intelligent computing and
models in DDR SI/PI co-simulations. In electrical control systems 2020 (pp. 93-8). IEEE.
design of advanced packaging and systems symposium [25] Hu Q, Zhang MS. A collaborative optimization for
2018 (pp. 1-3). IEEE. floorplanning and pin assignment of 3D ICs based on
[14] Al-obaidy F, Asad A, Mohammadi F. Power- GA-SA algorithm. In international symposium on
management based on reconfigurable last-cache level electromagnetic compatibility & signal/power integrity
on non-volatile memories in chip-multi processors. In 2020 (pp. 434-8). IEEE.
Canadian conference of electrical and computer [26] Cheng WK, Wu CS. Machine learning techniques for
engineering 2019 (pp. 1-4). IEEE. building and evaluation of routability-driven macro
[15] Nath A, Kapoor HK. Write variation aware cache placement. In international conference on consumer
partitioning for improved lifetime in non-volatile electronics-Taiwan 2019 (pp. 1-2). IEEE.
caches. In international conference on VLSI design and [27] Wang L, Wang J, Zhang Q. Design and implementation
international conference on embedded systems 2019 of DDR SDRAM controller based on FPGA in satellite
(pp. 425-30). IEEE. navigation system. In international conference on
signal processing 2012 (pp. 456-60). IEEE.

917
J. Fadnavis and Kariyappa B.S.

Juhie Fadnavis is currently pursuing Appendix I


her Bachelors of Engineering from RV S.No. Abbreviation Description
College of Engineering, Bangalore. She 1 CAD Computer-Aided Design
completed her schooling at National 2 CCD Concurrent Clock and Data
Public School, Bangalore in 2017. Her 3 CTS Clock Tree Synthesis
interests lie in Static Timing Analysis, 4 DDR Double Data Rate
VLSI Chip Engineering, and Embedded 5 DRAM Dynamic Random-Access
Memory
Systems. She wishes to pursue her
6 DRC Design Rule Check
Master's in Engineering from a reputed University soon. 7 DSP Digital Signal Processor
Authos
Email: Photo
[email protected] 8 DEF Design Exchange Format
9 ECO Engineering Change Order
Kariyappa B. S obtained his B.E. 10 GCell Global Cell
degree in Electronics and 11 GDS Graphic Data Stream
Communication from Bangalore 12 GSA Genetic and Simulated
University, in 1997, ME degree in Annealing
Electronics and Communication from 13 GPU Graphics Processing Unit
the same university in 2000, and the 14 GND Ground
Ph.D. degree in Electronics and 15 I/O Input/Output
Communication from Avinashlingam 16 IoT Internet of Things
University, Coimbatore in 2012. He is currently working as 17 IP Intellectual Properties
Authr’s Photo 18 LLC Last Level Cache
Professor in the Electronics and Communication
19 MRAM Magnetic Random Access
Engineering department of R V College of Engineering, Memory
Bengaluru. With over 20 years of teaching experience, he is 20 NDR Non-Default Routing
guiding 3 Ph.D. students and guided many undergraduate 21 NoC Network On-Chip
and postgraduate student projects. He has authored/co- 22 PC Personal Computers
authored more than 60 articles in refereed international 23 PD Physical Design
journals/conferences and having a good number of Scopus 24 PDN Power Delivery Network
and Google scholar citations. 25 PNR Place and Route
Email: [email protected] 26 PG Power/Ground
27 QOR Quality of Results
28 RC Resistance-Capacitance
29 RDL Re-Distribution Layer
30 RLC Resistance-Inductance-
Capacitance
31 RTL Register Transfer Logic
32 SDC Synopsis Design
Constraints
33 SRAM Static Random-Access
Memory
34 SoC System-on-Chip
35 STT Spin Transfer Torque
36 UPF Unified Power Format
37 VDD Voltage Drain Drain
38 VLSI Very Large-Scale
Integration
39 VT Voltage Threshold
40 WNS Worst Negative Slack

918

You might also like