Containing Power Dissipation in The Latest Generation of Integrated Circuits Is Testing The Ingenuity of Designers

the gap
between what we can theoretically
manufacture in silicon and what we can
realistically design has finally stabilised.
This is the direct result of re-use strategies
involving successively higher levels of
Containing power dissipation in the latest abstraction – cell re-use, succeeded by IP re-
generation of integrated circuits is testing use, succeeded, in turn, by sub-system and
architecture re-use. Today, the system on
the ingenuity of designers chip (SoC) design community is already
contemplating chip re-use – tiling several
by Peter Klapproth and Francesco Pessolano fully verified ‘chiplets’ onto a single piece
of silicon. Sub-100nm CMOS process
technologies will continue to make cost-
effective implementation of these designs
possible. The latest large-scale SoCs are
already heterogeneous multi-processor
systems, containing an array of CPUs,
DSPs, vector processors and hardware

65nm CMOS and beyond will see an end to

the power consumption scaling that the
industry has been used to, where dynamic
switching losses per transistor fell in direct
proportion to the scaling factor. As a result,
power dissipation per unit area of silicon
will begin to approach critical levels. No
longer able to rely on the inherent low-
power performance of underlying
semiconductor processes, designers must
therefore find new ways of reducing power
consumption. It is not only dynamic power
consumption that they need to address. For
65nm CMOS the contribution of leakage
current is already significant, making it
important to find solutions that cut static
power consumption as well.


The dynamic power consumption of CMOS
digital chips is approximated by the
Reducing the amount of power consumed by these following relationship:
processing resources is already a critical design Pdynamic ∝ CVdd
2 f

requirement, especially in SoCs for battery-powered

portable devices such as mobile phones. The Where C = switched capacitance
introduction of multimedia functions on mobile Vdd = supply voltage
phones, including Internet browsing, TV viewing and f = switching frequency
3D gaming, has increased demands on the battery from
two directions. First, there has been a significant The capacitance for a given number of
increase in computing power per se, and, secondly, this gates is fixed by the process/library
enhanced performance is now required for much technology. This means that, for a given
longer periods. We have gone from usage patterns process technology, dynamic power
characterised by long periods in standby, punctuated dissipation can only be reduced by
by short periods of use, to the situation where a mobile lowering the clock frequency, the supply
phone is expected to support several hours of compute- voltage, or a combination of both. Clock
intensive activity, at the same time retaining sufficient frequencies can theoretically be adjusted
battery power to support mobile communications. between the maximum clock frequency
Unfortunately, these escalating energy demands have that the process technology can sustain at
not been matched by a corresponding increase in its maximum Vdd, down to zero frequency
battery capacity. (a stopped clock). The supply voltage can be
Things will only get worse. SoC implementation in adjusted over the specified operating ➔

voltage range of the process technology. In
POWER REDUCTION STRATEGIES FOR SOC addition, supply voltage and clock
frequency are interdependent, with higher
Vdd values being required to sustain higher
Vddcore Vddram Vddsoc Vddalways clock frequencies.
Dynamic frequency switching, in the
CPU L1 cache
Peripherals Always-On
form of clock gating (reducing the clock
Core logic
Functions frequency to zero), has been used for some
time to reduce dynamic power dissipation
in areas of a chip that are temporarily idle.
However, although clock gating is easy to
Networks and Bridges
implement, it does nothing to save power
consumption when these areas are once
SDRAM Static Embedded Tunnels again active.
Memory SRAM Dynamic frequency scaling, on the other
hand, addresses active as well as idle-state
power consumption, by progressively
reducing the chip’s clock frequency as the
computational load on its processing
Vddcore Vddram Vddsoc Vddalways
resources falls, at the same time ensuring
that computational tasks still complete
CPU L1 cache within the real-time constraints of the
Core logic Frequency Always-On
Frequency application. This technique is particularly
switching and switching Functions
switching suitable for circuits that operate
continuously at known performance levels,
for example, peripheral devices in which
Networks and Bridges the device driver software has a good
Frequency switching knowledge of performance requirements
SDRAM Static Embedded
in different operating modes. However, for
Frequency Memory SRAM circuits that operate intermittently,
switching Ctrl dynamic frequency switching may achieve
equivalent power savings and be simpler to
Lowering the switching frequency has
the added bonus of opening up the
Vddcore Vddram Vddsoc Vddalways
possibility of additional power savings
CPU L1 cache through a reduction in supply voltage.
Core logic & TCM Peripherals Always-On Given that power consumption is
Voltage switching Functions proportional to the square of the supply
switching reduction voltage, this means that even a small
reduction in supply voltage can have a
Networks and Bridges significant benefit.
This ability to reduce the clock
frequency to parts of a chip, and run those
SDRAM Static Embedded Tunnels
Ctrl Memory
Voltage parts at reduced supply voltage, is already
Switching being exploited in SoCs. Implementation
Ctrl Ctrl
takes the form of ‘voltage islands’, in which
IP blocks with a common maximum clock
frequency are grouped together and
Top: Voltage islands; middle: dynamic power reduction techniques; powered from a separate supply voltage.
bottom: static power reduction techniques
For example, one recent 65nm SoC intro-
duced by Philips has four distinct voltage

islands: one for the Philips ARM1176 CPU, one for the it incorporates a performance monitor
high-speed level 1 cache, one for a set of ‘always on’ within each DVFS domain that measures
functions (clock and reset generation, power supply the domain’s actual clock-speed capability
IC control, RTC and power mode controller) island, at any point in time. The output of this
and a fourth for the chip’s peripheral functions and performance monitor passes information
main memory (see ‘Power reduction strategies for to a voltage regulator that adjusts the
SoC’, top). domain’s Vdd voltage to the minimum level
This SoC, which Philips has designed as a platform needed to meet the performance
to demonstrate the multimedia capabilities of next- requirements of the silicon under actual
generation consumer products, is already at the heart conditions. Performance monitors are
of several new 65nm CMOS products currently in an typically based upon circuits that exploit
advanced stage of development within the company. gate delays, such as ring oscillators. The
In current SoCs that employ voltage islands, each dynamic power reduction techniques
island typically operates at a fixed Vdd voltage. employed in the Philips 65nm SoC are
However, the existence of these voltage islands also shown in the middle figure of the Power
opens up the possibility of applying another technique, reduction strategies for SoC panel.
called dynamic voltage and frequency scaling (DVFS) Another technique that has been
to further reduce power consumption. DVFS adds proposed for use in combination with DVFS
voltage scaling on top of frequency scaling to to optimise performance and power
dynamically adjust each island’s Vdd to the lowest consumption is body biasing – controlling a
possible voltage needed to sustain its selected clock CMOS transistor’s body potential to alter
speed. its threshold voltage. Body biasing allows
an improvement in switching speed if a
LOOPING AND BEYOND forward bias is applied or a reduction in
DVFS can be implemented either as an open-loop or power consumption if a back bias is
closed-loop process. In open-loop DVFS, several applied. It can also be used to improve
discrete frequency and voltage operating points are production yields by compensating for
defined for the target system, and the system is set to process technology variations that would
the nearest operating point that guarantees the render a conventional SoC out-of-specifi-
required processing performance. In practice, the cation. However, because of the need to
number of different operating points is typically isolate areas of silicon so that the body
limited to between 2 and 4, each of which must potential of the transistors can be
guarantee performance under specific processor controlled, body biasing significantly
loading, also taking into account worst-case process increases design complexity. In addition, it
variations (variation in system performance due to is unlikely that its benefits can be extended
process technology variations) and operating beyond 65-nm CMOS, making the
temperatures. However, because open-loop DVFS has to associated research and development effort
make decisions based on worst-case process variations needed to perfect the technique
and operating temperatures, this means that in many questionable.
cases the supply voltage may still be set higher than An alternative way to reduce dynamic
strictly necessary. power consumption within logic blocks is
Closed-loop DVFS overcomes the problem by to eliminate the clock completely by
providing direct feedback on actual silicon implementing them in clockless self-
performance in the system, thereby taking into timed asynchronous logic. It’s an
account process and temperature variations. To do so, approach that has been pioneered by the
engineers at Eindhoven-based Handshake
Solutions, which has replaced the clock
BOUNDARIES BETWEEN CLOCK DOMAINS MUST by a request/acknowledge handshaking
scheme that initiates tasks (request) and
INCORPORATE CLOCK SYNCHRONISATION signals when the results of these tasks
MECHANISMS AND VOLTAGE BOUNDARIES become available (acknowledge). This
MUST INCLUDE LEVEL-SHIFTERS means that only those parts of a ➔

system actively involved in task execution consume The static power reduction techniques
power, and the moment all tasks are completed the employed in the Philips 65nm SoC are shown
system automatically goes into a near-zero power in bottom section of the ‘Power reduction
consumption standby mode. Handshake Solutions has strategies for SoC’ panel.
already worked with processor company ARM to
produce the ARM996HS processor, the world’s first CROSSING THE BOUNDARIES
synthesisable clockless ARM9E family processor for A consequence of both frequency and
real-time embedded low-power applications. voltage scaling in conventional synchronous
SoC architectures is that signals travelling
CUTTING STATIC POWER from one clock or voltage domain to another
Static (standby) power consumption in deep sub- must cross frequency and/or voltage
micron CMOS is dominated by sub-threshold channel boundaries. Boundaries between clock
leakage currents. It is approximated by the following domains must therefore incorporate clock
relationship: synchronisation mechanisms and voltage
Pstatic ∝ Vdd k e(Vgs – Vt)/s W/L boundaries must include level-shifters, all of
Where Vdd = the supply voltage which adds to silicon cost and complexity.
k = dielectric constant of the gate dielectric For frequency boundaries, communication
Vgs = transistor gate-source voltage latencies are also increased. This is because
Vt = transistor threshold voltage synchronisation mechanisms typically add
s = a process-specific parameter several clock cycles of latency that on a CPU
W = transistor channel width bus interface can reduce CPU performance.
L = transistor channel length Future SoCs may therefore adopt new
system architectures that help to overcome
Because W and L are fixed by the transistor design, these effects.
this leaves adjustment of Vdd and Vt (via body-biasing) as One approach currently finding favour is
the two principle ways of fine-tuning static power the Globally Asynchronous, Locally
consumption in the finished SoC. Synchronous (GALS) architecture. This
As with frequency switching, switching off the Vdd approach, which fits well with dynamic
supply to areas of the chip that are not in use is the frequency scaling, breaks the chip up into a
simplest form of control and has the advantage of number of independent clock domains,
reducing static power consumption in these areas to each of which is internally synchronous,
zero. However, this inevitably leads to a loss of state in but which communicate with one another
the de-powered logic blocks, which means that the via asynchronous communications
power consumption saved by switching these channels. The addition of routers in these
components off has to be weighed against the communication channels offers the
additional power needed to save/restore their state on potential to turn these architectures into
entry/exit from the standby condition. In addition, on- networks-on-chip. Within such an
chip voltage switching adds to design complexity, and architecture, frequency or voltage
the IR drops and timings associated with embedded Vdd switching can be used to turn entire sub-
switching transistors are not well handled by current systems off for periods of time. Although
EDA tools. originally designed to overcome the fact
In situations where the internal state of IP blocks that timing closure was becoming
must be retained, for example in SRAM blocks, dropping increasingly difficult to achieve due to the
Vdd to the minimum level required for state retention long clock distribution and signal
can also reduce power consumption. However, the interconnect lengths in large-scale SoCs,
savings achieved by this technique are not normally GALS architectures will also contribute
large. Forcing the CMOS transistors harder off by significantly to power savings.
applying back-bias to alter their Vt holds out the
promise of much more substantial static power savings Francesco Pessolano is CTO programme
for 90nm CMOS but like the body-biasing mentioned manager, and Peter Klapproth is a technology
earlier it is not expected to scale well to 65nm CMOS and architect with the System Technology &
beyond. Architecture Group at Philips Semiconductors

