Architecture of SoC
Architecture of SoC
Architecture of SoC
Chip (SoC)
2-1
2-2 useN Embedded System Design
SoC includes both the hardware and software, it uses less powe, has better perfornDane
requires less space and is more reliable than multi-chip systems. Most system-on-chips toda
come inside mobile devices like smartphones and tablets.
Types of SoCs
Cusually contains
AnSoC
various components such as
Operatingsystem and Uility software applications.
Voltage regulators and power management circuits.
Timing sources such as phase lock loop control systems or
oscillators.
iN.
Amicroprocessor, microcontroller or digital signal processor.
Peripherals such as real-time clocks, counter timers and
External interfaces such as USB, firewire,
power-on-reset generators.
ethernet, universal asynchronous
receiver-transmitter or serial peripheral interface bus.
i Analog interfaces such as digital-to-analog converters and analog-to-digital converters.
Vili. RAM and ROM memory.
Advant
1.
ages
Lower cost per gate, Lower power consumption, Faster circuit operation, More reliable
2. Architecture of SoC
An SoC consists of hardware functional units, including microprocessors that run software .,
as well as a communications subsystem to connect, control, direct and interface betvween h
functional modules.
SoC chip
CPU
Digital Signal
Processor (DSP)
Storage
An SoC must have at least one processor core, but typically an SoC has more than one cor
Processor cores can be a microcontroller, microprocessor, Digital Signal Processor (DS?
or Application-Specific Instruction Set Processor (ASIP) core. ASIPs have instruction sets th
are customized for an application domain and designed to be more efficient than general-purpos
instructions for a specific type of workload. Multiprocessor SoCs have more than one process
core by definition.
use RISC
Whether single-core, multi-core or many core, SoC processor cores typically
instruction set architectures. RISC architectures are advantageous over CISC processors N
SoCs because they require less digital logic, and therefore less power and area on board, ano
the embedded and mobile computing markets, area and power are often highly constrained
Architecture of System on Chip (SOc) Iion\ 2-5
particular, Soc processor cores often use the ARM architecture because it is a soft
processorspecified as an IP core and is more power efficient than x86.
DigitalSignalProcessors
Digital Signal Processor (DSP) cores are often included on SoCs. They perform signal
prozessing operations in SoCs for sensors, actuators, data collection, data
nedia processing DSP cores typically feature Very Long analysis and
Instruction Multiple Data (SIMD) Instruction Word(VLIW)
and Single instruction set architectures,
and are therefore
amenable to
highly exploiting instruction-level
ing and superscalar execution. DSP cores are most often parallelisrn through parallel
feature application-specific
instructions, and as such are typically
Application-Specific Instruction-set Processors (ASIP).
Such application-specific instructions correspond to dedicated hardware functional uníts that
compute those instructions.
TIypical DSP instructions include multiply-accumulate, fast fourier transform, fused
multiply-add, and convolutions.
Other Components
As with other computer systems, SoCs require timing sources to generate
clock signals, control
execution of SoC functions and provide time context to signal processing
applications of the
SoC, if needed. Popular time sources are crystal oscillators and phase-locked loops.
SoC peripherals include counter-timers, real-timne timers and power-on reset
generators. SoCs
also include voltage regulators and power management circuits.
Memory
SoCs include semiconductor mnemory blocks to perform their computation, as do
microcontrollers and other embedded systems. Depending on the application, SoC memory may
form a
but in memory
rmany
hierarchy and cache hierarchy. In the mobile computing market, this is common,
for iow-power embedded microcontrollers, this is not necessary. Memory technologies
SoCs include read-only mnemory(ROM), random-access memory (RAM), Electrically
Erasable
RAM can Programmable into
ROM (EEPROM) and flash memory. As in other computer systems,
be
slower but subdivided relatively faster but more expensive static RAM(SRAM) and the
Usualy be cheaper dynamic RAM (DRAM), When an SoC has a cache hierarchy, SRAM will
used to implement processor registers and cores' LI caches whereas DRAM will be
2-6 on Embedded System Design
nsed for lower levels of the cache hierarchy including main memory. "Main memory" may ke
specific to a single processor (which can be multi-core)when the SoC has multiple processo
in which case it is distributed memory and must be sent via intermodule communication on-chi
to be accessed by a different processor.
Interfaces
SoCs include extermal interfaces, typically for conmmunication protocols. These are often based
upon industry standards such as USB, firewire, ethernet, USART, SPI, HDMI, IIC, etc. These
interfaces willdiffer according to the intended application. Wireless networking protocols such
as Wi-Fi. bluetooth, 6LOWPAN and near-field communication may also be supported.
When needed, SoCs include analog interfaces including analog-to-digital and digital-to-analog
different types
cOnverters, often for signal processing. These may be able to interface with
transducers. They may interface wit
of sensors or actuatos, including smart
analog
application-specificmodules or shields. Or they may be internal to the SoC, such as if an
to digital signals for mathematica
sensor is built into the SoC and its readings must be converted
processing
Inter-module communication
must often send data and instructions back
SoCs consist of many execution units. These units
SoCs require communication subsystems
and forth. Because of this, all but the most trivial
architectures were used, bu
Originally, as with other microcomnputer technologies, data bus networks known as
based on sparse intercommunication
recently designs
forecast to overtake bus
Networks-on-Chip (NoC) have risen to prominence and are
architectures for SoC design in the near future.
Bus-based communication
a
Historically, a shared global computer bus typically connected the different components,
royalty-ti
ARM's
called "blocks" of the SoC. Avery common bus for SoC communications is
Advanced Microcontroller Bus Architecture (AMBA) standard.
and
Direct memory access controllers route data directly between external interfaces
the PC
memory. bypassing the CPU or control unit, thereby increasing the data throughput ofmodule
iThis is similar to some device drivers of peripherals on component-based multi-chip
,architectures.
Architecture of System on Chip (SOC) es\ 2-7
Computer buses are limited in scalability, supporting only upto tens of cores (multicore) on a
single chip. Wire delay is not scalable due to continued
iniaturizatíon, system
performancedoes not scale with the number of cores attached, the SoC's operating
frequency must decrease with cach additional core attached for power to be sustainable, and long
wires consume large amounts of electrical power. These challenges are prohibitive to
Supporting many core systems on chip.
Network on a chip
b the Jate 2010s, a trend of SoCs implementing communication subsystems in terms of a
network-like topology instead of bus-based protocols has emerged. A trend towards more
nrocessor cores on SoCs has caused on-chip communication efficiency to become one of the key
factors in determining the overall system performance and cost. This has led to the emergence of
interconnection networks with router-based packet switching known as "Network on Chip"
(NoCs) to overcome the bottlenecks of bus-based networks.
Network-on-chips have advantages including destination and application-specific routing.
greater power efficiency and reduced possibility of bus contention. Network-on-chip
architectures take inspiration from networking protocols like TCP and the Internet protocol
suite for on-chip communication, although they typically have fewer network layers. Optimal
network-on-chip network architectures are an ongoing area of much research interest. NoC
architectures range from traditional distributed computing network topologies such
a5 torus, hypercube, meshes and tree networks to genetic algorithm scheduling to
randomized
agorithms such as random walks with branching and randomized time to live (TTL).
Nany SoC researchers consider NoC architectures to be the future of SoC design because they
have been shown to efficiently meet power andthroughput needs of SoC designs.
ApplThe imostcations
smart common application of SOCstoday is in mobile applications, including smart phones,
watches, tablets. Other applications include signal speech processing, PC interfaces, data
ccommuni
ommuniccatatiioonn. modul
SoCs are being applied to personal computers as well due to the integration of
es like LTE and wireless networks onto the chip.
2-8 ion Embedded System Design
The Raspberry Pi Zero family, for example, is atiny version of the full-size Raspberry Pi which
dropsafewfeatures - in particular the multiple USB ports and wired network port-in favour of
asignificantly smaller layout and lowered power needs. All Raspberry Pi models have one thing
ia common, though: they re compatible, meaning that software written for one model will run on
any other model. It's even possible to take the very latest version of the Raspberry Pi's operating
systemand run it on an original pre-launch Model Bprototype.
12C
ID
EEPROM
UARTO_RXD CEN
Pn
N
XD
GPHO14
Ground GPIO15 Ground 18)
GPIO24
17 GroundGPIC2S GroundGFIO12GroundCPIOIE
GPIO2040)GPIO21
BB
Pi
model 19)20 Pimodel
B+
SCE1GPIO
GPIO3
120 GPIO2T
GPIO1
Ground
)8 D ()(6
GroundEEPROM
12C 0(8)
GPIOS GPIO6 GPIO13GPIO19GP102
Broadcom BCM2835
Micro SD card
USB ports
(underneath)
Audio jack
Micro USB
(underneath)
HDM! port Camera
module port
Improvements
The model B+ stays ahead in terms of processing speed and comes with an improved
wireless capability.
ii. The dual-band Wifi 802.1lac runs at 2.4 GHZ and 5GHz and provides a better range in
wireless challenging environments and Bluetooth 4.2 is available with BLE support.
The top side is painted with metal shielding, instead of plastic in the earlier models, that
acts as a heat sink and drains the excessive amount of heat if the board is subjected to the
high temperature or pressure.
iv. This B+ model is three times faster than Pi 2 and 3 which is a major development in terms
of speed, capable of exxecuting different functions at a decent pace.
V. The ethernet port comes with 300 Mbit/s which is much faster than earlier version with
100 Mbitsspeed. It is known as gigabit ethernet based on USB 2.0 interface.
vi. Four pin beader is added on the board that resides near 40 pin header. This allows the
Power over Ethernet (PoE), i.e., provides the necessary electrical current to the device
using data cables instead of power cords. It is very useful and reduces the number of
cables required for the installation of a device in the relevant project.
Following figure shows the pinout of Raspberry Pi 3B+
2
SDA GPIO2 4
SCL GPIO3 GND
7 8GPIO14 UARTO_TXD
GND 10 GPIO15 UARTO_RXD
12 CPlO18 CLK
GPIO17 11
GPIO27 13 14 GND
GPIO22 15 16 GPIo23
18 GPiO24
MOSI 19 20 GND
MISO GPIO9 21 22 GPiO25
CLK GPIO11 23 24 GPIO8 CEO N
GND 25 26GP07 CE1 N
120 DNG 27 28 DNO 120
GPIO5 29 30 GND
GPIO631 32 GPIO12
GPIO13 33 34 GND
GPIO19 35 36 GPO16
GPIO26 37 38 GPIO20
GND 39 40 GPIO21
i. 40 pin header is used to develop an external connection with the electronic deyice
the same as the previous versions, making it compatible with all the devices where
This is
versions can be used.
older
ii. Out of 40 pins,26 are used as a digital VOpins and 9 of the remaining 14 pins are termed
as dedicated VO pins which indicate they don't come with alternative function.
ii. Pins 3 and 5 come with an onboard pull up, resistor with 1.8 k2 and Pins 27 and 28 are
dedicated to D EEPROM. In B+ model, the GPIO header is slightly repositioned to allow
more space for the additional mounting hole. The devices that are compatible with the B
model may work with the B+ version; however, they may not sit identically to the
previous version.
RISC Architecture
Memory managemenet
ARM1176JZF-S
The ARMI176JZF-S CPU is a member of the ARM11 Thumb family. The ARMI176JZ
macrocell is a 32-bit cached processor with ARM architecture vó that supports the ARMA
Thumb instruction sets and includes features for direct execution of Java byte codes. p .
Java byte codes requires the Java Technology Enabling Kin(JTEK). The development chipa
contains:
Architecture of System on Chip (SOC) Iaon\ 2-15
Cache
i.
Cache memory for instruction (32 KB) and data (32 KB).
Level 2 Cache Controller (L2CC) with 128 KB unified cache.
DSP
A range of Single lnstruction Multiple Data (SIMD) DSP instructions that operate
on 16-bitor 8-bit data values in 32-bit registers.
ii. MMU
8KBof data and instruction Tightly Coupled Memory (TCM). The TCM operates
with a single wait-state and provides higher data rates than external memory.
VFP
Vector Floating Point coprocessor (VFP), supporting the ARM VFPv2 floaing
point coprocessor instruction set.
vi. TrustZone: TrustZone security extensions
TrustZone Interrupt Controller (TZIC)
TrustZone Protection Controller (TZPC)
vi. EM: Provision for Intelligent Energy Management (1EM).
ARM Intelligent Energy Controller (IEC).
National Semiconductor Advanced Power Controller (APCI).
National Semiconductor Hardware Performance Monitor (HPM).
viii. AXI RAM
AXIRAM (512 kB) and boot ROM emulation (16 kB).
ix, AXI buses: The ARM1176JF processor uses the Configurable AXI Interconnect to
connect the processor core to the on-chip AXI controllers and peripherals. An AXI to
APB bridge provides the interface to the APB-based peripherals in the development chip.
One external AXI master bus and one external AXI slave bus provide the interface to the
FPGA peripherals and the optional Logic Tile.
2-16 SLON Embedded System Design
X. CAI
i Synchronous serial port: The SSP provides a master or slave interface for synchronous
serial communication using Motorola SPI, TI or National Semiconductor Microwire
devices.
Smart card interface: The Smart Card Interface signals are programmable to enable
support for a Smart Card, Security Identity Module (SIM) card, or similar module.
oi, GPIO: Eight bits of GPIO are provided by the on-chip interface. (An additional GPIO is
provided by the FPGA.)
xi. Watchdog: AWatchdog module can be used to trigger an interrupt or system reset in the
event of software failure.
The Broadcom BCM2835 SoC used in the first generation Raspberry Pi includes a
700MHz ARMI176JZF-S processor, VideoCore IV Graphics Processing Unit (GPU), and
RAM. It hasa level 1(Ll) cache of 16 KiB and a level 2 (L2) cache of 128 KiB. The level 2
cache is used primarily by the GPU. The SoC is stacked underneath the RAM chip, so only its
edge is visible. The ARMIu76JZF)-S is the same CPU used in the original iPhone, although at
ahigher clock rate, and mated with a much faster GPU.
Ine carlier V1.l model of the Raspberry Pi2 used a Broadcom BCM2836 SoC with a
O MHz 32-bit, quad-core ARM Cortex-A7processor, with 256 KiB shared L2 cache. The
poerry Pi2V1.2 was upgraded to a Broadcom BCM2837 SoC with a 1.2 GHz 64-bit quad
Core ARM Cortex-A53 processor, the same SoC which is used on the Raspberry Pi 3,
but underclocked (by defaul) to the same 900 MHz CPU clock speed as the V1.1. The
BCM2836 SoC is no longer in production as of late 2016.
The Raspberry Pi3 Model Buses a Broadcom BCM2837 SoC with a 1.2 GHz 64-bit quad-
core ARM Cortex-A53 processor, with 512 KiB shared L2 cache. The Model A+ and B+ are
14GHz.
2-18 ON Embedded System Design
The Raspberry Pi 4 uses a Broadcom BCM2711 SoC with a 1.5 GHz 64-bit quad-core ARM
Cortex-A72 processor, with 1 MiB shared L2 cache. Unlike previous models, which all
used
custom intemupt controller poorly suited for virtualisation, the interrupt controller on this Soe
compatible with the ARM Generic Interrupt Controller (GIC) architecture 2.0,
hardware support for interrupt distribution when using ARM virtualisation capabilities, providing
The Raspbery Pi Zero and Zero W use the same Broadcom BCM2835 SoC as the
generation Raspbery Pi,although now running at 1GHz CPUclock speed.
The CPU of the first and second generation
heat sink or fan, even when
Raspberry Pi board did not require cooling with a
overclocked.
overclocked, but the Raspberry Pi 3 may generate more heat when
Figure 2.8
i. Fetch stages can hold upto four instructions. Branch prediction is performed on
E:
F: instructions ahead of execution of earlier instructions.
Issue and Decode stages can contain any instruction in parallel with a predicted branch.
iü. Execute, Memory, and Write stages can contain a predicted branch, an ALU, or multiply
instruction load/store multiple instruction, and a coprocessor instruction in parallel
execution.
Order of execution
1st fetch 2nd fetch Instruction Register Shifter Main Saturation Write back
decode read and stage ALU stage stage
stage stage
instruction
issue
Flgure 2.9
Ihe stages are executed in order from one to eight in one single clock cycle; so one clock cycle
can provide one ALU operation. Now think about if the pipeline was four stages long and not
cight: it would then take two full clock cycles to complete the same instruction. This makes the
process half as efficient. However, the ARMII is a superscalar architecture so it can do more
2-20 Hgm Embedded System Design
than one operation per clock cycle, as can most modern processors. Superscalar means that
functions inside the CPU core can operate in a parallel fashion. You can think of a
architecture like a Retail Mall with multiple checkout lines. You have many operators
superscala
many customers. The opposite to this is scalar: scalar would be a small green grocer withserving
one checkout that can serve onlyone person at a time.
The more stages you add, the higher the clock frequency you need to drive the stage. This be
the very unfortunate side effect of increased heat and power usage. Given that the ARM11 i
bad.
targeted to low power and low heat-embedded devices more stages would be very
The ARMV6 is special in another way too; it is the first ARM core to contain a vector floating
arithmetic by
point coprocessor. This coprocessor meets the IEEE standards for floating point
giving the ARMI1 a low-cost, high-performance, single-precision and double-precision
from this
computation ability in hardware. A lot of the performance inprovements will come
coprocessor that is potentially more than 10 times faster for certain operations.
RAMSet base address and size Write butfer data (1-2 words)
Micro
TLB
DATARAM TCM
TAGRAM
Way
select
Comparator
Cache Data
Micro TLB out
hit
miss and
Data abort
Figure 2.10
to 64 kB.
Four-way set associative with size configurable from 4
Cache is Harvard implementation.
Round Robin, which is controlled by
Cache replacement policies are Pseudo-Random or
the RR bit in CP15 register cl.
write-through.
MicroTLB determines if cache lines are write-back or
lines.
Contains both secure and non-secure data in cache
Branch h Prediction and Folding (Concept) history is available for dynamic
Processor handles branches first time execution when no
Prediction for the prefetch unit.
and return stack.
Integer Core (IC): Uses static branch prediction
Prefetch Unit (PU): Uses dynamic branch prediction.
Embedded System Design
2-22
iü. When a branch is resolved, the PU receives information from the IC and either alloca
space in the Branch Target Address Cache (BTAC) or updates an entry.
iv. Branches are resolved at or before the third execution stage.
Dynamic Branch Prediction
i.
Uses a Branch Target Address Cache (BTAC) as the first line of branch predicion
hold virtual target addresses.
Prediction history of a branch isstored as a two-bit value in the BTAC.
iii. BTAC is a 128-entry direct-mapped cache structure.
iv. Two bit values represent the following four states.
Strong predict branch taken.
b. Weak predict branch taken.
C. Strong predict branch not taken.
d. Weak predict branch not taken.
Static Branch Prediction
branch prediction, which is ba
Second level of branch prediction in processor is static
on the characteristics of the branch instruction.
iü. Uses no history information.
branches not taken and all backwa
ARM1176JZF-S predicts all forward conditional
branches taken.
trouble experienced by the miss when first encountering the bra
iv. Added to mitigate the
by the predictor.
Branch Folding
Branch instruction is removed from the pipeline and is storu
i. Technique where the
branches.
buffer, which is executedon alldynamic predicted
under 1.
ii. Can improve the Branch CPI to
link/exchange
Technique not done on the following
ii.
instructions, to avoid losing the link. (Branch with
a. BL and BLX
instruction set).
to another branch.
Predicted branches that lead directlywhen
b. fetched.
C. Branches that have been cancelled
Architecture of System on Chip (SOC) wion\ 2-23
5. GPU Overview
AGraphics Processing Unit (GPU) is a specialized, electronic circuit designed to rapidly
manipulate and alter memory to accelerate the creation of images in a frame buffer intended for
output to a display device. GPUs are used in embedded systems, mobile phones, personal
at
computers, workstations, and game consoles. Modern GPUs are very efficient
manipulating computer graphics and image processing. Their highly parallel structure makes
them more efficient than general-purpose Central Processing Units (CPUs) for algorithms that
process large blocks of data in parallel. In a personal computer, a GPUcan be present on a video
cardor embedded on the motherboard. In certain CPUs, they are embedded on the CPU die.
Modern GPUs use most of their transistors to do calculations related to 3D computer graphics.
The Broadcom Video Core IV 250 MHz supports OpenGL ES 2.0(24 GFLOPS) Mpeg-2 and
VC-lis the GPUwhich also includes a 1080p30 H.264/MPEG-4 AVC decoded/encoder.
VideoCore is a low-power mobile multimedia processor originally developed by Alphamosaic
makes it flexible and
Lid and now owned by Broadcom. Its two-dimensional DSP architecture
while
efficient enough to decode (as well as encode) a number of multimedia codecs in software
maintaining low power usage.
GPU VGABIOS
Exercises
A. Choose correct option from the following:
1.
The hardware-software approach makes the SoC compact in size, allows for
less power consumption, and more reliable than a standard multi-chip system.
a. Integration b. Design
C. Isolation d. None
C. Microcontroller d. None
C. Assembly d. None