Embedded System Presentation
Embedded System Presentation
Embedded System Presentation
NoC Introduction & properties NoC buffered flow control Routing algorithms Application specialization Using Virtex 4 configuration network as a high-speed MetaWire data network. What is MetaWire and why use it? Architecture of MetaWire MetaWire performance Implementation And Application Exploration For Network on Chip
DES Algorithm NoC Implementation DES key Search Architectural Details Results
The System-on-Chip (SoC) today Heterogeneous ~10 IPs Homogeneous (MP-SoC) ~ 10 uP (with exceptions) On-Chip BUS (AMBA, Core Connect, Wishbone, ) IP and uP are sold with proprietary Bus IF Near and long-term forecast 100 IP/uP: Busses are non scalable! Physical Design issues: signal integrity, power consumption, timing closure Clock issues: Is time for the Globally Asynchronous, Locally Synchronous paradigm (GALS)? (Still locally synchronous) Need for more regular design
CPU
DMA DSP
MEM
Poor wire scaling Interconnect power + delay more dominant as the technology improves High Performance Energy efficiency Communication architecture large proportion of energy budget
System Bus
Mem Ctrl.
Bridge
MPEG
Pentium 4 had two dedicated drive stages to transport signals across chip
Architectural paradigm shift Replace wire spaghetti by an intelligent network infrastructure Design paradigm shift Busses and signals replaced by packets Organizational paradigm shift Create a new discipline, a new infrastructure responsibility
Bus-based architectures
Irregular architectures
Regular Architectures
Networks on Chip
Module
Module
Module
2) Increase system
Module
Module
Module
Module
integration productivity
3) Enable Multi Processors for SoCs
Module Module
Module
Module
Module
NoC:
d d
Simple Bus:
n
d d
O n O n
Point-to Point:
O n3 n
O n n
O n
n O n n
2
Segmented Bus:
d d
O n2 n
O n n
E. Bolotin at al. , Cost Considerations in Network on Chip, Integration, special issue on Network on Chip, October 2004
Layered approach
Software
Transport Network Wiring Networking Separation of concerns Traffic Modeling Architect ures Queuin g Theory
PE
PE
PE
PE
PE
PE
Router
PE
PE
PE
PE
Buffer
Buffer
Routing
Arbitration
Routing Algorithms
Complex routing schemes consume more device area (complex routing/arbitration logic) Additional latency for channel setup/release Deadlocks must be avoided
Deadlock can occur if it is impossible for any messages to move (without discarding one).
Buffer deadlock occurs when all buffers are full in a store and forward network. This leads to a circular wait condition, each node waiting for space to receive the next message. Channel deadlock is similar, but will result if all channels around a circular path in a wormhole-based network are busy (recall that each node has a single buffer used for both input and output).
QoS, fault-tolerance
X-Y routing is determined completely from their addresses. In X-Y routing, the message travels horizontally (in the X-dimension) from the source node to the column containing the destination, where the message travels vertically.
There are four possible direction pairs, east-north, eastsouth, west-north, and west-south. Advantages for X-Y routing:
Buffers
0 1 2 3
T H B B B T H B B B T H B B B T
T0 = H(Tr + L/b)
Cut-through
2. Cut-through Flow Control: Each node starts to send the packet without waiting for the whole packet to arrive. Cut-through is more efficient approach. 1) Good performance 2) Large buffer sizes, consumes more power
Suppose in the middle, we get stuck
0 1 2 3 H B H B B H B B B H T B B B T B B T B T 0 1 2 3 H B H B B B B T B T H B H B B B B T B T
T0 = HxTr + L/b
Wormhole routing divides a packet into smaller fixed-sized pieces called flits (flow control digits). The first flit in the packet must contain (at least) the destination address. Thus the size of a flit must be at least log2 N in an N-cores SOC Each flit is transmitted as a separate entity, but all flits belonging to a single packet must be transmitted in sequence, one immediately after the other, in a pipeline through intermediate routers.
No fairness is guarantied since routers arbitration is based on local state The further is the source from the destination, its worm has to win more arbitrations The hot module (HM) bandwidth isnt fairly shared
Parameters allow customization Parameters: Buffers depth, number of virtual channels, NoC size, etc
Buffers Routing Topology Mapping to topology Implementation and Reuse QoS Support Topology Gossiping architectures
Architecture Optimization
Fault tolerance
SRC
15000 FFT
4000 FIR
15000
matrix 82500
Each node has computation properties Directed edge describes task dependences Edge properties has communication volume
4000
Good? No
Synthesis
Optimized NoC
Place modules
R
Module
R Module R
Module
Module
R Module R Module R R R Module R Module R Module R R
Place modules
Module R
Module
R Module
Module
Module
Module Module
Place modules
R Module
R Module R
R Module
R Module R Module
Module R
Module Module
Module
Optimize capacity for performance/power tradeoff Capacity allocation is a traditional WAN optimization problem, however:
A SoC-like system with realistic traffic demands and delay requirements Classic design: 41.8Gbit/sec Using developed NOCs algorithm: 28.7Gbit/sec Total capacity reduced by 30%
Before optimization
After optimization
Some components
Static energy i.e. leakage power (it is becoming a increasing importance problem) Clock energy flip flops, latches need to be clocked
Can consume 50-80% of total communication architecture depending on size and depth of FIFOs Great problem in NOCs
Routers
FR DSP CNI R
FR CPU CNI R
CR CNI R CNI
CR CNI R
CR CNI R R CNI
CR CNI R R CNI
FR DRAM CNI R R
CR CNI
There is an advanced wiring network present whose only purpose is to download configuration information.
This presentation topic is centered on the Xilinx Virtex-4 FPGA which is a reconfigurable device. Theoretically, any reconfigurable device can use these concepts as long as there is a link between the configuration circuitry and the logic level.
Caveat: gaining access to low-level FPGA functions may not be supported by development software.
Architecture Basics
FPGAs are volatile devices which are composed of many RAM elements known as Look Up Tables (LUT).
Many FPGAs also have built in specialized blocks such as multipliers and floating point units.
VHDL Verilog
Nearly any digital circuit can be synthesized by specifying the architecture. The required logic gates (logic blocks in the FPGA) are connected with on-chip interconnects via the configuration network.
Synthesizing time on the development system can be greatly reduced for large designs. This may help alleviate bottlenecks in the interconnecting grid. Reduces extra buffers, latches, etc. as these are already built into the configuration network thus saving area for additional logic.
The configuration network is already fully addressable and synchronous across the chip.
Addressing scheme already has NoC written all over it. Synchronous feature allows data to be sent in single cycles with guaranteed minimal race condition effects.
MetaWire Controller
Single purpose controller for arbitrating data transfers. Somewhat similar to a DMA controller.
Performance
Both throughput and latency equations are derived from timing diagrams.
Final Verification
Outline
Application
Virtual Channel NoC Simple NoC NoC Layout DES key Search Engine
Results.
Designed by IBM 1977. Uses a 56 bit key and block of 64 bit with 8 bit for parity error check. Encrypt pain text in blocks of 64 bit Replace by TripleDES
Give a known plaintext-ciphertext pair (P,C), find the DES key or keys which encrypt P and produce C For DES there would be 2^56 key in the search space
DES Algorithm
Sixteen 48-bit from original 56-bit 56-bit key is permute (PC1) Then divided into two 28-bit treated separately thereafter. 28-bit are rotated left by 1 or 2 bits (specified for each round). Two 28-bit are combine and permutated and a subkey of 48 bit is selected Plaintext is passed thru 16 rounds of permuting key resulting in a cipher text. There is a initial permutation applied at the beginning An a Inverse initial permutation and 32-bit swap at the end.
Source: Exploring FPGA Network on Chip Implementations Across Various Application and Network Loads Graham Schelle and Dirk Grunwald. Department of Computer Science University of Colorado at Boulder Boulder, CO
NoC Implementation.
Physical Channel Multiple lanes so that packets can by pass one another Node arbitration Arbitration for outgoing virtual channel allocation and switch allocation Node Switch Multiple paths of communication simultaneously
Simple NoC
Shrinking the Physical Channel Simple one-word FIFO Shrinking the Node arbitration No virtual channel allocation Less side band state and signaling Shrinking the Node Switch 1 switching decision
Source: Exploring FPGA Network on Chip Implementations Across Various Application and Network Loads Graham Schelle and Dirk Grunwald. Department of Computer Science University of Colorado at Boulder Boulder, CO
Results
The application performance metric: Keys generated per second. Implementation Performance Simple has better performance when Network load is less than 15% Performance degradation virtual channel is more graceful while the simple has a rapid slope
Source: Exploring FPGA Network on Chip Implementations Across Various Application and Network Loads Graham Schelle and Dirk Grunwald. Department of Computer Science University of Colorado at Boulder Boulder, CO