Embedded System Presentation

Networks on Chip : a quick introduction
Abelardo Jara Jared Bevis Abraham Sanchez March 23rd, 2009
Outline - NoC Introduction
NoC Introduction & properties NoC buffered flow control Routing algorithms Application specialization Using Virtex 4 configuration network as a high-speed MetaWire data network. What is MetaWire and why use it? Architecture of MetaWire MetaWire performance Implementation And Application Exploration For Network on Chip

DES Algorithm NoC Implementation DES key Search Architectural Details Results
Todays heterogeneous SOCs
The System-on-Chip (SoC) today Heterogeneous ~10 IPs Homogeneous (MP-SoC) ~ 10 uP (with exceptions) On-Chip BUS (AMBA, Core Connect, Wishbone, ) IP and uP are sold with proprietary Bus IF Near and long-term forecast 100 IP/uP: Busses are non scalable! Physical Design issues: signal integrity, power consumption, timing closure Clock issues: Is time for the Globally Asynchronous, Locally Synchronous paradigm (GALS)? (Still locally synchronous) Need for more regular design
CPU
DMA DSP
MEM
Interconnection network (BUS) DSP Dedicated IP (MPEG) I/O
Locally synchronou s clock domains
Computation vs Communication: A growing gap

Source: Kanishka Lahiri 2004
Focus on communication-centric design

Poor wire scaling Interconnect power + delay more dominant as the technology improves High Performance Energy efficiency Communication architecture large proportion of energy budget
The SoC nightmare

DMA CPU DSP
System Bus
Mem Ctrl.
Bridge
The Board-on-a-Chip Approach
MPEG
The architecture is tightly coupled Peripheral Bus
C Control Wires Source: Prof Jan Rabaey CS-252-2000 UC Berkeley
SoC Design Trends
MPSoC: STI Cell
Eight Synergistic Processing Elements Ring-based Element Interconnect Bus
128-bit, 4 concentric rings
Interconnect delays have become important
Pentium 4 had two dedicated drive stages to transport signals across chip
Source: Pham et al ISSCC 2005
Evolution or Paradigm Shift?

Network link Network router Computing module Bus
Architectural paradigm shift Replace wire spaghetti by an intelligent network infrastructure Design paradigm shift Busses and signals replaced by packets Organizational paradigm shift Create a new discipline, a new infrastructure responsibility
Bus vs Networks-on-Chip (NoCs)
Bus-based architectures
Irregular architectures
Regular Architectures
Bus based interconnect

Networks on Chip

Low cost Easier to Implement Flexible
Layered Approach Buses replaced with Networked architectures

Better electrical properties Higher bandwidth Energy efficiency Scalable
Better electrical properties and System Integration

1) Efficient interconnect:
delay, power, noise, scalability, reliability
Module
Module
Module
2) Increase system
Module
Module
Module
Module
integration productivity
3) Enable Multi Processors for SoCs
Module Module
Module
Module
Module
Scalability Area and Power in NoCs

For Same Performance, compare the:
Wire-area and power:
NoC:
d d
Simple Bus:
n
d d
O n O n
Point-to Point:
O n3 n
O n n
O n
n O n n
2
Segmented Bus:
d d
O n2 n
O n n
E. Bolotin at al. , Cost Considerations in Network on Chip, Integration, special issue on Network on Chip, October 2004
Layered approach
Software
Transport Network Wiring Networking Separation of concerns Traffic Modeling Architect ures Queuin g Theory
Regular Network on Chip
PE
PE
PE
PE
PE
PE
Router
PE
PE
PE
PE
Typical NoC Router

H Buffer H Crossbar Switch Buffer H H Buffer Buffer
Buffer
Buffer
Routing
Arbitration
This example uses a centralized arbitrer for all I/O ports

Distributed arbitration can also be used
Routing Algorithms
NoC routing algorithms should be simple
Complex routing schemes consume more device area (complex routing/arbitration logic) Additional latency for channel setup/release Deadlocks must be avoided
Deadlock can occur if it is impossible for any messages to move (without discarding one).
Buffer deadlock occurs when all buffers are full in a store and forward network. This leads to a circular wait condition, each node waiting for space to receive the next message. Channel deadlock is similar, but will result if all channels around a circular path in a wormhole-based network are busy (recall that each node has a single buffer used for both input and output).
Some additional features are highly desirable
QoS, fault-tolerance
Routing in a 2D-mesh NoC XY routing
X-Y routing is determined completely from their addresses. In X-Y routing, the message travels horizontally (in the X-dimension) from the source node to the column containing the destination, where the message travels vertically.
X direction is determined first, next Y direction
There are four possible direction pairs, east-north, eastsouth, west-north, and west-south. Advantages for X-Y routing:

Very simple to implement Deterministic Deadlock-free
X-Y Routing Example
NoC Buffered Flow Control

1. Store & Forward 2. Cut-through 3. Wormhole 4. Virtual Channel
Store & Forward

1. Store & Forward Flow Control:
Each node receives a packet and then sends it out.
Buffers
0 1 2 3
T H B B B T H B B B T H B B B T
T0 = H(Tr + L/b)
Cut-through
2. Cut-through Flow Control: Each node starts to send the packet without waiting for the whole packet to arrive. Cut-through is more efficient approach. 1) Good performance 2) Large buffer sizes, consumes more power
Suppose in the middle, we get stuck
0 1 2 3 H B H B B H B B B H T B B B T B B T B T 0 1 2 3 H B H B B B B T B T H B H B B B B T B T
|---- Not Ready ----|
T0 = HxTr + L/b
Flits and Wormhole Routing
Wormhole routing divides a packet into smaller fixed-sized pieces called flits (flow control digits). The first flit in the packet must contain (at least) the destination address. Thus the size of a flit must be at least log2 N in an N-cores SOC Each flit is transmitted as a separate entity, but all flits belonging to a single packet must be transmitted in sequence, one immediately after the other, in a pipeline through intermediate routers.
Store and Forward vs. Wormhole
Blocking condition Wormhole router

IP
(HM) Interface
No fairness is guarantied since routers arbitration is based on local state The further is the source from the destination, its worm has to win more arbitrations The hot module (HM) bandwidth isnt fairly shared
A simple solution: Virtual Channels

1 A B 2 3
Solution 1: Time multiplexing

Input a Input b Interleaved Winner Takes All an a1 a2 a3 a4 bn b1 b2 b3 b4 an bn a1 b1 a2 b2 a3 b3 a4 b4 an a1 a2 a3 a4 bn b1 b2 b3 b4
Solution 2: Additional I/O ports
Optimizing a NoC for a particular application
Given a particular application, can we optimize a NoC for it?
NoC architecture has to flexible and parametric

Parameters allow customization Parameters: Buffers depth, number of virtual channels, NoC size, etc
Application Specific Optimization
Buffers Routing Topology Mapping to topology Implementation and Reuse QoS Support Topology Gossiping architectures
Architecture Optimization
Fault tolerance
But how an application is described?
Few multiprocessor embedded benchmarks Task graphs
SRC
15000 FFT
ARM:2.5ms PPC: 2.2ms
4000 FIR
15000
matrix 82500
Extensively used in scheduling research
Each node has computation properties Directed edge describes task dependences Edge properties has communication volume
4000
IFFT 40000 angle 15000 SINK
Communication Centric Design

Application Architecture Library Architecture / Application Model NoC Optimisation Configure Evaluate Analyse / Profile Refine
Good? No
Synthesis
Optimized NoC
NoC Design Flow

Extract intermodule traffic
Place modules
Allocate link capacities
Verify QoS and cost
NoC Design Flow

R Module R Module R
R
Module
R Module R
Module
Module
R Module R Module R R R Module R Module R Module R R
Place modules
Module R
Module
R Module
Module
Verify QoS and cost
NoC Design Flow

R Module R Module R R R Module
Module
Module Module
Place modules
R Module
R Module R
R Module
R Module R Module
Module R
Module Module
Module
Verify QoS and cost

Optimize capacity for performance/power tradeoff Capacity allocation is a traditional WAN optimization problem, however:
Capacity Allocation Realistic Example

A SoC-like system with realistic traffic demands and delay requirements Classic design: 41.8Gbit/sec Using developed NOCs algorithm: 28.7Gbit/sec Total capacity reduced by 30%
Before optimization
After optimization
Energy Model Limitations Buffering energy
Some components
Static energy i.e. leakage power (it is becoming a increasing importance problem) Clock energy flip flops, latches need to be clocked
Can consume 50-80% of total communication architecture depending on size and depth of FIFOs Great problem in NOCs
Buffering Energy is not free
NoC Based FPGA Architecture

Functional unit
R FR CPU CNI R CR CNI R CNI R CNI R CR CNI R FR ETH I/F CNI
NoC for interrouting
Routers
FR SERDES CNI R FR PCI CNI R R R
CR CNI R CNI R CNI R
CR CNI R FR D/A A/D CNI R CNI
FR DSP CNI R
FR CPU CNI R
CR CNI R CNI
Configurable region User logic
Configurable network interface
CR CNI R
CR CNI R R CNI
CR CNI R R CNI
CR CNI R FR ETH I/F CNI R
FR DRAM CNI R R
CR CNI R CNI R CNI R
CR CNI
MetaWire: Using FPGA Configuration Circuitry to Emulate a Network-OnChip

Jared Bevis
When Should I Consider This?
Many FPGAs have reconfigurable architectures.
There is an advanced wiring network present whose only purpose is to download configuration information.
For static designs, this network is unused after initial configuration.
What Resources are Required?
This presentation topic is centered on the Xilinx Virtex-4 FPGA which is a reconfigurable device. Theoretically, any reconfigurable device can use these concepts as long as there is a link between the configuration circuitry and the logic level.
Caveat: gaining access to low-level FPGA functions may not be supported by development software.
Architecture Basics
FPGAs are volatile devices which are composed of many RAM elements known as Look Up Tables (LUT).
Various combinations form what are known as logic blocks.
Many FPGAs also have built in specialized blocks such as multipliers and floating point units.
These components are connected as specified in a programming language.
VHDL Verilog
Nearly any digital circuit can be synthesized by specifying the architecture. The required logic gates (logic blocks in the FPGA) are connected with on-chip interconnects via the configuration network.
Why use the configuration network if there is already an interconnect network?
Synthesizing time on the development system can be greatly reduced for large designs. This may help alleviate bottlenecks in the interconnecting grid. Reduces extra buffers, latches, etc. as these are already built into the configuration network thus saving area for additional logic.
Additional Features of MetaWire Network
The configuration network is already fully addressable and synchronous across the chip.
Addressing scheme already has NoC written all over it. Synchronous feature allows data to be sent in single cycles with guaranteed minimal race condition effects.
Structure of the MetaWire Network
MWI TX and RX Details
MetaWire Controller
Single purpose controller for arbitrating data transfers. Somewhat similar to a DMA controller.
Executes a round-robin scheme of servicing data transfer requests.
Consists of address tables, logic control, and ICAP core.
Performance
Both throughput and latency equations are derived from timing diagrams.
Actual Testing Data
Final Verification
Implementation And Application Exploration For Network on Chip

Abraham Sanchez
Paper: Exploring FPGA Network on Chip Implementations Across Various Application and Network Loads. Graham Schelle and Dirk Grunwald. University of Colorado
Outline

Application
Brute Force DES key Search
DES Algorithm NoC Implementation.
Virtual Channel NoC Simple NoC NoC Layout DES key Search Engine
DES key Search Architectural Details

Results.
DES and Brute Force Key search
Data Encryption Standard (DES)

Designed by IBM 1977. Uses a 56 bit key and block of 64 bit with 8 bit for parity error check. Encrypt pain text in blocks of 64 bit Replace by TripleDES
Give a known plaintext-ciphertext pair (P,C), find the DES key or keys which encrypt P and produce C For DES there would be 2^56 key in the search space
Brute Force Key Search

DES Algorithm
Sixteen 48-bit from original 56-bit 56-bit key is permute (PC1) Then divided into two 28-bit treated separately thereafter. 28-bit are rotated left by 1 or 2 bits (specified for each round). Two 28-bit are combine and permutated and a subkey of 48 bit is selected Plaintext is passed thru 16 rounds of permuting key resulting in a cipher text. There is a initial permutation applied at the beginning An a Inverse initial permutation and 32-bit swap at the end.
Source: Exploring FPGA Network on Chip Implementations Across Various Application and Network Loads Graham Schelle and Dirk Grunwald. Department of Computer Science University of Colorado at Boulder Boulder, CO
NoC Implementation.
Virtual Channel NoC

Used by must NoC today Basic Network Components
Physical Channel Multiple lanes so that packets can by pass one another Node arbitration Arbitration for outgoing virtual channel allocation and switch allocation Node Switch Multiple paths of communication simultaneously
Simple NoC
Basic Network Components

Shrinking the Physical Channel Simple one-word FIFO Shrinking the Node arbitration No virtual channel allocation Less side band state and signaling Shrinking the Node Switch 1 switching decision
Deadlocks: avoided using deterministic XY Routing

DES key Search Architectural Details

Master uP Slave uP DES Engine NoC Layout Slave DES uP Engine DES Engine DES Engine DES Engine DES Engine Hierarchy of controllers Master Microprocessor Assigns a plaintext-ciphertext pair And assigns Range of keys to each slave microcontroller. Slave Microprocessor Subdivide the range of keys Assigns tasks DES Engine Polls for found keys DES search engine Takes a plaintext-ciphertext pair (P,C), a starting key K, and searches through keys until one is found that encrypts P to produce C Controllers are implemented as Microblaze that communicate with the DES Engine located in the NoC.
DES search engine
Results
The application performance metric: Keys generated per second. Implementation Performance Simple has better performance when Network load is less than 15% Performance degradation virtual channel is more graceful while the simple has a rapid slope

Embedded System Presentation

Uploaded by

Copyright:

Available Formats

Embedded System Presentation

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Embedded System Presentation

Uploaded by

Copyright:

Available Formats

Networks on Chip : a quick introduction

Abelardo Jara Jared Bevis Abraham Sanchez March 23rd, 2009

Outline - NoC Introduction

Todays heterogeneous SOCs

Interconnection network (BUS) DSP Dedicated IP (MPEG) I/O

Locally synchronou s clock domains

Computation vs Communication: A growing gap

Focus on communication-centric design

The SoC nightmare

The Board-on-a-Chip Approach

The architecture is tightly coupled Peripheral Bus

C Control Wires Source: Prof Jan Rabaey CS-252-2000 UC Berkeley

SoC Design Trends

MPSoC: STI Cell

Eight Synergistic Processing Elements Ring-based Element Interconnect Bus

128-bit, 4 concentric rings

Interconnect delays have become important

Source: Pham et al ISSCC 2005

Evolution or Paradigm Shift?

Bus vs Networks-on-Chip (NoCs)

Bus based interconnect

Low cost Easier to Implement Flexible

Layered Approach Buses replaced with Networked architectures

Better electrical properties Higher bandwidth Energy efficiency Scalable

Better electrical properties and System Integration

delay, power, noise, scalability, reliability

Scalability Area and Power in NoCs

Wire-area and power:

Regular Network on Chip

Typical NoC Router

This example uses a centralized arbitrer for all I/O ports

NoC routing algorithms should be simple

Some additional features are highly desirable

Routing in a 2D-mesh NoC XY routing

X direction is determined first, next Y direction

Very simple to implement Deterministic Deadlock-free

X-Y Routing Example

NoC Buffered Flow Control

Store & Forward

|---- Not Ready ----|

Flits and Wormhole Routing

Store and Forward vs. Wormhole

Blocking condition Wormhole router

A simple solution: Virtual Channels

Solution 1: Time multiplexing

Solution 2: Additional I/O ports

Optimizing a NoC for a particular application

Given a particular application, can we optimize a NoC for it?

NoC architecture has to flexible and parametric

Application Specific Optimization

But how an application is described?

Few multiprocessor embedded benchmarks Task graphs

ARM:2.5ms PPC: 2.2ms

Extensively used in scheduling research

IFFT 40000 angle 15000 SINK

Communication Centric Design

NoC Design Flow

Allocate link capacities

Verify QoS and cost