Embedded System Presentation

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 53

Networks on Chip : a quick introduction

Abelardo Jara Jared Bevis Abraham Sanchez March 23rd, 2009

Outline - NoC Introduction

NoC Introduction & properties NoC buffered flow control Routing algorithms Application specialization Using Virtex 4 configuration network as a high-speed MetaWire data network. What is MetaWire and why use it? Architecture of MetaWire MetaWire performance Implementation And Application Exploration For Network on Chip

DES Algorithm NoC Implementation DES key Search Architectural Details Results

Todays heterogeneous SOCs

The System-on-Chip (SoC) today Heterogeneous ~10 IPs Homogeneous (MP-SoC) ~ 10 uP (with exceptions) On-Chip BUS (AMBA, Core Connect, Wishbone, ) IP and uP are sold with proprietary Bus IF Near and long-term forecast 100 IP/uP: Busses are non scalable! Physical Design issues: signal integrity, power consumption, timing closure Clock issues: Is time for the Globally Asynchronous, Locally Synchronous paradigm (GALS)? (Still locally synchronous) Need for more regular design

CPU

DMA DSP

MEM

Interconnection network (BUS) DSP Dedicated IP (MPEG) I/O

Locally synchronou s clock domains

Computation vs Communication: A growing gap


Source: Kanishka Lahiri 2004

Focus on communication-centric design


Poor wire scaling Interconnect power + delay more dominant as the technology improves High Performance Energy efficiency Communication architecture large proportion of energy budget

The SoC nightmare


DMA CPU DSP

System Bus

Mem Ctrl.

Bridge

The Board-on-a-Chip Approach

MPEG

The architecture is tightly coupled Peripheral Bus

C Control Wires Source: Prof Jan Rabaey CS-252-2000 UC Berkeley

SoC Design Trends

MPSoC: STI Cell

Eight Synergistic Processing Elements Ring-based Element Interconnect Bus

128-bit, 4 concentric rings

Interconnect delays have become important

Pentium 4 had two dedicated drive stages to transport signals across chip

Source: Pham et al ISSCC 2005

Evolution or Paradigm Shift?


Network link Network router Computing module Bus

Architectural paradigm shift Replace wire spaghetti by an intelligent network infrastructure Design paradigm shift Busses and signals replaced by packets Organizational paradigm shift Create a new discipline, a new infrastructure responsibility

Bus vs Networks-on-Chip (NoCs)

Bus-based architectures

Irregular architectures

Regular Architectures

Bus based interconnect


Networks on Chip

Low cost Easier to Implement Flexible

Layered Approach Buses replaced with Networked architectures


Better electrical properties Higher bandwidth Energy efficiency Scalable

Better electrical properties and System Integration


1) Efficient interconnect:

delay, power, noise, scalability, reliability

Module

Module

Module

2) Increase system

Module

Module

Module

Module

integration productivity
3) Enable Multi Processors for SoCs

Module Module

Module

Module

Module

Scalability Area and Power in NoCs


For Same Performance, compare the:

Wire-area and power:

NoC:

d d

Simple Bus:
n

d d

O n O n
Point-to Point:

O n3 n

O n n

O n

n O n n
2

Segmented Bus:

d d

O n2 n

O n n

E. Bolotin at al. , Cost Considerations in Network on Chip, Integration, special issue on Network on Chip, October 2004

Layered approach
Software
Transport Network Wiring Networking Separation of concerns Traffic Modeling Architect ures Queuin g Theory

Regular Network on Chip

PE

PE

PE

PE

PE

PE

Router

PE

PE

PE

PE

Typical NoC Router


H Buffer H Crossbar Switch Buffer H H Buffer Buffer

Buffer

Buffer

Routing

Arbitration

This example uses a centralized arbitrer for all I/O ports


Distributed arbitration can also be used

Routing Algorithms

NoC routing algorithms should be simple

Complex routing schemes consume more device area (complex routing/arbitration logic) Additional latency for channel setup/release Deadlocks must be avoided

Deadlock can occur if it is impossible for any messages to move (without discarding one).

Buffer deadlock occurs when all buffers are full in a store and forward network. This leads to a circular wait condition, each node waiting for space to receive the next message. Channel deadlock is similar, but will result if all channels around a circular path in a wormhole-based network are busy (recall that each node has a single buffer used for both input and output).

Some additional features are highly desirable

QoS, fault-tolerance

Routing in a 2D-mesh NoC XY routing

X-Y routing is determined completely from their addresses. In X-Y routing, the message travels horizontally (in the X-dimension) from the source node to the column containing the destination, where the message travels vertically.

X direction is determined first, next Y direction

There are four possible direction pairs, east-north, eastsouth, west-north, and west-south. Advantages for X-Y routing:

Very simple to implement Deterministic Deadlock-free

X-Y Routing Example

NoC Buffered Flow Control


1. Store & Forward 2. Cut-through 3. Wormhole 4. Virtual Channel

Store & Forward


1. Store & Forward Flow Control:
Each node receives a packet and then sends it out.

Buffers

0 1 2 3

T H B B B T H B B B T H B B B T

T0 = H(Tr + L/b)

Cut-through
2. Cut-through Flow Control: Each node starts to send the packet without waiting for the whole packet to arrive. Cut-through is more efficient approach. 1) Good performance 2) Large buffer sizes, consumes more power
Suppose in the middle, we get stuck
0 1 2 3 H B H B B H B B B H T B B B T B B T B T 0 1 2 3 H B H B B B B T B T H B H B B B B T B T

|---- Not Ready ----|

T0 = HxTr + L/b

Flits and Wormhole Routing

Wormhole routing divides a packet into smaller fixed-sized pieces called flits (flow control digits). The first flit in the packet must contain (at least) the destination address. Thus the size of a flit must be at least log2 N in an N-cores SOC Each flit is transmitted as a separate entity, but all flits belonging to a single packet must be transmitted in sequence, one immediately after the other, in a pipeline through intermediate routers.

Store and Forward vs. Wormhole

Blocking condition Wormhole router


IP
(HM) Interface

No fairness is guarantied since routers arbitration is based on local state The further is the source from the destination, its worm has to win more arbitrations The hot module (HM) bandwidth isnt fairly shared

A simple solution: Virtual Channels


1 A B 2 3

Solution 1: Time multiplexing


Input a Input b Interleaved Winner Takes All an a1 a2 a3 a4 bn b1 b2 b3 b4 an bn a1 b1 a2 b2 a3 b3 a4 b4 an a1 a2 a3 a4 bn b1 b2 b3 b4

Solution 2: Additional I/O ports

Optimizing a NoC for a particular application

Given a particular application, can we optimize a NoC for it?

NoC architecture has to flexible and parametric


Parameters allow customization Parameters: Buffers depth, number of virtual channels, NoC size, etc

Application Specific Optimization

Buffers Routing Topology Mapping to topology Implementation and Reuse QoS Support Topology Gossiping architectures

Architecture Optimization

Fault tolerance

But how an application is described?

Few multiprocessor embedded benchmarks Task graphs

SRC

15000 FFT

ARM:2.5ms PPC: 2.2ms

4000 FIR

15000
matrix 82500

Extensively used in scheduling research

Each node has computation properties Directed edge describes task dependences Edge properties has communication volume

4000

IFFT 40000 angle 15000 SINK

Communication Centric Design


Application Architecture Library Architecture / Application Model NoC Optimisation Configure Evaluate Analyse / Profile Refine

Good? No
Synthesis

Optimized NoC

NoC Design Flow


Extract intermodule traffic

Place modules

Allocate link capacities

Verify QoS and cost

NoC Design Flow


Extract intermodule traffic
R Module R Module R

R
Module

R Module R

Module

Module
R Module R Module R R R Module R Module R Module R R

Place modules

Module R

Allocate link capacities

Module

R Module

Module

Verify QoS and cost

NoC Design Flow


Extract intermodule traffic
R Module R Module R R R Module

Module

Module Module

Place modules

R Module

R Module R

R Module

R Module R Module

Module R

Allocate link capacities

Module Module
Module

Verify QoS and cost


Optimize capacity for performance/power tradeoff Capacity allocation is a traditional WAN optimization problem, however:

Capacity Allocation Realistic Example


A SoC-like system with realistic traffic demands and delay requirements Classic design: 41.8Gbit/sec Using developed NOCs algorithm: 28.7Gbit/sec Total capacity reduced by 30%

Before optimization

After optimization

Energy Model Limitations Buffering energy

Some components

Static energy i.e. leakage power (it is becoming a increasing importance problem) Clock energy flip flops, latches need to be clocked
Can consume 50-80% of total communication architecture depending on size and depth of FIFOs Great problem in NOCs

Buffering Energy is not free

NoC Based FPGA Architecture


Functional unit
R FR CPU CNI R CR CNI R CNI R CNI R CR CNI R FR ETH I/F CNI

NoC for interrouting

Routers

FR SERDES CNI R FR PCI CNI R R R

CR CNI R CNI R CNI R

CR CNI R FR D/A A/D CNI R CNI

FR DSP CNI R

FR CPU CNI R

CR CNI R CNI

Configurable region User logic

Configurable network interface

CR CNI R

CR CNI R R CNI

CR CNI R R CNI

CR CNI R FR ETH I/F CNI R

FR DRAM CNI R R

CR CNI R CNI R CNI R

CR CNI

MetaWire: Using FPGA Configuration Circuitry to Emulate a Network-OnChip


Jared Bevis

When Should I Consider This?

Many FPGAs have reconfigurable architectures.

There is an advanced wiring network present whose only purpose is to download configuration information.

For static designs, this network is unused after initial configuration.

What Resources are Required?

This presentation topic is centered on the Xilinx Virtex-4 FPGA which is a reconfigurable device. Theoretically, any reconfigurable device can use these concepts as long as there is a link between the configuration circuitry and the logic level.

Caveat: gaining access to low-level FPGA functions may not be supported by development software.

Architecture Basics

FPGAs are volatile devices which are composed of many RAM elements known as Look Up Tables (LUT).

Various combinations form what are known as logic blocks.

Many FPGAs also have built in specialized blocks such as multipliers and floating point units.

These components are connected as specified in a programming language.

VHDL Verilog

Nearly any digital circuit can be synthesized by specifying the architecture. The required logic gates (logic blocks in the FPGA) are connected with on-chip interconnects via the configuration network.

Why use the configuration network if there is already an interconnect network?

Synthesizing time on the development system can be greatly reduced for large designs. This may help alleviate bottlenecks in the interconnecting grid. Reduces extra buffers, latches, etc. as these are already built into the configuration network thus saving area for additional logic.

Additional Features of MetaWire Network

The configuration network is already fully addressable and synchronous across the chip.

Addressing scheme already has NoC written all over it. Synchronous feature allows data to be sent in single cycles with guaranteed minimal race condition effects.

Structure of the MetaWire Network

MWI TX and RX Details

MetaWire Controller

Single purpose controller for arbitrating data transfers. Somewhat similar to a DMA controller.

Executes a round-robin scheme of servicing data transfer requests.

Consists of address tables, logic control, and ICAP core.

Performance

Both throughput and latency equations are derived from timing diagrams.

Actual Testing Data

Final Verification

Implementation And Application Exploration For Network on Chip


Abraham Sanchez
Paper: Exploring FPGA Network on Chip Implementations Across Various Application and Network Loads. Graham Schelle and Dirk Grunwald. University of Colorado

Outline

Application

Brute Force DES key Search

DES Algorithm NoC Implementation.

Virtual Channel NoC Simple NoC NoC Layout DES key Search Engine

DES key Search Architectural Details


Results.

DES and Brute Force Key search

Data Encryption Standard (DES)


Designed by IBM 1977. Uses a 56 bit key and block of 64 bit with 8 bit for parity error check. Encrypt pain text in blocks of 64 bit Replace by TripleDES
Give a known plaintext-ciphertext pair (P,C), find the DES key or keys which encrypt P and produce C For DES there would be 2^56 key in the search space

Brute Force Key Search


DES Algorithm
Sixteen 48-bit from original 56-bit 56-bit key is permute (PC1) Then divided into two 28-bit treated separately thereafter. 28-bit are rotated left by 1 or 2 bits (specified for each round). Two 28-bit are combine and permutated and a subkey of 48 bit is selected Plaintext is passed thru 16 rounds of permuting key resulting in a cipher text. There is a initial permutation applied at the beginning An a Inverse initial permutation and 32-bit swap at the end.

Source: Exploring FPGA Network on Chip Implementations Across Various Application and Network Loads Graham Schelle and Dirk Grunwald. Department of Computer Science University of Colorado at Boulder Boulder, CO

NoC Implementation.

Virtual Channel NoC


Used by must NoC today Basic Network Components

Physical Channel Multiple lanes so that packets can by pass one another Node arbitration Arbitration for outgoing virtual channel allocation and switch allocation Node Switch Multiple paths of communication simultaneously

Simple NoC

Basic Network Components


Shrinking the Physical Channel Simple one-word FIFO Shrinking the Node arbitration No virtual channel allocation Less side band state and signaling Shrinking the Node Switch 1 switching decision

Deadlocks: avoided using deterministic XY Routing


Source: Exploring FPGA Network on Chip Implementations Across Various Application and Network Loads Graham Schelle and Dirk Grunwald. Department of Computer Science University of Colorado at Boulder Boulder, CO

DES key Search Architectural Details


Master uP Slave uP DES Engine NoC Layout Slave DES uP Engine DES Engine DES Engine DES Engine DES Engine Hierarchy of controllers Master Microprocessor Assigns a plaintext-ciphertext pair And assigns Range of keys to each slave microcontroller. Slave Microprocessor Subdivide the range of keys Assigns tasks DES Engine Polls for found keys DES search engine Takes a plaintext-ciphertext pair (P,C), a starting key K, and searches through keys until one is found that encrypts P to produce C Controllers are implemented as Microblaze that communicate with the DES Engine located in the NoC.

DES search engine

Source: Exploring FPGA Network on Chip Implementations Across Various Application and Network Loads Graham Schelle and Dirk Grunwald. Department of Computer Science University of Colorado at Boulder Boulder, CO

Results

The application performance metric: Keys generated per second. Implementation Performance Simple has better performance when Network load is less than 15% Performance degradation virtual channel is more graceful while the simple has a rapid slope

Source: Exploring FPGA Network on Chip Implementations Across Various Application and Network Loads Graham Schelle and Dirk Grunwald. Department of Computer Science University of Colorado at Boulder Boulder, CO

You might also like