Tutorial NetworksonChip

Tutorial: Networks on Chip
The 1st ACM/IEEE International Symposium on Networks-on-Chip Princeton, New Jersey, May 6, 2007
Network Layer Communication Performance in Network-on-Chips A. Jantsch (Royal Institute of Technology, Sweden) 10:00 - 11:45 am Power, Energy and Reliability Issues in NoC R. Marculescu (Carnegie Mellon University, USA) 1:15 - 3:00 pm Tooling, OS Services and Middleware L. Benini (University of Bologna, Italy) 3:15 - 5:00 pm
Network on Chip Tutorial, Princeton, May 6, 2007
Network Layer Communication Performance In NoCs - 1
Network Layer Communication Performance in Network-on-Chips
Introduction Communication Performance Organizational Structure Interconnection Topologies Trade-os in Network Topology Routing Quality of Service
A. Jantsch, KTH
Introduction
Interconnection Network
Topology: How switches and nodes are connected Routing algorithm: determines the route from source to destination Switching strategy: how a message traverses the route
Network interface Network interface
Communication assistm Mem P Mem P
Communication assistm
Flow control: Schedules the traversal of the message over time
A. Jantsch, KTH
Basic Denitions
A. Jantsch, KTH
Basic Denitions
Message is the basic communication entity. Flit is the basic ow control unit. A message consists of 1 or many its. Phit is the basic unit of the physical layer.
A. Jantsch, KTH
Basic Denitions
Message is the basic communication entity. Flit is the basic ow control unit. A message consists of 1 or many its. Phit is the basic unit of the physical layer. Direct network is a network where each switch connects to a node. Indirect network is a network with switches not connected to any node.
A. Jantsch, KTH
Basic Denitions
Message is the basic communication entity. Flit is the basic ow control unit. A message consists of 1 or many its. Phit is the basic unit of the physical layer. Direct network is a network where each switch connects to a node. Indirect network is a network with switches not connected to any node. Hop is the basic communication action from node to switch or from switch to switch.
A. Jantsch, KTH
Basic Denitions
Message is the basic communication entity. Flit is the basic ow control unit. A message consists of 1 or many its. Phit is the basic unit of the physical layer. Direct network is a network where each switch connects to a node. Indirect network is a network with switches not connected to any node. Hop is the basic communication action from node to switch or from switch to switch. Diameter is the length of the maximum shortest path between any two nodes measured in hops. Routing distance between two nodes is the number of hops on a route. Average distance is the average of the routing distance over all pairs of nodes.
A. Jantsch, KTH
Basic Switching Techniques

Circuit Switching A real or virtual circuit establishes a direct connection between source and destination.
A. Jantsch, KTH

Circuit Switching A real or virtual circuit establishes a direct connection between source and destination. Packet Switching Each packet of a message is routed independently. The destination address has to be provided with each packet.
A. Jantsch, KTH

Circuit Switching A real or virtual circuit establishes a direct connection between source and destination. Packet Switching Each packet of a message is routed independently. The destination address has to be provided with each packet. Store and Forward Packet Switching The entire packet is stored and then forwarded at each switch.
A. Jantsch, KTH

Circuit Switching A real or virtual circuit establishes a direct connection between source and destination. Packet Switching Each packet of a message is routed independently. The destination address has to be provided with each packet. Store and Forward Packet Switching The entire packet is stored and then forwarded at each switch. Cut Through Packet Switching The its of a packet are pipelined through the network. The packet is not completely buered in each switch.
A. Jantsch, KTH

Circuit Switching A real or virtual circuit establishes a direct connection between source and destination. Packet Switching Each packet of a message is routed independently. The destination address has to be provided with each packet. Store and Forward Packet Switching The entire packet is stored and then forwarded at each switch. Cut Through Packet Switching The its of a packet are pipelined through the network. The packet is not completely buered in each switch. Virtual Cut Through Packet Switching The entire packet is stored in a switch only when the header it is blocked due to congestion.
A. Jantsch, KTH

Circuit Switching A real or virtual circuit establishes a direct connection between source and destination. Packet Switching Each packet of a message is routed independently. The destination address has to be provided with each packet. Store and Forward Packet Switching The entire packet is stored and then forwarded at each switch. Cut Through Packet Switching The its of a packet are pipelined through the network. The packet is not completely buered in each switch. Virtual Cut Through Packet Switching The entire packet is stored in a switch only when the header it is blocked due to congestion. Wormhole Switching is cut through switching and all its are blocked on the spot when the header it is blocked.
A. Jantsch, KTH
Performance - 5
Latency
1 A B C D
Time(n) = Admission + RoutingDelay + ContentionDelay Admission is the time it takes to emit the message into the network. RoutingDelay is the delay for the route. ContentionDelay is the delay of a message due to contention.
2 3
A. Jantsch, KTH
Performance - 6
Routing Delay
Store and Forward: Tsf (n, h) = h( n b + )
n ... message size in bits np ... size of message fragments in bits h ... number of hops b ... raw bandwidth of the channel ... switching delay per hop
A. Jantsch, KTH
Performance - 6
Routing Delay
Store and Forward: Circuit Switching: Tsf (n, h) = h( n b + ) Tcs(n, h) =
n b
+ h
A. Jantsch, KTH
Performance - 6
Routing Delay
Store and Forward: Circuit Switching: Cut Through: Tsf (n, h) = h( n b + ) Tcs(n, h) = Tct(n, h) =
n b n b
+ h + h
A. Jantsch, KTH
Performance - 6
Routing Delay
Store and Forward: Circuit Switching: Cut Through: Store and Forward with fragmented packets: Tsf (n, h) = h( n b + ) Tcs(n, h) = Tct(n, h) =
n b n b
+ h + h
nnp b
Tsf (n, h, np) =
+ h(
np b
+ )
A. Jantsch, KTH
Performance - 7
Routing Delay: Store and Forward vs Cut Through

SF vs CT switching; d=2, k=10, b=1 12000 Cut Through Store and Forward 10000 8000 6000 4000 2000 0 300 250 Average latency Average latency 200 150 100 50 0 350 Cut Through Store and Forward SF vs CT switching; d=2, k=10, b=32
100
200
300
400 500 600 700 Packet size in Bytes
800
900
1000
1100
100
200
300
400 500 600 700 Packet size in Bytes
800
900
1000
1100
A. Jantsch, KTH
Performance - 7
Routing Delay: Store and Forward vs Cut Through

SF vs CT switching; d=2, k=10, b=1 12000 Cut Through Store and Forward 10000 8000 6000 4000 2000 0 300 250 Average latency Average latency 200 150 100 50 0 350 Cut Through Store and Forward SF vs CT switching; d=2, k=10, b=32
100
200
300
400 500 600 700 Packet size in Bytes SF vs CT switching, k=2, m=8
800
900
1000
1100
100
200
300
400 500 600 700 Packet size in Bytes SF vs CT switching, d=2, m=8
800
900
1000
1100
90 80 70 Average latency Average latency 60 50 40 30 20 Cut Through Store and Forward
300 Cut Through Store and Forward 250 200 150 100 50
10 0 0 200 400 600 800 Number of nodes (k=2) 1000 1200 0 0 200 400 600 800 Number of nodes (d=2) 1000 1200
A. Jantsch, KTH
Performance - 8
Local and Global Bandwidth
Local bandwidth = b n+n n+w E Total bandwidth = Cb[bits/second] = Cw[bits/cycle] = C [phits/cycle] Bisection bandwidth ... minimum bandwidth to cut the net into two equal parts.
b ... raw bandwidth of a link; n ... message size; nE ... size of message envelope; w ... link bandwidth per cycle; ... switching time for each switch in cycles; w ... bandwidth lost during switching; C ... total number of channels;
1
For a k k mesh with bidirectional channels: Total bandwidth = (4k 2 4k )b Bisection bandwidth = 2kb
2 3
A B C D
A. Jantsch, KTH
Performance - 9
Link and Network Utilization
total load on the network: L =
N hl [phits/cycle] M N hl [phits/cycle] 1 MC
load per channel: =
M ... each host issues a packet every M cycles C ... number of channels N ... number of nodes h ... average routing distance l = n/w ... number of cycles a message occupies a channel n ... average message size w ... bitwidth per channel
A. Jantsch, KTH
Performance - 10
Network Saturation
Network saturation
Network saturation
Delivered bandwidth
Latency
Offered bandwidth
Delivered bandwidth
Typical saturation points are between 40% and 70%. The saturation point depends on Trac pattern Stochastic variations in trac Routing algorithm
A. Jantsch, KTH
Organizational Structure - 11
Organizational Structure
Link Switch Network Interface
A. Jantsch, KTH
Link
Short link At any time there is only one data word on the link. Long link Several data words can travel on the link simultaneously. Narrow link Data and control information is multiplexed on the same wires. Wide link Data and control information is transmitted in parallel and simultaneously. Synchronous clocking Both source and destination operate on the same clock. Asynchronous clocking The clock is encoded in the transmitted data to allow the receiver to sample at the right time instance.
A. Jantsch, KTH
Switch
Receiver Input ports Crossbar Input buffer Output buffer Transmitter Output ports
Control (Routing, Scheduling)
A. Jantsch, KTH
Switch Design Issues

Degree: number of inputs and outputs; Buering Input buers Output buers Shared buers Routing Source routing Deterministic routing Adaptive routing Output scheduling Deadlock handling Control ow
A. Jantsch, KTH
Network Interface
Admission protocol Reception obligations Buering Assembling and disassembling of messages Routing Higher level services and protocols
A. Jantsch, KTH
Topologies - 16
Interconnection Topologies
Fully connected networks Linear arrays and rings Multidimensional meshes and tori Trees Butteries
A. Jantsch, KTH
Topologies - 17
Fully Connected Networks

Bus: switch degree diameter distance network cost total bandwidth bisection bandwidth
Node
Node
Node
= = = = = =
N 1 1 O(N ) b b
Node
Node Node
Crossbar: switch degree diameter distance network cost total bandwidth bisection bandwidth
= = = = = =
N 1 1 O(N 2) Nb Nb
A. Jantsch, KTH
Topologies - 18
Linear Arrays and Rings

Linear array: switch degree diameter distance network cost total bandwidth bisection bandwidth Torus: switch degree diameter distance network cost total bandwidth bisection bandwidth = = = = = 2 N 1 2/3N O(N ) 2(N 1)b 2b
Linear array
Torus
Folded torus
= = = = =
2 N/2 1/3N O(N ) 2N b 4b
A. Jantsch, KTH
Topologies - 19
Multidimensional Meshes and Tori

2d mesh
k -ary d-cubes are d-dimensional tori with unidirectional links and k nodes in each dimension:
3d cube
number of nodes N switch degree diameter
= kd = d = d(k 1) d1 2 (k 1) = O(N ) = 2N b
2d torus
distance network cost total bandwidth
bisection bandwidth = 2k (d1)b
A. Jantsch, KTH
Topologies - 20
Routing Distance in k -ary n-Cubes

Network Scalability wrt Distance 45 40 35 Average distance 30 25 20 15 10 5 0 0 200 400 600 Number of nodes 800 1000 1200 k=2 d=4 d=3 d=2
A. Jantsch, KTH
Topologies - 21
Projecting High Dimensional Cubes
2ary 2cube
2ary 3cube
2ary 4cube
2ary 5cube
A. Jantsch, KTH
Topologies - 22
Binary Trees
number of nodes N number of switches switch degree diameter distance network cost total bandwidth bisection bandwidth
= = = = = = =
2d 2d 1 3 2d d+2 O(N ) 2 2(N 1)b 2b
A. Jantsch, KTH
Topologies - 23
k -ary Trees
= = = = = =
kd kd k+1 2d d+2 O(N ) 2 2(N 1)b kb
A. Jantsch, KTH
Topologies - 24
Binary Tree Projection
Ecient and regular 2-layout; Longest wires in resource width: lW = 2

d N lW 2 4 0 3 8 1 4 16 1 5 32 2 6 64 2
d1 2 1
7 128 4
8 256 4
9 512 8
10 1024 8
A. Jantsch, KTH
Topologies - 25
k -ary n-Cubes versus k -ary Trees
k -ary n-cubes: number of nodes N switch degree diameter distance network cost total bandwidth = kd = d+2 = d(k 1) d1 2 (k 1) = O(N ) = 2N b
k -ary trees: number of nodes N number of switches switch degree diameter distance network cost total bandwidth bisection bandwidth = = = = = = kd kd k+1 2d d+2 O(N ) 2 2(N 1)b kb
A. Jantsch, KTH
Topologies - 26
Butteries
01 01 Butterfly building block
0
01
1
01
4 3 2 1 0 16 node butterfly
A. Jantsch, KTH
Topologies - 27
Buttery Characteristics
4 3 2 1 0
= = = = = = = =
2d 2d1d 2 d+1 d+1 O(N d) 2ddb N 2b
A. Jantsch, KTH
Topologies - 28
k -ary n-Cubes versus k -ary Trees vs Butteries
k -ary n-cubes binary tree cost distance links per node bisection frequency limit of random trac O(N ) 1 d 2 N log N 2 2N
d1 d
buttery O(N log N ) log N log N

1 2N
O(N ) 2 log N 2 1 1/N
1/( d
N 2)
1/2
A. Jantsch, KTH
Topologies - 29
Problems with Butteries
Cost of the network O(N log N ) 2-d layout is more dicult than for binary trees Number of long wires grows faster than for trees. For each source-destination pair there is only one route. Each route blocks many other routes.
A. Jantsch, KTH
Topologies - 30
Benes Networks
Many routes; Costly to compute non-blocking routes; High probability for non-blocking route by randomly selecting an intermediate node [Leighton, 1992];
A. Jantsch, KTH
Topologies - 31
Fat Trees
fat nodes
16node 2ary fattree
A. Jantsch, KTH
Topologies - 32
k -ary n-dimensional Fat Tree Characteristics

kd k d1d 2k 2d d O(N d) 2k ddb 2k d1b
16node 2ary fattree
= = = = = = =
fat nodes
A. Jantsch, KTH
Topologies - 33
k -ary n-Cubes versus k -ary d-dimensional Fat Trees
k -ary n-cubes: number of nodes N switch degree diameter distance network cost total bandwidth = kd = d = d(k 1) d1 2 (k 1) = O(N ) = 2N b
k -ary n-dimensional fat trees: number of nodes N number of switches switch degree diameter distance network cost total bandwidth bisection bandwidth = = = = = = = kd k d1d 2k 2d d O(N d) 2k ddb 2k d1b
A. Jantsch, KTH
Topologies - 34
Relation between Fat Tree and Hypercube
binary 2dim fat tree
binary 1cube
A. Jantsch, KTH
Topologies - 35
Relation between Fat Tree and Hypercube - contd
binary 2cube
binary 2cube
A. Jantsch, KTH
Topologies - 36
Relation between Fat Tree and Hypercube - contd
binary 3cube
binary 3cube
A. Jantsch, KTH
Trade-os in Topologies - 37
Trade-os in Topology Design for the k -ary n-Cube
Unloaded Latency Latency under Load
A. Jantsch, KTH
Network Scaling for Unloaded Latency

Latency(n) = Admission + RoutingDelay + ContentionDelay n + h RoutingDelay Tct(n, h) = b 1 1 1 d d(k 1) = (k 1) logk N = (d N 1) RoutingDistance h = 2 2 2
Network scalabilit wrt latency (m=32) 160 140 120 Average latency 100 80 60 40 20 Average latency k=2 d=5 d=4 d=3 d=2 260 240 220 200 180 160 140 120 k=2 d=5 d=4 d=3 d=2 Network scalabilit wrt latency (m=128)
1000
2000
3000
4000 5000 6000 Number of nodes
7000
8000
9000
10000
1000
2000
3000
4000 5000 6000 Number of nodes
7000
8000
9000
10000
A. Jantsch, KTH
Unloaded Latency for Small Networks and Local Trac

Network scalabilit wrt latency (m=128) 137 136 135 Average latency 134 133 132 131 130 129 10 20 30 40 50 60 Number of nodes 70 80 90 100 129.5 129 10 k=2 d=5 d=4 d=3 d=2 Average latency 132 131.5 131 130.5 130 k=2 d=5 d=4 d=3 d=2 Network scalabilit wrt latency (m=128; h=dk/5)
20
30
40
50 60 Number of nodes
70
80
90
100
Network scalabilit wrt latency (m=128; h=dk/5) 170 165 160 Average latency 155 150 145 140 135 130 125 0 1000 2000 3000 4000 5000 6000 Number of nodes 7000 8000 9000 10000 k=2 d=5 d=4 d=3 d=2
A. Jantsch, KTH
Unloaded Latency under a Free-Wire Cost Model
Free-wire cost model: Wires are free and can be added without penalty.
Latency wrt dimension under freewire cost model (m=32;b=32) 120 100 80 60 40 20 0 N=16K N=1K N=256 N=128 N=64 Average latency 120 100 80 60 40 20 0 N=16K N=1K N=256 N=128 N=64 Latency wrt dimension under freewire cost model (m=128;b=32)
Average latency
4 Dimension
4 Dimension
A. Jantsch, KTH
Unloaded Latency under a Fixed-Wire Cost Models
Fixed-wire cost model: The number of wires is constant per node: 128 wires per node: w(d) = 64 d . d 2 3 4 5 6 7 8 9 10 w(d) 32 21 16 12 10 9 8 7 6
Latency wrt dimension under fixedwire cost model (m=32;b=64/d) 120 100 80 60 40 20 0 N=16K N=1K N=256 N=128 N=64 Average latency 120 100 80 60 40 20 0 N=16K N=1K N=256 N=128 N=64 Latency wrt dimension under fixedwire cost model (m=128;b=64/d)
Average latency
6 Dimension
10
6 Dimension
10
A. Jantsch, KTH
Unloaded Latency under a Fixed-Bisection Cost Models

Fixed-bisection cost model: The number of wires across the bisection is constant: d k bisection = 1024 wires: w(d) = 2 = 2N . Example: N=1024: d 2 3 4 5 6 7 8 9 10 w(d) 512 16 5 3 2 2 1 1 1
Latency wrt dimension under fixedbisection cost model (m=32B;b=k/2) 400 350 300 Average latency 250 200 150 100 50 0 2 3 4 5 6 Dimension 7 8 9 10 N=16K N=1K N=256 N=128 N=64 Average latency 1000 900 800 700 600 500 400 300 200 100 2 3 4 5 6 Dimension 7 8 9 10 N=16K N=1K N=256 N=128 N=64 Latency wrt dimension under fixedbisection cost model (m=128B;b=k/2)
A. Jantsch, KTH
Unloaded Latency under a Logarithmic Wire Delay Cost Models

Fixed-bisection Logarithmic Wire Delay cost model: The number of wires across the bisection is constant and the delay on wires increases logarithmically with the length [Dally, 1990]: n Length of long wires: l = k 2 1 d Tc 1 + log l = 1 + ( 1) log k 2
Latency wrt dimension under fixedbisection log wire delay cost model (m=32B;b=k/2) 1200 1000 800 600 400 200 0 N=16K N=1K N=256 N=128 N=64
Average latency Latency wrt dimension under fixedbisection log wire delay cost model (m=128B;b=k/2) 4000 N=16K N=1K 3500 N=256 N=128 N=64 3000 2500 2000 1500 1000 500
Average latency
6 Dimension
10
6 Dimension
10
A. Jantsch, KTH
Unloaded Latency under a Linear Wire Delay Cost Models

Fixed-bisection Linear Wire Delay cost model: The number of wires across the bisection is constant and the delay on wires increases linearly with the length [Dally, 1990]: n Length of long wires: l = k 2 1 Tc l = k 2 1
Latency wrt dimension under fixedbisection log wire delay cost model (m=32B;b=k/2) 12000 N=16K N=1K N=256 10000 N=128 N=64 Average latency
Average latency Latency wrt dimension under fixedbisection log wire delay cost model (m=128B;b=k/2) 40000 N=16K N=1K 35000 N=256 N=128 N=64 30000 25000 20000 15000 10000
8000 6000 4000 2000 0
5000
6 Dimension
10
6 Dimension
10
A. Jantsch, KTH
Latency under Load
Assumptions [Agarwal, 1991]: k -ary n-cubes random trac dimension-order cut-through routing unbounded internal buers (to ignore ow control and deadlock issues)
A. Jantsch, KTH
Latency under Load - contd

Latency(n) = Admission + RoutingDelay + ContentionDelay T (m, k, d, w, ) = RoutingDelay + ContentionDelay m T (m, k, d, w, ) = + dhk ( + W (m, k, d, w, )) w m 1 hk 1 W (m, k, d, w, ) = 1 + w (1 ) h2 d k 1 d(k 1) h = 2 m w hk message size bitwidth of link aggregate channel utilization average distance in each dimension switching time in cycles
A. Jantsch, KTH
Latency vs Channel Load

Latency wrt channel utilization (w=8;delta=1) 300 250 200 150 100 50 0 m8,d3,k10 m8,d2,k32 m32,d3,k10 m32,d2,k32 m128,d3,k10 m128,d2,k32
Average latency
0.1
0.2
0.3
0.4 0.5 0.6 Channel utilization
0.7
0.8
0.9
A. Jantsch, KTH
Routing - 48
Routing
Deterministic routing The route is determined solely by source and destination locations. Arithmetic routing The destination address of the incoming packet is compared with the address of the switch and the packet is routed accordingly. (relative or absolute addresses) Source based routing The source determines the route and builds a header with one directive for each switch. The switches strip o the top directive. Table-driven routing Switches have routing tables, which can be congured. Adaptive routing The route can be adapted by the switches to balance the load. Minimal routing allows only shortest paths while non-minimal routing allows even longer paths.
A. Jantsch, KTH
Quality of Service - 49
Quality of Service
Best Eort (BE) Optimization of the average case Loose or non-existent worst case bounds Cost eective use of resources Guaranteed Service (GS) Maximum delay Minimum bandwidth Maximum Jitter Requires additional resources
A. Jantsch, KTH
Regulated Flows
A Flow F is (, ) regulated if F (b) F (a) + (b a) for all time intervals [a, b], 0 a b and where F (t) the cumulative amount of trac between 0 and t 0. 0 is the burstiness constraint; 0 is the maximum average rate;
A. Jantsch, KTH
Regulated Flows
A Flow F is (, ) regulated if F (b) F (a) + (b a) for all time intervals [a, b], 0 a b and where F (t) the cumulative amount of trac between 0 and t 0. 0 is the burstiness constraint; 0 is the maximum average rate;
F(t)
t1
t2 t3 t4
t5
t
A. Jantsch, KTH
Regulated Flows - Delay Element
F 1
F 2
A. Jantsch, KTH
Regulated Flows - Delay Element
F 1
D
F1 (, ) F2 ( + D, )
F 2
A. Jantsch, KTH
Regulated Flows - Work Conserving Multiplexer
F 1 F 2
D B
F 3 b
A. Jantsch, KTH
Regulated Flows - Work Conserving Multiplexer
F 1 F 2
D B
F 3 b
F1 (1, 1) F2 (2, 2) link bandwidth b < 1 + 2 F3 ? maximum delay D = ? maximum backlog B = ?
A. Jantsch, KTH
Work Conserving Multiplexer - 1

B max B 1 1 b Time 1 1 b 1 2
t= 0
t1 taccu
t2
tdrain
A. Jantsch, KTH

B max B 1 1 b Time 1 1 b 1 2
t= 0
t1 taccu
t2
tdrain
Phase 1 (t1): F1 and F2 transmit at full speed;
A. Jantsch, KTH

B max B 1 1 b Time 1 1 b 1 2
t= 0
t1 taccu
t2
tdrain
Phase 1 (t1): F1 and F2 transmit at full speed; Assume: At t = 0 the queue is empty; 1 2
A. Jantsch, KTH

B max B 1 1 b Time 1 1 b 1 2
t= 0
t1 taccu
t2
tdrain
Phase 1 (t1): F1 and F2 transmit at full speed; Assume: At t = 0 the queue is empty; 1 2 Injection rate: 2b; Drain rate: b
A. Jantsch, KTH

B max B 1 1 b Time 1 1 b 1 2
t= 0
t1 taccu
t2
tdrain
Phase 1 (t1): F1 and F2 transmit at full speed; Assume: At t = 0 the queue is empty; 1 2 Injection rate: 2b; Drain rate: b
bt1 = 1 + 1t1 1 t1 = b 1
A. Jantsch, KTH

B max B 1 1 b Time 1 1 b 1 2
t= 0
t1 taccu
t2
tdrain
A. Jantsch, KTH

B max B 1 1 b Time 1 1 b 1 2
t= 0
t1 taccu
t2
tdrain
Phase 2 (t2): F1 transmits at rate 1, F2 transmits at full speed;
A. Jantsch, KTH

B max B 1 1 b Time 1 1 b 1 2
t= 0
t1 taccu
t2
tdrain
Phase 2 (t2): F1 transmits at rate 1, F2 transmits at full speed; Injection rate: b + 1; Drain rate: b
A. Jantsch, KTH

B max B 1 1 b Time 1 1 b 1 2
t= 0
t1 taccu
t2
tdrain
Phase 2 (t2): F1 transmits at rate 1, F2 transmits at full speed; Injection rate: b + 1; Drain rate: b
btaccu = 2 + 2taccu 2 taccu = b 2

A. Jantsch, KTH

B max B 1 1 b Time 1 1 b 1 2
t= 0
t1 taccu
t2
tdrain
A. Jantsch, KTH

B max B 1 1 b Time 1 1 b 1 2
t= 0
t1 taccu
t2
tdrain
Phase 3 (tdrain): F1 transmits at rate 1, F2 transmits at rate 2;
A. Jantsch, KTH

B max B 1 1 b Time 1 1 b 1 2
t= 0
t1 taccu
t2
tdrain
Phase 3 (tdrain): F1 transmits at rate 1, F2 transmits at rate 2; Injection rate: 1 + 2; Drain rate: b
A. Jantsch, KTH

B max B 1 1 b Time 1 1 b 1 2
t= 0
t1 taccu
t2
tdrain
Phase 3 (tdrain): F1 transmits at rate 1, F2 transmits at rate 2; Injection rate: 1 + 2; Drain rate: b tdrain = Bmax b 1 2
A. Jantsch, KTH

B max B 1 1 b Time 1 1 b 1 2
t= 0
t1 taccu
t2
tdrain
Phase 3 (tdrain): F1 transmits at rate 1, F2 transmits at rate 2; Injection rate: 1 + 2; Drain rate: b tdrain = Bmax b 1 2
Bmax = bt1 + 1t2
A. Jantsch, KTH

B max B 1 1 b Time 1 1 b 1 2
t= 0
t1 taccu
t2
tdrain
Phase 3 (tdrain): F1 transmits at rate 1, F2 transmits at rate 2; Injection rate: 1 + 2; Drain rate: b tdrain = Bmax b 1 2 1 2 b 2
A. Jantsch, KTH
Bmax = bt1 + 1t2 = 1 +
Work Conserving Multiplexer - Summary

B max B 1 1 b Time 1 1 b 1 2
t= 0
t1 taccu
t2
tdrain
A. Jantsch, KTH

B max B 1 1 b Time 1 1 b 1 2
t= 0
t1 taccu
t2
tdrain
Bmax = 1 +
1 2 b 2
A. Jantsch, KTH

B max B 1 1 b Time 1 1 b 1 2
t= 0
t1 taccu
t2
tdrain
Bmax = 1 +
1 2 b 2 1 + 2 = b 1 2
Dmax = taccu + tdrain
A. Jantsch, KTH

B max B 1 1 b Time 1 1 b 1 2
t= 0
t1 taccu
t2
tdrain
Bmax = 1 + Dmax F3
1 2 b 2
1 + 2 = taccu + tdrain = b 1 2 (1 + 2, 1 + 2)
A. Jantsch, KTH
MPEG Encoding Case Study
A. Jantsch, KTH
Processor
Custom HW Interconnect
Custom HW
Memory
A. Jantsch, KTH
Processor
Custom HW
Memory
A. Jantsch, KTH
Processor
Custom HW
Memory
A. Jantsch, KTH
Processor
Custom HW
Memory
A. Jantsch, KTH
MPEG Encoding Case Study - contd
T F1
M
F1 (0, t)
A. Jantsch, KTH
T F1 C1 F2 F4 C2 F3 M
S F6 C3 F7 F8
V F9 C4
F1 (0, t)
A. Jantsch, KTH
T F1 C1 F2 F4 C2 F3 M
S F6 C3 F7 F8
V F9 C4
F1 (0, t) C1 : (t, D1) C2 : (t, D2) C1 : (t, D3) C4 : (t, D4)
A. Jantsch, KTH
T F1 C1 F2 F4 C2 F3 M
S F6 C3 F7 F8
V F9 C4
F1 (0, t) C1 : (t, D1) C2 : (t, D2) C1 : (t, D3) C4 : (t, D4) F2 (tD1, t)
A. Jantsch, KTH
MPEG Encoding Case Study - Memory
F2 F7
M :
FM1
FM2
FM3 FM4
(2t, DM )
A. Jantsch, KTH
F2 F7
M :
FM1
FM2
FM3 FM4
Dmux Fmuxout
(2t, DM ) For a general multiplexer we have: 1 + 2 = Cout 1 2 (1 + 2, 1 + 2)
A. Jantsch, KTH
F2 F7
M :
FM1
FM2
FM3 FM4
Dmux Fmuxout
FM 3 (t(D1 + Dmux + DM ), t)
A. Jantsch, KTH
F2 F7
M :
FM1
FM2
FM3 FM4
Dmux Fmuxout
FM 3 (t(D1 + Dmux + DM ), t) FM 4 ?
A. Jantsch, KTH

T F1 F4 C2 C1 F2 F3 RM1 FM3 M S F5 RS F6 C3 F7 RM2 FM4 V F9 C4 F8
A. Jantsch, KTH

RM 1 (Sbuer, t); RM 2 (Sbuer, t); RS (Sbuer, t); Sbuer is the size of the input buer in S.
A. Jantsch, KTH

D(, )-regulator =
max(0, )
B(, )-regulator = max(0, )
A. Jantsch, KTH

F (Sbuer,max(0 t ) , ) D(, 6 = )-regulator C3 : (t, D3) B(, )-regulator = max(0, ) F7 (Sbuer + tD3, t)
A. Jantsch, KTH
F2 F7
FM1
FM2
FM3 FM4
A. Jantsch, KTH
F2 F7
M :
FM1
FM2
FM3 FM4
(2t, DM )
Dmux = FM 1
1 + 2 Sbuer + t(D1 + D3) = Cout 1 2 Cout 2t (Sbuer + t(D1 + D3), 2t)
FM 2 (Sbuer + t(D1 + D3 + 2DM ), 2t) FM 3 (t(D1 + Dmux + DM ), t) FM 4 (Sbuer + t(D3 + Dmux + DM ), t)
A. Jantsch, KTH
T F1 F4 C2 C1 F2 F3
S F5 RS F6 C3 F7 RM2 FM4
V F9 C4 F8
RM1 FM3 M
A. Jantsch, KTH

Backlog of the regulators:
BRM 1 = max(0, t(D1 + Dmux + DM ) Sbuer) BRM 2 = max(0, 128B + t(D3 + Dmux +DM ) Sbuer)
Delay of the regulators: DRM 1 = DRM 2 = BRM 1 t BRM 2 t
A. Jantsch, KTH
T F1 F4 C2 C1 F2 F3
S F5 RS F6 C3 F7 RM2 FM4
V F9 C4 F8
RM1 FM3 M
A. Jantsch, KTH

The ow from the memory to S: F3
(Sbuer, t) : (t, D2)
C2
F4 (Sbuer + D2, t), A charatcerization of S and its output: Sbuer S : (t, ) t F5 (2Sbuer + tD2, t) The ows between memory and V: F8 C4 (Sbuer, t) : (, D4)
F9 (Sbuer + tD4, t)
A. Jantsch, KTH

A. Jantsch, KTH

End to end delay: Dtotal =D1 + Dmux + DM + DRM 1 + D2 + DS + DRS + D3 + Dmux + DM + DRM 2 + D4 The ow at V: FT V (0 + tDtotal, t)
A. Jantsch, KTH
Modeling with Regulated Flows

Interconnect: Model each channel by available bandwidth and maximum delay variation; Model each node in the interconnect as an arbiter; Model read request, write acknowledge as separate ows; Model synchronization as separate ows; A simple generalization of (, ) ows is F min(i, i), i > 0 F (b) F (a) min(i + i(b a))
i
Good analysis depends on good element models;
A. Jantsch, KTH
Network Calculus - Arrival Curves
bits (t)
ba a
Given a monotonically increasing function , dened for t 0, is an arrival curve for ow F if for all 0 a b: F (b) F (a) (b a)
A. Jantsch, KTH
Network Calculus - Min-Plus Convolusion

Given two monotonically increasing functions f and g . The min-plus convolusion of f and g is the function (f g )(t) = inf (f (t s) + g (s))
0st
A. Jantsch, KTH

0st
If is an arrival curve for F we have: F F
A. Jantsch, KTH

0st
If is an arrival curve for F we have: F F and F with being the best bound that we can nd based on information of .
A. Jantsch, KTH
Network Calculus - Service Curves

F*(t) bits
F(t) F(t) F*(t) S (t) (F ) (t)
Given a system S with an input ow F and an output ow F . S oers the ow a service curve if and only if is a monotonically increasing function and F F which means that F (t) inf (F (t) + (t s))
st
A. Jantsch, KTH
Network Calculus - Backlog Bound

F*(t) bits backlog F(t) F(t) F*(t) S t
Given a ow F constrained by arrival curve and a system oering a service curve , the backlog F (t) F (t) for all t satises F (t) F (t) sup((s) (s))
s0
A. Jantsch, KTH
Network Calculus - Delay Bound

bits delay F(t) F(t) F*(t) S t
F*(t)
Given a ow F constrained by arrival curve and a system oering a service curve , the delay d(t) at time t is d(t) = inf( 0 : F (t) F (t + )). It satises d(t) h(, ) = sup(inf( 0 : (t) (t + )))
t0
A. Jantsch, KTH
Network Calculus - Output Arrival Curve

bits F(t) F(t) F*(t) S t
F*(t)
Given a ow F constrained by arrival curve and a system oering a service curve , the output ow F is constrained by the arrival curve = ( .
)(t) = sup((t + s) (s))

s0
A. Jantsch, KTH
Network Calculus - Useful Functions

bits R bits R bits
Peak rate function: R(t) = Rt
Rate latency function: R,T (t) = R[t T ]+
Ane function: ,(t) = 0 + t for t = 0 for t > 0
bits
bits 5 4 3 2 1 T t Tt 2Tt 3Tt
bits
1 4Tt 5Tt 6Tt t T t
Burst-delay function: T (t) = 0 for t T for t > T
Staircase function: vT, (t) = (t + )/T
Step function: uT (t) = 0 for t T 1 for t > T

A. Jantsch, KTH
Network Calculus - Concatenation of Nodes
S / 1 F S1 1
2 S2 F* 2
A. Jantsch, KTH
S / 1 F S1 1
2 S2 F* 2
Example:
1 = R1,T1 2 = R2,T2 R1,T1 R2,T2 = min(R1,R2),T1+T2
A. Jantsch, KTH
S / 1 F S1 1
2 S2 F* 2
Example:
1 = R1,T1 2 = R2,T2 R1,T1 R2,T2 = min(R1,R2),T1+T2 f g =gf (f g ) h = f (g h) (f + c) g = (f g ) + c for any constant c R

A. Jantsch, KTH
Useful properties:
Network Calculus - Pay Bursts Only Once

S / 1 F S1 1 2 S2 F* 2
bits R1 R2 R2=min( R1, R2)
T1 T2
T1 + T2
A. Jantsch, KTH

S / 1 F S1 1 2 S2 F* 2
T1 T2
T1 + T2
= , 1 = R1,T1 = R1 max(0, t T1) 2 = R2,T2 = R2 max(0, t T2)
A. Jantsch, KTH

S / 1 F S1 1 2 S2 F* 2
T1 T2
T1 + T2
= , 1 = R1,T1 = R1 max(0, t T1) 2 = R2,T2 = R2 max(0, t T2) R1,T1 R2,T2 = min(R1,R2),T1+T2 = min(R1, R2) max(0, t (T1 + T2))
A. Jantsch, KTH

S / 1 F S1 1 2 S2 F* 2
T1 T2
T1 + T2
= , 1 = R1,T1 = R1 max(0, t T1) 2 = R2,T2 = R2 max(0, t T2) R1,T1 R2,T2 = min(R1,R2),T1+T2 = min(R1, R2) max(0, t (T1 + T2)) T1 D1 + D2 = + + + T1 + T2 R1 R2 R2
A. Jantsch, KTH

S / 1 F S1 1 2 S2 F* 2
T1 T2
T1 + T2
= , 1 = R1,T1 = R1 max(0, t T1) 2 = R2,T2 = R2 max(0, t T2) R1,T1 R2,T2 = min(R1,R2),T1+T2 = min(R1, R2) max(0, t (T1 + T2)) T1 D1 + D2 = + + + T1 + T2 R1 R2 R2 DS = + T1 + T2 min (R1, R2)
A. Jantsch, KTH
Summary - 76
Summary
Communication Performance: bandwidth, unloaded latency, loaded latency Organizational Structure: NI, switch, link Topologies: wire space and delay domination favors low dimension topologies; Routing: deterministic vs source based vs adaptive routing; deadlock; Quality of Service and ow regulation
A. Jantsch, KTH
Communication Performance In NoCs - 77
To Probe Further
Classic papers: [Agarwal, 1991] Agarwal, A. (1991). Limit on interconnection performance. IEEE Transactions on Parallel and Distributed Systems, 4(6):613624. [Dally, 1990] Dally, W. J. (1990). Performance analysis of k-ary n-cube interconnection networks. IEEE Transactions on Computers, 39(6):775785. Text books: [Duato et al., 1998] Duato, J., Yalamanchili, S., and Ni, L. (1998). Interconnection Networks - An Engineering Approach. Computer Society Press, Los Alamitos, California. [Culler et al., 1999] Culler, D. E., Singh, J. P., and Gupta, A. (1999). Parallel Computer Architecture - A Hardware/Software Approach. Morgan Kaufman Publishers. [Dally and Towels, 2004] Dally, W. J. and Towels, B. (2004). Principles and Practices of Interconnection Networks. Morgan Kaufman Publishers. [DeMicheli and Benini, 2006] DeMicheli, G. and Benini, L. (2006). Networks on Chip. Morgan Kaufmann. [Leighton, 1992] Leighton, F. T. (1992). Introduction to Parallel Algorithms and Architectures. Morgan Kaufmann, San Francisco. [LeBoudec, 200] Jean-Yves LeBoudec, J-Y. (2001). Network Calculus. Springer Verlag, LCNS 2050
A. Jantsch, KTH

Tutorial NetworksonChip

Uploaded by

Tutorial NetworksonChip

Uploaded by

Tutorial: Networks on Chip

Network on Chip Tutorial, Princeton, May 6, 2007

Network Layer Communication Performance In NoCs - 1

Network Layer Communication Performance in Network-on-Chips

Network on Chip Tutorial, Princeton, May 6, 2007

Network Layer Communication Performance In NoCs - 2

Communication assistm Mem P Mem P

Flow control: Schedules the traversal of the message over time

Network on Chip Tutorial, Princeton, May 6, 2007

Network Layer Communication Performance In NoCs - 3

Network on Chip Tutorial, Princeton, May 6, 2007

Network Layer Communication Performance In NoCs - 3

Network on Chip Tutorial, Princeton, May 6, 2007

Network Layer Communication Performance In NoCs - 3

Network on Chip Tutorial, Princeton, May 6, 2007

Network Layer Communication Performance In NoCs - 3

Network on Chip Tutorial, Princeton, May 6, 2007

Network Layer Communication Performance In NoCs - 3

Network on Chip Tutorial, Princeton, May 6, 2007

Network Layer Communication Performance In NoCs - 4

Basic Switching Techniques

Network on Chip Tutorial, Princeton, May 6, 2007

Network Layer Communication Performance In NoCs - 4

Basic Switching Techniques

Network on Chip Tutorial, Princeton, May 6, 2007

Network Layer Communication Performance In NoCs - 4

Basic Switching Techniques

Network on Chip Tutorial, Princeton, May 6, 2007

Network Layer Communication Performance In NoCs - 4

Basic Switching Techniques

Network on Chip Tutorial, Princeton, May 6, 2007

Network Layer Communication Performance In NoCs - 4

Basic Switching Techniques

Network on Chip Tutorial, Princeton, May 6, 2007

Network Layer Communication Performance In NoCs - 4

Basic Switching Techniques

Network on Chip Tutorial, Princeton, May 6, 2007

Network on Chip Tutorial, Princeton, May 6, 2007

Network on Chip Tutorial, Princeton, May 6, 2007

Network on Chip Tutorial, Princeton, May 6, 2007

Network on Chip Tutorial, Princeton, May 6, 2007

Tsf (n, h, np) =

Network on Chip Tutorial, Princeton, May 6, 2007

Routing Delay: Store and Forward vs Cut Through

400 500 600 700 Packet size in Bytes

400 500 600 700 Packet size in Bytes

Network on Chip Tutorial, Princeton, May 6, 2007

Routing Delay: Store and Forward vs Cut Through

90 80 70 Average latency Average latency 60 50 40 30 20 Cut Through Store and Forward

Network on Chip Tutorial, Princeton, May 6, 2007

Local and Global Bandwidth

Network on Chip Tutorial, Princeton, May 6, 2007

Link and Network Utilization

total load on the network: L =

load per channel: =

Network on Chip Tutorial, Princeton, May 6, 2007

Network on Chip Tutorial, Princeton, May 6, 2007

Link Switch Network Interface

Network on Chip Tutorial, Princeton, May 6, 2007

Network on Chip Tutorial, Princeton, May 6, 2007

Control (Routing, Scheduling)

Network on Chip Tutorial, Princeton, May 6, 2007

Switch Design Issues

Network on Chip Tutorial, Princeton, May 6, 2007

Network on Chip Tutorial, Princeton, May 6, 2007

Network on Chip Tutorial, Princeton, May 6, 2007

Fully Connected Networks

Network on Chip Tutorial, Princeton, May 6, 2007

Linear Arrays and Rings

2 N/2 1/3N O(N ) 2N b 4b

Network on Chip Tutorial, Princeton, May 6, 2007

Multidimensional Meshes and Tori

number of nodes N switch degree diameter

distance network cost total bandwidth

bisection bandwidth = 2k (d1)b

Network on Chip Tutorial, Princeton, May 6, 2007