The 25<sup>th</sup> International Conference on Field-Programmable Logic and Applications (FPL 2015) September 3, 2015



#### Ultra-Fast NoC Emulation on a Single FPGA

Thiem Van Chu, Shimpei Sato, and Kenji Kise Tokyo Institute of Technology



### Contributions

Methodologies for emulating Networkon-Chip (NoC) architectures with up to 1000s of nodes on a single FPGA



Cycle-accurate & 5000x simulation speedup over BookSim<sup>1)</sup>, a widely used software simulator



# Multi/Many-Core Architectures Have Become Mainstream

#### 1000s cores in near future architectures



Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

## Network-on-Chip (NoC)

- The interconnection network of many-core architectures
- NoC simulation plays a vital role in designing many-core architectures



A many-core architecture with 2D Mesh NoC

## Software Simulators

Flexible and easy to debug

**Too slow** to simulate large architectures

#### Parallelization is non-trivial

Without sacrificing accuracy, only a limited degree of parallelization can be achieved

|                                                     | Simple Models       | Moderate Models       | Detailed Models      |
|-----------------------------------------------------|---------------------|-----------------------|----------------------|
| Typical Simulation Time of 1000s-Core Architectures | Several <b>days</b> | Several <i>months</i> | Several <b>years</b> |

#### **FPGA-Accelerated Simulation**

Avoids impractical designs

Possible to reuse RTL code

Can achieve an ultra-fast simulation speed

- Many operations can be simulated simultaneously in a tick of FPGA's clock
- Adding detail to a model requires more hardware, but does not necessary degrade performance

#### **FPGA-Accelerated Simulation**





#### The primary router model

# **Emulation Model**



Three basic components

- Router: a state-of-the-art pipelined router architecture
- Traffic generator: generates and injects synthetic workloads
- Traffic sink: collects performance characteristics



# Emulation Model: 2D Mesh



- Packet source: models injection processes (e.g. Bernoulli process)
- Source queue: every packet generated by the packet source is stored in the source queue until it can enter the network
- *Flit generator*: models traffic patterns (e.g. uniform random)

### **Decoupling Time Counters**

#### Conventional approach

Every packet source is synchronized with the network

*The source queues must be very large* to cope with the case when the packet sources generate so many packets that the network becomes very congested



### **Decoupling Time Counters**

#### Proposal

Each packet source, as well as the network, has its own time counter and operates based on a separate state machine







### **Time-Multiplexing**

To complete one cycle of the network, the *physical cluster* sequentially emulates a number of *logical clusters* 



15

#### **Time-Multiplexing**

- Combinational logic and block RAMs (BRAMs) are utilized much more efficiently because they can be shared between many NoC nodes
- Example: 128x128 mesh NoC



# Datapath

- Physical cluster emulates different logical clusters using different states loaded from state memory
- In buffer and out buffer store data passed between
  - logical clusters



- (1) Load a state & Emulate the corresponding logical cluster
- (2) Store the updated state

## Datapath

- Physical cluster emulates different logical clusters using different states loaded from state memory
- In buffer and out buffer store data passed between logical clusters







#### Xilinx VC707 board

#### **Evaluation and Analysis**



- 128 × 128 mesh NoC (16,384 nodes) on a Xilinx VC707 board
- Four NoC designs
  - **5-stage 2-VC**: canonical 5-stage pipelined VC router architecture with 2 VCs per port
  - **5-stage 1-VC**: canonical 5-stage pipelined VC router architecture with 1 VC per port
  - **4-stage 2-VC**: canonical 4-stage pipelined VC router architecture with 2 VCs per port
  - **4-stage 1-VC**: canonical 4-stage pipelined VC router architecture with 1 VC per port

#### Three configurations

- 4-phy  $(2 \times 2)$ : use four nodes to emulate the entire  $128 \times 128$  mesh network
- **16-phy**  $(4 \times 4)$ : use 16 nodes to emulate the entire  $128 \times 128$  mesh network
- **32-phy**  $(8 \times 4)$ : use 32 nodes to emulate the entire  $128 \times 128$  mesh network

#### Metrics

- Hardware usage
- Verification against BookSim, a widely used cycle-accurate software simulator
- Simulation performance: speedup over BookSim

## **Configuration Parameters**

| Тороlоду            | 128x128 mesh <b>(16,384 nodes)</b>                                                                     |
|---------------------|--------------------------------------------------------------------------------------------------------|
| Router architecture | 5-stage pipelined VC router <b>or</b><br>4-stage pipelined VC router<br>(employing look-ahead routing) |
| # of VCs per port   | 2 <b>or</b> 1                                                                                          |
| Routing algorithm   | Dimension-order (XY)                                                                                   |
| Flow control        | Credit-based                                                                                           |
| VC/Switch allocator | Separable output first                                                                                 |
| Arbiter type        | Fixed priority                                                                                         |
| Flit size           | 25-bit <i>or</i> 22-bit                                                                                |
| VC size             | 4-flit                                                                                                 |
| Packet length       | 8-flit                                                                                                 |
| Injection process   | Bernoulli                                                                                              |
| Traffic pattern     | Uniform random                                                                                         |
| Source queue length | 8-entry                                                                                                |

#### Hardware Usage

Simulation of a same  $128 \times 128$  NoC design by using 4 physical nodes (4-phy), 16 physical nodes (16-phy), and 32 physical nodes (32-phy)



#### Accuracy

- The proposed methods do not affect the simulation accuracy
  - Synthetic workloads can be modeled accurately without using a large amount of memory
  - No compromise in simulation accuracy is made
- Verification
  - Compare the output results in simulating 4 NoC designs of the FPGA-based emulator and BookSim
    - 5-stage 2-VC
    - 5-stage 1-VC
    - 4-stage 2-VC
    - 4-stage 1-VC

#### Verification: Proposal vs BookSim

Solid Lines: Proposal (FPGA-based) Dotted Lines: BookSim (Software-based)



## Simulation Performance

| Тороlоду            | 128x128 mesh |  |
|---------------------|--------------|--|
| Router architecture | 5-stage      |  |
| # of VCs per port   | 2            |  |

*The drop is caused by stalling the emulated network* in the first proposed method which helps to eliminate the memory constraint



#### Simulation Performance



## Conclusions & Future Work

#### Conclusions

- Two methods are proposed to enable ultra-fast and accurate emulation of large-scale NoC architectures on a single FPGA
- More than 5000x simulation speedup over BookSim is achieved when emulating an 128x128 NoC with state-ofthe-art router architectures

#### Future work

- Support full-system simulations
- Support a wide range of benchmarks/workloads

# Q & A

Thank you!