2023-v1_covered_871333e8-9a5c-4e82-947a-4e44d6ac3785

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

TB-TBP: A Task-Based Adaptive Routing Algorithm

for Network-On-Chip in Heterogenous CPU-GPU


Architectures
Juan Fang (  [email protected] )
Beijing University of Technology
Zhichao Wei
Beijing University of Technology
Yaqi Liu
Beijing University of Technology
Yumin Hou
Beijing University of Technology

Research Article

Keywords: Heterogenous Architectures, Network-on-chip(NoC), Routing Algorithm, Task-Based

Posted Date: May 30th, 2023

DOI: https://doi.org/10.21203/rs.3.rs-2981298/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License

Additional Declarations: No competing interests reported.


TB-TBP: A Task-Based Adaptive Routing
Algorithm for Network-On-Chip in Heterogenous
CPU-GPU Architectures
Juan Fang1*, Zhichao Wei1† , Yaqi Liu1† , Yumin Hou1†
1* Faculty of Information Technology, Beijing University of Technology,
Beijing, 100124,China.

*Corresponding author(s). E-mail(s): [email protected];


Contributing authors: [email protected];
[email protected]; [email protected];
† These authors contributed equally to this work.

Abstract
With the rapid development of heterogeneous Network-on-Chip (NoC), a vast
amount of shared resources are integrated into NoC. Intense resource competition
exists between CPUs and GPUs, leading to congestion and a decrease in overall
network performance. Reasonable node placement can minimize network conflicts
at the topology level. This paper first discusses the placement of shared last-level
cache (LLC) and memory controller (MC), then selects a more rational place-
ment method and optimizes the path. To solve the hot spots problem for center
placement method, a task-based routing algorithm is designed to plan the path.
Simulation results demonstrate that, compared to the traditional routing algo-
rithm, the overall network latency is reduced by 9%, and the CPU performance
is improved by 13.6%. Furthermore, a dynamic task-based routing algorithm is
proposed. Compared to the static task routing algorithm, the overall network
latency is reduced by 2.08%, and the CPU performance is improved by 4.09%.

Keywords: Heterogenous Architectures, Network-on-chip(NoC), Routing Algorithm,


Task-Based

1
1 Introduction
With the increasing demand for computational power in everyday life, homogeneous
multi-core CPUs are no longer sufficient to meet the requirements [1]. In response
to this, heterogeneous multi-core chips have emerged that integrate both CPU and
GPU cores to meet the computational demand. Examples of such chips include Intel’s
Sandy Bridge [2] and AMD’s Fusion APU chips [3] , which have captured a signifi-
cant share of the market. These products have reduced communication overhead by
sharing the same last-level-cache (LLC), memory controller (MC), and other on-chip
resources between the CPU and GPU, resulting in improved system performance.
However, resource contention is a major challenge posed by resource sharing. The
inherent characteristics of CPUs and GPUs result in differences in their shared resource
utilization[4]. CPUs reduce memory access latency by utilizing multi-level caches[5],
while GPUs can better tolerate latency problems by running a large number of paral-
lel threads[6, 7]. Due to the requirement for extensive network resources to implement
high parallel processing, the high throughput of GPUs can exacerbate the issues caused
by resource sharing.
Network-On-chip(NoC) can be regarded as programmable systems for inter-node
communication[8–10], connecting processing units with shared resources such as the
LLC and MC. Proper management of the NoC is crucial to improving system perfor-
mance, especially as it is one of the largest shared resources in heterogeneous multi-core
systems. Although heterogeneous multi-core processors that integrate CPU and GPU
cores theoretically offer higher peak performance, many factors such as data transmis-
sion between cores, resource allocation, and GPU programming strategies continue to
affect the overall system performance. Due to the different computing characteristics
of the two cores, the traditional cyclic scheduling allocation mechanism of NoC can
lead to significant performance losses as GPU occupies a large amount of network
resources. To achieve reasonable resource allocation, it is necessary to separate the
different traffic flows in the NoC of CPU-GPU heterogeneous systems.
Some researchers have focused on optimizing the performance of NoC through com-
munication routing mechanisms. Lee et al. [11] proposed a multicast routing scheme
that guarantees deadlock-free and improved throughput by using router marking rules,
destination router partitioning, and traffic-adaptive branching, ultimately reducing
the number of packet hops and dispersing channel traffic. Elham et al. [12] developed
a fault-tolerant routing algorithm (FT-PDC) based on path separation and congestion
reduction for 3D NoC, which considers factors such as finding the shortest path, link
failure, path diversity, and congestion. Alaei et al.[13]proposed a multicast adaptive
routing algorithm for Mesh networks based on fuzzy load control, which dynamically
prevents live-lock and deadlock situations using a fuzzy control system, effectively
reducing latency and congestion. Salamat et al. [14, 15]proposed a high-performance
adaptive routing algorithm for 3D NoC and conducted experiments on different traf-
fic scenarios. Charles et al.[16] proposed a lightweight trust-aware routing mechanism
that bypasses malicious IP cores during data packet transmission, reducing the number
of retransmissions caused by data tampering and minimizing the risk of DoS attacks,
thereby improving NoC performance.Most of the aforementioned research is focused
on homogeneous architectures, with limited exploration on networks with multiple

2
types of computing cores. However, different cores have unique memory access char-
acteristics, making it difficult for routing algorithms designed for one type of core to
adapt to others.
Many researchers have turned to heterogeneous Network-on-Chip (NoC) to address
the issue of on-chip resource contention between the CPU and GPU. Virtual chan-
nel technology has been proposed as a deadlock avoidance solution in on-chip
networks, which also helps prevent head-of-line blocking. Lee et al. conducted struc-
tural exploration on ring network and proposed a optimized placement method of
ring heterogeneous network [17]. They also proposed a static and dynamic adap-
tive allocation method of the virtual channel in mesh network, targeting the virtual
channel contention problem in heterogeneous NoC [18]. Cui et al.[19] introduced an
interference-free NoC architecture by partitioning different routing nodes and plac-
ing them appropriately, enabling routing in different dimensions for different tasks.
They reduced network energy consumption through dedicated routing algorithms and
bypass techniques. Li et al. [20]optimized the ALPHA router for heterogeneous mul-
ticore systems by increasing the injection link width, crossbar size, and modifying
the buffer organization of router injection ports to enhance injected bandwidth and
improve throughput. It effectively addressed local and global contention to reduce net-
work latency. Yin et al. [21]leveraged the diversity of on-chip heterogeneous computing
devices and partitioned the network using time-division multiplexing, allowing packet-
switched and circuit-switched messages to share the same communication structure,
with circuit-switched paths established along frequently communicating nodes. Zhan
et al.[22]designed the OSCAR architecture, which maximized the potential of STT-
RAM-based LLC. They proposed an integrated asynchronous batch scheduling and
priority-based on-chip network interconnect method. Fang et al[23]. first evaluated the
placement of different buffered and unbuffered routers and propose a Unidirectional
Flow Control (UFC) method to avoid the network congestion problem .
The objective of this paper is to alleviate the contention of on-chip resources
between CPU and GPU in a heterogeneous system. We first discuss the impact of dif-
ferent topologies on the performance of heterogeneous NoC, followed by proposing an
optimized routing algorithm. The contributions of this paper are summarized
as follows:
• We propose an LLC/MC CENTER architecture, which considers the impact of
different LLC/MC placement methods on the performance of heterogeneous network
systems in a mesh architecture.
• Based on the LLC/MC CENTER model, we analyze the easily congested paths on
the network, optimize them based on different tasks in a high-traffic heterogeneous
NoC, and propose a Task-Based(TB) routing algorithm to enhance the network’s
performance.
• We propose a Task-Based-Partition(TBP) routing algorithm that allocates the rout-
ing algorithms of different tasks into separate virtual channels.By utilizing dynamic
monitoring technology to detect the phase behavior information of applications in
the network, we propose an improved TB-TBP adaptive routing algorithm. The
TB-TBP routing algorithm combines the advantages of both TB and TBP routing
algorithms, which enhances the system performance.

3
The remainder of this paper is organized as follows. Section 2 analyzes the issue
of performance degradation caused by mutual interference between CPU and GPU in
heterogeneous multi-core systems, and compares different LLC/MC placement meth-
ods. Section 3 elaborates on the proposed Task-Based routing algorithm and TB-TBP
routing algorithm on the heterogeneous network architecture of LLC/MC CENTER.
In Sections 4 and 5, we present the simulator parameters used in the experiments, the
benchmark set, and the evaluation results and analysis. Section 6 concludes the article.

2 Background And Motivation


2.1 Topology

Router Router Router Router Router

IP IP IP IP IP

Router Router Router Router Router

IP IP IP IP IP

Router Router Router Router Router

IP IP IP IP IP

Router Router Router Router Router

IP IP IP IP IP

Router Router Router Router Router

IP IP IP IP IP

CPU GPU LLC MC


L1-Cache L1-Cache
L2-Cache

Fig. 1 Heterogeneous NoC architecture.

The scalability and reliability of the Network-on-Chip (NoC) make it a suitable


structure for heterogeneous systems. However, urgent attention is required to address
issues such as network delay and performance degradation caused by contention for
resources among different cores. In the general mesh heterogeneous network topology,
each router is connected to an IP node, which can be CPU, GPU, LLC, MC, or other

4
accelerators. This paper adopts the mesh architecture shown in Figure 1, where routers
are connected to CPU, GPU, LLC, and MC respectively. The CPU has a first-level
private cache and a second-level private cache, while the GPU has a first-level private
cache. As the load traffic on the heterogeneous NoC increases, conflicts will intensify,
and the network’s latency will increase correspondingly.
The modularity of the NoC allows for the flexible placement of IP nodes based
on different tasks. In a heterogeneous system, LLC and MC are shared resources that
perform numerous memory access tasks from CPU and GPU and form a critical part
of the NoC. Reasonable placement of LLC and MC can reduce memory access path
delay and prepare for path division. Based on the location of LLC/MC, we categorize
the structure of heterogeneous NoC into three models: center placement (LLC/MC
CENTER), side placement (LLC/MC SIDE), and four corner placement (LLC/MC
CORNER), as shown in Figure 2. The LLC/MC SIDE model places LLC/MC only
near either CPU or GPU, resulting in significant conflicts at the LLC convergence.
Although the LLC/MC CORNER model can disperse traffic conflicts well, it also
increases the number of routing hops, which is not conducive to optimization. The
LLC/MC CENTER model takes into account the memory access tasks of CPU and
GPU, with MC surrounded by LLC to reduce the number of hops. The number of hops
in the route for CPU and GPU to access LLC is also reduced to a minimum. However,
this model’s central location concentrates all traffic, which may lead to congestion.
Considering traffic characteristics, LLC should be placed as close as possible to CPU
and GPU. Therefore, LLC/MC CENTER is chosen as the baseline model in this paper.
In the experiments, we demonstrate the performance of LLC/MC CENTER.

C C C C C L L L M M M L G L M
C CPU
G L M L G L L L M C L G G G L
G GPU
G L M L G C C C C G C C C C C
L Last Level Cache
G L M L G G G G G G L G G G G
M Memory Controller
G G G G G G G G G G M L G G G

LLC/MC CENTER LLC/MC SIDE LLC/MC CORNER

Fig. 2 Heterogeneous NoC with different topologies.

2.2 Routing Algorithm


During on-chip network communication, achieving low latency and high bandwidth is
crucial for the entire system. The routing algorithm determines the path that data
packets take from the source to the destination in different network topologies, aiming
to avoid hotspots and reduce the probability of packet collisions, thus improving overall
system bandwidth and reducing latency. By designing routing algorithms reasonably,
network latency can be reduced, and system throughput can be increased. However,
critical paths and congestion may still occur in routing, requiring the consideration of

5
Fig. 3 CPU performance decrease when running CPU and GPU at the same time.

minimizing path hops to enhance overall system communication capacity. For hetero-
geneous on-chip networks, unlike traditional homogeneous networks, the impact of core
heterogeneity on the network needs to be considered. Traditional routing algorithms
do not address individual routing algorithm designs for different cores. The number
of GPU communication tasks is significantly higher than that of CPU communication
tasks, necessitating routing algorithm designs that account for the placement of nodes
in different core locations and core characteristics. Moreover, when designing routing
algorithms, deadlock and load balancing issues also need to be taken into account.

2.3 CPU and GPU Resource Contention


To assess the degree of resource contention between the CPU and GPU in the LLC/MC
model, we followed the classification method proposed by Lee et al. [18] and Fang
et al. [23]to categorize the benchmark sets. Specifically, we classified the CPU and
GPU benchmarks into high-traffic and low-traffic groups, as illustrated in Tables 3
and 4, respectively. For the CPU benchmarks in Table 3, we defined benchmarks
with a Packet per Kilo Cycle (PKC) value greater than 20 as high-traffic group, and
those with PKC values less than 20 as low-traffic group. As for the GPU benchmarks
in Table 4, we classified benchmarks with PKC values exceeding 100 as high-traffic
group, and those with PKC values below 100 as low-traffic group [18]. To evaluate
the performance impact on the CPU when running CPU benchmarks alone versus

6
when running together with GPU benchmarks in the baseline model (LLC/MC CEN-
TER), we compared the results, as depicted in Figure 3. The figure clearly shows that
CPU performance suffers significant degradation under high-traffic conditions, with
an average decrease of up to 60% in the high-traffic group, as opposed to only 4% in
the low-traffic group.

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%

CPU<->LLC GPU<->LLC LLC<->MC

Fig. 4 Proportions of network routing tasks.

We also measured the proportions of network routing tasks for 24 groups of mixed
loads, as shown in the figure 4. When the network is under low load, it is evident
that communication frequency between the CPU and LLC accounts for the major-
ity, and resource contention between the GPU and CPU occurs less frequently. For
example, in the three groups of mixed loads in the Povray test set, tasks between the
CPU and LLC can account for 70% of the total workload. When the network is under
high load, there is more communication frequency between the GPU and LLC, and
resource contention between the GPU and CPU becomes more severe. On the other
hand, overall, the communication frequency between LLC and MC is relatively fixed,
but it also accounts for nearly 30% of the workload. Therefore, our optimization of the
heterogeneous network communication routing algorithm is divided into two points:
first, to reduce the contention of the GPU for shared resources and improve the acces-
sibility rate of CPU tasks; second, to reduce the path access of hotspots for LLC and
MC communication.

3 Heterogeneous NoC Routing Algorithm


In this section, we analyzed the path flow characteristics of the LLC/MC-CENTER
model for heterogeneous NoC, and designed dynamic and static routing algorithms on
the LLC/MC CENTER model.

3.1 Task-Based Routing Algorithm(TB)


Regardless of the routing algorithm used, the LLC/MC CENTER model will result
in a highly congested path due to the direct access of the CPU and GPU to the LLC.
Furthermore, frequent communication between the LLC and MC will create hot spots
in the center of the network. Figure 5 illustrates this issue, where the congested path is

7
represented by the red line and the hot spot areas are circled by dotted lines. In Figure

C C C C C

G L M L G

G L M L G

G L M L G

G G G G G

Fig. 5 Hotspot areas and high-traffic paths for LLC/MC CENTER model.

5, we can clearly observe the flow characteristic of the LLC/MC CENTER model.
Since there is no direct communication between individual LLCs and MCs, the traffic
in the vertical path within the hotspot area will be lower than that in the horizontal
path. For the outer area, the traffic on the ring path will be lower than the traffic
in the middle hotspot area. Based on the structure’s characteristics, we can allocate
traffic in a reasonable manner to reduce network congestion. We propose a four-point
path selection rule with decreasing priority from top to bottom, described as follows:
1. Execute tasks with the least number of hops possible.
2. Assign as few high-traffic paths as possible.
3. Exit the hotspot area as quickly as possible.
4. Distribute workload evenly across the network.
Regarding routing algorithms, the XY routing algorithm is the most commonly used in
NoC . It is a deadlock-free and highly reliable algorithm under low workload. However,
under high workload, the problem of path delay becomes more serious.
Based on the aforementioned principles, when performing a cache access request
task (CPU->LLC/GPU->LLC), the dimension-order XY routing algorithm is better
suited to allow the CPU and GPU to access the hotspot area. When performing CPU
reply (LLC->CPU), as depicted in Figure 6, the vertical path in the hotspot area
has fewer high-traffic paths, and there is also little traffic on the lateral path in the
CPU area. For CPU reply, the dimension-order YX routing algorithm is chosen to
avoid potential congested paths. When performing GPU reply (LLC->GPU), the XY

8
CPU Area

C C C C C
Dst

G L M L G

G L M L G

G L M L G
Src

G G G G G

Fig. 6 Path selection for CPU reply tasks.

Routiing Computation
VC1 Task-Based Routing
VC Arbiter
Input 1
Switch Arbiter
VC2 Escape Routing

Output1
...

...

...

VC1 Task-Based Routing


Output N
Input N
Crossbar
VC2 Escape Routing

Fig. 7 Virtual Channel assignment under Task-Based Routing Algorithm.

routing algorithm is selected since most GPUs can exit the hotspot area quickly, as
there is less traffic in the vertical and circular paths. To balance the network load
for routing tasks in hotspot areas, we adhere to rule 4) and utilize the YX routing
algorithm.
To avoid the risk of deadlock, an escape virtual channel has been set up in the
proposed Task-Based routing algorithm, which is a common issue faced by routing
algorithms. As depicted in Figure 7, each router input port has two virtual channels,
VC1 and VC2, with Routing Computation calculating the packet’s correct output

9
Algorithm 1 TB Routing Algorithm
Require: src is source node, dst is destination node, src.x is source node horizontal
coordinate,src.y is source node vertical coordinate
1: if CPU request tasks OR GPU reply tasks OR GPU reply tasks then
2: if src.x >dst.x then choose straight route to the LEFT
3: else if src.x <dst.x then choose straight route to the RIGHT
4: else if src.y >dst.y then choose straight route to the UP
5: else if src.y <dst.y then choose straight route to the DOWN
6: end if
7: end if
8: if CPU reply tasks OR tasks between LLC and MC then
9: if src.y >dst.y then choose straight route to the UP
10: else if src.y <dst.y then choose straight route to the DOWN
11: else if src.x >dst.x then choose straight route to the LEFT
12: else if src.x <dst.x then choose straight route to the RIGHT
13: end if
14: end if

port. VC Arbiter selects the virtual channel that can release the filt to pass through
the crossbar input port, while Switch Arbiter determines which filt enters or exits the
crossbar. The crossbar physically moves the filt from the input port to the output
port. Task-Based routing algorithm is used for traffic to VC1, while a dimensionally
ordered XY algorithm is used for traffic to VC2 to ensure deadlock-free routing. By
implementing escape virtual channels, deadlocks can be effectively avoided.

3.2 TB-TBP Routing Algorithm


To further enhance the performance of the heterogeneous architecture, we propose the
Task-Based-Partition Algorithm(TBP), which entails assigning each virtual channel
to execute a particular task. As illustrated in Figure 8, VC1 is dedicated to executing
all tasks of the XY routing algorithm, while VC2 is responsible for executing all tasks
of the YX routing algorithm. Both routing algorithms utilized by the virtual channels
are deadlock-free. While the Task-Based routing algorithm distributes XY and YX
routing algorithms in the same dimension, the advantage of the TBP routing algorithm
is that it reduces conflicts between CPU and GPU by separating different tasks in
different dimensions. However, this comes at the cost of sacrificing the utilization of
virtual channels, leading to a reduction in network throughput to some extent, but
ultimately resulting in improved performance.
Although the TB routing algorithm can effectively avoid path conflicts and resource
contention, the escape virtual channel is not well utilized in high-load network situa-
tions and is only used as a backup channel to avoid deadlocks. If the escape virtual
channel is used reasonably, congestion problems can be further reduced. The TBP
routing algorithm makes good use of virtual channel resources but wastes on-chip
resources at lower network loads. Therefore, we propose the TB-TBP adaptive routing
algorithm for improvement. The TB-TBP routing algorithm combines the advantages

10
of the TB and TBP routing algorithms and dynamically monitors the phase behavior
information of application programs in the network through monitoring technology. In
high-load network situations, it adopts the TBP algorithm to improve network com-
munication efficiency, while in low-load networks, it adopts the TB algorithm to avoid
wasting channel resources and improve network throughput, as shown in the figure 8.

Routing Computation
VC1 Task-Based-XY Routing
VC Arbiter
Input 1
Switch Arbiter
VC2 Task-Based-YX Routing
Routing Algorithm
... Converter

Output1
VC1 Task-Based-XY Routing

...

...
Input N
Output N
VC2 Task-Based-YX Routing
Crossbar
Fig. 8 Virtual Channel assignment under Task-Based-Partition Routing Algorithm.

In order to dynamically monitor the core performance, we propose a dynamic


monitoring technique. The dynamic monitoring technique uses the number of CPU
core’s retired instructions as a measure of core performance. The variation in the
retired instruction count is used as a parameter to dynamically switch between the
TB and TBP routing algorithms.This is consistent with the research methodology
employed by previous researchers[18, 24].
We determine the retired instruction count for different CPU benchmark test sets,
and plot the trend of retired instruction count per hundred cycles, as shown in the
figure 9.
Each CPU application exhibits its own unique phased behavior, and there is no
regularity to the number of retired instructions. From the blue trend chart, it can
be observed that each application may have a high number of retired instructions
at different time periods, or the overall number of retired instructions may be low.
However, when we consider all applications as a whole, as shown in the black trend
chart in the lower right corner, the phased behavior of each application becomes
blurred and indistinct. Under the overall context, the applications exhibit relatively
periodic behavior, which provides a basis for our subsequent optimization. When we
bind different applications to different CPU cores, the number of retired instructions
can be tracked through dynamic monitoring.
Accurate switching between states can better improve network performance, oth-
erwise, there may be a situation where system overhead increases but performance
improvement is not significant. By observing the changes in the number of retired
instruction , we found that if the routing algorithm is switched in a short period of

11
Fig. 9 Retired instructions of different CPU cores and the total.

time, accurate information about the number of retired instruction cannot be obtained,
affecting the judgment of algorithm switching. However, if the routing algorithm is
switched after a long time, it cannot efficiently improve the system’s performance.
Hence, it is crucial to identify the appropriate sampling and main periods that yield
maximum benefits, as depicted in Figure 10.

over over

TB-TBP
TB RA Initial Sample RA Main
over

RA:Routing Algorithm Reset


Fig. 10 State transition in heterogeneous NoC.

We set up three types of periods:

12
The start-up period: During the start-up period, the system state is unstable, and
we use the TB algorithm. No sampling work is performed during this period.
The sampling period: The purpose of the sampling period is to obtain informa-
tion about the current performance of the routing algorithm through sampling, in
order to decide whether to switch to a different routing algorithm. When the TB algo-
rithm is used, we switch to the TBP routing algorithm promptly when the CPU’s
invalid instruction count is high, allowing the YX routing algorithm’s traffic to enter
the escape virtual channel. When utilizing the TBP routing algorithm, it is deemed
necessary to transition towards the TB routing algorithm for the purpose of achiev-
ing a balanced network load in situations where the CPU’s invalid instruction count
remains at a low level. The decision to switch between routing algorithms is deter-
mined through the application of formula (1), which calculates the speedup ratio of
the two algorithms before and after the switch. Specifically, when the speedup ratio
surpasses or equals to a threshold level of 1.5, the TBP routing algorithm is preferen-
tially selected. In contrast, when the speedup ratio falls below this threshold, the TB
routing algorithm is deemed more suitable for optimizing network performance.
P
inst retiredT B
SpeedupRoutingAlgorithm = P (1)
inst retiredT BP

The main period: it uses the better-performing routing algorithm from the sam-
pling period to maintain stable system performance over an extended period. Once
the main period is over, we enter the sampling period to begin sampling again. The
duration of each period is specified in the table 1.

Table 1 Period Length Setting

Period Name Period Length


Start-up Period 700K cycles
Sampling Period 200K cycles
Main period TBP: 3M cycles;TB :2M cycles

4 Experiment Setup
In this section, we list the experimental configuration, benchmark set, and evaluation
metrics.

4.1 System Setup


We use MacSim version 2.3 [25] as the simulator for the experiments. MacSim is a
heterogeneous architecture simulator that is trace-driven and cycle-level. MacSim sup-
ports X86 and NVIDIA PTX instruction set architectures. The architectures simulated
in this experiment are Intel’s Sandy Bridge and NVIDIA’s fermi architecture. Table 2
shows our emulator configuration and NoC configuration. For multi-application emu-
lation, an early terminated application sometimes must be re-executed to simulate

13
resource contention (cache, on-chip interconnect, and memory controller) until all
applications are complete. The GPU application will continue to run repeatedly until
the CPU application finishes running, in order to simulate the resource contention
behavior on the network. This method is consistent with the work of Lee et al [17].

Table 2 Heterogeneous CPU-GPU Architecture Configuration

CPU Core 5 cores 3GHz 4-wide out-of-order(ooo)


gshare branch predictor
8-way ,64KB L1 D/I cache, 3-cycle
8-way ,256KB L2 cache, 8-cycle

GPU Core 11 cores 1.5GHz in-order(io) 2-way 16 SIMD width


8-way , 32KB L1 D (2 cycle), 8-way 4KB L1 I (1 cycle)
16 KB software managed cache

LLC 6 tiles (each tile :32-way 1MB) 64B line, LRU

Memory DDR3-1333, 2 MCs (each 8 banks, 2 channels)


41.6 GB/s bandwidth, 2 KB row buffer, FR-FCFS scheduler

NoC 1GHz, 5*5 2D mesh topology


4 stage(IB, RC, VCA, SA/ST) ,credit-based
VC 2 per port, each VC can hold 5 flits
Ports 5 per router 128 bits (16 B) with 1-cycle latency

4.2 Benchmark
We utilized the spec 2006 CPU benchmark [26]and several suites of CUDA GPGPU
benchmark, including Nvidia CUDA SDK, Rodinia [27], and Parboil[28] in our exper-
iments. Each CPU runs one CPU application, while all GPUs run one GPGPU
application. To generate spec 2006 trace, we used pinpoint[29], and GPU Ocelot[30]
was used to generate GPGPU benchmark trace. PKC was used as a statistical indica-
tor to evaluate the communication status of the network and determine the high and
low traffic groups of applications. The grouping results are presented in Table 3 and 4.

Table 3 CPU benchmark

PKC <20(Low) PKC >20(High)


Benchmark Suite PKC Benchmark Suite PKC
povray fp 2 mcf int 22
namd fp 3 bwaves fp 40
sjeng int 3 milc fp 40
gamess fp 3 leslie3d fp 45
gobmk int 6 lbm fp 68
bzip2 int 11
perlbench int 15

14
Table 4 GPU benchmark

PKC<100(Low) PKC>100(High)
Benchmark Suite PKC Benchmark Suite PKC
mri parboil 3 backprop rodinia 111
luDecomposition rodinia 4 histo parboil 199
needlemanWunsch rodinia 23 spmv parboil 467
MonteCarlo SDK 38
mm parboil 44
srad rodinia 68

4.3 Metrics
We have selected IPC (Instructions Per Cycle) as the performance indicator for the
CPU. Additionally, we calculate the total network delay while running the application
to evaluate the overall performance of the network. The formula for IPC is given by
equation (2), where cycles represents the number of cycles used by the CPU to exe-
cute the application program, and instructioni represents the number of instructions
executed by the CPU. Equation (3) gives the formula for calculating the overall IPC
of the network, where n is the total number of CPU cores.
instructioni
IP Ci = (2)
cycles
The average IPC of the system is then:
Pn−1
i=0 IP Ci
IP C = (3)
n

5 Results And Analysis


In this section, we assess the path latency across different topologies and investigate
the properties of networks with high and low traffic. Additionally, we evaluate the
proposed Task-Based routing algorithm and TB-TBP routing algorithm, and analyze
the experimental results.

5.1 LLC/MC CENTER Model


Figures 11 and 12 provide latency comparisons among the LLC/MC SIDE, LLC/MC
CORNER, and LLC/MC CENTER models when running benchmarks for both high-
traffic and low-traffic groups. The results indicate that, for both groups, the LLC/MC
CENTER model has the lowest latency among the three placements and is significantly
better than the LLC/MC SIDE and LLC/MC CORNER models. This finding aligns
with our prior analysis. Thus, we selected the LLC/MC CENTER model as our target
model for subsequent experiments.

15
1.8E+10
1.6E+10
1.4E+10
1.2E+10

Latency(ns)
1E+10
8E+09
6E+09
4E+09
2E+09
0
lbm+stencil milc+stencil milc+spmv lbm+spmv mcf+stencil

LLC/MC_SIDE LLC/MC_CORNER LLC/MC_CENTER

Fig. 11 Latency of high-traffic networks under different topologies.

1.4E+08

1.2E+08

1.0E+08
Latency(ns)

8.0E+07

6.0E+07

4.0E+07

2.0E+07

0.0E+00
namd+hotspot namd+mri sjeng+lu povray+hotspot povray+needleman

LLC/MC_SIDE LLC/MC_CORNER LLC/MC_CENTER

Fig. 12 Latency of low-traffic networks under different topologies.

5.2 Routing Algorithm Analysis


The comparison algorithms we used include the classic odd-even routing algorithm[31],
X-Y dimension order routing algorithm[32], and SD (Static and Dynamic) routing
algorithm. The SD routing algorithm is a hybrid algorithm that adopts shortest path
adaptive routing in congested central node regions and XY routing algorithm in
non-central nodes. Figures 13 and 14 demonstrate the latency and IPC comparisons
between these three typical routing algorithms and the proposed Task-Based routing
algorithm in a low-traffic network. When running 10 groups of mixed applications, the
Task-Based routing algorithm exhibits an average latency reduction of 1.53% com-
pared to the traditional XY routing algorithm. Additionally, the Task-Based algorithm
shows an average IPC increase of 1.21% compared to the traditional routing algo-
rithm. In a low-traffic network, where there are fewer resource contention conflicts

16
between cores, the algorithm’s improvement is not significant, which aligns with our
expectations.

1.2E+08

1.0E+08

8.0E+07
Latency(ns)

6.0E+07

4.0E+07

2.0E+07

0.0E+00

odd-even SD X-Y Task-Based

Fig. 13 Latency of different algorithms under low-traffic networks.

1.2

0.8
IPC

0.6

0.4

0.2

odd-even SD X-Y Task-Based


Fig. 14 IPC of different algorithms under low-traffic networks.

Under high-traffic conditions, the Task-Based routing algorithm is evaluated.


Among the compared algorithms, the odd-even routing algorithm does not consider
the routing characteristics of the network topology, resulting in lower performance
compared to other routing algorithms. The SD routing algorithm takes into account
the network’s tendency to form hotspots in central regions but does not effectively

17
integrate routing selection for different CPU and GPU tasks. In comparison, the
Task-Based routing algorithm exhibits an average latency reduction of 9% com-
pared to the classic XY routing algorithm. For memory-intensive benchmarks such
as mcf, the proposed Task-Based algorithm achieves better latency reduction. On
average, the Task-Based routing algorithm reduces the latency of the benchmark by
12%, indicating that our proposed Task-Based routing algorithm is more suitable for
CPU-memory-intensive application scenarios.

6E+10

5E+10

4E+10
Latency(ns)

3E+10

2E+10

1E+10

odd-even SD X-Y Task-Based

Fig. 15 Latency of different Algorithms under high-traffic networks.

0.2

0.18

0.16

0.14

0.12
IPC

0.1

0.08

0.06

0.04

0.02

odd-even SD X-Y Task-Based

Fig. 16 IPC of different Algorithms under high-traffic networks.

18
1.80E+10

1.60E+10

1.40E+10

1.20E+10

Latency(ns)
1.00E+10

8.00E+09

6.00E+09

4.00E+09

2.00E+09

0.00E+00
Mixed_high+backprop Mixed_high+histo Mixed_high+spmv Mixed_high+stencil

TB TB-TBP

4.00E+08

3.50E+08

3.00E+08

2.50E+08
Latency(ns)

2.00E+08

1.50E+08

1.00E+08

5.00E+07

0.00E+00
Mixed_low+mri Mixed_low+hotspot Mixed_low+lu Mixed_low+needleman
TB TB-TBP

Fig. 17 Latency between TB Algorithms and TB-TBP Algorithms on network

In terms of CPU performance, under ten groups of mixed loads, the Task-Based
routing algorithm exhibits an average IPC increase of 13.6% compared to the tra-
ditional routing algorithm, as depicted in Figure 15. Similarly, the performance
improvement is particularly notable for memory-intensive benchmarks. These findings
demonstrate that adopting a reasonable path planning method in high-traffic net-
works can significantly reduce resource contention and waiting time between the CPU
and the GPU. The Task-Based routing algorithm leverages the network’s characteris-
tic of having less traffic on the vertical path to enable faster communication between
the CPU and LLC. This mitigates the impact of a large number of GPU tasks and
enhances CPU performance. Simultaneously, while our focus was on improving CPU
performance, the overall network latency was also reduced accordingly. This indicates
that reducing congestion events improves the routing efficiency of both the CPU and
GPU on the network simultaneously. Furthermore, we introduce the TB-TBP dynamic
routing algorithm. Since TB-TBP focuses more on the overall characteristics of the
network when running different benchmarks, it better approximates the real system
state. Therefore, the testing approach for benchmarks slightly differs from the Task-
Based algorithm. We bind different benchmarks to various CPU cores and mix them
with different GPU test programs. These CPU benchmarks share the common charac-
teristics described in the previous context. ”Mixed-high” refers to different benchmarks

19
0.16
0.14
0.12
0.1

IPC
0.08
0.06
0.04
0.02
0

TB TB-TBP

Fig. 18 IPC between TB Algorithms and TB-TBP Algorithms on network

in the high-traffic group, while ”Mixed-low” refers to different benchmarks in the low-
traffic group. We compare the TB-TBP dynamic routing algorithm with the proposed
TB algorithm. In the tests with eight different programs under varying traffic condi-
tions, the TB-TBP algorithm shows a 4.08% increase in IPC and a 2.74% decrease in
latency. The TB-TBP dynamic routing algorithm effectively utilizes virtual channel
resources during network peak periods, resulting in reduced network congestion.

6 Conclusion
In this study, we addressed the placement problem of core components, namely shared
last-level cache (LLC) and memory controller (MC), in heterogeneous on-chip net-
works, as their proper placement can significantly improve network performance.
Placing LLC/MC in the middle of the network allows for the separation of CPU and
GPU traffic accessing LLC. However, the middle placement of LLC/MC can result in
the formation of hotspots, which leads to network congestion. To address this conges-
tion issue, we designed the Task-Based routing algorithm based on the tasks in the
heterogeneous on-chip network and developed the TB-TBP dynamic adaptive rout-
ing algorithm by analyzing network communication characteristics. Finally, extensive
simulations are conducted to verify the performance of our proposed algorithm. By
running mixed benchmarks, the Task-Based routing algorithm demonstrated a 9%
reduction in overall network latency compared to traditional routing algorithms, while
improving CPU performance by 13.6%. The TB-TBP routing algorithm achieved a
2.74% reduction in overall network latency and a 4.08% improvement in CPU per-
formance compared to the Task-Based routing algorithm. In future work, we aim
to further enhance the accuracy of congestion monitoring in the TB-TBP algorithm
and determine optimal algorithm cycle lengths. Additionally, we plan to scale up the
number of CPU and GPU cores in the network to evaluate the performance of the
algorithms in larger network architecture.

20
Declarations
• Funding
This work is supported by Beijing Natural Science Foundation (4192007), and sup-
ported by the National Natural Science Foundation of China (61202076), along with
other government sponsors.
• Conflict of interest/Competing interests (check journal-specific guidelines for which
heading to use)
The authors declare that they have no competing interests.
• Ethics approval
Not applicable.
• Consent to participate
Not applicable.
• Consent for publication
The authors readily consent to have this paper published.
• Availability of data and materials
All data generated or analyzed during this study are included in this article. There
are no further materials to provide.
• Authors’ contributions
ZhiChao Wei and Juan Fang designed the TB and TB-TBP routing algorithms
for heterogeneous Network-On-Chip, and Yaqi Liu participated in the experimental
testing and analysis of MacSim benchmark. Yumin Hou participated in the opti-
mization of image design and text in the paper.All authors composed the rest of the
manuscript, reviewed the whole manuscript, and approved the final manuscript.

References
[1] Fang, J., Yu, L., Liu, S., Lu, J., Chen, T.: Kl ga: an application mapping algorithm
for mesh-of-tree (mot) architecture in network-on-chip design. The Journal of
Supercomputing 71, 4056–4071 (2015)

[2] Gwennap, L.: Sandy bridge spans generations intel focuses on graphics , multi-
media in new processor design. (2010)

[3] Lee, K.S., Lin, H., Feng, W.-c.: Performance characterization of data-intensive
kernels on amd fusion architectures. Computer Science - Research and Develop-
ment 28, 175–184 (2013)

[4] Matoussi, O.: Noc performance model for efficient network latency estimation.
2021 Design, Automation & Test in Europe Conference & Exhibition (DATE),
994–999 (2021)

[5] Ma, S., Wang, Z., Liu, Z., Jerger, N.D.E.: Leaving one slot empty: Flit bubble
flow control for torus cache-coherent nocs. IEEE Transactions on Computers 64,
763–777 (2015)

21
[6] Marculescu, R., Ogras, Ü.Y., Peh, L.-S., Jerger, N.D.E., Hoskote, Y.V.: Outstand-
ing research problems in noc design: System, microarchitecture, and circuit per-
spectives. IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems 28, 3–21 (2009)

[7] Cong, J., Gill, M., Hao, Y., Reinman, G.D., Yuan, B.: On-chip interconnection
network for accelerator-rich architectures. 2015 52nd ACM/EDAC/IEEE Design
Automation Conference (DAC), 1–6 (2015)

[8] Zheng, H., Wang, K., Louri, A.: A versatile and flexible chiplet-based system
design for heterogeneous manycore architectures. 2020 57th ACM/IEEE Design
Automation Conference (DAC), 1–6 (2020)

[9] Cheng, X., Zhao, Y., Zhao, H., Xie, Y.: Packet pump: Overcoming network bottle-
neck in on-chip interconnects for gpgpus*. 2018 55th ACM/ESDA/IEEE Design
Automation Conference (DAC), 1–6 (2018)

[10] Kim, H.K., Kim, J., Seo, W., Cho, Y.-G., Ryu, S.: Providing cost-effective on-
chip network bandwidth in gpgpus. 2012 IEEE 30th International Conference on
Computer Design (ICCD), 407–412 (2012)

[11] Lee, Y.S., Kim, Y.W., Han, T.H.: Mrcn: Throughput-oriented multicast routing
for customized network-on-chips. IEEE Transactions on Parallel and Distributed
Systems 34, 163–179 (2023)

[12] Khodadadi, E., Barekatain, B., Yaghoubi, E., Mogharrabi-Rad, Z.: Ft-pdc: an
enhanced hybrid congestion-aware fault-tolerant routing technique based on path
diversity for 3d noc. The Journal of Supercomputing 78, 523–558 (2021)

[13] Alaei, M., Yazdanpanah, F.: A high reliable multicast routing algorithm for 2d
and 3d mesh-based nocs with fuzzy-based load control. Journal of Control (2021)

[14] Salamat, R., Khayambashi, M., Ebrahimi, M., Bagherzadeh, N.: Lead: An adap-
tive 3d-noc routing algorithm with queuing-theory based analytical verification.
IEEE Transactions on Computers 67, 1153–1166 (2018)

[15] Salamat, R.: Design and evaluation of high-performance and fault-tolerant routing
algorithms for 3d-nocs. (2018)

[16] Charles, S., Mishra, P.: Lightweight and trust-aware routing in noc-based socs.
2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 160–167
(2020)

[17] Lee, J., Li, S., Kim, H., Yalamanchili, S.: Design space exploration of on-chip ring
interconnection for a cpu-gpu heterogeneous architecture. J. Parallel Distributed
Comput. 73, 1525–1538 (2013)

22
[18] Lee, J., Li, S., Kim, H., Yalamanchili, S.: Adaptive virtual channel partitioning
for network-on-chip in heterogeneous architectures. ACM Transactions on Design
Automation of Electronic Systems (TODAES) 18, 1–28 (2013)

[19] Cui, Y.-W., Prabhakar, S.M., Zhao, H., Mohanty, S.P., Fang, J.: A low-cost
conflict-free noc architecture for heterogeneous multicore systems. 2020 IEEE
Computer Society Annual Symposium on VLSI (ISVLSI), 300–305 (2020)

[20] Li, Y., Louri, A.: Alpha: A learning-enabled high-performance network-on-chip


router design for heterogeneous manycore architectures. IEEE Transactions on
Sustainable Computing 6, 274–288 (2020)

[21] Yin, J., Zhou, P., Sapatnekar, S.S., Zhai, A.: Energy-efficient time-division multi-
plexed hybrid-switched noc for heterogeneous multicore systems. 2014 IEEE 28th
International Parallel and Distributed Processing Symposium, 293–303 (2014)

[22] Zhan, J., Kayiran, O., Loh, G.H., Das, C.R., Xie, Y.: Oscar: Orchestrating stt-
ram cache traffic for heterogeneous cpu-gpu architectures. 2016 49th Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO), 1–13
(2016)

[23] Fang, J., Leng, Z., Liu, S., Yao, Z., Sui, X.: Exploring heterogeneous noc design
space in heterogeneous gpu-cpu architectures. Journal of Computer Science and
Technology 30, 74–83 (2015)

[24] Ma, S., Lu, H., Huang, L., Shen, L., Guo, Y., Wang, Z., Xue, W.: Adaptive vc
partitioning for nocs in gpgpus. 2018 IEEE International Symposium on Circuits
and Systems (ISCAS), 1–5 (2018)

[25] Kim, H., Lee, J., Lakshminarayana, N.B., Sim, J., Pho, T.: Macsim: A cpu-gpu
heterogeneous simulation framework user guide. tams (2012)

[26] Henning, J.L.: Spec cpu2006 benchmark descriptions. SIGARCH Comput. Archit.
News 34, 1–17 (2006)

[27] Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.-H., Skadron,
K.: Rodinia: A benchmark suite for heterogeneous computing. 2009 IEEE
International Symposium on Workload Characterization (IISWC), 44–54 (2009)

[28] Stratton, J.A., Rodrigues, C.I., Sung, I.-J., Obeid, N., Chang, L.-W., Anssari,
N., Liu, G., Hwu, W.-m.W.: Parboil: A revised benchmark suite for scientific and
commercial throughput computing. (2012)

[29] Patil, H., Cohn, R.S., Charney, M.J., Kapoor, R., Sun, A., Karunanidhi,
A.: Pinpointing representative portions of large intel; itanium; programs with
dynamic instrumentation. 37th International Symposium on Microarchitecture
(MICRO-37’04), 81–92 (2004)

23
[30] Farooqui, N., Kerr, A., Diamos, G.F., Yalamanchili, S., Schwan, K.: A framework
for dynamically instrumenting gpu compute applications within gpu ocelot. In:
GPGPU-4 (2011)

[31] Chiu, G.-M.: The odd-even turn model for adaptive routing. IEEE Trans. Parallel
Distributed Syst. 11, 729–738 (2000)

[32] An, J., You, H., Sun, J., Cao, J.: Fault tolerant xy-yx routing algorithm
supporting backtracking strategy for noc. 2021 IEEE Intl Conf on Parallel
& Distributed Processing with Applications, Big Data & Cloud Computing,
Sustainable Computing & Communications, Social Computing & Networking
(ISPA/BDCloud/SocialCom/SustainCom), 632–635 (2021)

24

You might also like