2023-v1_covered_871333e8-9a5c-4e82-947a-4e44d6ac3785
2023-v1_covered_871333e8-9a5c-4e82-947a-4e44d6ac3785
2023-v1_covered_871333e8-9a5c-4e82-947a-4e44d6ac3785
Research Article
DOI: https://doi.org/10.21203/rs.3.rs-2981298/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License
Abstract
With the rapid development of heterogeneous Network-on-Chip (NoC), a vast
amount of shared resources are integrated into NoC. Intense resource competition
exists between CPUs and GPUs, leading to congestion and a decrease in overall
network performance. Reasonable node placement can minimize network conflicts
at the topology level. This paper first discusses the placement of shared last-level
cache (LLC) and memory controller (MC), then selects a more rational place-
ment method and optimizes the path. To solve the hot spots problem for center
placement method, a task-based routing algorithm is designed to plan the path.
Simulation results demonstrate that, compared to the traditional routing algo-
rithm, the overall network latency is reduced by 9%, and the CPU performance
is improved by 13.6%. Furthermore, a dynamic task-based routing algorithm is
proposed. Compared to the static task routing algorithm, the overall network
latency is reduced by 2.08%, and the CPU performance is improved by 4.09%.
1
1 Introduction
With the increasing demand for computational power in everyday life, homogeneous
multi-core CPUs are no longer sufficient to meet the requirements [1]. In response
to this, heterogeneous multi-core chips have emerged that integrate both CPU and
GPU cores to meet the computational demand. Examples of such chips include Intel’s
Sandy Bridge [2] and AMD’s Fusion APU chips [3] , which have captured a signifi-
cant share of the market. These products have reduced communication overhead by
sharing the same last-level-cache (LLC), memory controller (MC), and other on-chip
resources between the CPU and GPU, resulting in improved system performance.
However, resource contention is a major challenge posed by resource sharing. The
inherent characteristics of CPUs and GPUs result in differences in their shared resource
utilization[4]. CPUs reduce memory access latency by utilizing multi-level caches[5],
while GPUs can better tolerate latency problems by running a large number of paral-
lel threads[6, 7]. Due to the requirement for extensive network resources to implement
high parallel processing, the high throughput of GPUs can exacerbate the issues caused
by resource sharing.
Network-On-chip(NoC) can be regarded as programmable systems for inter-node
communication[8–10], connecting processing units with shared resources such as the
LLC and MC. Proper management of the NoC is crucial to improving system perfor-
mance, especially as it is one of the largest shared resources in heterogeneous multi-core
systems. Although heterogeneous multi-core processors that integrate CPU and GPU
cores theoretically offer higher peak performance, many factors such as data transmis-
sion between cores, resource allocation, and GPU programming strategies continue to
affect the overall system performance. Due to the different computing characteristics
of the two cores, the traditional cyclic scheduling allocation mechanism of NoC can
lead to significant performance losses as GPU occupies a large amount of network
resources. To achieve reasonable resource allocation, it is necessary to separate the
different traffic flows in the NoC of CPU-GPU heterogeneous systems.
Some researchers have focused on optimizing the performance of NoC through com-
munication routing mechanisms. Lee et al. [11] proposed a multicast routing scheme
that guarantees deadlock-free and improved throughput by using router marking rules,
destination router partitioning, and traffic-adaptive branching, ultimately reducing
the number of packet hops and dispersing channel traffic. Elham et al. [12] developed
a fault-tolerant routing algorithm (FT-PDC) based on path separation and congestion
reduction for 3D NoC, which considers factors such as finding the shortest path, link
failure, path diversity, and congestion. Alaei et al.[13]proposed a multicast adaptive
routing algorithm for Mesh networks based on fuzzy load control, which dynamically
prevents live-lock and deadlock situations using a fuzzy control system, effectively
reducing latency and congestion. Salamat et al. [14, 15]proposed a high-performance
adaptive routing algorithm for 3D NoC and conducted experiments on different traf-
fic scenarios. Charles et al.[16] proposed a lightweight trust-aware routing mechanism
that bypasses malicious IP cores during data packet transmission, reducing the number
of retransmissions caused by data tampering and minimizing the risk of DoS attacks,
thereby improving NoC performance.Most of the aforementioned research is focused
on homogeneous architectures, with limited exploration on networks with multiple
2
types of computing cores. However, different cores have unique memory access char-
acteristics, making it difficult for routing algorithms designed for one type of core to
adapt to others.
Many researchers have turned to heterogeneous Network-on-Chip (NoC) to address
the issue of on-chip resource contention between the CPU and GPU. Virtual chan-
nel technology has been proposed as a deadlock avoidance solution in on-chip
networks, which also helps prevent head-of-line blocking. Lee et al. conducted struc-
tural exploration on ring network and proposed a optimized placement method of
ring heterogeneous network [17]. They also proposed a static and dynamic adap-
tive allocation method of the virtual channel in mesh network, targeting the virtual
channel contention problem in heterogeneous NoC [18]. Cui et al.[19] introduced an
interference-free NoC architecture by partitioning different routing nodes and plac-
ing them appropriately, enabling routing in different dimensions for different tasks.
They reduced network energy consumption through dedicated routing algorithms and
bypass techniques. Li et al. [20]optimized the ALPHA router for heterogeneous mul-
ticore systems by increasing the injection link width, crossbar size, and modifying
the buffer organization of router injection ports to enhance injected bandwidth and
improve throughput. It effectively addressed local and global contention to reduce net-
work latency. Yin et al. [21]leveraged the diversity of on-chip heterogeneous computing
devices and partitioned the network using time-division multiplexing, allowing packet-
switched and circuit-switched messages to share the same communication structure,
with circuit-switched paths established along frequently communicating nodes. Zhan
et al.[22]designed the OSCAR architecture, which maximized the potential of STT-
RAM-based LLC. They proposed an integrated asynchronous batch scheduling and
priority-based on-chip network interconnect method. Fang et al[23]. first evaluated the
placement of different buffered and unbuffered routers and propose a Unidirectional
Flow Control (UFC) method to avoid the network congestion problem .
The objective of this paper is to alleviate the contention of on-chip resources
between CPU and GPU in a heterogeneous system. We first discuss the impact of dif-
ferent topologies on the performance of heterogeneous NoC, followed by proposing an
optimized routing algorithm. The contributions of this paper are summarized
as follows:
• We propose an LLC/MC CENTER architecture, which considers the impact of
different LLC/MC placement methods on the performance of heterogeneous network
systems in a mesh architecture.
• Based on the LLC/MC CENTER model, we analyze the easily congested paths on
the network, optimize them based on different tasks in a high-traffic heterogeneous
NoC, and propose a Task-Based(TB) routing algorithm to enhance the network’s
performance.
• We propose a Task-Based-Partition(TBP) routing algorithm that allocates the rout-
ing algorithms of different tasks into separate virtual channels.By utilizing dynamic
monitoring technology to detect the phase behavior information of applications in
the network, we propose an improved TB-TBP adaptive routing algorithm. The
TB-TBP routing algorithm combines the advantages of both TB and TBP routing
algorithms, which enhances the system performance.
3
The remainder of this paper is organized as follows. Section 2 analyzes the issue
of performance degradation caused by mutual interference between CPU and GPU in
heterogeneous multi-core systems, and compares different LLC/MC placement meth-
ods. Section 3 elaborates on the proposed Task-Based routing algorithm and TB-TBP
routing algorithm on the heterogeneous network architecture of LLC/MC CENTER.
In Sections 4 and 5, we present the simulator parameters used in the experiments, the
benchmark set, and the evaluation results and analysis. Section 6 concludes the article.
IP IP IP IP IP
IP IP IP IP IP
IP IP IP IP IP
IP IP IP IP IP
IP IP IP IP IP
4
accelerators. This paper adopts the mesh architecture shown in Figure 1, where routers
are connected to CPU, GPU, LLC, and MC respectively. The CPU has a first-level
private cache and a second-level private cache, while the GPU has a first-level private
cache. As the load traffic on the heterogeneous NoC increases, conflicts will intensify,
and the network’s latency will increase correspondingly.
The modularity of the NoC allows for the flexible placement of IP nodes based
on different tasks. In a heterogeneous system, LLC and MC are shared resources that
perform numerous memory access tasks from CPU and GPU and form a critical part
of the NoC. Reasonable placement of LLC and MC can reduce memory access path
delay and prepare for path division. Based on the location of LLC/MC, we categorize
the structure of heterogeneous NoC into three models: center placement (LLC/MC
CENTER), side placement (LLC/MC SIDE), and four corner placement (LLC/MC
CORNER), as shown in Figure 2. The LLC/MC SIDE model places LLC/MC only
near either CPU or GPU, resulting in significant conflicts at the LLC convergence.
Although the LLC/MC CORNER model can disperse traffic conflicts well, it also
increases the number of routing hops, which is not conducive to optimization. The
LLC/MC CENTER model takes into account the memory access tasks of CPU and
GPU, with MC surrounded by LLC to reduce the number of hops. The number of hops
in the route for CPU and GPU to access LLC is also reduced to a minimum. However,
this model’s central location concentrates all traffic, which may lead to congestion.
Considering traffic characteristics, LLC should be placed as close as possible to CPU
and GPU. Therefore, LLC/MC CENTER is chosen as the baseline model in this paper.
In the experiments, we demonstrate the performance of LLC/MC CENTER.
C C C C C L L L M M M L G L M
C CPU
G L M L G L L L M C L G G G L
G GPU
G L M L G C C C C G C C C C C
L Last Level Cache
G L M L G G G G G G L G G G G
M Memory Controller
G G G G G G G G G G M L G G G
5
Fig. 3 CPU performance decrease when running CPU and GPU at the same time.
minimizing path hops to enhance overall system communication capacity. For hetero-
geneous on-chip networks, unlike traditional homogeneous networks, the impact of core
heterogeneity on the network needs to be considered. Traditional routing algorithms
do not address individual routing algorithm designs for different cores. The number
of GPU communication tasks is significantly higher than that of CPU communication
tasks, necessitating routing algorithm designs that account for the placement of nodes
in different core locations and core characteristics. Moreover, when designing routing
algorithms, deadlock and load balancing issues also need to be taken into account.
6
when running together with GPU benchmarks in the baseline model (LLC/MC CEN-
TER), we compared the results, as depicted in Figure 3. The figure clearly shows that
CPU performance suffers significant degradation under high-traffic conditions, with
an average decrease of up to 60% in the high-traffic group, as opposed to only 4% in
the low-traffic group.
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
We also measured the proportions of network routing tasks for 24 groups of mixed
loads, as shown in the figure 4. When the network is under low load, it is evident
that communication frequency between the CPU and LLC accounts for the major-
ity, and resource contention between the GPU and CPU occurs less frequently. For
example, in the three groups of mixed loads in the Povray test set, tasks between the
CPU and LLC can account for 70% of the total workload. When the network is under
high load, there is more communication frequency between the GPU and LLC, and
resource contention between the GPU and CPU becomes more severe. On the other
hand, overall, the communication frequency between LLC and MC is relatively fixed,
but it also accounts for nearly 30% of the workload. Therefore, our optimization of the
heterogeneous network communication routing algorithm is divided into two points:
first, to reduce the contention of the GPU for shared resources and improve the acces-
sibility rate of CPU tasks; second, to reduce the path access of hotspots for LLC and
MC communication.
7
represented by the red line and the hot spot areas are circled by dotted lines. In Figure
C C C C C
G L M L G
G L M L G
G L M L G
G G G G G
Fig. 5 Hotspot areas and high-traffic paths for LLC/MC CENTER model.
5, we can clearly observe the flow characteristic of the LLC/MC CENTER model.
Since there is no direct communication between individual LLCs and MCs, the traffic
in the vertical path within the hotspot area will be lower than that in the horizontal
path. For the outer area, the traffic on the ring path will be lower than the traffic
in the middle hotspot area. Based on the structure’s characteristics, we can allocate
traffic in a reasonable manner to reduce network congestion. We propose a four-point
path selection rule with decreasing priority from top to bottom, described as follows:
1. Execute tasks with the least number of hops possible.
2. Assign as few high-traffic paths as possible.
3. Exit the hotspot area as quickly as possible.
4. Distribute workload evenly across the network.
Regarding routing algorithms, the XY routing algorithm is the most commonly used in
NoC . It is a deadlock-free and highly reliable algorithm under low workload. However,
under high workload, the problem of path delay becomes more serious.
Based on the aforementioned principles, when performing a cache access request
task (CPU->LLC/GPU->LLC), the dimension-order XY routing algorithm is better
suited to allow the CPU and GPU to access the hotspot area. When performing CPU
reply (LLC->CPU), as depicted in Figure 6, the vertical path in the hotspot area
has fewer high-traffic paths, and there is also little traffic on the lateral path in the
CPU area. For CPU reply, the dimension-order YX routing algorithm is chosen to
avoid potential congested paths. When performing GPU reply (LLC->GPU), the XY
8
CPU Area
C C C C C
Dst
G L M L G
G L M L G
G L M L G
Src
G G G G G
Routiing Computation
VC1 Task-Based Routing
VC Arbiter
Input 1
Switch Arbiter
VC2 Escape Routing
Output1
...
...
...
routing algorithm is selected since most GPUs can exit the hotspot area quickly, as
there is less traffic in the vertical and circular paths. To balance the network load
for routing tasks in hotspot areas, we adhere to rule 4) and utilize the YX routing
algorithm.
To avoid the risk of deadlock, an escape virtual channel has been set up in the
proposed Task-Based routing algorithm, which is a common issue faced by routing
algorithms. As depicted in Figure 7, each router input port has two virtual channels,
VC1 and VC2, with Routing Computation calculating the packet’s correct output
9
Algorithm 1 TB Routing Algorithm
Require: src is source node, dst is destination node, src.x is source node horizontal
coordinate,src.y is source node vertical coordinate
1: if CPU request tasks OR GPU reply tasks OR GPU reply tasks then
2: if src.x >dst.x then choose straight route to the LEFT
3: else if src.x <dst.x then choose straight route to the RIGHT
4: else if src.y >dst.y then choose straight route to the UP
5: else if src.y <dst.y then choose straight route to the DOWN
6: end if
7: end if
8: if CPU reply tasks OR tasks between LLC and MC then
9: if src.y >dst.y then choose straight route to the UP
10: else if src.y <dst.y then choose straight route to the DOWN
11: else if src.x >dst.x then choose straight route to the LEFT
12: else if src.x <dst.x then choose straight route to the RIGHT
13: end if
14: end if
port. VC Arbiter selects the virtual channel that can release the filt to pass through
the crossbar input port, while Switch Arbiter determines which filt enters or exits the
crossbar. The crossbar physically moves the filt from the input port to the output
port. Task-Based routing algorithm is used for traffic to VC1, while a dimensionally
ordered XY algorithm is used for traffic to VC2 to ensure deadlock-free routing. By
implementing escape virtual channels, deadlocks can be effectively avoided.
10
of the TB and TBP routing algorithms and dynamically monitors the phase behavior
information of application programs in the network through monitoring technology. In
high-load network situations, it adopts the TBP algorithm to improve network com-
munication efficiency, while in low-load networks, it adopts the TB algorithm to avoid
wasting channel resources and improve network throughput, as shown in the figure 8.
Routing Computation
VC1 Task-Based-XY Routing
VC Arbiter
Input 1
Switch Arbiter
VC2 Task-Based-YX Routing
Routing Algorithm
... Converter
Output1
VC1 Task-Based-XY Routing
...
...
Input N
Output N
VC2 Task-Based-YX Routing
Crossbar
Fig. 8 Virtual Channel assignment under Task-Based-Partition Routing Algorithm.
11
Fig. 9 Retired instructions of different CPU cores and the total.
time, accurate information about the number of retired instruction cannot be obtained,
affecting the judgment of algorithm switching. However, if the routing algorithm is
switched after a long time, it cannot efficiently improve the system’s performance.
Hence, it is crucial to identify the appropriate sampling and main periods that yield
maximum benefits, as depicted in Figure 10.
over over
TB-TBP
TB RA Initial Sample RA Main
over
12
The start-up period: During the start-up period, the system state is unstable, and
we use the TB algorithm. No sampling work is performed during this period.
The sampling period: The purpose of the sampling period is to obtain informa-
tion about the current performance of the routing algorithm through sampling, in
order to decide whether to switch to a different routing algorithm. When the TB algo-
rithm is used, we switch to the TBP routing algorithm promptly when the CPU’s
invalid instruction count is high, allowing the YX routing algorithm’s traffic to enter
the escape virtual channel. When utilizing the TBP routing algorithm, it is deemed
necessary to transition towards the TB routing algorithm for the purpose of achiev-
ing a balanced network load in situations where the CPU’s invalid instruction count
remains at a low level. The decision to switch between routing algorithms is deter-
mined through the application of formula (1), which calculates the speedup ratio of
the two algorithms before and after the switch. Specifically, when the speedup ratio
surpasses or equals to a threshold level of 1.5, the TBP routing algorithm is preferen-
tially selected. In contrast, when the speedup ratio falls below this threshold, the TB
routing algorithm is deemed more suitable for optimizing network performance.
P
inst retiredT B
SpeedupRoutingAlgorithm = P (1)
inst retiredT BP
The main period: it uses the better-performing routing algorithm from the sam-
pling period to maintain stable system performance over an extended period. Once
the main period is over, we enter the sampling period to begin sampling again. The
duration of each period is specified in the table 1.
4 Experiment Setup
In this section, we list the experimental configuration, benchmark set, and evaluation
metrics.
13
resource contention (cache, on-chip interconnect, and memory controller) until all
applications are complete. The GPU application will continue to run repeatedly until
the CPU application finishes running, in order to simulate the resource contention
behavior on the network. This method is consistent with the work of Lee et al [17].
4.2 Benchmark
We utilized the spec 2006 CPU benchmark [26]and several suites of CUDA GPGPU
benchmark, including Nvidia CUDA SDK, Rodinia [27], and Parboil[28] in our exper-
iments. Each CPU runs one CPU application, while all GPUs run one GPGPU
application. To generate spec 2006 trace, we used pinpoint[29], and GPU Ocelot[30]
was used to generate GPGPU benchmark trace. PKC was used as a statistical indica-
tor to evaluate the communication status of the network and determine the high and
low traffic groups of applications. The grouping results are presented in Table 3 and 4.
14
Table 4 GPU benchmark
PKC<100(Low) PKC>100(High)
Benchmark Suite PKC Benchmark Suite PKC
mri parboil 3 backprop rodinia 111
luDecomposition rodinia 4 histo parboil 199
needlemanWunsch rodinia 23 spmv parboil 467
MonteCarlo SDK 38
mm parboil 44
srad rodinia 68
4.3 Metrics
We have selected IPC (Instructions Per Cycle) as the performance indicator for the
CPU. Additionally, we calculate the total network delay while running the application
to evaluate the overall performance of the network. The formula for IPC is given by
equation (2), where cycles represents the number of cycles used by the CPU to exe-
cute the application program, and instructioni represents the number of instructions
executed by the CPU. Equation (3) gives the formula for calculating the overall IPC
of the network, where n is the total number of CPU cores.
instructioni
IP Ci = (2)
cycles
The average IPC of the system is then:
Pn−1
i=0 IP Ci
IP C = (3)
n
15
1.8E+10
1.6E+10
1.4E+10
1.2E+10
Latency(ns)
1E+10
8E+09
6E+09
4E+09
2E+09
0
lbm+stencil milc+stencil milc+spmv lbm+spmv mcf+stencil
1.4E+08
1.2E+08
1.0E+08
Latency(ns)
8.0E+07
6.0E+07
4.0E+07
2.0E+07
0.0E+00
namd+hotspot namd+mri sjeng+lu povray+hotspot povray+needleman
16
between cores, the algorithm’s improvement is not significant, which aligns with our
expectations.
1.2E+08
1.0E+08
8.0E+07
Latency(ns)
6.0E+07
4.0E+07
2.0E+07
0.0E+00
1.2
0.8
IPC
0.6
0.4
0.2
17
integrate routing selection for different CPU and GPU tasks. In comparison, the
Task-Based routing algorithm exhibits an average latency reduction of 9% com-
pared to the classic XY routing algorithm. For memory-intensive benchmarks such
as mcf, the proposed Task-Based algorithm achieves better latency reduction. On
average, the Task-Based routing algorithm reduces the latency of the benchmark by
12%, indicating that our proposed Task-Based routing algorithm is more suitable for
CPU-memory-intensive application scenarios.
6E+10
5E+10
4E+10
Latency(ns)
3E+10
2E+10
1E+10
0.2
0.18
0.16
0.14
0.12
IPC
0.1
0.08
0.06
0.04
0.02
18
1.80E+10
1.60E+10
1.40E+10
1.20E+10
Latency(ns)
1.00E+10
8.00E+09
6.00E+09
4.00E+09
2.00E+09
0.00E+00
Mixed_high+backprop Mixed_high+histo Mixed_high+spmv Mixed_high+stencil
TB TB-TBP
4.00E+08
3.50E+08
3.00E+08
2.50E+08
Latency(ns)
2.00E+08
1.50E+08
1.00E+08
5.00E+07
0.00E+00
Mixed_low+mri Mixed_low+hotspot Mixed_low+lu Mixed_low+needleman
TB TB-TBP
In terms of CPU performance, under ten groups of mixed loads, the Task-Based
routing algorithm exhibits an average IPC increase of 13.6% compared to the tra-
ditional routing algorithm, as depicted in Figure 15. Similarly, the performance
improvement is particularly notable for memory-intensive benchmarks. These findings
demonstrate that adopting a reasonable path planning method in high-traffic net-
works can significantly reduce resource contention and waiting time between the CPU
and the GPU. The Task-Based routing algorithm leverages the network’s characteris-
tic of having less traffic on the vertical path to enable faster communication between
the CPU and LLC. This mitigates the impact of a large number of GPU tasks and
enhances CPU performance. Simultaneously, while our focus was on improving CPU
performance, the overall network latency was also reduced accordingly. This indicates
that reducing congestion events improves the routing efficiency of both the CPU and
GPU on the network simultaneously. Furthermore, we introduce the TB-TBP dynamic
routing algorithm. Since TB-TBP focuses more on the overall characteristics of the
network when running different benchmarks, it better approximates the real system
state. Therefore, the testing approach for benchmarks slightly differs from the Task-
Based algorithm. We bind different benchmarks to various CPU cores and mix them
with different GPU test programs. These CPU benchmarks share the common charac-
teristics described in the previous context. ”Mixed-high” refers to different benchmarks
19
0.16
0.14
0.12
0.1
IPC
0.08
0.06
0.04
0.02
0
TB TB-TBP
in the high-traffic group, while ”Mixed-low” refers to different benchmarks in the low-
traffic group. We compare the TB-TBP dynamic routing algorithm with the proposed
TB algorithm. In the tests with eight different programs under varying traffic condi-
tions, the TB-TBP algorithm shows a 4.08% increase in IPC and a 2.74% decrease in
latency. The TB-TBP dynamic routing algorithm effectively utilizes virtual channel
resources during network peak periods, resulting in reduced network congestion.
6 Conclusion
In this study, we addressed the placement problem of core components, namely shared
last-level cache (LLC) and memory controller (MC), in heterogeneous on-chip net-
works, as their proper placement can significantly improve network performance.
Placing LLC/MC in the middle of the network allows for the separation of CPU and
GPU traffic accessing LLC. However, the middle placement of LLC/MC can result in
the formation of hotspots, which leads to network congestion. To address this conges-
tion issue, we designed the Task-Based routing algorithm based on the tasks in the
heterogeneous on-chip network and developed the TB-TBP dynamic adaptive rout-
ing algorithm by analyzing network communication characteristics. Finally, extensive
simulations are conducted to verify the performance of our proposed algorithm. By
running mixed benchmarks, the Task-Based routing algorithm demonstrated a 9%
reduction in overall network latency compared to traditional routing algorithms, while
improving CPU performance by 13.6%. The TB-TBP routing algorithm achieved a
2.74% reduction in overall network latency and a 4.08% improvement in CPU per-
formance compared to the Task-Based routing algorithm. In future work, we aim
to further enhance the accuracy of congestion monitoring in the TB-TBP algorithm
and determine optimal algorithm cycle lengths. Additionally, we plan to scale up the
number of CPU and GPU cores in the network to evaluate the performance of the
algorithms in larger network architecture.
20
Declarations
• Funding
This work is supported by Beijing Natural Science Foundation (4192007), and sup-
ported by the National Natural Science Foundation of China (61202076), along with
other government sponsors.
• Conflict of interest/Competing interests (check journal-specific guidelines for which
heading to use)
The authors declare that they have no competing interests.
• Ethics approval
Not applicable.
• Consent to participate
Not applicable.
• Consent for publication
The authors readily consent to have this paper published.
• Availability of data and materials
All data generated or analyzed during this study are included in this article. There
are no further materials to provide.
• Authors’ contributions
ZhiChao Wei and Juan Fang designed the TB and TB-TBP routing algorithms
for heterogeneous Network-On-Chip, and Yaqi Liu participated in the experimental
testing and analysis of MacSim benchmark. Yumin Hou participated in the opti-
mization of image design and text in the paper.All authors composed the rest of the
manuscript, reviewed the whole manuscript, and approved the final manuscript.
References
[1] Fang, J., Yu, L., Liu, S., Lu, J., Chen, T.: Kl ga: an application mapping algorithm
for mesh-of-tree (mot) architecture in network-on-chip design. The Journal of
Supercomputing 71, 4056–4071 (2015)
[2] Gwennap, L.: Sandy bridge spans generations intel focuses on graphics , multi-
media in new processor design. (2010)
[3] Lee, K.S., Lin, H., Feng, W.-c.: Performance characterization of data-intensive
kernels on amd fusion architectures. Computer Science - Research and Develop-
ment 28, 175–184 (2013)
[4] Matoussi, O.: Noc performance model for efficient network latency estimation.
2021 Design, Automation & Test in Europe Conference & Exhibition (DATE),
994–999 (2021)
[5] Ma, S., Wang, Z., Liu, Z., Jerger, N.D.E.: Leaving one slot empty: Flit bubble
flow control for torus cache-coherent nocs. IEEE Transactions on Computers 64,
763–777 (2015)
21
[6] Marculescu, R., Ogras, Ü.Y., Peh, L.-S., Jerger, N.D.E., Hoskote, Y.V.: Outstand-
ing research problems in noc design: System, microarchitecture, and circuit per-
spectives. IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems 28, 3–21 (2009)
[7] Cong, J., Gill, M., Hao, Y., Reinman, G.D., Yuan, B.: On-chip interconnection
network for accelerator-rich architectures. 2015 52nd ACM/EDAC/IEEE Design
Automation Conference (DAC), 1–6 (2015)
[8] Zheng, H., Wang, K., Louri, A.: A versatile and flexible chiplet-based system
design for heterogeneous manycore architectures. 2020 57th ACM/IEEE Design
Automation Conference (DAC), 1–6 (2020)
[9] Cheng, X., Zhao, Y., Zhao, H., Xie, Y.: Packet pump: Overcoming network bottle-
neck in on-chip interconnects for gpgpus*. 2018 55th ACM/ESDA/IEEE Design
Automation Conference (DAC), 1–6 (2018)
[10] Kim, H.K., Kim, J., Seo, W., Cho, Y.-G., Ryu, S.: Providing cost-effective on-
chip network bandwidth in gpgpus. 2012 IEEE 30th International Conference on
Computer Design (ICCD), 407–412 (2012)
[11] Lee, Y.S., Kim, Y.W., Han, T.H.: Mrcn: Throughput-oriented multicast routing
for customized network-on-chips. IEEE Transactions on Parallel and Distributed
Systems 34, 163–179 (2023)
[12] Khodadadi, E., Barekatain, B., Yaghoubi, E., Mogharrabi-Rad, Z.: Ft-pdc: an
enhanced hybrid congestion-aware fault-tolerant routing technique based on path
diversity for 3d noc. The Journal of Supercomputing 78, 523–558 (2021)
[13] Alaei, M., Yazdanpanah, F.: A high reliable multicast routing algorithm for 2d
and 3d mesh-based nocs with fuzzy-based load control. Journal of Control (2021)
[14] Salamat, R., Khayambashi, M., Ebrahimi, M., Bagherzadeh, N.: Lead: An adap-
tive 3d-noc routing algorithm with queuing-theory based analytical verification.
IEEE Transactions on Computers 67, 1153–1166 (2018)
[15] Salamat, R.: Design and evaluation of high-performance and fault-tolerant routing
algorithms for 3d-nocs. (2018)
[16] Charles, S., Mishra, P.: Lightweight and trust-aware routing in noc-based socs.
2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 160–167
(2020)
[17] Lee, J., Li, S., Kim, H., Yalamanchili, S.: Design space exploration of on-chip ring
interconnection for a cpu-gpu heterogeneous architecture. J. Parallel Distributed
Comput. 73, 1525–1538 (2013)
22
[18] Lee, J., Li, S., Kim, H., Yalamanchili, S.: Adaptive virtual channel partitioning
for network-on-chip in heterogeneous architectures. ACM Transactions on Design
Automation of Electronic Systems (TODAES) 18, 1–28 (2013)
[19] Cui, Y.-W., Prabhakar, S.M., Zhao, H., Mohanty, S.P., Fang, J.: A low-cost
conflict-free noc architecture for heterogeneous multicore systems. 2020 IEEE
Computer Society Annual Symposium on VLSI (ISVLSI), 300–305 (2020)
[21] Yin, J., Zhou, P., Sapatnekar, S.S., Zhai, A.: Energy-efficient time-division multi-
plexed hybrid-switched noc for heterogeneous multicore systems. 2014 IEEE 28th
International Parallel and Distributed Processing Symposium, 293–303 (2014)
[22] Zhan, J., Kayiran, O., Loh, G.H., Das, C.R., Xie, Y.: Oscar: Orchestrating stt-
ram cache traffic for heterogeneous cpu-gpu architectures. 2016 49th Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO), 1–13
(2016)
[23] Fang, J., Leng, Z., Liu, S., Yao, Z., Sui, X.: Exploring heterogeneous noc design
space in heterogeneous gpu-cpu architectures. Journal of Computer Science and
Technology 30, 74–83 (2015)
[24] Ma, S., Lu, H., Huang, L., Shen, L., Guo, Y., Wang, Z., Xue, W.: Adaptive vc
partitioning for nocs in gpgpus. 2018 IEEE International Symposium on Circuits
and Systems (ISCAS), 1–5 (2018)
[25] Kim, H., Lee, J., Lakshminarayana, N.B., Sim, J., Pho, T.: Macsim: A cpu-gpu
heterogeneous simulation framework user guide. tams (2012)
[26] Henning, J.L.: Spec cpu2006 benchmark descriptions. SIGARCH Comput. Archit.
News 34, 1–17 (2006)
[27] Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.-H., Skadron,
K.: Rodinia: A benchmark suite for heterogeneous computing. 2009 IEEE
International Symposium on Workload Characterization (IISWC), 44–54 (2009)
[28] Stratton, J.A., Rodrigues, C.I., Sung, I.-J., Obeid, N., Chang, L.-W., Anssari,
N., Liu, G., Hwu, W.-m.W.: Parboil: A revised benchmark suite for scientific and
commercial throughput computing. (2012)
[29] Patil, H., Cohn, R.S., Charney, M.J., Kapoor, R., Sun, A., Karunanidhi,
A.: Pinpointing representative portions of large intel; itanium; programs with
dynamic instrumentation. 37th International Symposium on Microarchitecture
(MICRO-37’04), 81–92 (2004)
23
[30] Farooqui, N., Kerr, A., Diamos, G.F., Yalamanchili, S., Schwan, K.: A framework
for dynamically instrumenting gpu compute applications within gpu ocelot. In:
GPGPU-4 (2011)
[31] Chiu, G.-M.: The odd-even turn model for adaptive routing. IEEE Trans. Parallel
Distributed Syst. 11, 729–738 (2000)
[32] An, J., You, H., Sun, J., Cao, J.: Fault tolerant xy-yx routing algorithm
supporting backtracking strategy for noc. 2021 IEEE Intl Conf on Parallel
& Distributed Processing with Applications, Big Data & Cloud Computing,
Sustainable Computing & Communications, Social Computing & Networking
(ISPA/BDCloud/SocialCom/SustainCom), 632–635 (2021)
24