AMD's CDNA 3 Compute Architecture - Chips and Cheese

17.12.
23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
Chips and Cheese

The Devil is in the Details
Posts
AMD’s CDNA 3
Compute Architecture
Search …
 December 17, 2023  clamchowder, Cheese  Leave a
comment Sort by
Relevance
AMD has a long history of vying for GPU compute market
share. Ever since Nvidia got first dibs with their Tesla
architecture, AMD has been playing catch up. Terascale 3
moved from VLIW5 to VLIW4 to improve execution unit Archives
utilization in compute workloads. GCN replaced Terascale
December 2023
and emphasized consistent performance for both GPGPU and
graphics applications. Then, AMD diverged their GPU November 2023
architecture development into separate CDNA and RDNA October 2023

lines specialized for compute and graphics respectively. September 2023
August 2023
July 2023
June 2023
May 2023
April 2023
March 2023
February 2023
January 2023
December 2022
November 2022
https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/ 1/24
17.12.23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese
CDNA 2 finally brought AMD notable success. MI250X and October 2022
MI210 GPUs won several supercomputer contracts including
September 2022
ORNL’s Frontier, which holds first place on November 2023’s
TOP500 list. But while CDNA2 delivered solid and cost August 2022
efficient FP64 compute, H100 had better AI performance and July 2022
offered a larger unified GPU. June 2022
May 2022
CDNA 3 looks to close those gaps by bringing forward
everything AMD has to offer. The company’s experience in April 2022
advanced packaging technology is on full show, with March 2022
MI300X getting a sophisticated chiplet setup. Together with
February 2022
Infinity Fabric components, advanced packaging lets
MI300X scale to compete with Nvidia’s largest GPUs. On the January 2022
memory side, Infinity Cache from the RDNA line gets pulled December 2021
into the CDNA world to mitigate bandwidth issues. But that
November 2021
doesn’t mean MI300X is light on memory bandwidth. It still
October 2021
gets a massive HBM setup, giving it the best of both worlds.
Finally, CDNA 3’s compute architecture gets significant September 2021
generational improvements to boost throughput and August 2021
utilization.
July 2021
GPU Layout June 2021
May 2021
AMD has a tradition of using chiplets to cheaply scale core
April 2021
counts in their Ryzen and Epyc CPUs. MI300X uses a similar
March 2021
strategy at a high level, with compute split off onto
Accelerator Complex Dies, or XCDs. XCDs are analogous to February 2021
CDNA 2 or RDNA 3’s Graphics Compute Dies (GCDs) or January 2021
Ryzen’s Core Complex Dies (CCDs). AMD likely changed the
December 2020
naming because CDNA products lack the dedicated graphics
hardware present in the RDNA line.
Each XCD contains a set of cores and a shared cache.

Specifically, every XCD physically has 40 CDNA 3 Compute
Units, with 38 of these being enabled per XCD on the MI300X.
A 4 MB L2 cache sits on the XCD as well, and serves all of the
die’s CUs. MI300X has eight XCDs, giving it 304 total
Compute Units.
That’s a large increase over the MI250X’s 220 CUs. Even

better, MI300X can expose all of those CUs as a single GPU.
On MI250X, a programmer would have to manually split up
work across the two GPUs because each has a separate pool
of memory.
Nvidia’s H100 consists of 132 Streaming Multiprocessors

(SMs) and also presents them to programmers as a big
unified GPU. H100 takes a conventional approach by
implementing all of that compute on a large monolithic die.
Even with everything on the same die, H100 is too large to
give all of its SMs equal access to cache. So, H100 splits the
L2 into two instances. A single SM can use all 50 MB of L2,
but access to more than 25 MB will incur a performance
penalty.
Still, Nvidia’s strategy makes more efficient use of cache

capacity than MI300X’s. A MI300X XCD doesn’t use L2
capacity on other XCDs for caching, just as CCDs on
Epyc/Ryzen don’t allocate into each other’s L3 caches.
Intel’s Ponte Vecchio (PVC) compute GPUs make for a very

interesting comparison. PVC places its basic compute
building blocks in dies called Compute Tiles, which are

roughly analogous to CDNA 3’s XCDs. Similarly, PVC’s Base
Tile serves a similar function to CDNA 3’s IO dies. Both
contain a large last level cache and HBM memory
controllers. Like MI300X, a Ponte Vecchio card can be
exposed as a single GPU with a unified memory pool.
However, there are important differences. Ponte Vecchio’s

Compute Tiles are smaller with only eight Xe Cores,
compared to 38 Compute Units on a CDNA 3 XCD. Instead of
using a Compute Tile wide cache, Intel uses larger L1 caches
to reduce cross-die traffic demands. Using a two-stack Ponte
Vecchio part as a unified GPU presents challenges too. The
EMIB bridge between the two stacks only offers 230 GB/s of
bandwidth, which isn’t enough to fully utilize HBM
bandwidth if accesses are striped across all memory
controllers. To address this, Intel has APIs that can let
programs work with the GPU in a NUMA configuration.
In terms of physical construction, PVC and CDNA 3’s designs

have different challenges. CDNA 3’s ability to present a
unified memory pool with HBM requires high bandwidth
between the IO dies. PVC gets by with a relatively low
bandwidth EMIB link. But PVC’s design gets complicated
because it uses four die types with different process nodes
and foundries. AMD only uses two die types in MI300X, and
both nodes (6 nm and 5 nm) are from TSMC.
Tackling the Bandwidth

Problem
Compute has been outpacing memory for decades. Like
CPUs, GPUs have countered this with increasingly
sophisticated caching strategies. CDNA 2 used a
conventional two-level cache hierarchy with a 8 MB L2,
relying on HBM2e to keep the execution units fed. But even
with HBM2e, MI250X was more bandwidth starved than
Nvidia’s H100. If AMD simply added more compute,
bandwidth starvation could be come a serious issue. So,
AMD took a leaf out of RDNA(2)’s book and added an “Infinity
Cache”.
Much like the consumer RDNA GPUs, MI300’s Infinity Cache

is what the technical documentation calls Memory Attached
Last Level (MALL), which is a fancy way to say that the last
level cache level is a memory side cache. Compared to L1 and
L2 caches that are closer to the Compute Units, the Infinity
Cache is attached to the memory controllers. All memory
traffic passes through the Infinity Cache regardless of what
block it’s coming from. That includes IO traffic, so
communications between peer GPUs can benefit from
Infinity Cache bandwidth. Because the Infinity Cache always
has the most up to date view of DRAM contents, It doesn’t
have to handle snoops or other cache maintenance
operations.
From AMD’s presentation on their RDNA architecture. L2 slices may be

associated with memory controllers, but the L2 is not a memory side cache
because many agents can write to DRAM without going through L2
But because a memory side cache is farther away from

compute, it generally suffers from higher latency. Therefore,
AMD has multi-megabyte L2 caches on both CDNA 3 and
RDNA 2 to insulate compute from the lower performance of a
memory side cache.
Like RDNA 2, CDNA 3’s Infinity Cache is 16-way set

associative. However, CDNA 3’s implementation is more
optimized for bandwidth than capacity. It’s composed of 128
slices, each with 2 MB of capacity and 64 bytes per cycle of
read bandwidth. All of the slices together can deliver 8192
bytes per cycle, which is good for 17.2 TB/s at 2.1 GHz.
For comparison, RDNA 2’s 128 MB Infinity Cache can provide

1024 bytes per cycle across all slices, giving it 2.5 TB/s of
theoretical bandwidth at 2.5 GHz. Die shots suggest each
Infinity Cache slice has 4 MB of capacity and provides
32B/cycle. RDNA 2 therefore uses bigger slices, fewer of them
and has less bandwidth from each slice.
MI300X’s focus on bandwidth means workloads with lower

compute density can still enjoy decent performance if they
can get enough Infinity Cache hits. That should make CDNA
3’s execution units easier to feed even though the main
memory bandwidth to compute ratio hasn’t changed much
and remains behind Nvidia’s.
MI250X figures are for a single GCD
If we construct a roofline model for MI300X using Infinity

Cache’s theoretical bandwidth, we can achieve full FP64
throughput with 4.75 FLOPs per byte loaded. It’s a massive
improvement over DRAM, which would require 14.6 to 15
FLOPs per byte loaded.
Possible Challenges with Cross-Die Bandwidth
MI300X’s Infinity Fabric spans four IO dies, each of which

connects to two HBM stacks and associated cache partitions.
However, the bandwidth of the die to die connections may
limit achieving full Infinity Cache bandwidth when MI300X
operates as a single logical GPU with a unified memory pool.
If memory accesses are striped evenly across the memory
controllers (and thus cache partitions), as is typical for most
GPU designs, the available die-to-die bandwidth may prevent

applications from reaching theoretical Infinity Cache
bandwidth.
First, let’s focus on a single IO die partition. It has 2.7 TB/s of

ingress bandwidth along two edges adjacent to other IO dies.
Its two XCDs can get 4.2 TB/s of Infinity cache bandwidth. If
L2 miss requests are evenly striped across the dies, 3/4 of
that bandwidth, or 3.15 TB/s, must come from peer dies.
Since 3.15 TB/s is greater than 2.7 TB/s, cross-die bandwidth
will limit achievable cache bandwidth.
We can add the die in the opposite corner without any

differences because all of its required die-to-die bandwidth
goes in the opposite direction. MI300X has bidirectional die-
to-die links.
If all dies demand maximum Infinity Cache bandwidth in a

unified configuration, things get more complex. Extra cross-
die bandwidth is consumed because transfers between dies
in opposite corners require two hops, and that’ll cut into
ingress bandwidth available for each die.
While MI300X was engineered to act like one big GPU,

splitting MI300X into multiple NUMA domains could give
higher combined Infinity Cache bandwidth. It’s possible that
AMD will have an API that will transparently split up
programs among the different IO dies. Additionally, the
likelihood of bandwidth issues would be minimized by high
L2 hit rates, which would help avoid those bottlenecks. And
in cases where the Infinity Cache hit rate are low, the
MI300X’s die-to-die links are sufficiently robust and offer
ample bandwidth to smoothly handle HBM traffic.
Cross-XCD Coherency
Even though the Infinity Cache doesn’t have to worry about
coherency, the L2 caches do. Ordinary GPU memory accesses
follow a relaxed coherency model, but programmers can use
atomics to enforce ordering between threads. Memory
accesses on AMD GPUs can also be marked with a GLC bit
(Global Level Coherent). Those mechanisms still have to
work if AMD wants to expose MI300X as a single big GPU,
rather than a multi-GPU configuration as MI250X had done.
Snippet of RDNA 2 code from Folding at Home, showing use of global

memory atomics
On prior AMD GPUs, atomics and coherent accesses were

handled at L2. Loads with the GLC bit set would bypass L1
caches, and thus get the most up-to-date copy of data from
L2. That doesn’t work with MI300X because the most up-to-
date copy of a cacheline could be on another XCD’s L2 cache.
AMD could make coherent accesses bypass L2, but that
would lower performance. That may have worked for a
gaming GPU where coherent accesses aren’t too important.
But AMD wants MI300X to perform well with compute
workloads, and needs MI300A (the APU variant) to efficiently
share data between the CPU and GPU. That’s where Infinity
Fabric comes in.
CM = Coherent Master. CS = Coherent Slave
Like Infinity Fabric on Ryzen, CDNA 3 has Coherent Masters

(CMs) where the XCDs connect to the IO dies. Coherent
Slaves (CS) sit at each memory controller alongside Infinity
Cache (IC) slices. We can infer how these work via Ryzen
documentation, which shows Coherent Slaves have a probe
filter and hardware for handling atomic transactions.

MI300X likely has a similar CS implementation.
From AMD’s Zen PPR, showing error reporting available at the Coherent
Slave (CS).
If a coherent write shows up at the CS, it has to ensure any

thread doing a coherent read will observe that write
regardless of where that thread is running on the GPU. That
means any XCD with the line cached will have to reload it
from Infinity Cache to get the most up to date data. Naively,
the CS would have to probe L2 caches across all XCDs
because any of them could have the corresponding data
cached. The probe filter helps avoid this by tracking which
XCDs actually have the line cached, thus avoiding
unnecessary probe traffic. CDNA 3’s whitepaper says the
snoop filter (another name for a probe filter) is large enough
to cover multiple XCD L2 caches. I certainly believe them
because MI300X has 32 MB of L2 across all eight XCDs. Even
consumer Ryzen parts can have more CCD-private cache for
the probe filter to cover.
Thanks to CPU-like Infinity Fabric components like CS and

CM, a XCD can have a private write-back L2 cache capable of
handling intra-die coherent accesses without going across
the IO die fabric. AMD could have gone for a naive solution
where coherent operations and atomics go straight to the
Infinity Cache, bypassing L2. Such a solution would save

engineering effort and create a simpler design at the cost of
lower performance for coherent operations. Evidently, AMD
thought optimizing atomics and coherent accesses was
important enough to go the extra mile.
To ensure coherence of local memory writes

of CUs in different agents a buffer_wbl2 sc1 is
required. It will writeback dirty L2 cache
lines.
To ensure coherence of local memory reads of

CUs in different agents a buffer_inv sc0 sc1 is
required. It will invalidate non-local L2 cache
lines if configured to have multiple L2 caches.
LLVM Documentation for the GFX942 Target
However, CDNA 3 within the XCD still works a lot like prior
GPUs. Evidently normal memory writes will not
automatically invalidate written lines from peer caches as in
CPUs. Instead, code must explicitly tell the L2 to write back
dirty lines and have peer L2 caches invalidate non-local L2
lines.
L2 Cache
Closer to the Compute Units, each MI300X XCD packs a 4 MB
L2 cache. The L2 is a more traditional GPU cache, and is built
from 16 slices. Each 256 KB slice can provide 128 bytes per
cycle of bandwidth. At 2.1 GHz, that’s good for 4.3 TB/s. As
the last level of cache on the same die as the Compute Units,
the L2 plays an important role in acting as a backstop for L1
misses.
Compared to H100 and MI250X, MI300X has a higher L2

bandwidth to compute ratio. Because each XCD comes with a
L2, L2 bandwidth naturally scales as a CDNA 3 product
comes with more XCDs. In other words, MI300X’s L2
arrangement avoids the problem of getting a single cache
hooked up to a lot of Compute Units and maintain a ton of
bandwidth.
PVC’s L2 is a clear contrast. As Intel adds more Compute

Tiles, the Base Tile’s shared L2 gets increasing bandwidth
demands. From a cache design standpoint, PVC’s
configuration is simpler because the L2 acts as a single point
of coherency and a backstop for L1 misses. But it can’t offer
as much bandwidth as MI300X’s L2. MI300X also likely
enjoys better L2 latency, making it easier for applications to
utilize cache bandwidth.
L1 Cache
CDNA 3’s focus on high cache bandwidth continues to the L1.
In a move that matches RDNA, CDNA 3 sees its L1 throughput
increased from 64 to 128 bytes per cycle. CDNA 2 increased
per-CU vector throughput to 4096 bits per cycle compared to
2048 in GCN, so CDNA 3’s doubled L1 throughput helps
maintain the same compute to L1 bandwidth ratio as GCN.
Besides higher bandwidth, CDNA 3 increases L1 capacity

from 16 to 32 KB. It’s a move that again mirrors developments
in the RDNA line, where RDNA 3 received a similar size boost
for its first level cache. Higher hitrates from the larger cache
would lower average memory access latency, improving
execution unit utilization. Transferring data from L2 and
beyond costs power, so higher hitrate can help power
efficiency too.
While CDNA 3 improves first level caching, Ponte Vecchio is

still the champion in that category. Each Xe Core in PVC can
deliver 512 bytes per cycle, giving Intel a very high L1
bandwidth to compute ratio. The L1 is large as well at 512 KB.
Memory bound kernels that fit in L1 will do very well on
Intel’s architecture. However, Ponte Vecchio lacks a mid-
level cache at the Compute Tile level, and could face a harsh
performance cliff as data spills out of L1.
Scheduling and Execution

Units
A complex chiplet setup and modified cache hierarchy let
AMD present MI300X as a single GPU, thus addressing one of
MI250X’s biggest weaknesses. But AMD didn’t settle with
that. They also made iterative improvements to the core
Compute Unit architecture, addressing CDNA 2’s difficulties
with utilizing its FP32 units.
From the CDNA 3 whitepaper
When CDNA 2 shifted to handling FP64 natively, AMD

provided double rate FP32 via packed execution. The
compiler would have to pack two FP32 values into adjacent
registers and perform the same instruction on both. Often,
the compiler struggled to pull this off unless programmers
explicitly used vectors.
CDNA 3 gets around this with a more flexible dual issue

mechanism. Most likely, this is an extension of GCN’s multi-
issue capability rather than RDNA 3’s VOPD/wave64 method.
Each cycle, the CU scheduler selects one of the four SIMDs
and checks whether any of its threads are ready to execute. If
multiple threads are ready, GCN could select up to five of
them to send to execution units. Of course a GCN SIMD only
has a single 16-wide vector ALU, so GCN would have to select
threads with different instruction types ready to multi-issue.
For example, a scalar ALU instruction can issue alongside a
vector ALU one.
An alternative approach would be to take advantage of

wave64’s wider width and let a thread complete two vector
instructions over four cycles. However, doing so would break
GCN’s model of handling VALU instructions in multiples of 4
clock cycles. CDNA 3 is still more closely related to GCN than
RDNA is, and reusing GCN’s multi-issue strategy is a sensible
move. AMD also could have used RDNA 3’s VOPD

mechanism, where a special instruction format can contain
two operations. While that method could increase per-thread
performance, relying on the compiler to find dual issue pairs
could be hit or miss.
From an old AMD presentation
Instead of relying on the compiler, CDNA 3’s dual issue

approach likely pushes responsibility to the programmer to
expose more thread level parallelism via larger dispatch
sizes. If a SIMD has more threads in flight, it’ll have a better
chance of finding two threads with FP32 instructions ready
to execute. At minimum, a SIMD will need two threads active
to achieve full FP32 throughput. In practice CDNA 3 will need
much higher occupancy to achieve good FP32 utilization.
GPUs use in-order execution so individual threads will often
be blocked by memory or execution latency. Keeping one set
of execution units fed can be difficult even at full occupancy.
Therefore, AMD has dramatically increased the number of

threads each CDNA 3 SIMD can track from 8 to 24. If a
programmer can take advantage of this, CDNA 3 will be
better positioned to multi-issue. But this can be difficult.
AMD did not mention an increase in vector register file
capacity, which often limits how many threads a SIMD can
have in flight. The vector register file can hold state for more
threads if each thread uses fewer registers, so CDNA 3’s
multi-issue capability may work best for simple kernels with
few live variables.
Register file bandwidth presents another challenge for dual

issue. CDNA 2’s packed FP32 execution didn’t require extra
reads from the vector register file because it took advantage
of wider register file ports needed to deliver 64-bit values.
But separate instructions can reference different registers
and require more reads from the register file. Adding more
register file ports would be expensive, so CDNA 3
“generationally improves the source caching to provide
better re-use and bandwidth amplification so that each
vector register read can support more downstream vector or
matrix operations”1. Most likely, AMD is using a larger
register cache to mitigate port conflicts and keep the
execution units fed.
Matrix Operations
Matrix multiplication has become increasingly important as

machine learning picks up. Nvidia invested heavily in this
area, adding matrix multiplication units (tensor cores) to
their Volta and Turing architectures years ago. AMD’s CDNA
architecture added matrix multiply support, but
contemporary Nvidia architectures invested more heavily in
matrix multiplication throughput. This especially applies to
lower precision data types like FP16, which are often used in
AI.
Matrix FP16 Rate Relative to

FMAs/Clk Packed FP16
AMD MI100 (CDNA)

512 4x
Compute Unit
AMD MI250X (CDNA 2)

512 4x
Compute Unit
AMD MI300X (CDNA 3)

1024 8x
Compute Unit
Nvidia V100 Streaming

512 4x4
Multiprocessor
Nvidia A100 Streaming

1024 4x
Multiprocessor
Nvidia H100 Streaming

2048 8x
Multiprocessor
MI300X plays catch up by doubling per-CU matrix

throughput compared to prior CDNA generations. On top of
that, MI300X’s chiplet design allows a massive number of
CUs. But Nvidia’s higher per-SM matrix performance still
makes it a force to be reckoned with. Therefore, CDNA 3
continues AMD’s trend of hitting Nvidia hard from the vector
FP64 performance side while maintaining strong AI
performance in isolation.
Instruction Cache
Besides handling memory accesses requested by
instructions, a Compute Unit has to fetch the instructions
themselves from memory. GPUs traditionally had an easier
time with instruction delivery because GPU code tends to be
simple and not occupy a lot of memory. In the DirectX 9 era,
Shader Model 3.0 even imposed limits on code size. As GPUs
evolved to take on compute, AMD rolled out their GCN
architecture with 32 KB instruction caches. Today, CDNA 2
and RDNA GPUs continue to use 32 KB instruction caches.
CDNA 3 increases instruction cache capacity to 64 KB.

Associativity doubles too, from 4-way to 8-way. That means
higher instruction cache hitrates for CDNA 3 with bigger,
more complex kernels. I suspect AMD is targeting CPU code
naively ported to GPUs. Complex CPU code can be punishing
on GPUs, since they can’t hide instruction cache miss
latency with long distance instruction prefetching and
accurate branch prediction. Higher instruction cache
capacity helps contain larger kernels, while increased
associativity helps avoid conflict misses.
Like CDNA 2, each CDNA 3 instruction cache instance

services two Compute Units. GPU kernels are usually
launched with large enough work sizes to fill many Compute
Units, so sharing the instruction cache is a good way to
efficiently use SRAM storage. I suspect AMD didn’t share the
cache across even more Compute Units because a single
cache instance may struggle to satisfy instruction
bandwidth demands.
Final Words
CDNA 3’s whitepaper says that “the greatest generational
changes in the AMD CDNA 3 architecture lie in the memory
hierarchy” and I would have to agree. While AMD improved
the Compute Unit’s low precision math capabilities
compared to CDNA 2, the real improvement was the addition
of the Infinity Cache.
MI250X’s primary issue was that it wasn’t really one GPU. It

was two GPUs sharing the same package which only has 200
Gigabyte per second per direction between the GCDs. In
AMD’s assessment that 200 Gigabyte per second per
direction was not enough to have the MI250X show up as
one GPU which is why AMD significantly increased the die
to die bandwidth.
For this image, I am considering North-South as the vertical axis and East-
West as the horizontal axis
AMD increased the total East-West bandwidth to 2.4TB/sec

per direction which is a 12 fold increase from MI250X. And
the total North-South bandwidth is an even higher 3.0TB/sec
per direction. With these massive bandwidth increases, AMD
was able to make the MI300 appear as one large, unified
accelerator instead of as 2 separate accelerators like MI250X.
4.0 TB/s of total ingress bandwidth for one die may not seem
like enough if both XCD needs all available memory
bandwidth. However, both XCDs combined can only access
up to 4.2TB/s of bandwidth from the IO die so realistically
the 4.0TB/s of ingress bandwidth is a non-issue. What the
maximum of 4.0TB/s of ingress bandwidth does mean is that
a single IO die can’t take advantage of all 5.3TB/s of memory
bandwidth.
This is similar to desktop Ryzen 7000 parts where one CCD

can’t take full advantage of DDR5 bandwidth due to Infinity
Fabric limits. However this is likely to be a non-issue on
MI300X because the bandwidth demands will be highest

with all dies in play. In that case, each die will consume
about 1.3 TB/s of bandwidth and getting 3/4 of that over
cross-die links won’t be a problem.
But MI300 isn’t just a GPGPU part, it also has an APU part as
well, which is in my opinion is the more interesting of the
two MI300 products. AMD’s first ever APU, Llano, was
released in 2011 which was based on AMD’s K10.5 CPU paired
with a Terascale 3 GPU. Fast forward to 2023 and for their
first “big iron” APU, the MI300A, AMD paired 6 of their CDNA3
XCDs with 24 Zen 4 cores all while reusing the same base
die. This allows for the CPU and the GPU to shared the same
memory address space which removes the need to copy data
over an external bus to keep the CPU and GPU coherent with
each other.
We look forward to what AMD could do with future “big iron”

APUs as well as their future GPGPU line up. Maybe they’ll
have specialized CCDs with wider vector units or maybe
they’ll have networking on their base die that can directly
connect to the xGMI switches that Broadcom have said to be
making. Regardless of what future Instinct products look
like, we are excited to both be looking forward to those
products as well as testing the MI300 series.
We would like to thank AMD for inviting Chips and Cheese to

the MI300 launch event. We were able to ask a lot of
questions and gain some extra information without which

this article would have been much shorter.
If you like our articles and journalism, and you want to

support us in our endeavors, then consider heading over to
our Patreon or our PayPal if you want to toss a few bucks our
way. If you would like to talk with the Chips and Cheese staff
and the people behind the scenes, then consider joining
our Discord.
References
1. CDNA 3 Whitepaper
2. CDNA 2 Whitepaper
3. CDNA Whitepaper
4. Volta Whitepaper
5. Nvidia A100 Whitepaper
6. Nvidia H100 Whitepaper
7. Intel Data Center GPU Max Series Technical Overview
Authors
clamchowder
View all posts 
Cheese
View all posts 
Don’t miss our

articles!
Email Addre SIGN UP
← Cortex A57, Nintendo Switch’s

CPU
Leave a Reply
This site uses Akismet to reduce spam. Learn how

your comment data is processed.
Privacy Policy Copyright © 2023 Chips and Cheese

AMD's CDNA 3 Compute Architecture - Chips and Cheese

Uploaded by

Copyright:

Available Formats

AMD's CDNA 3 Compute Architecture - Chips and Cheese

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AMD's CDNA 3 Compute Architecture - Chips and Cheese

Uploaded by

Copyright:

Available Formats

17.12.

23, 23:38 AMD’s CDNA 3 Compute Architecture – Chips and Cheese

Chips and Cheese

architecture development into separate CDNA and RDNA October 2023

GPU Layout June 2021

Each XCD contains a set of cores and a shared cache.

That’s a large increase over the MI250X’s 220 CUs. Even

Nvidia’s H100 consists of 132 Streaming Multiprocessors

Still, Nvidia’s strategy makes more efficient use of cache

Intel’s Ponte Vecchio (PVC) compute GPUs make for a very

building blocks in dies called Compute Tiles, which are

However, there are important differences. Ponte Vecchio’s

In terms of physical construction, PVC and CDNA 3’s designs

Tackling the Bandwidth

Much like the consumer RDNA GPUs, MI300’s Infinity Cache

From AMD’s presentation on their RDNA architecture. L2 slices may be

But because a memory side cache is farther away from

Like RDNA 2, CDNA 3’s Infinity Cache is 16-way set

For comparison, RDNA 2’s 128 MB Infinity Cache can provide

MI300X’s focus on bandwidth means workloads with lower

MI250X figures are for a single GCD

If we construct a roofline model for MI300X using Infinity

Possible Challenges with Cross-Die Bandwidth

MI300X’s Infinity Fabric spans four IO dies, each of which

GPU designs, the available die-to-die bandwidth may prevent

First, let’s focus on a single IO die partition. It has 2.7 TB/s of

We can add the die in the opposite corner without any

If all dies demand maximum Infinity Cache bandwidth in a

While MI300X was engineered to act like one big GPU,

Snippet of RDNA 2 code from Folding at Home, showing use of global

On prior AMD GPUs, atomics and coherent accesses were

CM = Coherent Master. CS = Coherent Slave

Like Infinity Fabric on Ryzen, CDNA 3 has Coherent Masters

filter and hardware for handling atomic transactions.

If a coherent write shows up at the CS, it has to ensure any

Thanks to CPU-like Infinity Fabric components like CS and

Infinity Cache, bypassing L2. Such a solution would save

To ensure coherence of local memory writes

To ensure coherence of local memory reads of

Compared to H100 and MI250X, MI300X has a higher L2

PVC’s L2 is a clear contrast. As Intel adds more Compute

Besides higher bandwidth, CDNA 3 increases L1 capacity

While CDNA 3 improves first level caching, Ponte Vecchio is

Scheduling and Execution

From the CDNA 3 whitepaper

When CDNA 2 shifted to handling FP64 natively, AMD

CDNA 3 gets around this with a more flexible dual issue

An alternative approach would be to take advantage of

move. AMD also could have used RDNA 3’s VOPD

From an old AMD presentation

Instead of relying on the compiler, CDNA 3’s dual issue

Therefore, AMD has dramatically increased the number of

Register file bandwidth presents another challenge for dual

Matrix multiplication has become increasingly important as

Matrix FP16 Rate Relative to

AMD MI100 (CDNA)

AMD MI250X (CDNA 2)

AMD MI300X (CDNA 3)

Nvidia V100 Streaming