Si Programming Guide v2
Si Programming Guide v2
Si Programming Guide v2
Trademarks
AMD, the AMD Arrow logo, Athlon, and combinations thereof, ATI, ATI logo, Radeon, and Crossfire are trademarks of Advanced
Micro Devices, Inc.
Other product names used in this publication are for identification purposes only and may be trademarks of their respective
companies.
Disclaimer
The contents of this document are provided in connection with Advanced Micro Devices, Inc. ("AMD") products. AMD makes no
representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the
right to make changes to specifications and product descriptions at any time without notice. No license, whether express, implied,
arising by estoppel, or otherwise, to any intellectual property rights are granted by this publication. Except as set forth in AMD's
Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty,
relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or
infringement of any intellectual property right. AMD's products are not designed, intended, authorized or warranted for use as
components in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or
in any other application in which the failure of AMD's product could create a situation where personal injury, death, or severe
property or environmental damage may occur. AMD reserves the right to discontinue or make changes to its products at any time
without notice.
1. INTRODUCTION ............................................................................................................................................. 4
2. PM4 ............................................................................................................................................................. 11
1. Introduction
This guide is targeted at those who are familiar with GPU programming and the Radeon programming model. It is
recommended that you read the r6xx/r7xx and evergreen/NI programming guides and ISA documents first as this
guide builds on the information in those documents.
Shader Core Next. The hardware shader core has been completely redesigned. Highlights:
o New, non-VLIW, clause-less instruction set architecture.
o Distributed sequencer per compute unit.
o Scalar ALU per compute unit.
o Support for unlimited resources. Resource constants now read from memory.
o Fetch shader stage removed.
o Compute stage separated from LS.
JIT Constant Updates. All resource descriptors (sometimes called “fetch constants” on previous
hardware) are read from memory instead of registers. In order to improve performance, the CP block adds
a new constant engine. The constant engine (CE) runs in parallel to the 3D engine, and allows constants to
be written to memory ahead of the main command stream.
Unified Cache. Most shader memory – vertex buffers, textures, constant buffers, UAVs, etc. – are
read/written through a shared cache. Draw indices and the CB and DB blocks do not use the shared cache.
UAVs Written through Texture Pipe. UAV writes are now done through the texture unit rather than the
CB/DB RAT capabilities of prior hardware. Reads and writes both share the same cache, and the return
buffer is no longer necessary.
Stateless Compute. All render state for a given compute dispatch will be passed through the pipeline
instead of reading context registers based on a state set ID. Therefore, compute dispatches do not use a
hardware context as on previous hardware.
Color Export Packing Removed From SX. On prior hardware, the PS would always export 4 32-bit
values and the SX had fixed function hardware to pack the data into fewer bits depending on the RT
format. On SI, this packing must be done by the PS itself.
Tile Index. In order to reduce complexity, tiling parameters are now specified as an index into a 32-entry
global tiling table initialized at boot.
o Depth Bounds Test. The DB now supports functionality equivalent to the OpenGL
EXT_depth_bounds_test extension.
indexed
PC wave1 inst buffer SGPRs regs
Vector Instr LDS
Vector GPRs
PC wave2 inst buffer
Scalar Instr
PC wave3 inst buffer vec-mem TA/TD
L1 D$
Arbiter
Instruction r/o
SX
On SI, the vector ALU processes waves of 64 threads, with each thread executing only 1 scalar op. This
architecture simplifies scheduling for the compiler and is more efficient in the common case where full instruction
vectors cannot be scheduled.
The vector unit mostly works with a set of Vector GPRs. These are referred to as VGPRs, and show up as v0, v1,
etc. in text shader code. It is important to recognize that the vector is 64-wide, with each item in the vector
corresponding to one thread (one pixel in a pixel shader, one vertex in a vertex shader, etc). They are not 4-wide
vectors corresponding to the rgba/xyzw components of an IL register, which must be implemented with 4 VGPRs.
For example, the resource descriptors (fetch constants) stored in memory must be fetched before the texture fetch
itself can be executed. Typically, the resource descriptor fetch would be done using a scalar op into SGPRs, while
the texture fetch itself would vary per thread (due to texture coordinates varying per-pixel), and is therefore done
with a vector texture fetch into VGPRs.
The Scalar ALU also manages control flow. For intra-wave branching, this can be done by manipulating the
EXEC architectural SGPR, a 64-bit register that qualifies which threads are active for each vector instruction.
The scalar unit also supports a variety of full-wave branch operations, typically based on the VCC architectural
SGPR, a 64-bit register containing the 1-bit result per-thread of a previous vector instruction.
VGPR Initialization: VGPRs are loaded with shader inputs that vary per-thread. The initialization
depends heavily on the shader type. For example, VSes load VGPR0 with the thread’s vertex index,
PSes loads VGPRs with barycentric coordinates, etc. The full details are explained in the Shader
Programming section.
SGPR Initialization: SGPRs are loaded with shader inputs that do not vary per-thread, and this process
also depends on the shader type.
o Shader type specific data. For example, CSes can load the thread group ID into SGPRs, GSes
can load a GS/VS ring offset, etc.
The full SGPR loading details are described in the Shader Programming section.
Internal Regs: Several architectural registers will also be loaded by SPI. For example, the program
counter (PC) will be loaded with the driver-written shader address, and the EXEC mask will be loaded
with the valid threads.
1.2.1.6 SH Registers
A new class of hardware register has been introduced for communicating with the shader core. NI has config (1-
state) registers and context (8-state) registers. SI adds the concept of SH registers, which allow more than 8 sets
of state at once, limited by the total number of registers updates in flight. Frequently updated shader registers are
SH regs – user data, program bases, etc. There is a new PM4 packet for setting SH regs, SET_SH_REG,
analogous to SET_CONTEXT_REG and SET_CONFIG_REG.
The layout of individual constants in memory (buffer, image, or sampler) is described in the register spec entries
for SQ_BUF_RSRC_WORD0-3, SQ_IMG_RSRC_WORD0-7, and SQ_IMG_SAMP_WORD0-3. These are
dummy entries; they do not correspond to real registers.
These registers are written with SET_SH_REG PM4 commands in the driver’s indirect buffer.
At shader launch, SPI loads values from these 16 hardware registers into SGPRs 0 – 15 to be read by the
shader. The actual number of values loaded by SPI is controlled by the USER_SGPR field in the
SPI_SHADER_PGM_RSRC2_xx register.
One entry in the user element table specifies mmSPI_SHADER_USER_DATA_VS_8-9 should be loaded
with a 64-bit GPU memory pointer to a table of all constant buffer SRDs:
At shader launch, SPI will load the address written by the driver into SGPRs 8 and 9. In the shader itself,
s[8:9] is dereferenced to lookup SRDs for each constant buffer.
This user data scheme is ultimately flexible for getting SRDs to the shader. We could specify SRD tables
hierarchically, in efficient sparse structures, etc. In practice, though, we currently only support two modes:
Immediate Mode. In immediate mode, a single SRD is written directly into user data hardware registers.
This makes the SRDs available for use (in SGPRs) immediately, without having to read it from memory
first. This mode is severely limited since there are so few user data registers and SRDs take up 4 to 8
dwords.
Flat Table Mode. In flat table mode, the specified user data registers are programmed with an address to
GPU memory containing a table of all SRDs of the specified type. E.g., PTR_RESOURCE_TABLE
requires a pointer to a table with all SRV SRDs stored consecutively. The shader will have to explicitly
load SRDs from the table before performing a fetch.
Draw Engine (DE): The standard graphics engine is now referred to as the Draw Engine. Most PM4 commands
continue to be submitted to the DE command buffer.
Constant Engine (CE): The constant engine uses a second, separate command buffer to control constant uploads.
The engine runs in parallel with the draw engine, allowing constant updates to get sufficiently ahead of the
draws/dispatches that will use them.
Additionally, CP has 64KB of on-chip RAM (CE RAM) that acts as a staging buffer for constant updates.
Shaders cannot read directly from the CE RAM.
The 64KB is carved up between the 3 rings, with ring-0 (gfx) having 32KB. The driver is responsible for further
subdividing the partitions to store an on-chip copy of the most up-to-date copy of every SRD.
Writes: Write operations update a specified location in the CE RAM with inline data from the CE
command buffer. As described above, these writes are used at resource bind to keep the CE RAM image
up to date.
Dumps: Dump operations copy from CE RAM to GPU memory. Dumps are performed during validation
to update a newly allocated chunk with the most up to date SRD table.
Loads: Load operations copy data from GPU memory to CE RAM.
The driver must explicitly manage synchronization between the CE and DE command buffers. To handle this,
there are two counters:
CE Counter: The CE counter can be incremented with a packet in the CE command buffer. Before each
draw/dispatch, we insert a CE increment packet.
DE Counter: The DE counter can be incremented with a packet in the DE command buffer. We issue a
DE counter increment after every draw/dispatch.
Finally, CP supports a PM4 command allowing the DE engine to wait for (CE – DE > 0). We issue this packet
before each draw/dispatch, ensuring that the updated constants are ready in memory before the draw is executed.
Fetch Shader Subroutine (FS): Implement a subroutine called by the VS to load all input VGPRs, based
on input layout.
Monolithic VS with Embedded Vertex Fetch (MVS): Compile vertex fetch as part of the VS. A new
hardware VS would need to be compiled for every associated input layout.
Fetch Operation Per Vertex Element (FOPE): Implement a set of small subroutines that each fetch a
single input element, VS would call subroutines to fetch input elements as necessary.
1.4.2 FS Subroutine
The fetch shader subroutine attempts to emulate the fetch shader functionality on NI. As on NI, the driver allocates
video memory for the FS and generates hardware shader instructions directly mapping the vertex layout state to the
shader inputs. Key changes from NI:
FS is a true VS subroutine.
o No SQ_PGM_START_FS register. Compiler specifies a user data register to be programmed
with the FS address.
o No SQ_PGM_RESOURCES_FS register. Driver may need to update hardware VS registers
based on which FS is bound (account for FS VGPR/SGPR usage, etc).
o FS is responsible for explicitly returning to the VS.
No semantic fetch support. Without semantic fetch, the compiler returns a map from logical input
registers to physical GPRs. A unique FS is generated for every combination of vertex layout state and
hardware VS.
No fixed function BaseVertexLocation or StartInstanceLocation support. NI exposed
SQ_VTX_BASE_VTX_LOC and SQ_VTX_START_INST_LOC CTL constant registers for these
parameters which modify the vertex index and instance index, respectively, before invoking the FS. We
must now specify these parameters through user data registers and compute the offset values in the FS.
On SI, this responsibility falls on the driver, which must program the SPI_PS_INPUT_CNTL_0-31 registers with
the absolute parameter cache location of each PS input. This must be handled at validation time by examining the
VS output declarations and PS input declarations as returned by the compiler.
values as needed. This feature was controlled by the SOURCE_FORMAT field in the CB_COLORn_INFO
register.
On SI, the SX hardware block no longer provides this fixed function support. Instead, the PS must issue shader
instructions to reduce precision before executing export instructions. This requires slightly different versions of
each PS based on the formats of the bound render targets.
At draw time, the driver will examine the bound RTs, and determine which format should be used for each export.
There are 10 possible export formats, selected per-RT in the SPI_SHADER_COL_FORMAT register (e.g.,
32_ABGR, UNORM16_ABGR, 32_R, etc). Based on the format chosen for each RT, the driver will potentially
create a new version of the PS, patching all export sequences with the appropriate export format.
1.7.1 Gotchas
There are some gotchas with the tile index scheme worth noting:
CB and SRDs allow a 5-bit index, allowing the choice of any tile config in the table. DB only has a 3-bit
index, so the Z/stencil tile configs have to be in the first 8 entries.
There is no LINEAR_GENERAL entry. Linear general is implied for all buffer SRVs. When rendering to
a buffer RT, an override bit, CB_COLORn_INFO.LINEAR_GENERAL, must be set telling the CB to
ignore the programmed tile index.
Z/stencil buffers have two tile indices:
o DB_Z_INFO.TILE_MODE_INDEX defines all tile parameters for the depth plane, and most
parameters for the stencil plane.
o DB_STENCIL_INFO.TILE_MODE_INDEX specifies an entry from which the stencil’s
TILE_SPLIT parameter will be read from (others are shared with the depth plane).
MSAA color surfaces also have two tile indices. CB_COLORn_ATTRIB.FMASK_TILE_MODE_INDEX
specifies an entry from which the fmask’s tile parameters will be read.
2. PM4
2.1 Overview
PM4 is the packet API used to program the GPU to perform a variety of tasks. The driver does not write directly to
the GPU registers to carry out drawing operations on the screen. Instead, it prepares data in the format of PM4
Command Packets in either system or video (a.k.a. local) memory, and lets the Micro Engine to do the rest of the
job.
Three types of PM4 command packets are currently defined. They are types 0, 2 and 3 as shown in the following
figure. A PM4 command packet consists of a packet header, identified by field HEADER, and an information body,
identified by IT_BODY, that follows the header. The packet header defines the operations to be carried out by the
PM4 micro-engine, and the information body contains the data to be used by the engine in carrying out the
operation. In the following, we use brackets [.] to denote a 32-bit field (referred to as DWord) in a packet, and
braces {.} to denote a size-varying field that may consist of a number of DWords. If a DWord consists of more than
one field, the fields are separated by "|". The field that appears on the far left takes the most significant bits, and the
field that appears on the far right takes the least significant bits. For example, DWord LO_WORD denotes that
HI_WORD is defined on bits 16-31, and LO_WORD on bits 0-15. A C-style notation of referencing an element of a
structure is used to refer to a sub-field of a main field. For example, MAIN_FIELD.SUBFIELD refers to the sub-
field SUBFIELD of MAIN_FIELD.
The use of this packet requires the complete understanding of the registers to be written. The register address is split
into two areas: the first 32K bytes is system registers and beyond that is graphics and multi-media. For graphics and
multi-media registers there is an alternative, called SET_*. For the first 32KB of register space (system registers)
there is no SET_* type packet and TYPE-0 packets should be used.
Type-3 packets have a common format for their headers. However, the size of their information body may vary
depending on the value of field IT_OPCODE. The size of the information body is indicated by field COUNT. If the
size of the information is N DWords, the value of COUNT is N-1. In the following packet definitions, we will
describe the field IT_BODY for each packet with respect to a given IT_OPCODE, and omit the header. .
ME_INITIALIZE 0 X X
PREAMBLE_CNTL 0-2 X
Command Buffer Packets
INDIRECT_BUFFER 0-2 X
INDIRECT_BUFFER_CONST 0-2 X
Draw/Dispatch Packets
DRAW_INDEX 0 X
DRAW_INDEX_2 0 X
DRAW_INDEX_AUTO 0 X
DRAW_INDEX_MULTI_AUTO 0 X
DRAW_INDEX_IMMED 0 X
DRAW_INDEX_INDIRECT 0 X
INDEX_BUFFER_SIZE 0 X
DRAW_INDEX_OFFSET 0 X
DRAW_INDEX_OFFSET_2 0 X
DRAW_INDIRECT 0 X
INDEX_BASE 0 X
INDEX_TYPE 0 X
NUM_INSTANCES 0 X
MPEG_INDEX 0 X
DISPATCH_DIRECT 0-2 X
DISPATCH_INDIRECT 0-2 X
State Management Packets
CLEAR_STATE 0-2 X
CONTEXT_CONTROL 0-2 X
LOAD_CONFIG_REG 0 X
LOAD_CONTEXT_REG 0-2 X
LOAD_SH_REG 0-2 X
ALLOC_GDS 0-2 X
SET_BASE 0-2 X X
SET_CONFIG_REG 0 X
SET_CONTEXT_REG 0-2 X
SET_CONTEXT_REG_INDIRECT 0-2 X
SET_SH_REG 0-2 X
LOAD_CONST_RAM 0-2 X
WRITE_CONST_RAM 0-2 X
WRITE_CONST_RAM_OFFSET 0-2 X
DUMP_CONST_RAM 0-2 X
SET_CE_DE_COUNTERS 0-2 X
INCR_DE_COUNTER 0-2 X
INCR_CE_COUNTER 0-2 X
WAIT_ON_DE_COUNTER 0-2 X
WAIT_ON_CE_COUNTER 0-2 X
Command Predication Packets
COND_EXEC 0-2 X
COND_WRITE 0-2 X
SET_PREDICATION 0-2 X
PRED_EXEC 0-2 X
OCCLUSION_QUERY 0-2 X
Synchronization Packets
EVENT_WRITE 0-2 X
EVENT_WRITE_EOP 0-2 X
EVENT_WRITE_EOS 0-2 X
MEM_SEMAPHORE 0-2 X
PFP_SYNC_ME 0-2 X
STRMOUT_BUFFER_UPDATE 0-2 X
SURFACE_SYNC 0-2 X
WAIT_REG_MEM 0-2 X
Atomic
ATOMIC 0-2 X
ATOMIC_GDS 0-2 X
Misc Packets
COPY_DW 0-2 X
COPY_DATA 0-2 X
ME_WRITE 0-2 X
CE_WRITE 0-2 X
MEM_WRITE 0-2 X
NOP 0-2 X X
ONE_REG_WRITE 0-2 X
The ME_INITIALIZE packet should be sent to the CP immediately after loading the microcode and enabling
the Micro Engine (ME).
This Type-3 packet is used by the ME to initialize internal state information that is used by other packets.
If the ME_INITIALIZE packet changes the MAX_CONTEXT value, then it needs to be followed by a
CONTEXT_CONTROL packet with a full load mask to force a reload of shadowed registers and constants.
If the device supports more than one ring buffer for a single GPU, only the primary ring (3D ring) should have
this packet.
Max context of 0 is not valid since that context is now used for the clear
state context. For example, 3 means the GPU uses contexts 0-3, i.e., it
utilizes 4 contexts.
5 DEV_ID | 31:24 - Reserved
EXTERNAL_MEM_SWAP 23:16 - One-hot Device-ID
15:2 - Reserved
1:0 - Swap Code Used for the following transactions: Load_* , Set_* ,
PM4 headers - debug
6 Header_Dump_Base | 31:4 - Header_Dump_Base : a 4 Kbyte aligned address, i.e. base memory
Header_Dump_Swap address [47:12] of the external memory location where CP will dump
PM4 Headers.
3:2 - Reserved: should be set to zero.
1:0 - Header_Dump_Swap: the 2 bit Swap Code used when writing
headers to memory.
7 Header_Dump_Enable | 31 - Header_Dump_Enable: Enable Writing PM4 Headers to Memory for
Header_Dump_Size Debug (Degrades Performance).
30 - Reserved.
29:0 - Header_Dump_Size: Size in DWords for the Header Dump Ring
in External Memory.
2.2.2 SET_CONFIG_REG
The SET_CONFIG_REG packet loads the single-context-configuration register data, which is embedded in the
packet, into the chip. The REG_OFFSET field is a DWord-offset from the starting address. All the register data in
the packet is written to consecutive register addresses beginning at the starting address. The starting address for
register data is computed as follows:
Reg_Start_Address[17:2] = 0x2000 + REG_OFFSET (Note: Byte Offset 0x8000; DWord Offset 0x2000)
The CP will write the data to external memory if the corresponding shadow enable is set. This allows the register
data to be reloaded into the chip later with the LOAD_CONFIG_REG packet. The LOAD_CONFIG_REG packet
sets the REG_CONFIG_BASE and the CONTEXT_CONTROL packet enables/disables write shadowing to external
memory (see these packets for more details). The starting external memory address that the register data is written to
is computed as follows:
2.2.3 SET_BASE
The SET_BASE packet is used to generically specify the starting, or base, memory address of a buffer use by the CP
for unrelated features.
BASE_INDEX detail:
0010 (GDS_Partition_Bases), the packet sets the partition boundaries for each section in the GDS. There
are 3 sections: Ring 0 Gfx/CS0, CS1 & CS2. GDS_RING0_INDEX is always 0, but the other
boundaries are programmable as indicated in the table below.
0011 (CE_Partition_Bases), the packet sets the partition boundaries for each section in the CE. There are 3
sections: Ring 0 Gfx/CS0, CS1 & CS2. CE_RING0_INDEX is always 0, but the other boundaries are
programmable as indicated in the table below.
2.2.4 LOAD_CONFIG_REG
The LOAD_CONFIG_REG packet will only be processed if the CP processes a CONTEXT_CONTROL packet
with the appropriate shadow bit set.
Initialize the CONFIG_REG_BASE (BASE_ADDR_* fields) internally for later use when a
SET_CONFIG_REG packet is processed and shadowing is enabled (via the CONTEXT_CONTROL
packet). For this case, there are 5 DWs in the packet and DWs 4 and 5, REG_OFFSET and NUM_DWords
are programmed to zero.
Fetch single-context-configuration register data from external memory into the chip that was previously
shadowed. For this case, there are 5 or more DWs and all are meaningful.
Reg_Start_Address[17:2] = 0x2000 + REG_OFFSET (Note: Byte Offset 0x8000 = DWord Offset 0x2000)
To preserve coherency between shadowed Set_* packet writes and Load_* packet reads from external memory, the
CP first waits until all prior SET_* data has been shadowed to memory before issuing the Load's memory read
requests for the register data. The CP will fetch Num_DWords from external memory. If, however, the driver only
needs to change the Base_Addr, then Num_DWords can be set to zero and no data will be fetched.
2 OFFSET [15:0] Starting byte offset into the Constant RAM. The minimum granularity is 4 bytes, so
bits[1:0] must be zero.
3 NUM_DW [14:0] Number of DWs to read from the constant RAM. The minimum granularity is DWs, so
any bit can be ‘1’.
4 ADDR_LO [31:0] Byte Address[31:0]. The Address granularity is 4 bytes, so bits[1:0] must be zero.
bits[1:0] are zero.
2.3.2 INCREMENT_CE_COUNTER
In the ME this packet creates a pipelined event that causes the CP to EOP block to increment it counter. If the packet
command specifies to clear the counter, the ME does this at the top of pipe and clears the associated counter. The
counter is double buffered.
2.3.3 INDIRECT_BUFFER_CONST
This packet is used for dispatching Constant Indirect Buffers, new in SI. A separate constant buffer allows the CP to
process constants ahead of and concurrently with the “draw command buffer”. The driver effectively creates two
command buffers where it created one previous to SI. The command buffer pointed to by this packet has new
packets specifying the constants.
The KMD will specify the VMID in each Indirect_Buffer_Const packet. The KMD is required to include the VMID
in the Indirect_Buffer_Const packet in the ring buffer.
2.3.4 LOAD_CONST_RAM
Can be conditionally executed to initialize or re-prime the CE's constant RAM from memory at the beginning of an
app or at a switch from one application to another.
2.3.5 SET_CE_DE_COUNTERS
This packet initializes the ME called DE_COUNT and the CE version called CE_COMPARE_COUNT. The
WAIT_ON_DE_COUNTER packet only compares to the DE_COUNT, where the WAIT_ON_DE_-
COUNTER_DIFF subtracts the DE_COUNT from the CE_COMPARE_COUNT. It does not clear the CE’s
up/down counter called CE_COUNTER since that will always be zero already by design.
2.3.6 WAIT_ON_DE_COUNTER
Instructs the CE to wait on counter from the DE to be greater than or equal to COUNTER_HI:COUNTER_LO.
DW Field Description
1 HEADER Header of the packet.
2 COUNTER_LO [31:0] Lower 32 bits of the counter.
3 COUNTER_HI [31:0] Upper 32 bits of the counter.
2.3.7 WAIT_ON_DE_COUNTER_DIFF
Instructs the CE to wait on difference between the ME’s copy of the DE counter (DE_COUNT) and the CE’s copy
(CE_COMPARE_COUNT) to be less than the DIFF.
2.3.8 WRITE_CONST_RAM
Write DWs from the PM4 stream to the CE's constant RAM.
DW Field Description
1 HEADER Header
2 OFFSET [15:0] Starting DW granularity offset into the constant RAM. Thus, bits[1:0] are zero.
2.3.9 WRITE_CONST_RAM_OFFSET
Packet is similar to the WRITE_CONST_RAM packet, except that the DATA ordinal needs to be modified. The
data supplied is an offset from the partition base, but the partition base is unknown. The microcode needs to replace
the DATA with 'DATA + Partition_Base' before writing it into the CE RAM. The corresponding partition is
determined from the Ring to which the packet is submitted (see GDS_*_INDEX as described in SET_BASE packet
for more details).
1 HEADER Header
The Shadow Enable bits are used to turn shadowing on and off for the SET_* packets. When set, the memory-
mapped register writes will be shadow to memory and when reset, they will not. There is a bit for enabling/disabling
shadowing for each type of SET packet (review the SET packets for more details on shadowing). The Load and
Shadow DWs each have a DW Enable bit. When not set, the DW will be discarded so that when needed, the driver
may update one without affecting the other.
2.4.2 CLEAR_STATE
The purpose of the Clear_State packet is to reduce command buffer preamble setup time for all driver versions of
both DX and OpenGL and to specifically support DX11’s Display Lists requirements. The definition of Clear State
is essentially everything off, resources all NULL, other values set to a defined default state.
2.4.3 LOAD_CONTEXT_REG
This packet provides the ability to have the CP:
Initialize the CONTEXT_REG_BASE (BASE_ADDR_* fields) internally for later use when a
SET_CONTEXT_REG packet is processed and shadowing is enabled (via the CONTEXT_CONTROL
packet). For this case, there are 5 DWs in the packet and DWs 4 and 5, REG_OFFSET and NUM_DWords
are programmed to zero.
Fetch eight-context-configuration register data from external memory into the chip that was previously
shadowed. For this case, there are 5 or more DWs and all are meaningful.
The CP computes the DWord-aligned external memory read address as follows:
2.4.4 SET_CONTEXT_REG
This packet loads the eight-context-renderstate register data, which is embedded in the packet, into the chip. Note:
This packet checks if the context needs to be updated and rolls the context as required. The REG_OFFSET field is a
DWord-offset from the starting address. All the render state data in the packet is written to consecutive register
addresses beginning at the starting address. The starting address for register data is computed as follows:
• Reg_Start_Address[17:2] = 0xA000 + REG_OFFSET (Note: Byte Offset 0x28000; DWord Offset 0xA000)
The CP will write the data to external memory if the corresponding shadow enable is set. This allows the register
data to be reloaded into the chip after a context switch with the LOAD_CONTEXT_REG (LCTX) packet. The
LCTX packet sets the REG_CONTEXT_BASE and the CONTEXT_CONTROL packet enables/disables write
shadowing to external memory (see these packets for more details). The starting external memory address that the
render state data is written to is computed as follows:
To preserve coherency between shadowed Set packet writes and Load packet reads from external memory, the
CP.PFP first waits until all prior SET_* data has been shadowed to memory before issuing the Load's memory read
requests for the register data.
Header of the packet. Shader_Type in bit 1 of the Header will correspond to the shader
1 HEADER
type of the Load, see Type-3 Packet.
[15:0] - Offset in DWords from the register base address (0xA000 in DWs) and memory
2 REG_OFFSET
base address (CONTEXT_REG_BASE).
3 to N REG_DATA DWord Data for Registers or DW Offset into the Patch Table.
2.4.5 SET_CONTEXT_REG_INDIRECT
This packet loads the eight-context-renderstate register data, which the CP fetches from the Patch Table, starting at
offset REG_INDEX. The REG_OFFSET field is a DWord-offset from the starting address. Note: This packet
checks if the context needs to be updated and rolls the context as required. The REG_OFFSET field is a DWord-
offset from the starting address. The write address for the register data is computed as follows:
• Reg_Start_Address[17:2] = 0xA000 + REG_OFFSET (Note: Byte Offset 0x28000; DWord Offset 0xA000)
The CP will write the data to external memory if the corresponding shadow enable is set. This allows the register
data to be reloaded into the chip later with the LOAD_CONTEXT_REG packet. The LOAD_CONTEXT_REG
packet sets the REG_CONTEXT_BASE and the CONTEXT_CONTROL packet enables/disables write shadowing
to external memory (see these packets for more details). The starting external memory address that the render state
data is written to is computed as follows:
To preserve coherency between shadowed Set packet writes and Load packet reads from external memory, the
CP.PFP first waits until all prior SET_* data has been shadowed to memory before issuing the Load's memory read
requests for the register data.
2.4.6 LOAD_SH_REG
This packet provides the ability to have the CP:
Initialize the per-ring SH_REG_BASE (BASE_ADDR_* fields) internally for later use when a
SET_SH_REG packet is processed and shadowing is enabled (via the CONTEXT_CONTROL packet). For
this case, there are 5 DWs in the packet and DWs 4 and 5, REG_OFFSET and NUM_DWords are
programmed to zero.
Fetch SH REG data from external memory into the chip that was previously shadowed. For this case, there
are 5 or more DWs and all are meaningful.
The CP computes the DWord-aligned external memory read address as follows:
To preserve coherency between shadowed Set packet writes and Load packet reads from external memory, the CP
first waits until all prior SET_* data has been shadowed to memory before issuing the Load's memory read requests
for the register data. The CP will fetch Num_DWords from external memory. If, however, the driver only needs to
change the Base_Addr, then Num_DWords can be set to zero and no data will be fetched.
2.4.7 SET_SH_REG
This packet updates the shader persistent register state in the SPI, which is embedded in the packet, into the chip.
The REG_OFFSET field is a DWord-offset from the starting address. All the persistent state data in the packet is
written to consecutive register addresses beginning at the starting address. The starting address for register data is
computed as follows:
Reg_Start_Address[17:2] = 0x2C00 + REG_OFFSET (Note: Byte Offset 0xB000; DWord Offset 0x2C00)
The CP will write the data to external memory if the corresponding shadow enable is set. This allows the register
data to be reloaded into the chip later with the LOAD_SH_REG packet. The LOAD_SH_REG packet sets the
SH_REG_BASE and the CONTEXT_CONTROL packet enables/disables write shadowing to external memory (see
these packets for more details). The starting external memory address that the render state data is written to is
computed as follows:
To preserve coherency between shadowed Set packet writes and Load packet reads from external memory, the
CP.PFP first waits until all prior SET_* data has been shadowed to memory before issuing the Load"s memory read
requests for the register data.
2 REG_OFFSET [15:0] - Offset in DWords from the register base address (0x2C00 in DWs) and memory
base address (SH_REG_BASE).
3 to N REG_DATA DWord Data for Registers.
2.5.2 DRAW_INDEX_AUTO
Draws a set of primitives using indices auto-generated by the VGT.
2.5.3 DRAW_INDEX_IMMED
Draws a set of primitives using indices in the packet.
2.5.4 DRAW_INDEX_INDIRECT
The information needed for the Draw is embedded in a buffer (rather than in the packet) and therefore the CP must
fetch the information from memory before executing the Draw. The data structure in memory has this format:
struct DrawIndexedInstancedArgs
{
UINT IndexCountPerInstance;
UINT InstanceCount;
UINT StartIndexLocation;
UINT BaseVertexLocation;
UINT StartInstanceLocation;
};
Definition of Parameters:
• IndexCountPerInstance: The number of indexes per instance of the index buffer that indexes are read from to
draw the primitives.
• InstanceCount: The number of instances of the index buffer that indexes are read from to draw the primitives.
• StartIndexLocation: Index of the first index to use when accessing the vertex buffer; begin at StartIndexLocation
to index vertices from the vertex buffer.
• BaseVertexLocation: The number that should be added to each index that is referenced by the various primitives
to determine the actual index of the vertex elements in each vertex stream.
• StartInstanceLocation: The first instance of the index buffer that indexes are read from to draw the primitives.
The driver must send the following packets before sending the DRAW_INDEX_INDIRECT packet: SET_BASE
packet to specify the start address of BufferForArgs, the INDEX_BASE packet to specify where the index buffer
starts and the INDEX_BUFFER_SIZE packet to specify the number of indices.
assigned context.
2.5.5 DRAW_INDEX_MULTI_AUTO
Combines several individual packets into a single one to allow fast draws of primitives where the number of indices
is small.
2.5.6 DRAW_INDEX_OFFSET_2
The purpose of this packet, in conjunction with the INDEX_TYPE Packet and INDEX_BASE packets, draws a set
of primitives using fetched indices from a bounded index buffer while minimizing the amount of address patching
that the driver must do Vista BDM. The base of the index buffer, supplied in the INDEX_BASE packet, and the
index type (16 bit or 32 bit), supplied in the INDEX_TYPE Packet, must have already been sent when this packet
arrives at the CP.
2.5.7 DRAW_INDIRECT
The information needed for the Draw is embedded in a buffer (rather than in the packet) and therefore the CP must
fetch the information from memory before executing the Draw. The data structure in memory has this format:
struct DrawInstancedArgs
{
UINT VertexCountPerInstance;
© 2012 Advanced Micro Devices, Inc.
Proprietary 27
Revision 1.0 April 18, 2012
UINT InstanceCount;
UINT StartVertexLocation;
UINT StartInstanceLocation;
};
Note: See the DRAW_INDEX_INDIRECT packet for the definition of Terms. The driver must send the following
packet before sending the DRAW_INDEX_INDIRECT packet: SET_BASE packet to specify the start address of
BufferForArgs. This packet will write to two SGPRs, so they need to be included in the total number the SPI loads
and it must be coordinated with the shader compiler.
2.5.8 INCREMENT_DE_COUNTER
In the ME this packet creates a pipelined event that causes the CP to EOP block to increment it counter. If the packet
command specifies to clear the counter, the ME does this at the top of pipe and clears the associated counter. The
counter is double buffered.
2.5.9 INDEX_BASE
The CP saves the INDEX_BASE address in this packet, so when the CP processes the DRAW_INDEX_OFFSET
packet it can add the base address to the (offset shifted one or two bits depending on the size of the index (16 bits or
32 bits) specified previously in the INDEX_TYPE). This packet is considered part of the draw packet sequence, so
the INDEX_BASE is not shadowed. If this packet is not sent before each draw then it will need to be in the
preamble of each command buffer to ensure it gets set correctly before the first draw.
2.5.10 INDEX_BUFFER_SIZE
The purpose of the INDEX_BUFFER_SIZE packet, in conjunction with the INDEX_TYPE, INDEX_BASE and
DRAW_INDEX_INDIRECT packets, is to allow the CP to calculate the value to write to the
VGT_DMA_MAX_SIZE register. See the DRAW_INDEX_INDIRECT packet for how it is used for that packet.
2.5.11 INDEX_TYPE
This packet is considered part of the draw packet sequence, so the VGT_INDEX_TYPE is not shadowed. If this
packet is not sent before each draw then it will need to be in the preamble of each command buffer to ensure it gets
set correctly before the first draw.
2.5.12 INDIRECT_BUFFER
This packet is used for dispatching Indirect Buffers. The KMD will specify the VMID in each Indirect_Buffer
packet. The KMD is required to include the VMID in the Indirect_Buffer packet.
2.5.13 MPEG_INDEX
MPEG_INDEX: Packed register writes for MPEG and Generation of Indices.
2 NUM_INDICE Number of Indices the VGT will actually fetch + 3 * number of base indices
S given at end of this packet. Valid values are 0x0003 to 0x3FFF.
3 DRAW_INITI Written Unconditional to VGT_DRAW_INITIATOR register
ATOR
4 to 4 + 32-Bit INDEX First Index of Rect. (0x00000000 to 0xFFFFFFFD) For each First Index", CP
((NUM_INDICES/ will generate the other 2 indices and output: FIRST_INDEX FIRST_INDEX+1
3) - 1) FIRST_INDEX+2 All indices are written to the VGT_IMMED_DATA register.
2.5.14 NUM_INSTANCES
NUM_INSTANCES is used to specify the number of instances for the subsequent draw command.
This packet is considered part of the draw packet sequence, so the VGT_NUM_INSTANCES is not shadowed. If
this packet is not sent before each draw then it will need to be in the preamble of each command buffer to ensure it
gets set correctly before the first draw.
2.5.15 WAIT_ON_AVAIL_BUFFER
This packet is inserted by the driver into the draw command buffer when, and only when, it inserts the
SET_CE_DE_COUNTERS packet into the constant command buffer. This indicates that the following indirect
buffer should use switch to using the other ping-pong buffer, but it must wait until the buffer is available. The DE is
not allowed to get ahead of the CE. The CE forwards both of its flags to the DE. The algorithm for the DE is as
follows:
1 0 1 1 (Invalid #1)
1 1 0 0 CE is working both buffers, but the DE has not yet even processed the
WAIT_ON_AVAIL_BUFFER for the first buffer.
1 1 0 1 CE and DE are both active on buffer 0.CE set buffer 1, when the DE receives
WAIT_AVAIL_BUFFER it will set DE1.
1 1 1 0 CE and DE are both active on buffer 1. CE set buffer 0, when the DE receives
WAIT_AVAIL_BUFFER it will set DE0.
1 1 1 1 CE and DE are both active on buffer 0.
CE and DE are both active on buffer 1.
2.5.16 WAIT_ON_CE_COUNTER
Instructs the ME to wait on CE Counter (Write Confirm Constant Set Counts) to be greater than zero.
2.6.2 DISPATCH_INDIRECT
Dispatches a compute job using the parameters fetched from memory.
//At the specified offset, the following data members will be in this order.
struct GroupDimensions;
{
UINT DIM_X;
UINT DIM_Y;
UINT DIM_Z;
};
Before sending a COND_EXEC packet the driver allocates a memory location and set it to 0x00000001. It will
clear it to zero if it needs the CP to stop and perform the command in the packet the next time the CP encounters a
COND_EXEC packet.
Note: Care must be taken to make certain that EXEC_COUNT contains the exact number of DWords for the
subsequent packets that are to be predicated if the Boolean value is zero. The CP will start parsing the DWord
immediately following EXEC_COUNT DWords. If this is not a packet header, the device will encounter corruption
or hang.
2.7.2 COND_WRITE
The CP reads either a memory or a register location (indicated by POLL_SPACE) and tests the polled value with the
reference value provided in the command packet. The test is qualified by both the specified function and mask. If the
test passes, the write occurs to either a register or memory depending on WRITE_SPACE. If the test fails, the CP
skips the write. In either case, the CP then continues parsing the command stream.
2.7.3 PRED_EXEC
Functionality Perform a predicated execution of a sequence of packets (type 0, 2, and type 3) on select devices.
Notes: The ME_INITIALIZE packet includes a GPU unique Device ID. Care must be taken to make certain that
EXEC_COUNT contains the exact number of DWords for the subsequent packets that are to be predicated. The
CP.PFP will start parsing the DWord immediately following EXEC_COUNT DWords. If this is not a packet header,
the device will encounter corruption or hang.
2.7.4 SET_PREDICATION
The SET_PREDICTION packet provides a single flexible packet for the driver to specify type of predication check
for previous events: ZPASS, PRIMCOUNT, etc.
2.8 Synchronization
2.8.1 ATOMIC
Sent only in the DE command buffer to request that the CP does either a single atomic operation or an atomic loop.
All supported atomics can perform a single atomic operation, and only CMPSWP atomics support loop CMD. The
CP will wait until the preop value is returned and place it into the CP_ATOMIC_PREOP_LO, and for 64 bit atomics
the CP_ATOMIC_PREOP_HI register before proceeding to the next packet or next loop. For CMPSWP atomics
with CMD = Loop, the CP will take the preop RTN value and compare it to packet compare value
(CMP_DATA_*), if equal the packet ends, else it loops again. Subsequent packets may operate on the value
returned. For context switching this register needs to be saved and restored. Only opcodes that return preop values
are supported for this packet.
Combinations
OP CMD ADDR_LO/HI SRC_DATA CMP_DATA LOOP_INTERVAL
Non-CMPSWP 0 yes yes n/a (0x0) n/a (0x0)
CMPSWP 0 yes yes n/a (0x0) n/a (0x0)
CMPSWP 1 yes yes yes yes
2.8.2 ATOMIC_GDS
The purpose of this packet is to support Atomic operations in the GDS from the CP.
If the Atomic Op returns pre-op source data, the CP will read the data and store it the CP_GDS_ATOMIC*
registers. The driver must indicate that the CP needs to do this by setting the “ATOM_READ” and
“ATOM_RD_CNTL” control bits in the packet. Reads take a long time to complete, therefore the
“ATOM_RD_CNTL” bits allow the driver to optimize for size (32-bits vs. 64-bits) and number of return values (1
or 2). The COPY_DATA packet can be used to “copy” the read return data to various destinations (see
COPY_DATA packet for more details).
The GDS supports Compare-Swap Atomic operations. For these ops, the compare data is placed in the
ATOM_SRC0* ordinals and the source data is placed in the ATOM_SRC1* ordinals. If the CP should repeat the
compare-swap operation until it passes, then the “ATOM_CMP_SWAP” control bit should be set. If the CP does
not need to repeat until it passes, then it should not be set. Whenever the “ATOM_CMP_SWAP” control bit is set,
the “ATOM_READ” control bit should also be set and the “ATOM_RD_CNTL” bit should be set equal to either ‘0’
or ‘2’.
If the Atomic Op does not return pre-op source data and the driver wants confirmation that the Atomic Op that has
completed, it must set the ATOM_COMPLETE control bit. The ATOM_COMPLETE and ATOM_READ bits
should never both be set. Setting neither of the bits is also valid.
The GDS_ATOM_SRC0 triggers the Atomic operation in the GDS and the microcode will therefore write it last.
Any fields not used by the Atomic operation specified can be set to 0.
[7:6] Reserved.
[8] DMODE – controls flushing of denorms.
4 ATOM_BASE [15:0] ATOM_BASE – See byte granularity GDS_ATOM_BASE register for more
details.
Base address for Atomic operation relative to the GDS partition base. See the
SET_BASE packet for details on setting the GDS partition bases.
5 ATOM_SIZE [15:0] ATOM_SIZE – See GDS_ATOM_SIZE register for more details.
Size in bytes of the DS memory. Determines where clamping begins.
6 ATOM_OFFSET0 [7:0] ATOM_OFFSET0 – See GDS_ATOM_OFFSET0 register for more details.
Used to calculate the address of the corresponding source operation.
ATOM_OFFSET1 [23:16] ATOM_OFFSET1 – See GDS_ATOM_OFFSET1 register for more details.
Used to calculate the address of the corresponding source operation.
7 ATOM_DST [31:0] ATOM_DST – See GDS_ATOM_DST register for more details.
DS Memory address to perform the Atomic operation.
8 ATOM_SRC0 [31:0] ATOM_SRC0 – See GDS_ATOM_SRC0 register for more details.
Lower 32-bits of the atomic source0 data for non compare-swap atomic ops.
Lower 32-bits of the atomic compare data for compare-swap atomic ops.
9 ATOM_SRC0_U [31:0] ATOM_SRC0_U – See GDS_ATOM_SRC0_U register for more details.
Upper 32-bits of the atomic source0 data for non compare-swap atomic ops.
Upper 32-bits of the atomic compare data for compare-swap atomic ops.
10 ATOM_SRC1 [31:0] ATOM_SRC1 – See GDS_ATOM_SRC1 register for more details.
Lower 32-bits of the atomic source1 data and source data for compare-swap atomic
ops.
11 ATOM_SRC1_U [31:0] ATOM_SRC1_U – See GDS_ATOM_SRC1_U register for more details.
Upper 32-bits of the atomic source1 data and source data for compare-swap atomic
ops.
2.8.3 EVENT_WRITE
This packet is used when the driver wants to create a non-TimeStamp/Fence event. See EVENT_WRITE_EOP to
send timestamps and fences. The EVENT_WRITE supports two categories of events. Those are:
4 DW (DW) event where special handling is required: ZPASS, SAMPLE_PIPELINESTATS,
SAMPLE_STREAMOUTSTATS[,1,2,3].
2 DW (DW) event where no special handling is required; CP just writes EVENT_TYPE (bits[5:0] of DW 2
from the packet) into VGT_EVENT_INITIATOR register and DWs 3 and 4 do not exist, i.e., the packet is
only 2 DWs for these events. These include all other events.
When the EVENT_INDEX is set to ‘0111’ for the CACHE_FLUSH* events, there is also an option to invalidate the
TC’s L2 cache: INV_L2.
EVENT_WRITE Packet Description
DW Field Name Description
1 HEADER Header of the packet
2 EVENT_CNTL INV_L2[20]
Send WBINVL2 op to the TC L2 cache when EVENT_INDEX = 0111.
EVENT_INDEX[11:8]
0000: Any non-Time Stamp/non-Fence/non-Trap EVENT_TYPE not listed.
0001: ZPASS_DONE
0010: SAMPLE_PIPELINESTAT
0011: SAMPLE_STREAMOUTSTAT[S|S1|S2|S3]
0100: [CS|VS|PS]_PARTIAL_FLUSH
0101: Reserved for EVENT_WRITE_EOP time stamp/fence event types
0110: Reserved for EVENT_WRITE_EOS packet
0111: CACHE_FLUSH, CACHE_FLUSH_AND_INV_EVENT
1000 - 1111: Reserved.
EVENT_TYPE[5:0]
The CP writes this value to the VGT_EVENT_INITIATOR register for the assigned
context.
3 ADDRESS_LO ADDRESS_LO[31:3]
Lower bits of QWORD-Aligned Address. [2:0] - Reserved & must be programmed to
zero. Driver should only supply this DW for Sample_PipelineStats,
Sample_StreamoutStats, and Zpass (Occlusion).
4 ADDRESS_HI ADDRESS_HI[15:0]
Upper bits of Address [47:32] Driver should only supply this DW for
Sample_PipelineStats, Sample_StreamoutStats, and Zpass (Occlusion).
2.8.4 EVENT_WRITE_EOP
The EVENT_WRITE_EOP packet is used when the driver wants to create any end-of-pipe event. TS used below is
historical and indicates either fence data, trap or actual timestamp will be written back. Supported Events are:
Cache Flush TS: provides the driver with a pipelined fence/timestamp indicating that the CBs and DBs
have completed flushing their caches.
Cache Flush And Inval TS: same as above but the CBs and DBs also invalidate their caches before sending
the pulse back to the CP.
Bottom Of Pipe TS: provides the driver with a pipelined timestamp indicating that the CBs and DBs have
completed all work before the time stamp. This can be considered a read EOP event in that all reads have
occurred but the CBs/DBsz have not written out all the data in their caches.
Use the EVENT_WRITE packet for all others. Supported actions when requested event has completed are:
Timestamps - 64-bit global GPU clock counter value or CP_PERFCOUNTER_HI/LO, either with optional
interrupt .
Fences - 32 or 64 bit embedded data in the packet with optional interrupt. The privilege vs. unprivileged
designation is based on the privilege level of the DMA buffer that included the EVENT_WRITE_EOP
packet, not anything to do with the packet itself.
Traps (interrupt only).
There is also an option to invalidate the TC’s L2 cache: INV_L2.
EVENT_WRITE_EOP Packet Description
DW Field Name Description
1 HEADER Header of the packet
2 EVENT_CNTL INV_L2[20]
Send WBINVL2 op to the TC L2 cache.
EVENT_INDEX[11:8]
0000 - 0100: Reserved for EVENT_WRITE packet.
2.8.5 EVENT_WRITE_EOS
The EVENT_WRITE_EOS packet is used when the driver wants to create any end-of-shader event (end of CS or
end of PS). Supported Events are CS Done and PS Done.
When the CMD = 001, the CP will copy SIZE dwords starting from the partition base (see GDS_*_INDEX as
described in SET_BASE packet for more details) plus GDS_INDEX to the memory address specified. The
corresponding partition is determined from the Ring to which the packet is submitted.
© 2012 Advanced Micro Devices, Inc.
Proprietary 39
Revision 1.0 April 18, 2012
2.8.6 MEM_SEMAPHORE
The MEM_SEMAPHORE packet supports Signal and Wait Semaphores. Wait Semaphores are executed at the top
of pipe (CP) and a Signal Semaphores are executed at the bottom of pipe (after whatever work before it has been
completed). If the CP processes a Wait Semaphore there could be a Signal Semaphore still in the Gfx pipe behind
draws still being rendered.
2.8.7 OCCLUSION_QUERY
The motivation for this packet is to allow the application to access the accumulated query counts from the shader.
Before this packet, the application had to do the query at the driver level.
Available Controls
1. Specify the 4-byte aligned 40-bit MC address where the current 64-bit accumulation value is stored.
2. Specify the 16-byte aligned 40-bit MC starting address where a set of eight DB Zpass count pairs are
stored.
1. The Driver must initialize Begin and End occlusion data for DBs that do exist to 0x00000000 and for those
that don’t exist to 0x80000000.
2. The Driver must initialize AccumCnt to zero before the OCCLUSIONQUERY packet is sent to the GPU.
3. The CP will keep reading the DB ZPASS occlusion data until they are all valid.
4. The CP will write-confirm the final accumulated value before proceeding to the next packet.
2.8.8 PFP_SYNC_ME
This packet is inserted by the driver when it needs the PFP to stall or wait until the ME is at the synced up to the
PFP.
2.8.9 STRMOUT_BUFFER_UPDATE
The STRMOUT_BUFFER_UPDATE packet is expected to be used in a variety of streamout scenarios. When a
streamout operation spreads across two command buffers, the driver needs to ensure BufferFilledSize is captured for
each streamout buffer at the end of the first command buffer and restart streamout buffers with the captured values
in the next command buffer.
2.8.10 SURFACE_SYNC
The SURFACE_SYNC packet will allow the driver to place the surface sync commands as one atomic packet.
2.8.11 WAIT_REG_MEM
The WAIT_REG_MEM packet can be processed by either the CP.PFP or the CP.ME, as indicated by the ENGINE
field. Zero was chose for the ME for backward compatibility. The CP.PFP is limited to polling a memory location,
where the ME can be programmed to poll either a memory location or a register (indicated by MEM_SPACE). The
polled value is then tested against the reference value given in the command packet. The test is qualified by both the
specified function and mask. If the test passes, the parsing continues. If it fails, the CP waits for the Wait_Interval *
16 Clocks, then tests the Poll Address again.
Note: The driver should always insert a packet that re-programs the CP_WAIT_REG_MEM_TIMEOUT register to
the expected value before submitting the WAIT_REG_MEM packet.
- 0=ME ,
- 1=PFP
[7:5] - Reserved
MEM_SPACE [4] - MEM_SPACE:
- 0=Register,
- 1=Memory. If ENGINE == PFP, only Memory is valid.
[3] - Reserved
FUNCTION [2:0] - FUNCTION
-000 - Always (Compare Passes). Still does read operation and waits for read
results to come back.
- 001 - Less Than (<) the Reference Value.
- 010 - Less Than or Equal (<=) to the Reference Value.
- 011 - Equal (==) to the Reference Value.
- 100 - Not Equal (!=) to the Reference Value.
- 101 - Greater Than or Equal (>=) to the Reference Value.
- 110 - Greater Than (>) the Reference Value.
- 111 - Reserved.
If ENGINE==PFP, only 101/Greater Than or Equal is valid.
3 POLL_ADDRESS_LO Lower portion of Address to poll If the address is a memory location then bits
[31:2] specify the lower bits of the address and
[1:0] specify SWAP used for memory read. If the address is a memory-mapped
register, then bits [15:0] is the DWord memory-mapped register address that the
CP will read.
4 POLL_ADDRESS_HI Higher portion Address to poll If the address is a memory location then bits
[15:0] specify bits 47:32 of the address. If the address is a memory-mapped
register, then this DW is a don"t care.
5 REFERENCE [31:0] - Reference Value.
6 MASK [31:0] - Mask for Comparison.
7 POLL_INTERVAL [15:0] - Poll_Interval: Interval to wait between the time an unsuccessful polling
result is returned and a new poll is issued. Time between these is
16*Poll_Interval clocks. The minimum value is 0x04. A value less than 0x04 will
be forced to 0x04.
2.9.2 COPY_DATA
The purpose of this packet is to provide a generic and flexible way for the CP to copy data by reading it from any
source and writing it to any destination to which it has access. When applicable, it can copy either 32-bits or 64-bits
of data. All read and write addresses auto-increment for 64-bit operations. The write to destination phase can stall
the CP until the write has completed by setting the write confirm bit. The CP can efficiently copy multiple DWs to
and from any combination of memory, registers or GDS.
Support Tables
Engine: ME
N/A
Mem: wait for wc
Engine: PFP
Engine: CE
2.9.3 WRITE_GDS_RAM
This packet writes the embedded immediate data into the GDS starting at the indexed offset from the partition base
(see GDS_*_INDEX as described in SET_BASE packet for more details). The corresponding partition is
determined from the Shader-_Type bit in the header along with the Ring to which the packet is submitted.
2.9.4 WRITE_DATA
The purpose of this packet is to provide a generic and flexible way for the CP to write N Dwords of data to any
destination to which it has access. As applicable, the writes can be sent from the CE, PFP, ME or DE (Dispatch
Engine). The CE (and PFP) are limited to the GRBM and MC as destinations. Optionally, the writes can be
“confirmed” before continuing.
Mem: wait for wc 0 or 1 Mem sync, Low addr bits Hi addr bits
Mem: wait for wc 0 or 1 Mem async Low addr bits Hi addr bits
DE Reg: do read 0 or 1 Reg, Reg address -
Mem: wait for wc 0 or 1 Mem sync, Low addr bits Hi addr bits
TC/L2: wait for wc 0 or 1 TC/L2, Low addr bits Hi addr bits
GDS: do read 0 or 1 GDS, GDS offset -
Mem: wait for wc 0 or 1 Mem async Low addr bits Hi addr bits
Notes: “wc”: Write Confirm, Ack, or acknowledge from the destination.
2.9.5 NOP
Skip a number of DWords to get to the next packet.