-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking Issue for NVPTX shared memory #135516
Comments
@rustbot label O-nvptx |
One of the concerns here is that we likely will want to consider an API that is amenable to being used on both amdgpu and ptx targets. The concept is identical to both, as far as I am aware. In that sense, the main issue with simply using inline assembly or intrinsics is that it might be too architecture-specific for what we want. |
In my crate rust-cuda I have wrapper types for both static (https://github.com/juntyr/rust-cuda/blob/main/src/utils/shared/static.rs) and dynamic (https://github.com/juntyr/rust-cuda/blob/main/src/utils/shared/slice.rs) shared memory, perhaps some of its Assembly hacks might be useful :) |
@workingjubilee Does that comment apply to the rest of |
Hmm. No, I don't think it applies to the rest of the architectural intrinsics, because many of them are about more-specific ideas about how the GPU's execution is organized, as opposed to a more common concept of "there's areas of memory that threads 0..N get access to but N..N*2 do not (because it has its own area)", which also may require more special consideration for how it will be elaborated relative to our opsem. |
There are quite a few intrinsics that exist on both nvptx and amdgpu (and surely also intel). After all, they are built to run the same applications (thinking of Vulkan, DirectX, OpenCL, etc.). 🙂 I guess it would be nice to have the “common” intrinsics like Of course, then the question comes up what the naming scheme should be 😉
I lean towards open standards, which would be the Khronos nomenclature. |
Sorry, I misspoke, the reason to not pursue those immediately in this issue is because ones that are not immediately about shared memory would be yet-another topic. |
It's nigh impossible to use shared memory (or write any useful kernel) without the concept of thread indices, which are in Note that operational semantics (e.g., with respect to forward progress) are different between vendors. Suppose one writes a kernel using |
I agree that an ambition for a future user friendly safe way to deal with shared memory between GPU targets (that ideally would also be consistent on targets without thread groups) should not come in the way for exposing nvptx intrinsics related to shared memory. I think there's also precedence for this already. As in how |
I don't think anyone relying on forward progress guarantees in the language without forward progress guarantees in its operational semantics is relying on anything to do about Rust's operational semantics. |
Regarding implementing shared memory for nvptx, LLVM exposes it as LLVM nvptx docs: https://llvm.org/docs/NVPTXUsage.html#address-spaces For static shared allocations, we can declare a global in LLVM IR: %MyStruct = type { i32, i32, float }
@Gbl = internal addrspace(3) global [1024 x %MyStruct] zeroinitializer Results in e.g. this nvptx (compiled with llc):
For dynamic shared allocations, we declare the global as %MyStruct = type { i32, i32, float }
@ExternGbl = external addrspace(3) global [1024 x %MyStruct] Resulting nvptx:
Test in LLVM that uses it: https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/NVPTX/globals_lowering.ll |
If getting the pointer is implemented as a separate intrinsics, I think it could work like this: // Declare an intrinsic
#[rustc_nounwind]
#[rustc_intrinsic]
#[rustc_intrinsic_must_be_overridden]
pub fn _dynamic_smem() -> *mut u8 {
unreachable!()
} And the implementation for LLVM creates a The thing I’m a little unhappy about is hard-coding the alignment. 16 seems ok for most things, but I guess many use-cases would also be fine with A different way that let’s the user specify the alignment would be a generic intrinsic: // Declare an intrinsic
#[rustc_nounwind]
#[rustc_intrinsic]
#[rustc_intrinsic_must_be_overridden]
pub fn _dynamic_smem<T>() -> *mut T {
unreachable!()
} And the implementation for LLVM creates a For reference, because I wondered about this, the LLVM language reference says that external globals can be larger than their specified type, so dereferencing them outside the declared type is valid (as long as the definition/implementation is large enough).
|
I wrote some code that implements an intrinsic for dynamic shared memory: Flakebi@026563a pub fn dynamic_shared_memory<T: ?Sized>() -> *mut T {} A lot of cleanup is missing (docs, tests, …), but from a quick try it seems to work on amdgpu. I expect that it works on nvptx as well, though I don’t have the hardware to try that. |
It's great that you've been able to make progress here. I had the chance to test your solution with nvptx. Unfortunately the created global identifier contains dots which are not allowed in PTX: If I change the name of the global when invoking |
I took the opportunity to extend your prototype and implement an intrinsic for static shared memory. Since the attributes necessary for intrinsics have also changed in the meantime, I performed a rebase on the current main branch and resolved the conflicts. Additionally, I set For testing with my NVIDIA GPU I am currently using this repository. Two of the kernels for matrix multiplication use static shared memory, the stencil and reduction kernels use dynamic shared memory. So far I have not encountered any problems. |
Nice! For static shared memory, I was thinking it makes more sense to declare them as static variables in Rust: // Same in all executions (existing)
static GLOBAL_I: i32 = 0;
// Available only in the current thread (existing)
#[thread_local]
static THREAD_I: i32 = 0;
// Same for all threads in the current group (new)
#[group_shared]
static SHARED_I: i32 = 0; There is a bit more to this, like currently neither nvptx nor amdgpu allow initializing shared variables, so we want to enforce this being
I guess a simple No matter which name we pick, one can always write a Rust program that creates a symbol collision, though only through the use of unsafe (since the 2024 edition). #[unsafe(export_name = "$_dynamic_shared_memory_$")]
static CONFLICTING_SYMBOL: i32 = 0; |
That is definitely valid. I was curious if static shared memory could be handled similar to dynamic shared memory, but I agree that we actually have already To expose it to the user and also handle the group_shared! {
static BUFFER : [i32; 256];
} which expands to #[group_shared]
static BUFFER : MaybeUninit<[i32; 256]> = MaybeUninit::uninit(); Another way would be to let the attribute do the type transformation. However, I do not know of any other attribute that does something like this so it doesn‘t feel right to me.
Yes it was the intention to prevent naming conflicts, but I didn‘t think about |
I think these would need to be But also, it's typical to have several kernels in the same module. Would these static declarations only be allowed within an I had envisioned just having an intrinsic return |
|
So Atomics are nontrivial overhead and don't apply to general types. |
A quick test with ptxas -arch sm_61 -v -ias ".version 5.0
.target sm_61
.address_size 64
.shared .align 4 .b8 static_shared_mem[2048];
.visible .entry no_shared_kernel()
{
ret;
}
.visible .entry shared_kernel_1()
{
.reg .b64 %rd<1>;
cvta.shared.u64 %rd0, static_shared_mem;
ret;
}
.visible .entry shared_kernel_2()
{
.reg .b64 %rd<1>;
cvta.shared.u64 %rd0, static_shared_mem;
ret;
}" shows that only
Hm, yes the user would probably need to create a pointer anyway, so it would not change much, except being a little more descriptive in the first place. fn alloc_static_smem<T : Sized>() -> *mut T; or something similar. I also think it would be easier to implement. My static shared memory implementation, which is based on Flakebi's dynamic shared memory implementation is already basically working. |
Feature gate:
#![feature(stdarch_nvptx)]
(probably; see #111199)This is a tracking issue for access to shared memory from NVPTX device code.
There are two flavors of shared memory in NVPTX (see this NVIDIA Technical Blog (2013) for a friendly introduction):
__shared__
attribute).In practice, the required amount of shared memory for GPU algorithms will depend on the block size. Since many domains expect to choose the block size at run-time, static shared memory is often considered to be of limited use.
Public API
Note that other
core::arch::nvptx
intrinsics are covered by #111199.A possible implementation for dynamic shared memory
Shared memory can be exposed using inline assembly (reference).
Steps / History
Unresolved Questions
_dynamic_smem() -> (*mut u8, u32)
or should we have two separate intrinsics for accessing the base pointer and the size?usize
for the length even though NVPTX does not (yet, and maybe ever) support that? I'm not sure about other vendors, if we're looking for the intrinsics to be as close as possible.llvm.nvvm.read.ptx.sreg.dynamic_smem_size
. That could in principle allow accessing the shared memory base pointer without introducing a new symbol (_shared_data above
). I lean toward using the inline assembly for now and possibly migrating it later, but no other inline assembly is used incore::arch::nvptx
and I don't know if it should be kept that way.stdarch_nvptx
feature gate since it would need to verify launch parameters on the host (better user experience) or handle failure on the device (i.e., prevent UB if an insufficient amount of shared memory is provided). I prefer to handle only the unsafe primitives here since there are many open questions about such ergonomics.stdarch_nvptx
feature gate.Footnotes
https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html ↩
The text was updated successfully, but these errors were encountered: