Rust-CUDA is being rebooted! #130

LegNeato · 2025-01-27T14:40:27Z

See https://rust-gpu.github.io/blog/2025/01/27/rust-cuda-reboot.

@RDambrosio016 has made me a new maintainer (I'm also a maintainer of rust-gpu).

Please comment here if you would like to be involved, or better yet put up some PRs or direct me to what needs to be done from your perspective!

AnubhabB · 2025-01-27T15:42:45Z

This is exciting! I'd love to contribute (if I can). Any areas to dig deep into? That would probably help with figuring out starting points and if at-all I'd be capable enough to contribute!

In any case .. cheers .. will closely follow how this evolves!

LegNeato · 2025-01-27T15:44:39Z

I'm still orienting myself as to the current state. I'd start with just trying to get the examples running on your machine and see if you hit anything I do not! Thank you so much for (potentially) helping. 🍻

David-OConnor · 2025-01-27T15:53:26Z

The big thing is to make it work. Try it on a few different machines (OS, GPUs, CUDA versions etc), make it work on modern RustC and CUDA versions without errors. I switched to Cudarc because that is in a working state, and this isn't.

Dropping support for older versions of CUDA is fine if that makes it easier.

apriori · 2025-01-27T16:12:31Z

The big thing is to make it work. Try it on a few different machines (OS, GPUs, CUDA versions etc), make it work on modern RustC and CUDA versions without errors. I switched to Cudarc because that is in a working state, and this isn't.

That will be quite some work. Rustc changed significantly, so did libNVVM.

@LegNeato As you are a maintainer of rust-gpu, I would be curious to know what in the end lead you to rust-cuda. Afaik rust-gpu did not enter the compute kernel area too much.

LegNeato · 2025-01-27T16:28:48Z

@apriori Actually, Rust-GPU does have pretty good support for Vulkan compute! It's just that Embark and most current contributors are focused on graphics use-cases. I personally care more about GPGPU.

What lead me here is I see a lot of opportunities and overlap between the two projects. As an end user writing GPU code in Rust, what I really want is to not care about Vulkan vs CUDA as the output target at all, similar to how I don't care about linux vs windows when writing CPU rust (or arm vs x86_64 for that matter). Of course, we also need to expose platform-specific stuff for those wanting to get the most out of their hardware or ecosystem (similar to how rust on the CPU exposes platform specific apis or ISA-specific escape hatches), but the progressive disclosure of complexity is key.

This wasn't going to happen as two completely separate projects that only peek over the fence occasionally, or with rust-cuda no longer being developed. So I am involved in both and can hopefully bring them together where they are different for different's sake.

txbm · 2025-01-27T19:10:08Z

Will contribute

buk0vec · 2025-01-27T19:22:19Z

Would definitely love to help out, I think this is a really cool project

Schmiedium · 2025-01-27T19:35:22Z

I'd definitely like to contribute and get involved if I can. I'm currently a Master's Student at Georgia Tech and taking a Parallel algorithms course this semester. I have a few different machines, cards, and Operating systems I can try to put the current iteration on and see what issues pop up

LegNeato · 2025-01-27T19:43:46Z

@Schmiedium Awesome! I think one thing everyone is going to hit is we are on a super old version of rust and cargo automatically upgrading versions will hit issues. I'm trying to untangle that a bit currently.

mooreniemi · 2025-01-28T01:49:50Z

I'm noticing this too @LegNeato - do you have a branch going or no?

danielglin · 2025-01-28T03:33:26Z

I'm a Rust beginner without any GPU programming experience, but I'd love to learn and help out where I can.

Schmiedium · 2025-01-28T04:05:49Z

So I had some time to play around with it. I'm running into two main issues, and they seem to be windows specific. This is on windows 10 with a 2080ti, CUDA 12.8 and Optix 8.1.

The first issue I think is on me, has to do with Nvidia Optix. I'm just having trouble getting it setup correctly i think, but the Optix examples fail to compile with the error that the OPTIX_ROOT_DIR or OPTIX_ROOT env variable isn't found. This points to the FindCUDAHelper looking for environment variables, but even with those set it still fails.

The second is ntapi. It looks like ntapi 0.3.7 includes code that is no longer valid rust. This issue seems to have first cropped up in 2022 and was fixed. You can see the issue here. l guess one of the dependencies somewhere in the dependency tree of this project may be using that version, causing that build error. I haven't yet been able to look into where that's being brought in, so not sure how difficult that would be to fix.

I should be able to try this out on NixOS tomorrow with the same hardware, so I'll check in tomorrow if I find anything there

One more comment, not an issue per say: looks like this project as of right now still requires nightly rust to build due to using #[feature] macros, so be aware of that as well.

I'd be interested to know what you guys find

BurtonQin · 2025-01-28T16:36:04Z

Glad to hear Rust CUDA is making a comeback! I went through the cust portion of the project last year. Since I work with both Rust and CUDA regularly and have experience with cudarc, I’d love to contribute. I also have an NVIDIA 4090 that could be useful for testing. Once the roadmap and contribution guidelines are ready, count me in to help out!

ctrl-z-9000-times · 2025-01-30T01:01:14Z

Hello, I've been trying to use cust for the past few weeks and I have some *ideas* for how the library could be improved. I think now, if ever, would be a good time to break compatibility to polish the existing API.

In particular: some of the flags are useless and none of them implement the Default trait.

StreamWaitEventFlags does nothing, because the underlying cuda function (cudaStreamWaitEvent) doesnt use the flags (future proofing i guess?). I would remove this flag from the API, and in the future if you need to add options you can add another function with a distinct name (ex: stream.wait_event_foobar).
cust::init(cust::CudaFlags) same story StreamWaitEventFlags
StreamFlags does nothing because of an unsoundness issue StreamFlags::NON_BLOCKING is unsound because of fringe asynchronous memory copy behavior in CUDA #15. We should decide how we're going resolve this issue. This project can either accept the memory-safety hazard, or prohibit the option and lose those features. IMO we should prioritze fixing memory safety, and document the potential innefficiency issue. Either way, the current API was left in a somewhat broken state.

Edit: another potential compatibility break is issue: #110

I'm looking forward to seeing where this project will go!
Sincerely

P.P.S. Here's what I think you should do with the cust library (current version: 0.3.2).
Plan to do two releases:

0.3.3 a final patch release with any easy bug fixes that have accumulated in the past 3 years. Are any of those outstanding PR's worth the effort of a patch release?
0.4.0 which breaks compatability

@LegNeato, You should make a tracking issue to discuss what will be included in each release that you plan to do.

Schmiedium · 2025-01-30T14:38:11Z

I got the rest of the issues with my environment resolved. The main thing not building right now is the nvvm_codegen. It looks like there was another issue in this repo for resolving that, so I can play around with that and see if I can get it to build.

I also agree with ctrl-z, I think if we want to do some re-design or break compatibility, now would be the best time

LegNeato · 2025-01-31T00:51:05Z

Yep, open to breaking whatever, let's get to latest. The plan was to switch off NVVM and onto ptx directly but after talking with NVIDIA I am not so sure that is the best way forward.

LegNeato · 2025-01-31T00:52:18Z

@Schmiedium you might want to look at rust -gpu 's forward porting as it has to deal with similar issues. I plan to take a look later this week as I largely did the other forward port, but if you get time go for it (just comment or start a draft or issue so we don't duplicate) 😁

devillove084 · 2025-01-31T03:31:54Z

@LegNeato I'd like to share some observations on potential challenges with direct PTX usage and offer concrete ways I can
try help address them:

Key Challenges with PTX

Toolchain Immaturity
- Current PTX assembly workflows (e.g. ptxas integration) may lack Rust-friendly abstractions
- Example: Manual memory alignment directives required for #[repr(C)] structs
Debugging Friction
- No mature PTX-level debugger integrated with rust-gdb/LLDB
- Crash analysis requires manual mapping between PTX instructions and Rust source
Optimization Burden
- Missing auto-vectorization equivalent to NVVM's -opt=3
- Developers must manually insert PTX pragmas (e.g. .reqntid 256)
Cross-Architecture Support
- JIT compilation via GPU driver may conflict with Rust's ABI stability goals
- Need per-SM versioned PTX bundles (e.g. sm_80 vs sm_90)

LegNeato · 2025-01-31T12:05:34Z

Great info! I the topic came up because it was mentioned by @RDambrosio016 in #98 (comment) and @kjetilkjeka is actively working on / using the nvptx backend in rustc so it is worth exploring the tradeoffs

jorge-ortega · 2025-01-31T17:04:24Z

The second is ntapi. It looks like ntapi 0.3.7 includes code that is no longer valid rust. ... I haven't yet been able to look into where that's being brought in, so not sure how difficult that would be to fix.

This is being pulled in through the path_tracer example. Added details to #120.

skinnyBat · 2025-02-01T08:32:29Z

I would love to contribute. I will first try and get the existing examples working on my setup.

Schmiedium · 2025-02-01T14:55:49Z

@jorge-ortega Thanks! I found the package, looks like it was an old version of sysinfo. I'm going to publish a branch for the forward port of the project to try and get all the dependencies updated.

And @LegNeato, thanks for the info on the rust-gpu forward port, I'll check that out for how they went about it

kulst · 2025-02-02T10:03:11Z

Hey, great to see this crate being rebooted.

I am interested in contributing as well. I have some experience in using the nvptx backend from Rust. I think it really could be a viable alternative to the current nvvm codegen which is used from Rust-CUDA at the moment. My observations so far are:

It is possible to implement CUDA kernels with the nvptx backend in Rust. I implemented some simple ones like stencil operations, matrix multiplication and reduction
There are some simple intrinsics already available from Rust (like _syncthreads() or _block_idx_x())
For other instructions (like texture fetching or atomic operations on floats) it is either possible to link against the corresponding llvm.nvvm intrinsics or to use inline assembly
Using shared memory requires inline assembly at the moment but a solution for this is being actively discussed
debugging such kernels should be possible with cuda-gdb. The llvm bitcode linker rust tool (not the llvm application) currently strips out all debug information. However, I was able to debug a simple kernel by compiling it with -g -O1 and manually using opt and llc without stripping the debug information. In my case compiling with -O0 did produce invalid ptx regardless of whether debug information was included or not.

mratsim · 2025-02-04T07:40:41Z

Hello there,

I've been developing GPU kernels in Nim, Cuda and LLVM IR ? NVPTX for awhile including an LLVM based JIT compiler with both NVVM and NVPTX backends (see my Nim hello world with both backends https://github.com/mratsim/constantine/blob/v0.2.0/tests/gpu/hello_world_nvidia.nim#L107-L152)

Yep, open to breaking whatever, let's get to latest. The plan was to switch off NVVM and onto ptx directly but after talking with NVIDIA I am not so sure that is the best way forward.

The issue with NVVM is that they use LLVM IR v7.0.1 from december 2018 and the version just after 7.1.0 was a breaking change. Quoting myself:

⚠ NVVM IR is based on LLVM 7.0.1 IR which dates from december 2018.
There are a couple of caveats:

LLVM 7.0.1 is usually not available in repo, making installation difficult

There was a ABI breaking bug making the 7.0.1 and 7.1.0 versions messy (https://www.phoronix.com/news/LLVM-7.0.1-Released)

LLVM 7.0.1 does not have LLVMBuildCall2 and relies on the deprecated LLVMBuildCall meaning
supporting that and latest LLVM (for AMDGPU and SPIR-V backends) will likely have heavy costs

When generating a add-with-carry kernel with inline ASM calls from LLVM-14,
if the LLVM IR is passed as bitcode,
the kernel content is silently discarded, this does not happen with built-in add.
It is unsure if it's call2 or inline ASM incompatibility that causes the issues

When generating a add-with-carry kernel with inline ASM calls from LLVM-14,
if the LLVM IR is passed as testual IR, the code is refused with NVVM_ERROR_INVALID_IR

Hence, using LLVM NVPTX backend instead of libNVVM is likely the sustainable way forward

There is a way to dowgrade LLVM IR which is what Julia is doing through https://github.com/JuliaGPU/GPUCompiler.jl in the following package https://github.com/JuliaLLVM/llvm-downgrade but they have to maintain a branch per LLVM release and it seems quite cumbersome.

jorge-ortega · 2025-02-04T11:06:59Z

FWIW, the latest CUDA 12.8 introduced a second dialect of NVVM IR based on LLVM 18.1.8 (see NVVM IR docs)

NVVM IR can be in one of two dialects. The LLVM 7 dialect is based on LLVM 7.0.1. The modern dialect is based on a more recent public release version of LLVM (LLVM 18.1.8). The modern dialect only supports Blackwell and later architectures (compute capability compute_100 or greater).

LegNeato · 2025-02-04T13:45:54Z

Yeah, that's what NVIDIA pointed out to me that made me reassess! It sounds like they are treating NVVM as the stable and suggested layer and ptx is the discouraged hard mode.

I'm also not sure if there is more interop or optimization potential with NVVM and if it is worth the problems hit previously. We'd certainly get work "for free" from Nvidia's tools, but it is not clear if we'll be fighting upstream on the rust or DX side. I know the MS shader compilers are notoriously annoying to work with for example.

There are also considerations like the autodiff support in nightly operates at the llvm layer and might be easier to interface with if we are at the NVVM layer? On the flip side, there is the nvptx rustc backend and perhaps targeting ptx will let us all better reuse work.

If anyone thinks they have insights or thoughts, please chime in. Lots to figure out!

dssgabriel · 2025-02-08T19:25:11Z

I have access to A100 and H100 GPUs so I can help with testing. If I have some free time, I could also try to help with development and porting to newer versions of rustc/libNVVM.

mratsim · 2025-02-08T22:35:13Z

I'm also not sure if there is more interop or optimization potential with NVVM and if it is worth the problems hit previously.

They have better optimization passes, and the driver is optimized to lower PTX from NVVM to binary code. But I think using LLVM will significantly ease development and also deployment. And requiring Blackwell for NVVM 2nd gen is meh.

Ultimately if someone has a perf bottleneck, I believe they would use inline PTX (or if crazy enough, go the Nervana way and reverse engineer the GPU SASS: https://github.com/NervanaSystems/maxas/wiki/Introduction. I think NVPTX would go 90% of the way and if even higher perf is needed, it's for a commercial product and they would dedicate dev time (or buy faster hardware or distribute compute).

There are also considerations like the autodiff support in nightly operates at the llvm layer and might be easier to interface with if we are at the NVVM layer?

Which autodiff are you talking about? is it Enzyme? (https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s32466/)

LegNeato · 2025-02-09T01:50:49Z

rust-lang/rust#124509

ZuseZ4 · 2025-02-12T06:19:59Z

Hi, I just saw someone referenced autodiff support! Enzyme supports CPU and GPU parallelism and has a PreserveNVVM pass that I could schedule, if someone starts running some experiments with it I can probably enable it for nightly: https://github.com/EnzymeAD/Enzyme/blob/main/enzyme/Enzyme/PreserveNVVM.cpp

I'm about to finish upstreaming autodiff see here, so I hope to get soon back to my offload project goal, which also intends to run pure/safe Rust code on the GPU: rust-lang/rust#131513. It's not vendor specific, so writing unsafe and Nvidia specific Kernels is probably going to be faster in some cases, but I hope to offer a good-enough performance purely based on LLVM for most cases, s.t. all vendors and Enzyme can be supported. That's not too different from Julia world (since someone already mentioned KA.jl), where KernelAbstractions.jl is the generic frontend which supports all vendors, and then some people decide to instead or additionally write code based on CUDA.jl, for better performance which KA.jl can't provide at the moment.

dabacircle · 2025-02-15T07:07:49Z

Happy to see this project being rebooted! I know this project is called Rust-CUDA. But you also mentioned that Rust-GPU is like the sibling project and you may look into backend-agnostic integration in the future. What about other backends? I personally work for a Chinese corp that is building our own GPGPU ecosystem and I know there are several other new players that are trying to break the CUDA monopoly. I am interested in contributing to the backend-agnostic features and possibly adding new backend support if it's planned.

cell-scape · 2025-02-15T21:48:49Z

I'm very excited to see this project getting rebooted, I would love to contribute if I can. I have a couple of 3070 TIs and have some experience writing kernels with CUDA C/C++, pycuda, and Julia's CUDA.jl. I'd be happy to do testing, docs or examples, or development if I'm able. I've been using Rust as a hobby for a couple years, the community has been really great and I'd like to get involved to the extent that I'm able

igor-semyonov · 2025-02-25T23:57:23Z

I am excited too.
When testing, which version should I use? The latest published one? Or the one from the main branch?

Schmiedium · 2025-02-27T22:45:14Z

So I made some progress, I was able to get the dependencies updated to get a much newer version of rust supported. I played around with CI, and I got it working except for the fact that nvvm_codegen doesn't build. I also updated gpu_rand to be consistent with the newest rand_core api and that crate builds successfully now as well.

The windows part of CI also seems to take forever, it was over 40 minutes and still didn't get past installing CUDA, so that's something to dig into later.
Also the Ubuntu version of CI seems to fail at building the project which makes sense. However, it only succeeds in installing the CUDA toolkit the second time CI is run. I'm not sure what the deal there is, so more stuff to look into.

I think the next thing to work on is getting nvvm_codegen to build, and the rest of the project as well. Once that's done, and the remaining CI kinks are worked out, I think we'll have an updated, functioning project on our hands

Thanks to @juntyr for reviewing my pull request earlier

trigpolynom · 2025-03-04T17:45:55Z

Hi all, looking into building a rust-based simulator with gpu support--needless to say, would love to help contribute to this project.

msharmavikram · 2025-03-04T17:54:51Z

@LegNeato would love this project to run within the GPUMODE community. Interested?

LegNeato · 2025-03-06T15:27:34Z

@LegNeato would love this project to run within the GPUMODE community. Interested?

@msharmavikram not sure what that means.

boardwalkjoe · 2025-03-21T03:32:43Z

@LegNeato what kinda CI machines do you need?

LegNeato added enhancement help wanted question labels Jan 27, 2025

This was referenced Jan 27, 2025

Interested in being part of a maintainer group #114

Closed

Updated nvvm codegen to 1.72.0 nightly. #111

Closed

Fix CI #133

Closed

erhant mentioned this issue Feb 24, 2025

Is this project abandoned? #106

Closed

LegNeato mentioned this issue Mar 13, 2025

Plans for future development and maintainance? #118

Closed

LegNeato pinned this issue Mar 13, 2025

LegNeato mentioned this issue Mar 16, 2025

Error compiling example: couldn't load codegen backend symbol #163

Closed

peng1999 mentioned this issue Mar 19, 2025

Maintenance this repository rust-cuda/cuda-sys#35

Open

Rust-CUDA is being rebooted! #130

Rust-CUDA is being rebooted! #130

Comments

LegNeato commented Jan 27, 2025 • edited Loading

AnubhabB commented Jan 27, 2025

LegNeato commented Jan 27, 2025

David-OConnor commented Jan 27, 2025 • edited Loading

apriori commented Jan 27, 2025

LegNeato commented Jan 27, 2025

txbm commented Jan 27, 2025

buk0vec commented Jan 27, 2025

Schmiedium commented Jan 27, 2025

LegNeato commented Jan 27, 2025

mooreniemi commented Jan 28, 2025 • edited Loading

danielglin commented Jan 28, 2025

Schmiedium commented Jan 28, 2025 • edited Loading

BurtonQin commented Jan 28, 2025

ctrl-z-9000-times commented Jan 30, 2025 • edited Loading

Schmiedium commented Jan 30, 2025

LegNeato commented Jan 31, 2025

LegNeato commented Jan 31, 2025 • edited Loading

devillove084 commented Jan 31, 2025

Key Challenges with PTX

LegNeato commented Jan 31, 2025

jorge-ortega commented Jan 31, 2025

skinnyBat commented Feb 1, 2025

Schmiedium commented Feb 1, 2025

kulst commented Feb 2, 2025 • edited Loading

mratsim commented Feb 4, 2025 • edited Loading

jorge-ortega commented Feb 4, 2025 • edited Loading

LegNeato commented Feb 4, 2025

dssgabriel commented Feb 8, 2025

mratsim commented Feb 8, 2025

LegNeato commented Feb 9, 2025

ZuseZ4 commented Feb 12, 2025

dabacircle commented Feb 15, 2025

cell-scape commented Feb 15, 2025

igor-semyonov commented Feb 25, 2025

Schmiedium commented Feb 27, 2025

trigpolynom commented Mar 4, 2025

msharmavikram commented Mar 4, 2025

LegNeato commented Mar 6, 2025

boardwalkjoe commented Mar 21, 2025

LegNeato commented Jan 27, 2025 •

edited

Loading

David-OConnor commented Jan 27, 2025 •

edited

Loading

mooreniemi commented Jan 28, 2025 •

edited

Loading

Schmiedium commented Jan 28, 2025 •

edited

Loading

ctrl-z-9000-times commented Jan 30, 2025 •

edited

Loading

LegNeato commented Jan 31, 2025 •

edited

Loading

kulst commented Feb 2, 2025 •

edited

Loading

mratsim commented Feb 4, 2025 •

edited

Loading

jorge-ortega commented Feb 4, 2025 •

edited

Loading