Checkpoint and restore functionality for CUDA is exposed through a command-line utility called cuda-checkpoint. This utility can be used to transparently checkpoint and restore CUDA state within a running Linux process. Combine it with CRIU (Checkpoint/Restore in Userspace), an open-source checkpointing utility, to fully checkpoint CUDA applications.
Checkpointing overview
Transparent, per-process checkpointing offers a middle ground between virtual machine checkpointing and application-driven checkpointing. Per-process checkpointing can be used in combination with containers to checkpoint the state of a complex application, facilitating use cases such as the following:
- Fault tolerance, with periodic checkpoints
- Preemption of lower-priority work on a single node by checkpointing the preempted task
- Cluster scheduling with migration

CRIU
CRIU (Checkpoint/Restore in Userspace) is an open-source checkpointing utility for Linux, maintained outside of NVIDIA, which can checkpoint and restore process trees.
CRIU exposes its functionality through a command line program called criu
and operates by checkpointing and restoring every kernel mode resource associated with a process. These resources include:
- Anonymous memory
- Threads
- Regular files
- Sockets
- Pipes between checkpointed processes
As the behavior of these resources is specified by Linux and is independent of the underlying hardware, CRIU knows how to checkpoint and restore them.
In contrast, NVIDIA GPUs provide functionality beyond that of a standard Linux kernel, so CRIU is not able to manage them. cuda-checkpoint
adds this capability and can be used with CRIU to checkpoint and restore a CUDA application.
cuda-checkpoint
cuda-checkpoint
checkpoints and restores the CUDA state of a single Linux process. It supports display driver version 550 and higher and can be downloaded from the /bin directory.
localhost$ cuda-checkpoint --help CUDA checkpoint and restore utility. Toggles the state of CUDA within a process between suspended and running. Version 550.54.09. Copyright (C) 2024 NVIDIA Corporation. All rights reserved. --toggle --pid <value> Toggle the state of CUDA in the specified process. --help Print help message. |
The cuda-checkpoint
binary can toggle the CUDA state of a process, specified by PID, between suspended and running. A running-to-suspended transition is called a suspend and the opposite transition is called a resume.
A process’s CUDA state is initially running. When cuda-checkpoint
is used to suspend CUDA in a process, it follows these steps:
![]() | Any CUDA driver APIs that launch work, manage resources, or otherwise impact GPU state are locked. |
![]() | Already submitted CUDA work, including stream callbacks, is completed. |
![]() | Device memory is copied to the host, into allocations managed by the CUDA driver. |
![]() | All CUDA GPU resources are released. |
cuda-checkpoint
used to suspend CUDAcuda-checkpoint
does not suspend CPU threads, which may continue to safely interact with CUDA in one of the following ways: Calling runtime or driver APIs, which may block until CUDA is resumed or accessing host memory allocated by cudaMallocHost
and similar APIs, which remains valid.
A suspended CUDA process no longer directly refers to any GPU hardware at the OS level and may therefore be checkpointed by a CPU checkpointing utility such as CRIU.
When a process’s CUDA state is resumed using cuda-checkpoint
, it follows these steps:
![]() | GPUs are re-acquired by the process. |
![]() | Device memory is copied back to the GPU and GPU memory mappings are restored at their original addresses. |
![]() | CUDA objects such as streams and contexts are restored. |
![]() | CUDA driver APIs are unlocked. |
cuda-checkpoint
At this point, CUDA calls unblock and CUDA can begin running on the GPU again.
Checkpointing example
This example uses cuda-checkpoint
and CRIU to checkpoint a CUDA application called counter
. Every time that counter
receives a packet, it increments GPU memory and replies with the updated value. The example code is also available in the GitHub repo.
#include <stdio.h> #include <sys/types.h> #include <sys/socket.h> #include <netinet/in.h> #include <arpa/inet.h> #define PORT 10000 __device__ int counter = 100; __global__ void increment() { counter++; } int main( void ) { cudaFree(0); int sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP); sockaddr_in addr = {AF_INET, htons(PORT), inet_addr( "127.0.0.1" )}; bind(sock, (sockaddr *)&addr, sizeof addr); while ( true ) { char buffer[16] = {0}; sockaddr_in peer = {0}; socklen_t inetSize = sizeof peer; int hCounter = 0; recvfrom(sock, buffer, sizeof buffer, 0, (sockaddr *)&peer, &inetSize); increment<<<1,1>>>(); cudaMemcpyFromSymbol(&hCounter, counter, sizeof counter); size_t bytes = sprintf (buffer, "%d\n" , hCounter); sendto(sock, buffer, bytes, 0, (sockaddr *)&peer, inetSize); } return 0; } |
You can build the counter
application using nvcc
.
localhost$ nvcc counter.cu -o counter |
Save the counter
PID for reference in subsequent commands:
localhost# PID=$! |
Send counter
a packet and observe the returned value. The initial value was 100 but the response is 101, showing that the GPU memory has changed since initialization.
localhost# echo hello | nc -u localhost 10000 -W 1 101 |
Use nvidia-smi
to confirm that counter
is running on a GPU:
localhost# nvidia-smi --query --display=PIDS | grep $PID Process ID : 298027 |
Use cuda-checkpoint
to suspend the counter
CUDA state:
localhost# cuda-checkpoint --toggle --pid $PID |
Use nvidia-smi
to confirm that counter
is no longer running on a GPU:
localhost# nvidia-smi --query --display=PIDS | grep $PID |
Create a directory to hold the checkpoint image:
localhost# mkdir -p demo |
Use criu
to checkpoint counter
:
localhost# criu dump --shell-job --images-dir demo --tree $PID [1]+ Killed ./counter |
Confirm that counter
is no longer running:
localhost# ps --pid $PID PID TTY TIME CMD |
Use criu
to restore counter
:
localhost# criu restore --shell-job --restore-detached --images-dir demo |
Use cuda-checkpoint
to resume the counter
CUDA state:
localhost# cuda-checkpoint --toggle --pid $PID |
Now that counter
is fully restored, send it another packet. The response is 102, showing that earlier GPU operations were persisted correctly.
localhost# echo hello | nc -u localhost 10000 -W 1 102 |
Functionality
As of display driver version 550, checkpoint and restore functionality is still being actively developed. In particular, cuda-checkpoint
has the following characteristics:
- x64 only.
- Acts upon a single process, not a process tree.
- Doesn’t support UVM or IPC memory.
- Doesn’t support GPU migration.
- Waits for already-submitted CUDA work to finish before completing a checkpoint.
- Doesn’t attempt to keep the process in a good state if an error (such as the presence of a UVM allocation) is encountered during checkpoint or restore.
These limitations will be addressed in subsequent display driver releases and will not require an update to the cuda-checkpoint utility itself. The cuda-checkpoint
utility exposes functionality that is contained in the driver.
Summary
The cuda-checkpoint
utility, when combined with CRIU, enables transparent per-process checkpointing of Linux applications. For more information, see the /NVIDIA/cuda-checkpoint GitHub repo.
Try checkpointing the counter
application, or any other compatible CUDA application, on your own machine!