Doing More With Slurm Advanced Capabilities

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Doing More with Slurm

Advanced Capabilities
Nick Ihli, Director - Cloud and Sales Engineering - SchedMD
[email protected]
Most people know Slurm…
● Policy-driven, open source, fault-tolerant, and
highly scalable workload management and job
scheduling system

● Three Key Functions


● Allocates exclusive and/or non-exclusive access to
resources to users for some duration of time for a workload

● Provides a framework for starting, executing, and


monitoring work on the set of allocated nodes

● Arbitrates contention for resources by managing a queue of


pending work
Rank System Cores Rpeak (TFlop/s)

1 Supercomputer Fugaku 7,630,848 537,212.0


Slurm on Top500

2 Summit - IBM 2,414,592 200,794.9 5 of top 10


DOE/SC/Oak Ridge National Laboratory - United States

3 Sierra - IBM / NVIDIA / Mellanox 1,572,480 125,712.0


More than 50% of Top100
DOE/NNSA/LLNL - United States

4 Sunway TaihuLight - NRCPC 10,649,600 125,435.9


National Supercomputing Center in Wuxi - China

5 Perlmutter - HPE 761,856 93,750.0


DOE/SC/LBNL/NERSC - United States

6 Selene - Nvidia 555,520 79,215.0


NVIDIA Corporation - United States

7 Tianhe-2A - NUDT 4,981,760 100,678.7


National Super Computer Center in Guangzhou - China

8 JUWELS Booster Module - Atos 449,280 70,980.0


Forschungszentrum Juelich (FZJ) - Germany

9 HPC5 - Dell EMC 669,760 51,720.8


Eni S.p.A. - Italy

10 Voyager-EUS2 - Microsoft Azure 253,440 39,531.2


But what is SchedMD?
● Maintainers and Supporters of Slurm
● Only organization providing level-3 support
● Training
● Consultation
● Custom Development
Industry Trends
Manufacturing & EDA
● GPUs - AI Workloads
Healthcare & Lifesciences
● Hybrid Cloud
Financial Services & Insurance
● AI Tooling Integration
Energy

Government

Academic
GPU Scheduling for
AI Workloads
Fine-Grained GPU Control
Same options apply to salloc, sbatch and srun commands

● --cpus-per-gpu= CPUs required per allocated GPU


● -G/--gpus= GPU count across entire job allocation
● --gpu-bind= Task/GPU binding option
● --gpu-freq= Specify GPU and memory frequency
● --gpus-per-node= Works like “--gres=gpu:#” option today
● --gpus-per-socket= GPUs per allocated socket
● --gpus-per-task= GPUs per spawned task
● --mem-per-gpu= Memory per allocated GPU
Examples of Use
$ sbatch --ntasks=16 --gpus-per-task=2 my.bash

$ sbatch --ntasks=8 --ntasks-per-socket=2 --gpus-per-socket=k80:1 my.bash

$ sbatch --gpus=16 --gpu-bind=closest --nodes=2 my.bash

$ sbatch --gpus=k80:8,a100:2 --nodes=1 my.bash


Configuring GPUs
● GPUs fall under the Generic Resource (GRES) plugin
○ Node-specific resources
● Requires definition in slurm.conf and gres.conf on node
● GRES can be associated with specific device files (e.g. specific GPUs)
● GPUs can be autodetected with NVML or RSMI libraries
● Sets CUDA_VISIBLE_DEVICES environment variable for the job
Restricting Devices with Cgroups

● Uses the devices subsystem


○ devices.allow and devices.deny control access to devices
○ All devices in gres.conf that the job does not request are added to
devices.deny so the job can’t use them
● Must be a Unix device file. Cgroups restrict devices based on major/minor
number, not file path
● GPUs are the most common use case, but any Unix device file can be
restricted with cgroups
NVIDIA MIG Support
● Configured like regular GPUs in Slurm
● Fully supported by task/cgroup and --gpu-bind
● AutoDetect support
● Make it work with CUDA_VISIBLE_DEVICES
● MIGs must be manually partitioned outside of Slurm beforehand via
nvidia-smi
Hybrid Cloud Autoscaling
Hybrid Cloud
Cloud Enablement
● Power Saving module
○ Requires 3 parameters to enable
■ ResumeProgram
■ SuspendProgram
■ SuspendTime (Either global or
Partition)
○ Other important parameters
■ ResumeTimeout
■ SuspendTimeout
Power State Transition - Resume

Job State Configuring Running Completing

IDLE ALLOCATED / MIXED


Node State
ALLOCATED / MIXED
POWERED_DOWN ~ POWERING_UP #
Power State Transition - Suspend

IDLE IDLE
Node State IDLE
POWERING_DOWN % POWERED_DOWN ~

SuspendTime SuspendTimeOut
What about the Data?
● Most common question - How do we get my data from onprem to cloud?
● Previous best option - mini-workflow w/ job dependency

Stage-in job > Application job > Stage-out job

● Benefit: easy to increase the number of nodes involved in moving the data
New Option: Lua Burst Buffer plugin
● Originally developed for Cray Datawarp
○ Intermediate storage - in between slow long-term storage and the fast memory
on compute nodes
● Asynchronously calls an external script to not interfere with the scheduler
● Generalized this function so you don’t need Cray Datawarp or actual
hardware “burst buffers” or Cray’s API
● Good for Data movement or provisioning cloud nodes
○ Anything you think you want to do while the job is pending (or at other
job states)
Asynchronous “stages”
● Stage in - called before the job is scheduled, job state == pending
○ Best time for Cloud data staging
● Pre run - called after the job is scheduled, job state == running + configuring
○ Job not actually running yet
● Stage out - called after the job completes, job state == stage out
○ Job cannot be purged until this is done
● Teardown - called after stage out, job state == complete
AI Tooling Integration:
Enter the REST API
New Integration Requirements
What is Slurm REST API

GET

POST
Client JSON/YAML HTTP Server
PUT

Client sends a request. DELETE


Server sends a response.
(NOT srun,sbatch,salloc)
HTTP Method
slurmrestd
A tool that runs inside of the Slurm perimeter that will translate JSON/YAML
requests into Slurm RPC requests

SLURM
slurmctld RPC REST API
slurmrestd clients
slurmdbd
Slurm REST API Architecture (rest_auth/jwt)
AuthAltTypes Perimeter - JWT authentication
client
Munge Perimeter client
client
slurmrestd client
slurmctld client
client
client
slurmdbd
cluster network
slurmd
slurmd
slurmd
slurmd
slurmd
Slurm REST API Architecture (rest_auth/jwt + Proxy)

AuthAltTypes Perimeter Site Authentication Perimeter


Munge Perimeter
slurmrestd
slurmctld

Authenticated client
slurmdbd
Site
slurmd Authenticating Authentication Server
slurmd
slurmd HTTP proxy TLS wrapped
slurmd
slurmd
JSON/YAML output
● Slurmrestd uses content (a.k.a. openapi) plugins. These plugins have been made
global to allow other parts of Slurm to be able to dump JSON/YAML output.
● New output formatting (limited to these binaries only):
○ sacct --json or sacct --yaml
○ sinfo --json or squeue --yaml
○ squeue --json or squeue --yaml
● Output is always same format of latest version of slurmrestd output.
○ Formatting arguments are ignored for JSON or YAML output as it is expected
that clients can easily pick and choose what they want.
$ sinfo --json … …
{ "gres": "", "operating_system": "Linux
"meta": { "gres_drained": "N\/A", 5.4.0-100-generic #113-Ubuntu SMP Thu Feb 3
"plugin": { "gres_used": "scratch:0", 18:43:29 UTC 2022",
"type": "openapi\/v0.0.37", "mcs_label": "", "owner": null,
"name": "Slurm OpenAPI v0.0.37" "name": "node00", "partitions": [
}, "next_state_after_reboot": "invalid", "debug"
"Slurm": { "address": "node00", ],
"version": { "hostname": "node00", "port": 6818,
"major": 22, "state": "idle", "real_memory": 31856,
"micro": 0, "state_flags": [ "reason": "",
"minor": 5 ], "reason_changed_at": 0,
}, "next_state_after_reboot_flags": [ "reason_set_by_user": null,
"release": "21.08.6" ], "slurmd_start_time": 1646430151,
} "operating_system": "Linux "sockets": 1,
}, 5.4.0-100-generic #113-Ubuntu SMP Thu Feb 3 "threads": 2,
"errors": [ 18:43:29 UTC 2022", "temporary_disk": 0,
], "owner": null, "weight": 1,
"nodes": [ "partitions": [ "tres":
{ "debug" "cpu=12,mem=31856M,billing=12",
"architecture": "x86_64", ], "slurmd_version": "22.05.0-0pre1",
"burstbuffer_network_address": "", "port": 6818, "alloc_memory": 0,
"boards": 1, "real_memory": 31856, "alloc_cpus": 0,
"boot_time": 1646380817, "reason": "", "idle_cpus": 12,
"comment": "", "reason_changed_at": 0, "tres_used": null,
"cores": 6, "reason_set_by_user": null, "tres_weighted": 0.0
"cpu_binding": 0, "slurmd_start_time": 1646430151, }
"cpu_load": 64, "sockets": 1, ]
"extra": "", "threads": 2, }
"free_memory": 3208, "temporary_disk": 0,
"cpus": 12, "weight": 1,
"last_busy": 1646430364, "tres":
"features": "", "cpu=12,mem=31856M,billing=12",
"active_features": "", …

A Migration Journey
Large Energy Company

● Using their scheduler for many years


○ Can’t just flip a switch and go to production

● Massive scale - multiple international sites, nodes and


workloads
● Many integrations required

3-4 Months to Production


Three Migration Steps
● Admin/User education
○ Training - Help admins identify the commonalities and learn the Slurm way
○ Wrappers - a bridge to migration not a crutch
■ LSF, Grid Engine - command and submission
■ PBS - command, submission, environment variables, #PBS scripts
● Policy replication
○ Reevaluate policies
■ Are we continuing to produce technical debt due to “doing things how we’ve always
done them?”
○ Optimizing for scale and throughput - 1 million jobs/day
■ Some Financial sites doing up to 15 million/day
● Tooling integration
○ Most time consuming of the journey
Questions?

Thank You
schedmd.com slurm.schedmd.com [email protected]

You might also like