* slurm example and make preprocess play nicely * start slurm if it init file exists * remove incorrect comment * feat: add slurm docs --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>
SLURM Multi-Node Training
This directory contains an example SLURM script for running Axolotl training jobs across multiple nodes in a SLURM cluster.
Prerequisites
- Access to a SLURM cluster with GPU nodes
- Axolotl installed on all nodes (see installation docs)
Usage
Standard SLURM Clusters
-
Copy
axolotl.slurmto your working directory. -
Place your Axolotl config file (
train.yaml) in the same directory. -
Set the appropriate environment variables for the job:
export HF_TOKEN="your-huggingface-token" # metric tracking # export WANDB_API_KEY="your-wandb-api-key" # ... -
Submit the job:
sbatch --export=ALL,NUM_NODES=2,NUM_TRAINERS=8,PRIMARY_ADDR=<master-node>,PRIMARY_PORT=29400 axolotl.slurmWhere:
NUM_NODES: Number of nodes to useNUM_TRAINERS: GPUs per node (typically 8)PRIMARY_ADDR: Hostname/IP of the master nodePRIMARY_PORT: Port for distributed training (default: 29400)
-
(Optional) Run other slurm commands:
# check job info scontrol show job axolotl-cli # check job queue squeue # check cluster status sinfo
RunPod Instant Clusters
Axolotl works with RunPod Instant Clusters. This feature provides managed SLURM clusters with zero configuration.
-
Deploy a SLURM Cluster:
- Go to RunPod Instant Clusters
- Click "Create a Cluster"
- Choose your GPU type, node count, and region
- Choose an Axolotl cloud docker image
- Deploy the cluster
-
Connect to the Controller Node: Find the controller node in the RunPod console and connect via SSH
-
Follow the instructions in Standard SLURM Clusters