Files

Wing Lian eb2c87b525 Example for Slurm and various fixes (#3038 ) [skip ci]

* slurm example and make preprocess play nicely

* start slurm if it init file exists

* remove incorrect comment

* feat: add slurm docs

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>

2025-08-08 08:02:03 -04:00

axolotl.slurm

Example for Slurm and various fixes (#3038 ) [skip ci]

2025-08-08 08:02:03 -04:00

README.md

Example for Slurm and various fixes (#3038 ) [skip ci]

2025-08-08 08:02:03 -04:00

README.md

SLURM Multi-Node Training

This directory contains an example SLURM script for running Axolotl training jobs across multiple nodes in a SLURM cluster.

Prerequisites

Access to a SLURM cluster with GPU nodes
Axolotl installed on all nodes (see installation docs)

Usage

Standard SLURM Clusters

Copy axolotl.slurm to your working directory.
Place your Axolotl config file (train.yaml) in the same directory.

Set the appropriate environment variables for the job:

export HF_TOKEN="your-huggingface-token"

# metric tracking
# export WANDB_API_KEY="your-wandb-api-key"
# ...

Submit the job:
```
sbatch --export=ALL,NUM_NODES=2,NUM_TRAINERS=8,PRIMARY_ADDR=<master-node>,PRIMARY_PORT=29400 axolotl.slurm
```
Where:
- NUM_NODES: Number of nodes to use
- NUM_TRAINERS: GPUs per node (typically 8)
- PRIMARY_ADDR: Hostname/IP of the master node
- PRIMARY_PORT: Port for distributed training (default: 29400)

(Optional) Run other slurm commands:

# check job info
scontrol show job axolotl-cli

# check job queue
squeue

# check cluster status
sinfo

RunPod Instant Clusters

Axolotl works with RunPod Instant Clusters. This feature provides managed SLURM clusters with zero configuration.

Deploy a SLURM Cluster:
- Go to RunPod Instant Clusters
- Click "Create a Cluster"
- Choose your GPU type, node count, and region
- Choose an Axolotl cloud docker image
- Deploy the cluster
Connect to the Controller Node: Find the controller node in the RunPod console and connect via SSH
Follow the instructions in Standard SLURM Clusters

README.md

SLURM Multi-Node Training

Prerequisites

Usage

Standard SLURM Clusters

RunPod Instant Clusters

Additional Resources