Files
Wing Lian eb2c87b525 Example for Slurm and various fixes (#3038) [skip ci]
* slurm example and make preprocess play nicely

* start slurm if it init file exists

* remove incorrect comment

* feat: add slurm docs

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-08-08 08:02:03 -04:00
..

SLURM Multi-Node Training

This directory contains an example SLURM script for running Axolotl training jobs across multiple nodes in a SLURM cluster.

Prerequisites

  • Access to a SLURM cluster with GPU nodes
  • Axolotl installed on all nodes (see installation docs)

Usage

Standard SLURM Clusters

  1. Copy axolotl.slurm to your working directory.

  2. Place your Axolotl config file (train.yaml) in the same directory.

  3. Set the appropriate environment variables for the job:

    export HF_TOKEN="your-huggingface-token"
    
    # metric tracking
    # export WANDB_API_KEY="your-wandb-api-key"
    # ...
    
  4. Submit the job:

    sbatch --export=ALL,NUM_NODES=2,NUM_TRAINERS=8,PRIMARY_ADDR=<master-node>,PRIMARY_PORT=29400 axolotl.slurm
    

    Where:

    • NUM_NODES: Number of nodes to use
    • NUM_TRAINERS: GPUs per node (typically 8)
    • PRIMARY_ADDR: Hostname/IP of the master node
    • PRIMARY_PORT: Port for distributed training (default: 29400)
  5. (Optional) Run other slurm commands:

    # check job info
    scontrol show job axolotl-cli
    
    # check job queue
    squeue
    
    # check cluster status
    sinfo
    

RunPod Instant Clusters

Axolotl works with RunPod Instant Clusters. This feature provides managed SLURM clusters with zero configuration.

  1. Deploy a SLURM Cluster:

  2. Connect to the Controller Node: Find the controller node in the RunPod console and connect via SSH

  3. Follow the instructions in Standard SLURM Clusters

Additional Resources