* slurm example and make preprocess play nicely * start slurm if it init file exists * remove incorrect comment * feat: add slurm docs --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>
67 lines
2.1 KiB
Markdown
67 lines
2.1 KiB
Markdown
# SLURM Multi-Node Training
|
|
|
|
This directory contains an example SLURM script for running Axolotl training jobs across multiple nodes in a SLURM cluster.
|
|
|
|
## Prerequisites
|
|
|
|
- Access to a SLURM cluster with GPU nodes
|
|
- Axolotl installed on all nodes (see [installation docs](https://docs.axolotl.ai/docs/installation.html))
|
|
|
|
## Usage
|
|
|
|
### Standard SLURM Clusters
|
|
|
|
1. Copy [`axolotl.slurm`](./axolotl.slurm) to your working directory.
|
|
2. Place your Axolotl config file (`train.yaml`) in the same directory.
|
|
3. Set the appropriate environment variables for the job:
|
|
```bash
|
|
export HF_TOKEN="your-huggingface-token"
|
|
|
|
# metric tracking
|
|
# export WANDB_API_KEY="your-wandb-api-key"
|
|
# ...
|
|
```
|
|
4. Submit the job:
|
|
```bash
|
|
sbatch --export=ALL,NUM_NODES=2,NUM_TRAINERS=8,PRIMARY_ADDR=<master-node>,PRIMARY_PORT=29400 axolotl.slurm
|
|
```
|
|
|
|
Where:
|
|
- `NUM_NODES`: Number of nodes to use
|
|
- `NUM_TRAINERS`: GPUs per node (typically 8)
|
|
- `PRIMARY_ADDR`: Hostname/IP of the master node
|
|
- `PRIMARY_PORT`: Port for distributed training (default: 29400)
|
|
|
|
5. (Optional) Run other slurm commands:
|
|
```bash
|
|
# check job info
|
|
scontrol show job axolotl-cli
|
|
|
|
# check job queue
|
|
squeue
|
|
|
|
# check cluster status
|
|
sinfo
|
|
```
|
|
|
|
### RunPod Instant Clusters
|
|
|
|
Axolotl works with RunPod Instant Clusters. This feature provides managed SLURM clusters with zero configuration.
|
|
|
|
1. **Deploy a SLURM Cluster**:
|
|
- Go to [RunPod Instant Clusters](https://console.runpod.io/cluster)
|
|
- Click "Create a Cluster"
|
|
- Choose your GPU type, node count, and region
|
|
- Choose an [Axolotl cloud docker image](https://docs.axolotl.ai/docs/docker.html#cloud)
|
|
- Deploy the cluster
|
|
|
|
2. **Connect to the Controller Node**: Find the controller node in the RunPod console and connect via SSH
|
|
|
|
3. **Follow the instructions in [Standard SLURM Clusters](#standard-slurm-clusters)**
|
|
|
|
## Additional Resources
|
|
|
|
- [Axolotl Multi-Node Training](https://docs.axolotl.ai/docs/multi-node.html)
|
|
- [SLURM Documentation](https://slurm.schedmd.com/documentation.html)
|
|
- [RunPod SLURM Clusters Guide](https://docs.runpod.io/instant-clusters/slurm-clusters)
|