* add --launcher option; explicit True/False bool args; small cleanup * refactor * add torchrun, accelerate cli args * add rdzv arg default + tests * update _quarto * coderabbit * fix * we can't set rdvz_id independently across nodes * coderabbit * fix tests
95 lines
3.2 KiB
Plaintext
95 lines
3.2 KiB
Plaintext
---
|
|
title: Multi Node
|
|
description: How to use Axolotl on multiple machines
|
|
---
|
|
|
|
The below are three ways to train multi-node in Axolotl.
|
|
|
|
::: {.callout-important}
|
|
Each machine needs a copy of Axolotl, we suggest using the same commit to ensure compatibility.
|
|
|
|
You will also need to have the same configuration file for your model on each machine.
|
|
|
|
Make sure the main machine is reachable by other machines.
|
|
:::
|
|
|
|
## Accelerate
|
|
|
|
You will need to create a configuration for accelerate, either by using `accelerate config` and follow the instructions or you can use one of the preset below:
|
|
|
|
~/.cache/huggingface/accelerate/default_config.yaml
|
|
```yaml
|
|
compute_environment: LOCAL_MACHINE
|
|
debug: false
|
|
distributed_type: FSDP
|
|
downcast_bf16: 'no'
|
|
machine_rank: 0 # Set to 0 for the main machine, increment by one for other machines
|
|
main_process_ip: 10.0.0.4 # Set to main machine's IP
|
|
main_process_port: 5000
|
|
main_training_function: main
|
|
mixed_precision: bf16
|
|
num_machines: 2 # Change to the number of machines
|
|
num_processes: 4 # That's the total number of GPUs, (for example: if you have 2 machines with 4 GPU, put 8)
|
|
rdzv_backend: static
|
|
same_network: true
|
|
tpu_env: []
|
|
tpu_use_cluster: false
|
|
tpu_use_sudo: false
|
|
use_cpu: false
|
|
```
|
|
|
|
Configure your model to use FSDP in the Axolotl yaml. For example:
|
|
```yaml
|
|
fsdp_version: 2
|
|
fsdp_config:
|
|
offload_params: true
|
|
state_dict_type: FULL_STATE_DICT
|
|
auto_wrap_policy: TRANSFORMER_BASED_WRAP
|
|
transformer_layer_cls_to_wrap: LlamaDecoderLayer
|
|
reshard_after_forward: true
|
|
```
|
|
|
|
All you have to do now is launch using accelerate as you would usually do on each machine and voila, the processes will start once you have launched accelerate on every machine.
|
|
|
|
## Raytrain
|
|
|
|
Please see ray train doc [here](ray-integration.qmd).
|
|
|
|
## Torchrun
|
|
|
|
If you are using Infiniband, we recommend torchrun to utilize the full bandwidth.
|
|
|
|
Set the following env (change buffersize/socketname depending on your system):
|
|
|
|
```bash
|
|
export NCCL_IB_DISABLE=0
|
|
export NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond"
|
|
export NCCL_BUFFSIZE=2097152
|
|
```
|
|
|
|
Run the following on each node:
|
|
|
|
### Option 1: New Axolotl CLI with launcher args (Recommended)
|
|
|
|
```bash
|
|
axolotl train config.yaml --launcher torchrun -- --nnodes $num_nodes --nproc_per_node $gpu_per_node --rdzv_id $rdzv_id --rdzv_backend c10d --rdzv_endpoint "$head_node_ip:$head_node_port"
|
|
```
|
|
|
|
### Option 2: Direct torchrun (Legacy)
|
|
|
|
```bash
|
|
torchrun --nnodes $num_nodes --nproc_per_node $gpu_per_node --rdzv_id $rdzv_id --rdzv_backend c10d --rdzv_endpoint "$head_node_ip:$head_node_port" -m axolotl.cli.train config.yaml
|
|
```
|
|
|
|
Please make sure to substitute the placeholder variables:
|
|
|
|
- `num_nodes`: Number of nodes (containing GPUs)
|
|
- `gpu_per_node`: Number of gpus per node
|
|
- `head_node_ip`: IP of the head node (make sure other machines can connect to this)
|
|
- `head_node_port`: Port of the head node (make sure other machines can connect to this. Default 29400)
|
|
- `rdzv_id`: A unique job ID that is used by the job across nodes.
|
|
|
|
The new CLI approach (Option 1) is recommended as it provides consistent argument handling and works seamlessly with other Axolotl CLI features.
|
|
|
|
More info on the available configs can be found on the Pytorch docs [here](https://pytorch.org/docs/stable/elastic/run.html)
|