* FSDP2 args migration implementation This commit implements the migration to FSDP2 arguments including: - FSDP2 support with LoRA training - DPO integration with FSDP2 - Model loading fixes and refactoring - CPU offloading and PEFT handling - Test updates and CI improvements - Bug fixes for dtype errors and various edge cases
192 lines
4.8 KiB
Plaintext
192 lines
4.8 KiB
Plaintext
---
|
|
title: "Multi-GPU"
|
|
format:
|
|
html:
|
|
toc: true
|
|
toc-depth: 3
|
|
number-sections: true
|
|
code-tools: true
|
|
execute:
|
|
enabled: false
|
|
---
|
|
|
|
This guide covers advanced training configurations for multi-GPU setups using Axolotl.
|
|
|
|
## Overview {#sec-overview}
|
|
|
|
Axolotl supports several methods for multi-GPU training:
|
|
|
|
- DeepSpeed (recommended)
|
|
- FSDP (Fully Sharded Data Parallel)
|
|
- Sequence parallelism
|
|
- FSDP + QLoRA
|
|
|
|
## DeepSpeed {#sec-deepspeed}
|
|
|
|
### Configuration {#sec-deepspeed-config}
|
|
|
|
Add to your YAML config:
|
|
|
|
```{.yaml}
|
|
deepspeed: deepspeed_configs/zero1.json
|
|
```
|
|
### Usage {#sec-deepspeed-usage}
|
|
|
|
```{.bash}
|
|
# Fetch deepspeed configs (if not already present)
|
|
axolotl fetch deepspeed_configs
|
|
|
|
# Passing arg via config
|
|
axolotl train config.yml
|
|
|
|
# Passing arg via cli
|
|
axolotl train config.yml --deepspeed deepspeed_configs/zero1.json
|
|
```
|
|
|
|
### ZeRO Stages {#sec-zero-stages}
|
|
|
|
We provide default configurations for:
|
|
|
|
- ZeRO Stage 1 (`zero1.json`)
|
|
- ZeRO Stage 1 with torch compile (`zero1_torch_compile.json`)
|
|
- ZeRO Stage 2 (`zero2.json`)
|
|
- ZeRO Stage 3 (`zero3.json`)
|
|
- ZeRO Stage 3 with bf16 (`zero3_bf16.json`)
|
|
- ZeRO Stage 3 with bf16 and CPU offload params(`zero3_bf16_cpuoffload_params.json`)
|
|
- ZeRO Stage 3 with bf16 and CPU offload params and optimizer (`zero3_bf16_cpuoffload_all.json`)
|
|
|
|
::: {.callout-tip}
|
|
|
|
Choose the configuration that offloads the least amount to memory while still being able to fit on VRAM for best performance.
|
|
|
|
Start from Stage 1 -> Stage 2 -> Stage 3.
|
|
|
|
:::
|
|
|
|
::: {.callout-tip}
|
|
|
|
Using ZeRO Stage 3 with Single-GPU training
|
|
|
|
ZeRO Stage 3 can be used for training on a single GPU by manually setting the environment variables:
|
|
`WORLD_SIZE=1 LOCAL_RANK=0 MASTER_ADDR=0.0.0.0 MASTER_PORT=29500`
|
|
|
|
:::
|
|
|
|
## Fully Sharded Data Parallel (FSDP) {#sec-fsdp}
|
|
|
|
::: {.callout-note}
|
|
|
|
FSDP2 is recommended for new users. FSDP1 is deprecated and will be removed in an upcoming release of Axolotl.
|
|
|
|
:::
|
|
|
|
### Migrating from FSDP1 to FSDP2 {#sec-migrate-fsdp1-fsdp2}
|
|
|
|
To migrate your config from FSDP1 to FSDP2, you must use the `fsdp_version` top-level config field to specify the FSDP version, and
|
|
also follow the config field mapping below to update field names.
|
|
|
|
#### Config mapping
|
|
|
|
FSDP1 | FSDP2
|
|
-------- | --------
|
|
fsdp_sharding_strategy | reshard_after_forward
|
|
fsdp_backward_prefetch_policy | **REMOVED**
|
|
fsdp_backward_prefetch | **REMOVED**
|
|
fsdp_forward_prefetch | **REMOVED**
|
|
fsdp_sync_module_states | **REMOVED**
|
|
fsdp_cpu_ram_efficient_loading | cpu_ram_efficient_loading
|
|
fsdp_state_dict_type | state_dict_type
|
|
fsdp_use_orig_params | **REMOVED**
|
|
|
|
|
|
For example, if you were using the following FSDP1 config:
|
|
|
|
```{.yaml}
|
|
fsdp_version: 1
|
|
fsdp_config:
|
|
fsdp_offload_params: false
|
|
fsdp_cpu_ram_efficient_loading: true
|
|
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
|
|
fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer
|
|
fsdp_state_dict_type: FULL_STATE_DICT
|
|
fsdp_sharding_strategy: FULL_SHARD
|
|
```
|
|
|
|
You can migrate to the following FSDP2 config:
|
|
|
|
```{.yaml}
|
|
fsdp_version: 2
|
|
fsdp_config:
|
|
offload_params: false
|
|
cpu_ram_efficient_loading: true
|
|
auto_wrap_policy: TRANSFORMER_BASED_WRAP
|
|
transformer_layer_cls_to_wrap: Qwen3DecoderLayer
|
|
state_dict_type: FULL_STATE_DICT
|
|
reshard_after_forward: true
|
|
```
|
|
|
|
### FSDP1 (deprecated) {#sec-fsdp-config}
|
|
|
|
::: {.callout-note}
|
|
|
|
Using `fsdp` to configure FSDP is deprecated and will be removed in an upcoming release of Axolotl. Please use `fsdp_config` as above instead.
|
|
|
|
:::
|
|
|
|
```{.yaml}
|
|
fsdp:
|
|
- full_shard
|
|
- auto_wrap
|
|
fsdp_config:
|
|
fsdp_offload_params: true
|
|
fsdp_state_dict_type: FULL_STATE_DICT
|
|
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
|
|
```
|
|
|
|
|
|
## Sequence parallelism {#sec-sequence-parallelism}
|
|
|
|
We support sequence parallelism (SP) via the
|
|
[ring-flash-attention](https://github.com/zhuzilin/ring-flash-attention) project. This
|
|
allows one to split up sequences across GPUs, which is useful in the event that a
|
|
single sequence causes OOM errors during model training.
|
|
|
|
See our [dedicated guide](sequence_parallelism.qmd) for more information.
|
|
|
|
### FSDP + QLoRA {#sec-fsdp-qlora}
|
|
|
|
For combining FSDP with QLoRA, see our [dedicated guide](fsdp_qlora.qmd).
|
|
|
|
## Performance Optimization {#sec-performance}
|
|
|
|
### Liger Kernel Integration {#sec-liger}
|
|
|
|
Please see [docs](custom_integrations.qmd#liger) for more info.
|
|
|
|
## Troubleshooting {#sec-troubleshooting}
|
|
|
|
### NCCL Issues {#sec-nccl}
|
|
|
|
For NCCL-related problems, see our [NCCL troubleshooting guide](nccl.qmd).
|
|
|
|
### Common Problems {#sec-common-problems}
|
|
|
|
::: {.panel-tabset}
|
|
|
|
## Memory Issues
|
|
|
|
- Reduce `micro_batch_size`
|
|
- Reduce `eval_batch_size`
|
|
- Adjust `gradient_accumulation_steps`
|
|
- Consider using a higher ZeRO stage
|
|
|
|
## Training Instability
|
|
|
|
- Start with DeepSpeed ZeRO-2
|
|
- Monitor loss values
|
|
- Check learning rates
|
|
|
|
:::
|
|
|
|
For more detailed troubleshooting, see our [debugging guide](debugging.qmd).
|