* feat(doc): organize docs, add to menu bar, fix broken formatting * feat: add link to custom integrations * feat: update readme for integrations to include citations and repo link * chore: update lm_eval info * chore: use fullname * Update docs/cli.qmd per suggestion Co-authored-by: Dan Saunders <danjsaund@gmail.com> * feat: add sweep doc * feat: add kd doc * fix: remove toc * fix: update deprecation * feat: add more info about chat_template issues * fix: heading level * fix: shell->bash code block * fix: ray link * fix(doc): heading level, header links, formatting * feat: add grpo docs * feat: add style changes * fix: wrong cli arg for lm-eval * fix: remove old run method * feat: load custom integration doc dynamically * fix: remove old cli way * fix: toc * fix: minor formatting --------- Co-authored-by: Dan Saunders <danjsaund@gmail.com>
105 lines
2.2 KiB
Plaintext
105 lines
2.2 KiB
Plaintext
---
|
|
title: "Multi-GPU"
|
|
format:
|
|
html:
|
|
toc: true
|
|
toc-depth: 3
|
|
number-sections: true
|
|
code-tools: true
|
|
execute:
|
|
enabled: false
|
|
---
|
|
|
|
This guide covers advanced training configurations for multi-GPU setups using Axolotl.
|
|
|
|
## Overview {#sec-overview}
|
|
|
|
Axolotl supports several methods for multi-GPU training:
|
|
|
|
- DeepSpeed (recommended)
|
|
- FSDP (Fully Sharded Data Parallel)
|
|
- FSDP + QLoRA
|
|
|
|
## DeepSpeed {#sec-deepspeed}
|
|
|
|
DeepSpeed is the recommended approach for multi-GPU training due to its stability and performance. It provides various optimization levels through ZeRO stages.
|
|
|
|
### Configuration {#sec-deepspeed-config}
|
|
|
|
Add to your YAML config:
|
|
|
|
```{.yaml}
|
|
deepspeed: deepspeed_configs/zero1.json
|
|
```
|
|
|
|
### Usage {#sec-deepspeed-usage}
|
|
|
|
```{.bash}
|
|
# Passing arg via config
|
|
axolotl train config.yml
|
|
|
|
# Passing arg via cli
|
|
axolotl train config.yml --deepspeed deepspeed_configs/zero1.json
|
|
```
|
|
|
|
### ZeRO Stages {#sec-zero-stages}
|
|
|
|
We provide default configurations for:
|
|
|
|
- ZeRO Stage 1 (`zero1.json`)
|
|
- ZeRO Stage 2 (`zero2.json`)
|
|
- ZeRO Stage 3 (`zero3.json`)
|
|
|
|
Choose based on your memory requirements and performance needs.
|
|
|
|
## FSDP {#sec-fsdp}
|
|
|
|
### Basic FSDP Configuration {#sec-fsdp-config}
|
|
|
|
```{.yaml}
|
|
fsdp:
|
|
- full_shard
|
|
- auto_wrap
|
|
fsdp_config:
|
|
fsdp_offload_params: true
|
|
fsdp_state_dict_type: FULL_STATE_DICT
|
|
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
|
|
```
|
|
|
|
### FSDP + QLoRA {#sec-fsdp-qlora}
|
|
|
|
For combining FSDP with QLoRA, see our [dedicated guide](fsdp_qlora.qmd).
|
|
|
|
## Performance Optimization {#sec-performance}
|
|
|
|
### Liger Kernel Integration {#sec-liger}
|
|
|
|
Please see [docs](custom_integrations.qmd#liger) for more info.
|
|
|
|
## Troubleshooting {#sec-troubleshooting}
|
|
|
|
### NCCL Issues {#sec-nccl}
|
|
|
|
For NCCL-related problems, see our [NCCL troubleshooting guide](nccl.qmd).
|
|
|
|
### Common Problems {#sec-common-problems}
|
|
|
|
::: {.panel-tabset}
|
|
|
|
## Memory Issues
|
|
|
|
- Reduce `micro_batch_size`
|
|
- Reduce `eval_batch_size`
|
|
- Adjust `gradient_accumulation_steps`
|
|
- Consider using a higher ZeRO stage
|
|
|
|
## Training Instability
|
|
|
|
- Start with DeepSpeed ZeRO-2
|
|
- Monitor loss values
|
|
- Check learning rates
|
|
|
|
:::
|
|
|
|
For more detailed troubleshooting, see our [debugging guide](debugging.qmd).
|