diff --git a/docs/nd_parallelism.qmd b/docs/nd_parallelism.qmd index 9b3eae890..435e53e21 100644 --- a/docs/nd_parallelism.qmd +++ b/docs/nd_parallelism.qmd @@ -73,6 +73,10 @@ Note: We recommend FSDP. DeepSpeed is only compatible with `tensor_parallel_size ## Examples +::: {.callout-tip} +See our example configs [here](https://github.com/axolotl-ai-cloud/axolotl/tree/main/examples/distributed-parallel). +::: + 1. HSDP on 2 nodes with 4 GPUs each (8 GPUs total): - You want FSDP within each node and DDP across nodes. - Set `dp_shard_size: 4` and `dp_replicate_size: 2`. diff --git a/examples/distributed-parallel/README.md b/examples/distributed-parallel/README.md index 5aff54cd1..ad7c48d5f 100644 --- a/examples/distributed-parallel/README.md +++ b/examples/distributed-parallel/README.md @@ -1,8 +1,52 @@ -# Distributed Parallel +# ND Parallelism Examples -See the accompanying blog post: [Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training](https://huggingface.co/blog/accelerate-nd-parallel) +This directory contains example configurations for training models using ND Parallelism in Axolotl. These examples demonstrate how to compose different parallelism strategies (FSDP, TP, CP, HSDP) for efficient multi-GPU training. -The examples provided are suitable for single node (8xGPU) SFT. +## Quick Start -- Qwen 3 8B w/ FSDP + TP + CP: [YAML](./qwen3-8b-fsdp-tp-cp.yaml) -- Llama 3.1 8B w/ HSDP + TP: [YAML](./llama-3_1-8b-hdsp-tp.yaml) +1. Install Axolotl following the [installation guide](https://docs.axolotl.ai/docs/installation.html). + +2. Run the command below: + +```bash +# Train Qwen3 8B with FSDP + TP + CP on a single 8-GPU node +axolotl train examples/distributed-parallel/qwen3-8b-fsdp-tp-cp.yaml + +# Train Llama 3.1 8B with HSDP + TP on 2 nodes (16 GPUs total) +axolotl train examples/distributed-parallel/llama-3_1-8b-hsdp-tp.yaml +``` + +## Example Configurations + +### Single Node (8 GPUs) + +**Qwen3 8B with FSDP + TP + CP** ([qwen3-8b-fsdp-tp-cp.yaml](./qwen3-8b-fsdp-tp-cp.yaml)) +- Uses all 3 parallelism dimensions on a single node +- Ideal for: when model weights, activations, and/or context are too large to fit on single GPU + +```yaml +dp_shard_size: 2 # FSDP across 2 GPUs +tensor_parallel_size: 2 # TP across 2 GPUs +context_parallel_size: 2 # CP across 2 GPUs +# Total: 2 × 2 × 2 = 8 GPUs +``` + +### Multi-Node + +**Llama 3.1 8B with HSDP + TP** ([llama-3_1-8b-hsdp-tp.yaml](./llama-3_1-8b-hsdp-tp.yaml)) +- FSDP & TP within nodes, DDP across nodes to minimize inter-node communication +- Ideal for: Scaling to multiple nodes while maintaining training efficiency + +```yaml +dp_shard_size: 4 # FSDP within each 4-GPU group +tensor_parallel_size: 2 # TP within each node +dp_replicate_size: 2 # DDP across 2 groups +# Total: (4 × 2) × 2 = 16 GPUs (2 nodes) +``` + +## Learn More + +- [ND Parallelism Documentation](https://docs.axolotl.ai/docs/nd_parallelism.html) +- [Blog: Accelerate ND-Parallel Guide](https://huggingface.co/blog/accelerate-nd-parallel) +- [Multi-GPU Training Guide](https://docs.axolotl.ai/docs/multi-gpu.html) +- [Axolotl Discord](https://discord.gg/7m9sfhzaf3) diff --git a/examples/distributed-parallel/llama-3_1-8b-hdsp-tp.yaml b/examples/distributed-parallel/llama-3_1-8b-hsdp-tp.yaml similarity index 98% rename from examples/distributed-parallel/llama-3_1-8b-hdsp-tp.yaml rename to examples/distributed-parallel/llama-3_1-8b-hsdp-tp.yaml index 5b3246f74..f10dc9bd2 100644 --- a/examples/distributed-parallel/llama-3_1-8b-hdsp-tp.yaml +++ b/examples/distributed-parallel/llama-3_1-8b-hsdp-tp.yaml @@ -3,7 +3,7 @@ base_model: meta-llama/Llama-3.1-8B plugins: - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin -dp_shard_size: 2 +dp_shard_size: 4 dp_replicate_size: 2 tensor_parallel_size: 2 # context_parallel_size: 2