feat: update nd parallelism readme (#3039)

Co-authored-by: salman <salman.mohammadi@outlook.com>
2025-08-08 18:45:36 +07:00
parent c5e5aba547
commit 4273d5cf7e
3 changed files with 54 additions and 6 deletions
--- a/docs/nd_parallelism.qmd
+++ b/docs/nd_parallelism.qmd
@@ -73,6 +73,10 @@ Note: We recommend FSDP. DeepSpeed is only compatible with `tensor_parallel_size

 ## Examples

+::: {.callout-tip}
+See our example configs [here](https://github.com/axolotl-ai-cloud/axolotl/tree/main/examples/distributed-parallel).
+:::
+
 1.  HSDP on 2 nodes with 4 GPUs each (8 GPUs total):
    - You want FSDP within each node and DDP across nodes.
    - Set `dp_shard_size: 4` and `dp_replicate_size: 2`.
--- a/examples/distributed-parallel/README.md
+++ b/examples/distributed-parallel/README.md
@@ -1,8 +1,52 @@
-# Distributed Parallel
+# ND Parallelism Examples

-See the accompanying blog post: [Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training](https://huggingface.co/blog/accelerate-nd-parallel)
+This directory contains example configurations for training models using ND Parallelism in Axolotl. These examples demonstrate how to compose different parallelism strategies (FSDP, TP, CP, HSDP) for efficient multi-GPU training.

-The examples provided are suitable for single node (8xGPU) SFT.
+## Quick Start

- Qwen 3 8B w/ FSDP + TP + CP: [YAML](./qwen3-8b-fsdp-tp-cp.yaml)
- Llama 3.1 8B w/ HSDP + TP: [YAML](./llama-3_1-8b-hdsp-tp.yaml)
+1. Install Axolotl following the [installation guide](https://docs.axolotl.ai/docs/installation.html).
+
+2. Run the command below:
+
+```bash
+# Train Qwen3 8B with FSDP + TP + CP on a single 8-GPU node
+axolotl train examples/distributed-parallel/qwen3-8b-fsdp-tp-cp.yaml
+
+# Train Llama 3.1 8B with HSDP + TP on 2 nodes (16 GPUs total)
+axolotl train examples/distributed-parallel/llama-3_1-8b-hsdp-tp.yaml
+```
+
+## Example Configurations
+
+### Single Node (8 GPUs)
+
+**Qwen3 8B with FSDP + TP + CP** ([qwen3-8b-fsdp-tp-cp.yaml](./qwen3-8b-fsdp-tp-cp.yaml))
+- Uses all 3 parallelism dimensions on a single node
+- Ideal for: when model weights, activations, and/or context are too large to fit on single GPU
+
+```yaml
+dp_shard_size: 2         # FSDP across 2 GPUs
+tensor_parallel_size: 2  # TP across 2 GPUs
+context_parallel_size: 2 # CP across 2 GPUs
+# Total: 2 × 2 × 2 = 8 GPUs
+```
+
+### Multi-Node
+
+**Llama 3.1 8B with HSDP + TP** ([llama-3_1-8b-hsdp-tp.yaml](./llama-3_1-8b-hsdp-tp.yaml))
+- FSDP & TP within nodes, DDP across nodes to minimize inter-node communication
+- Ideal for: Scaling to multiple nodes while maintaining training efficiency
+
+```yaml
+dp_shard_size: 4        # FSDP within each 4-GPU group
+tensor_parallel_size: 2 # TP within each node
+dp_replicate_size: 2    # DDP across 2 groups
+# Total: (4 × 2) × 2 = 16 GPUs (2 nodes)
+```
+
+## Learn More
+
+- [ND Parallelism Documentation](https://docs.axolotl.ai/docs/nd_parallelism.html)
+- [Blog: Accelerate ND-Parallel Guide](https://huggingface.co/blog/accelerate-nd-parallel)
+- [Multi-GPU Training Guide](https://docs.axolotl.ai/docs/multi-gpu.html)
+- [Axolotl Discord](https://discord.gg/7m9sfhzaf3)
--- a/examples/distributed-parallel/llama-3_1-8b-hsdp-tp.yaml
+++ b/examples/distributed-parallel/llama-3_1-8b-hsdp-tp.yaml
@@ -3,7 +3,7 @@ base_model: meta-llama/Llama-3.1-8B
 plugins:
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin

-dp_shard_size: 2
+dp_shard_size: 4
 dp_replicate_size: 2
 tensor_parallel_size: 2
 # context_parallel_size: 2