feat: update nd parallelism readme (#3039)

Co-authored-by: salman <salman.mohammadi@outlook.com>
2025-08-08 18:45:36 +07:00
parent c5e5aba547
commit 4273d5cf7e
3 changed files with 54 additions and 6 deletions
--- a/docs/nd_parallelism.qmd
+++ b/docs/nd_parallelism.qmd
@@ -73,6 +73,10 @@ Note: We recommend FSDP. DeepSpeed is only compatible with `tensor_parallel_size
 ## Examples
 ::: {.callout-tip}
 See our example configs [here](https://github.com/axolotl-ai-cloud/axolotl/tree/main/examples/distributed-parallel).
 :::
 1.  HSDP on 2 nodes with 4 GPUs each (8 GPUs total):
    - You want FSDP within each node and DDP across nodes.
    - Set `dp_shard_size: 4` and `dp_replicate_size: 2`.
--- a/examples/distributed-parallel/README.md
+++ b/examples/distributed-parallel/README.md
@@ -1,8 +1,52 @@
-# Distributed Parallel
+# ND Parallelism Examples
-See the accompanying blog post: [Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training](https://huggingface.co/blog/accelerate-nd-parallel)
+This directory contains example configurations for training models using ND Parallelism in Axolotl. These examples demonstrate how to compose different parallelism strategies (FSDP, TP, CP, HSDP) for efficient multi-GPU training.
-The examples provided are suitable for single node (8xGPU) SFT.
+## Quick Start
- Qwen 3 8B w/ FSDP + TP + CP: [YAML](./qwen3-8b-fsdp-tp-cp.yaml)
+1. Install Axolotl following the [installation guide](https://docs.axolotl.ai/docs/installation.html).
- Llama 3.1 8B w/ HSDP + TP: [YAML](./llama-3_1-8b-hdsp-tp.yaml)
+
 2. Run the command below:
 ```bash
 # Train Qwen3 8B with FSDP + TP + CP on a single 8-GPU node
 axolotl train examples/distributed-parallel/qwen3-8b-fsdp-tp-cp.yaml
 # Train Llama 3.1 8B with HSDP + TP on 2 nodes (16 GPUs total)
 axolotl train examples/distributed-parallel/llama-3_1-8b-hsdp-tp.yaml
 ```
 ## Example Configurations
 ### Single Node (8 GPUs)
 **Qwen3 8B with FSDP + TP + CP** ([qwen3-8b-fsdp-tp-cp.yaml](./qwen3-8b-fsdp-tp-cp.yaml))
 - Uses all 3 parallelism dimensions on a single node
 - Ideal for: when model weights, activations, and/or context are too large to fit on single GPU
 ```yaml
 dp_shard_size: 2         # FSDP across 2 GPUs
 tensor_parallel_size: 2  # TP across 2 GPUs
 context_parallel_size: 2 # CP across 2 GPUs
 # Total: 2 × 2 × 2 = 8 GPUs
 ```
 ### Multi-Node
 **Llama 3.1 8B with HSDP + TP** ([llama-3_1-8b-hsdp-tp.yaml](./llama-3_1-8b-hsdp-tp.yaml))
 - FSDP & TP within nodes, DDP across nodes to minimize inter-node communication
 - Ideal for: Scaling to multiple nodes while maintaining training efficiency
 ```yaml
 dp_shard_size: 4        # FSDP within each 4-GPU group
 tensor_parallel_size: 2 # TP within each node
 dp_replicate_size: 2    # DDP across 2 groups
 # Total: (4 × 2) × 2 = 16 GPUs (2 nodes)
 ```
 ## Learn More
 - [ND Parallelism Documentation](https://docs.axolotl.ai/docs/nd_parallelism.html)
 - [Blog: Accelerate ND-Parallel Guide](https://huggingface.co/blog/accelerate-nd-parallel)
 - [Multi-GPU Training Guide](https://docs.axolotl.ai/docs/multi-gpu.html)
 - [Axolotl Discord](https://discord.gg/7m9sfhzaf3)
--- a/examples/distributed-parallel/llama-3_1-8b-hsdp-tp.yaml
+++ b/examples/distributed-parallel/llama-3_1-8b-hsdp-tp.yaml
@@ -3,7 +3,7 @@ base_model: meta-llama/Llama-3.1-8B
 plugins:
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
-dp_shard_size: 2
+dp_shard_size: 4
 dp_replicate_size: 2
 tensor_parallel_size: 2
 # context_parallel_size: 2