feat: update nd parallelism readme (#3039)
Co-authored-by: salman <salman.mohammadi@outlook.com>
This commit is contained in:
@@ -73,6 +73,10 @@ Note: We recommend FSDP. DeepSpeed is only compatible with `tensor_parallel_size
|
|||||||
|
|
||||||
## Examples
|
## Examples
|
||||||
|
|
||||||
|
::: {.callout-tip}
|
||||||
|
See our example configs [here](https://github.com/axolotl-ai-cloud/axolotl/tree/main/examples/distributed-parallel).
|
||||||
|
:::
|
||||||
|
|
||||||
1. HSDP on 2 nodes with 4 GPUs each (8 GPUs total):
|
1. HSDP on 2 nodes with 4 GPUs each (8 GPUs total):
|
||||||
- You want FSDP within each node and DDP across nodes.
|
- You want FSDP within each node and DDP across nodes.
|
||||||
- Set `dp_shard_size: 4` and `dp_replicate_size: 2`.
|
- Set `dp_shard_size: 4` and `dp_replicate_size: 2`.
|
||||||
|
|||||||
@@ -1,8 +1,52 @@
|
|||||||
# Distributed Parallel
|
# ND Parallelism Examples
|
||||||
|
|
||||||
See the accompanying blog post: [Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training](https://huggingface.co/blog/accelerate-nd-parallel)
|
This directory contains example configurations for training models using ND Parallelism in Axolotl. These examples demonstrate how to compose different parallelism strategies (FSDP, TP, CP, HSDP) for efficient multi-GPU training.
|
||||||
|
|
||||||
The examples provided are suitable for single node (8xGPU) SFT.
|
## Quick Start
|
||||||
|
|
||||||
- Qwen 3 8B w/ FSDP + TP + CP: [YAML](./qwen3-8b-fsdp-tp-cp.yaml)
|
1. Install Axolotl following the [installation guide](https://docs.axolotl.ai/docs/installation.html).
|
||||||
- Llama 3.1 8B w/ HSDP + TP: [YAML](./llama-3_1-8b-hdsp-tp.yaml)
|
|
||||||
|
2. Run the command below:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Train Qwen3 8B with FSDP + TP + CP on a single 8-GPU node
|
||||||
|
axolotl train examples/distributed-parallel/qwen3-8b-fsdp-tp-cp.yaml
|
||||||
|
|
||||||
|
# Train Llama 3.1 8B with HSDP + TP on 2 nodes (16 GPUs total)
|
||||||
|
axolotl train examples/distributed-parallel/llama-3_1-8b-hsdp-tp.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
## Example Configurations
|
||||||
|
|
||||||
|
### Single Node (8 GPUs)
|
||||||
|
|
||||||
|
**Qwen3 8B with FSDP + TP + CP** ([qwen3-8b-fsdp-tp-cp.yaml](./qwen3-8b-fsdp-tp-cp.yaml))
|
||||||
|
- Uses all 3 parallelism dimensions on a single node
|
||||||
|
- Ideal for: when model weights, activations, and/or context are too large to fit on single GPU
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
dp_shard_size: 2 # FSDP across 2 GPUs
|
||||||
|
tensor_parallel_size: 2 # TP across 2 GPUs
|
||||||
|
context_parallel_size: 2 # CP across 2 GPUs
|
||||||
|
# Total: 2 × 2 × 2 = 8 GPUs
|
||||||
|
```
|
||||||
|
|
||||||
|
### Multi-Node
|
||||||
|
|
||||||
|
**Llama 3.1 8B with HSDP + TP** ([llama-3_1-8b-hsdp-tp.yaml](./llama-3_1-8b-hsdp-tp.yaml))
|
||||||
|
- FSDP & TP within nodes, DDP across nodes to minimize inter-node communication
|
||||||
|
- Ideal for: Scaling to multiple nodes while maintaining training efficiency
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
dp_shard_size: 4 # FSDP within each 4-GPU group
|
||||||
|
tensor_parallel_size: 2 # TP within each node
|
||||||
|
dp_replicate_size: 2 # DDP across 2 groups
|
||||||
|
# Total: (4 × 2) × 2 = 16 GPUs (2 nodes)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Learn More
|
||||||
|
|
||||||
|
- [ND Parallelism Documentation](https://docs.axolotl.ai/docs/nd_parallelism.html)
|
||||||
|
- [Blog: Accelerate ND-Parallel Guide](https://huggingface.co/blog/accelerate-nd-parallel)
|
||||||
|
- [Multi-GPU Training Guide](https://docs.axolotl.ai/docs/multi-gpu.html)
|
||||||
|
- [Axolotl Discord](https://discord.gg/7m9sfhzaf3)
|
||||||
|
|||||||
@@ -3,7 +3,7 @@ base_model: meta-llama/Llama-3.1-8B
|
|||||||
plugins:
|
plugins:
|
||||||
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
|
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
|
||||||
|
|
||||||
dp_shard_size: 2
|
dp_shard_size: 4
|
||||||
dp_replicate_size: 2
|
dp_replicate_size: 2
|
||||||
tensor_parallel_size: 2
|
tensor_parallel_size: 2
|
||||||
# context_parallel_size: 2
|
# context_parallel_size: 2
|
||||||
Reference in New Issue
Block a user