53 lines
1.9 KiB
Markdown
53 lines
1.9 KiB
Markdown
# ND Parallelism Examples
|
||
|
||
This directory contains example configurations for training models using ND Parallelism in Axolotl. These examples demonstrate how to compose different parallelism strategies (FSDP, TP, CP, HSDP) for efficient multi-GPU training.
|
||
|
||
## Quick Start
|
||
|
||
1. Install Axolotl following the [installation guide](https://docs.axolotl.ai/docs/installation.html).
|
||
|
||
2. Run the command below:
|
||
|
||
```bash
|
||
# Train Qwen3 8B with FSDP + TP + CP on a single 8-GPU node
|
||
axolotl train examples/distributed-parallel/qwen3-8b-fsdp-tp-cp.yaml
|
||
|
||
# Train Llama 3.1 8B with HSDP + TP on 2 nodes (16 GPUs total)
|
||
axolotl train examples/distributed-parallel/llama-3_1-8b-hsdp-tp.yaml
|
||
```
|
||
|
||
## Example Configurations
|
||
|
||
### Single Node (8 GPUs)
|
||
|
||
**Qwen3 8B with FSDP + TP + CP** ([qwen3-8b-fsdp-tp-cp.yaml](./qwen3-8b-fsdp-tp-cp.yaml))
|
||
- Uses all 3 parallelism dimensions on a single node
|
||
- Ideal for: when model weights, activations, and/or context are too large to fit on single GPU
|
||
|
||
```yaml
|
||
dp_shard_size: 2 # FSDP across 2 GPUs
|
||
tensor_parallel_size: 2 # TP across 2 GPUs
|
||
context_parallel_size: 2 # CP across 2 GPUs
|
||
# Total: 2 × 2 × 2 = 8 GPUs
|
||
```
|
||
|
||
### Multi-Node
|
||
|
||
**Llama 3.1 8B with HSDP + TP** ([llama-3_1-8b-hsdp-tp.yaml](./llama-3_1-8b-hsdp-tp.yaml))
|
||
- FSDP & TP within nodes, DDP across nodes to minimize inter-node communication
|
||
- Ideal for: Scaling to multiple nodes while maintaining training efficiency
|
||
|
||
```yaml
|
||
dp_shard_size: 4 # FSDP within each 4-GPU group
|
||
tensor_parallel_size: 2 # TP within each node
|
||
dp_replicate_size: 2 # DDP across 2 groups
|
||
# Total: (4 × 2) × 2 = 16 GPUs (2 nodes)
|
||
```
|
||
|
||
## Learn More
|
||
|
||
- [ND Parallelism Documentation](https://docs.axolotl.ai/docs/nd_parallelism.html)
|
||
- [Blog: Accelerate ND-Parallel Guide](https://huggingface.co/blog/accelerate-nd-parallel)
|
||
- [Multi-GPU Training Guide](https://docs.axolotl.ai/docs/multi-gpu.html)
|
||
- [Axolotl Discord](https://discord.gg/7m9sfhzaf3)
|