Files

NanoCode012 4273d5cf7e feat: update nd parallelism readme (#3039 )

Co-authored-by: salman <salman.mohammadi@outlook.com>

2025-08-08 12:45:36 +01:00

llama-3_1-8b-hsdp-tp.yaml

feat: update nd parallelism readme (#3039 )

2025-08-08 12:45:36 +01:00

qwen3-8b-fsdp-tp-cp.yaml

Add support for Accelerate CP, ND examples, and fix for parallel config w fsdp (#3019 )

2025-08-07 21:22:15 -04:00

README.md

feat: update nd parallelism readme (#3039 )

2025-08-08 12:45:36 +01:00

README.md

ND Parallelism Examples

This directory contains example configurations for training models using ND Parallelism in Axolotl. These examples demonstrate how to compose different parallelism strategies (FSDP, TP, CP, HSDP) for efficient multi-GPU training.

Quick Start

Install Axolotl following the installation guide.
Run the command below:

# Train Qwen3 8B with FSDP + TP + CP on a single 8-GPU node
axolotl train examples/distributed-parallel/qwen3-8b-fsdp-tp-cp.yaml

# Train Llama 3.1 8B with HSDP + TP on 2 nodes (16 GPUs total)
axolotl train examples/distributed-parallel/llama-3_1-8b-hsdp-tp.yaml

Example Configurations

Single Node (8 GPUs)

Qwen3 8B with FSDP + TP + CP (qwen3-8b-fsdp-tp-cp.yaml)

Uses all 3 parallelism dimensions on a single node
Ideal for: when model weights, activations, and/or context are too large to fit on single GPU

dp_shard_size: 2         # FSDP across 2 GPUs
tensor_parallel_size: 2  # TP across 2 GPUs
context_parallel_size: 2 # CP across 2 GPUs
# Total: 2 × 2 × 2 = 8 GPUs

Multi-Node

Llama 3.1 8B with HSDP + TP (llama-3_1-8b-hsdp-tp.yaml)

FSDP & TP within nodes, DDP across nodes to minimize inter-node communication
Ideal for: Scaling to multiple nodes while maintaining training efficiency

dp_shard_size: 4        # FSDP within each 4-GPU group
tensor_parallel_size: 2 # TP within each node
dp_replicate_size: 2    # DDP across 2 groups
# Total: (4 × 2) × 2 = 16 GPUs (2 nodes)

README.md Unescape Escape

ND Parallelism Examples

Quick Start

Example Configurations

Single Node (8 GPUs)

Multi-Node

Learn More

README.md