add 120b and deepspeed zero3 examples (#3035) [skip ci]

* add 120b and deepspeed zero3 examples

* add a bit of flavor and cleanup gpt oss readme

* fix: remove expert vram usage

* fix: remove redundant EOS token from eot_tokens

* feat: add 120B to docs

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
This commit is contained in:
Wing Lian
2025-08-08 08:04:56 -04:00
committed by GitHub
parent eb2c87b525
commit 50f2b94d50
6 changed files with 141 additions and 10 deletions

View File

@@ -16,11 +16,10 @@ pip3 install packaging==23.2 setuptools==75.8.0 wheel ninja
pip3 install --no-build-isolation 'axolotl[flash-attn]>=0.12.0'
```
2. Choose one of the following configs below for training the 20B model.
2. Choose one of the following configs below for training the 20B model. (for 120B, see [below](#training-120b))
```bash
# LoRA SFT linear layers & 2 experts (1x48GB @ ~47GiB)
# (only linear layers @ ~44GiB)
# LoRA SFT linear layers (1x48GB @ ~44GiB)
axolotl train examples/gpt-oss/gpt-oss-20b-sft-lora-singlegpu.yaml
# FFT SFT with offloading (2x24GB @ ~21GiB/GPU)
@@ -30,9 +29,16 @@ axolotl train examples/gpt-oss/gpt-oss-20b-fft-fsdp2-offload.yaml
axolotl train examples/gpt-oss/gpt-oss-20b-fft-fsdp2.yaml
```
Notes:
- 120B coming soon!
- Memory usage taken from `device_mem_reserved(gib)` from logs.
Note: Memory usage taken from `device_mem_reserved(gib)` from logs.
### Training 120B
On 8xH100s
```bash
# FFT SFT with offloading (8x80GB @ ~49GiB/GPU)
axolotl train examples/gpt-oss/gpt-oss-120b-fft-fsdp2-offload.yaml
```
### Tool use