add 120b and deepspeed zero3 examples (#3035) [skip ci]

* add 120b and deepspeed zero3 examples * add a bit of flavor and cleanup gpt oss readme * fix: remove expert vram usage * fix: remove redundant EOS token from eot_tokens * feat: add 120B to docs --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-08-08 08:04:56 -04:00
parent eb2c87b525
commit 50f2b94d50
6 changed files with 141 additions and 10 deletions
--- a/examples/gpt-oss/README.md
+++ b/examples/gpt-oss/README.md
@@ -16,11 +16,10 @@ pip3 install packaging==23.2 setuptools==75.8.0 wheel ninja
 pip3 install --no-build-isolation 'axolotl[flash-attn]>=0.12.0'
 ```

-2. Choose one of the following configs below for training the 20B model.
+2. Choose one of the following configs below for training the 20B model. (for 120B, see [below](#training-120b))

 ```bash
-# LoRA SFT linear layers & 2 experts (1x48GB @ ~47GiB)
-# (only linear layers @ ~44GiB)
+# LoRA SFT linear layers (1x48GB @ ~44GiB)
 axolotl train examples/gpt-oss/gpt-oss-20b-sft-lora-singlegpu.yaml

 # FFT SFT with offloading (2x24GB @ ~21GiB/GPU)
@@ -30,9 +29,16 @@ axolotl train examples/gpt-oss/gpt-oss-20b-fft-fsdp2-offload.yaml
 axolotl train examples/gpt-oss/gpt-oss-20b-fft-fsdp2.yaml
 ```

-Notes:
- 120B coming soon!
- Memory usage taken from `device_mem_reserved(gib)` from logs.
+Note: Memory usage taken from `device_mem_reserved(gib)` from logs.
+
+### Training 120B
+
+On 8xH100s
+
+```bash
+# FFT SFT with offloading (8x80GB @ ~49GiB/GPU)
+axolotl train examples/gpt-oss/gpt-oss-120b-fft-fsdp2-offload.yaml
+```

 ### Tool use