[GPT-OSS] improve FSDP shard merging and documentation for GPT-OSS (#3073)

* improve fsdp shard merging * improve logging * update information on merging and inferencing GPT-OSS * cleanup readme * automate cleanup of FSDP prefix * import GRPO only if necessary * only modify config.json on rank0 * merge final checkpoint at end of training * prevent circular import * Fix saving for sharded state dict * devx, move merged to output dir * move import back to top * Fix stuck merge * fix conditionals from pr feedback and add test
2025-08-15 21:25:01 -04:00
parent 130ef7c51a
commit ecbe8b2b61
7 changed files with 170 additions and 34 deletions
--- a/examples/gpt-oss/README.md
+++ b/examples/gpt-oss/README.md
@@ -33,13 +33,44 @@ Note: Memory usage taken from `device_mem_reserved(gib)` from logs.

 ### Training 120B

-On 8xH100s
+On 8xH100s, make sure you have ~3TB of free disk space. With each checkpoint clocking in at ~720GB, along with the base
+model, and final model output, you may need at least 3TB of free disk space to keep at least 2 checkpoints.

 ```bash
 # FFT SFT with offloading (8x80GB @ ~49GiB/GPU)
 axolotl train examples/gpt-oss/gpt-oss-120b-fft-fsdp2-offload.yaml
 ```

+ERRATA: Transformers saves the model Architecture prefixed with `FSDP` which needs to be manually renamed in `config.json`.
+See https://github.com/huggingface/transformers/pull/40207 for the status of this issue.
+
+```bash
+sed -i 's/FSDPGptOssForCausalLM/GptOssForCausalLM/g' ./outputs/gpt-oss-out/config.json
+```
+
+When using SHARDED_STATE_DICT with FSDP, the final checkpoint should automatically merge the sharded weights to your
+configured `output_dir`. However, if that step fails due to a disk space error, you can take an additional step to
+merge the sharded weights.  This step will automatically determine the last checkpoint directory and merge the sharded
+weights to `{output_dir}/merged`.
+
+```bash
+axolotl merge-sharded-fsdp-weights examples/gpt-oss/gpt-oss-120b-fft-fsdp2-offload.yaml
+mv ./outputs/gpt-oss-out/merged/* ./outputs/gpt-oss-out/
+```
+
+
+### Inferencing your fine-tuned model
+
+GPT-OSS support in vLLM does not exist in a stable release yet. See https://x.com/MaziyarPanahi/status/1955741905515323425
+for more information about using a special vllm-openai docker image for inferencing with vLLM.
+
+SGLang has 0-day support in main, see https://github.com/sgl-project/sglang/issues/8833 for infomation on installing
+SGLang from source. Once you've installed SGLang, run the following command to launch a SGLang server:
+
+```bash
+python3 -m sglang.launch_server --model ./outputs/gpt-oss-out/ --served-model-name axolotl/gpt-oss-120b --host 0.0.0.0 --port 8888 --tp 8
+```
+
 ### Tool use

 GPT-OSS has a comprehensive tool understanding. Axolotl supports tool calling datasets for Supervised Fine-tuning.
--- a/examples/gpt-oss/gpt-oss-120b-fft-fsdp2-offload.yaml
+++ b/examples/gpt-oss/gpt-oss-120b-fft-fsdp2-offload.yaml
@@ -20,6 +20,7 @@ datasets:
 dataset_prepared_path: last_run_prepared
 val_set_size: 0
 output_dir: ./outputs/gpt-oss-out/
+save_total_limit: 2  # the 120B model can use up to 720GB of disk space per checkpoint, so let's only keep the last 2

 sequence_len: 4096
 sample_packing: true