Files
axolotl/docs/agents/grpo.md
NanoCode012 16e32232fb feat(docs): comprehensive improvement (#3564)
* docs: comprehensive documentation improvements for humans and agents

New human docs:
- grpo.qmd: GRPO deep dive (async, rewards, IS correction, scaling)
- ebft.qmd: EBFT guide (structured/strided modes, feature extraction)
- choosing_method.qmd: decision tree for SFT vs LoRA vs DPO vs GRPO
- vllm_serving.qmd: vLLM setup for GRPO (server/colocate, LoRA sync)
- training_stability.qmd: monitoring, NaN debugging, OOM, healthy metrics

New agent docs:
- AGENTS_SFT.md: agent reference for supervised fine-tuning
- AGENTS_DPO.md: agent reference for preference learning (DPO/KTO/ORPO)

Updated existing docs:
- rlhf.qmd: cross-references to new GRPO/EBFT/choosing-method guides
- getting-started.qmd: reorganized Next Steps with links to new guides
- debugging.qmd: link to training stability guide
- _quarto.yml: added new pages to sidebar navigation

Removed:
- bak.agents.md: stale backup that confused agents

* docs: trim duplicated generic config from AGENTS_DPO.md

Remove boilerplate training params (optimizer, gradient_checkpointing,
flash_attention, etc.) from each method template. These are not
preference-learning-specific and are already covered in AGENTS_SFT.md.
Config templates now show only method-specific fields with a reference
to AGENTS_SFT.md for the rest.

* docs: deduplicate across new doc pages

- grpo.qmd: collapse vLLM setup section to brief config + link to
  vllm_serving.qmd; collapse IS correction to essentials + link;
  replace full monitoring tables with summary + link to
  training_stability.qmd
- vllm_serving.qmd: remove duplicated async/IS config reference tables
  (already in grpo.qmd config reference); replace full example config
  with link to grpo.qmd quick start
- ebft.qmd: trim generic training params in quick start config

* fix: train scripts

* feat: split files into cleaner parts

* fix: cleanup pretraining docs

---------

Co-authored-by: Wing Lian <wing.lian@gmail.com>
2026-04-02 08:01:26 -04:00

3.3 KiB

GRPO — Agent Reference

Online RL with verifiable reward functions. For full config reference, async features, and scaling, see grpo.qmd. For vLLM setup, see vllm_serving.qmd.

Architecture

Terminal 1 (GPU 0)                    Terminal 2 (GPU 1)
┌──────────────────────┐              ┌──────────────────────────────────┐
│  vLLM Server         │   HTTP       │  Trainer                         │
│  Serves base model   │◄────────────►│  1. Send prompts to vLLM         │
│  + LoRA adapter      │  /generate   │  2. Score completions (rewards)  │
│                      │  /set_lora   │  3. Compute advantages           │
│  Punica kernels for  │              │  4. PPO-clip gradient update     │
│  LoRA inference      │              │  5. Sync LoRA weights to vLLM    │
└──────────────────────┘              └──────────────────────────────────┘

Components Required

  1. A YAML config with rl: grpo
  2. A reward module (Python file with reward functions)
  3. A running vLLM server (axolotl vllm-serve config.yaml)

Reward Function Signature

def my_reward(completions, **kwargs) -> list[float]:
    # completions[i][0]["content"] = text of i-th completion
    # **kwargs contains dataset columns not removed by transform
    return [score_for_each_completion]

Multiple rewards: reward_funcs: [r1, r2] with reward_weights: [1.0, 0.5].

Key Async Features

Feature Config Purpose
Async prefetch async_prefetch: true Overlap generation with training
LoRA sync vllm_lora_sync: true Fast adapter sync via filesystem
Streaming scoring streaming_partial_batch: true Score one group at a time
Zero-adv skip skip_zero_advantage_batches: true Skip batches with no learning signal
Replay buffer replay_buffer_size: 100 Cache high-signal groups
IS correction vllm_importance_sampling_correction: true Fix off-policy distribution shift

Health Checks

  • rewards/*/mean > 0.15 within 20 steps (else: test reward function standalone)
  • reward_std > 0 on most steps (else: no learning signal)
  • entropy 0.05-0.5 (< 0.01 = mode collapse)
  • grad_norm 0.001-1.0 (> 10 = unstable, 0.0 = zero-advantage skip)

See training_stability.qmd for detailed diagnostics.

File Map

src/axolotl/
  cli/train.py                     # Entry point
  cli/vllm_serve.py                # Entry point for vLLM server
  core/trainers/grpo/
    trainer.py                     # AxolotlGRPOTrainer
    sampler.py                     # Sampling utilities
  core/builders/rl.py              # HFRLTrainerBuilder — routes rl type → trainer
  scripts/vllm_serve_lora.py       # vLLM serve script with LoRA sync support
  utils/schemas/trl.py             # TRL config schema (all trl: options)

docs/grpo.qmd                     # Full user docs: async, rewards, scaling, config reference
docs/vllm_serving.qmd             # vLLM server modes, LoRA sync, weight sync