* docs: comprehensive documentation improvements for humans and agents New human docs: - grpo.qmd: GRPO deep dive (async, rewards, IS correction, scaling) - ebft.qmd: EBFT guide (structured/strided modes, feature extraction) - choosing_method.qmd: decision tree for SFT vs LoRA vs DPO vs GRPO - vllm_serving.qmd: vLLM setup for GRPO (server/colocate, LoRA sync) - training_stability.qmd: monitoring, NaN debugging, OOM, healthy metrics New agent docs: - AGENTS_SFT.md: agent reference for supervised fine-tuning - AGENTS_DPO.md: agent reference for preference learning (DPO/KTO/ORPO) Updated existing docs: - rlhf.qmd: cross-references to new GRPO/EBFT/choosing-method guides - getting-started.qmd: reorganized Next Steps with links to new guides - debugging.qmd: link to training stability guide - _quarto.yml: added new pages to sidebar navigation Removed: - bak.agents.md: stale backup that confused agents * docs: trim duplicated generic config from AGENTS_DPO.md Remove boilerplate training params (optimizer, gradient_checkpointing, flash_attention, etc.) from each method template. These are not preference-learning-specific and are already covered in AGENTS_SFT.md. Config templates now show only method-specific fields with a reference to AGENTS_SFT.md for the rest. * docs: deduplicate across new doc pages - grpo.qmd: collapse vLLM setup section to brief config + link to vllm_serving.qmd; collapse IS correction to essentials + link; replace full monitoring tables with summary + link to training_stability.qmd - vllm_serving.qmd: remove duplicated async/IS config reference tables (already in grpo.qmd config reference); replace full example config with link to grpo.qmd quick start - ebft.qmd: trim generic training params in quick start config * fix: train scripts * feat: split files into cleaner parts * fix: cleanup pretraining docs --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>
3.3 KiB
3.3 KiB
GRPO — Agent Reference
Online RL with verifiable reward functions. For full config reference, async features, and scaling, see grpo.qmd. For vLLM setup, see vllm_serving.qmd.
Architecture
Terminal 1 (GPU 0) Terminal 2 (GPU 1)
┌──────────────────────┐ ┌──────────────────────────────────┐
│ vLLM Server │ HTTP │ Trainer │
│ Serves base model │◄────────────►│ 1. Send prompts to vLLM │
│ + LoRA adapter │ /generate │ 2. Score completions (rewards) │
│ │ /set_lora │ 3. Compute advantages │
│ Punica kernels for │ │ 4. PPO-clip gradient update │
│ LoRA inference │ │ 5. Sync LoRA weights to vLLM │
└──────────────────────┘ └──────────────────────────────────┘
Components Required
- A YAML config with
rl: grpo - A reward module (Python file with reward functions)
- A running vLLM server (
axolotl vllm-serve config.yaml)
Reward Function Signature
def my_reward(completions, **kwargs) -> list[float]:
# completions[i][0]["content"] = text of i-th completion
# **kwargs contains dataset columns not removed by transform
return [score_for_each_completion]
Multiple rewards: reward_funcs: [r1, r2] with reward_weights: [1.0, 0.5].
Key Async Features
| Feature | Config | Purpose |
|---|---|---|
| Async prefetch | async_prefetch: true |
Overlap generation with training |
| LoRA sync | vllm_lora_sync: true |
Fast adapter sync via filesystem |
| Streaming scoring | streaming_partial_batch: true |
Score one group at a time |
| Zero-adv skip | skip_zero_advantage_batches: true |
Skip batches with no learning signal |
| Replay buffer | replay_buffer_size: 100 |
Cache high-signal groups |
| IS correction | vllm_importance_sampling_correction: true |
Fix off-policy distribution shift |
Health Checks
rewards/*/mean> 0.15 within 20 steps (else: test reward function standalone)reward_std> 0 on most steps (else: no learning signal)entropy0.05-0.5 (< 0.01 = mode collapse)grad_norm0.001-1.0 (> 10 = unstable, 0.0 = zero-advantage skip)
See training_stability.qmd for detailed diagnostics.
File Map
src/axolotl/
cli/train.py # Entry point
cli/vllm_serve.py # Entry point for vLLM server
core/trainers/grpo/
trainer.py # AxolotlGRPOTrainer
sampler.py # Sampling utilities
core/builders/rl.py # HFRLTrainerBuilder — routes rl type → trainer
scripts/vllm_serve_lora.py # vLLM serve script with LoRA sync support
utils/schemas/trl.py # TRL config schema (all trl: options)
docs/grpo.qmd # Full user docs: async, rewards, scaling, config reference
docs/vllm_serving.qmd # vLLM server modes, LoRA sync, weight sync