* docs: comprehensive documentation improvements for humans and agents New human docs: - grpo.qmd: GRPO deep dive (async, rewards, IS correction, scaling) - ebft.qmd: EBFT guide (structured/strided modes, feature extraction) - choosing_method.qmd: decision tree for SFT vs LoRA vs DPO vs GRPO - vllm_serving.qmd: vLLM setup for GRPO (server/colocate, LoRA sync) - training_stability.qmd: monitoring, NaN debugging, OOM, healthy metrics New agent docs: - AGENTS_SFT.md: agent reference for supervised fine-tuning - AGENTS_DPO.md: agent reference for preference learning (DPO/KTO/ORPO) Updated existing docs: - rlhf.qmd: cross-references to new GRPO/EBFT/choosing-method guides - getting-started.qmd: reorganized Next Steps with links to new guides - debugging.qmd: link to training stability guide - _quarto.yml: added new pages to sidebar navigation Removed: - bak.agents.md: stale backup that confused agents * docs: trim duplicated generic config from AGENTS_DPO.md Remove boilerplate training params (optimizer, gradient_checkpointing, flash_attention, etc.) from each method template. These are not preference-learning-specific and are already covered in AGENTS_SFT.md. Config templates now show only method-specific fields with a reference to AGENTS_SFT.md for the rest. * docs: deduplicate across new doc pages - grpo.qmd: collapse vLLM setup section to brief config + link to vllm_serving.qmd; collapse IS correction to essentials + link; replace full monitoring tables with summary + link to training_stability.qmd - vllm_serving.qmd: remove duplicated async/IS config reference tables (already in grpo.qmd config reference); replace full example config with link to grpo.qmd quick start - ebft.qmd: trim generic training params in quick start config * fix: train scripts * feat: split files into cleaner parts * fix: cleanup pretraining docs --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>
72 lines
3.3 KiB
Markdown
72 lines
3.3 KiB
Markdown
# GRPO — Agent Reference
|
|
|
|
Online RL with verifiable reward functions. For full config reference, async features, and scaling, see [grpo.qmd](../grpo.qmd). For vLLM setup, see [vllm_serving.qmd](../vllm_serving.qmd).
|
|
|
|
## Architecture
|
|
|
|
```
|
|
Terminal 1 (GPU 0) Terminal 2 (GPU 1)
|
|
┌──────────────────────┐ ┌──────────────────────────────────┐
|
|
│ vLLM Server │ HTTP │ Trainer │
|
|
│ Serves base model │◄────────────►│ 1. Send prompts to vLLM │
|
|
│ + LoRA adapter │ /generate │ 2. Score completions (rewards) │
|
|
│ │ /set_lora │ 3. Compute advantages │
|
|
│ Punica kernels for │ │ 4. PPO-clip gradient update │
|
|
│ LoRA inference │ │ 5. Sync LoRA weights to vLLM │
|
|
└──────────────────────┘ └──────────────────────────────────┘
|
|
```
|
|
|
|
## Components Required
|
|
|
|
1. A YAML config with `rl: grpo`
|
|
2. A reward module (Python file with reward functions)
|
|
3. A running vLLM server (`axolotl vllm-serve config.yaml`)
|
|
|
|
## Reward Function Signature
|
|
|
|
```python
|
|
def my_reward(completions, **kwargs) -> list[float]:
|
|
# completions[i][0]["content"] = text of i-th completion
|
|
# **kwargs contains dataset columns not removed by transform
|
|
return [score_for_each_completion]
|
|
```
|
|
|
|
Multiple rewards: `reward_funcs: [r1, r2]` with `reward_weights: [1.0, 0.5]`.
|
|
|
|
## Key Async Features
|
|
|
|
| Feature | Config | Purpose |
|
|
|---------|--------|---------|
|
|
| Async prefetch | `async_prefetch: true` | Overlap generation with training |
|
|
| LoRA sync | `vllm_lora_sync: true` | Fast adapter sync via filesystem |
|
|
| Streaming scoring | `streaming_partial_batch: true` | Score one group at a time |
|
|
| Zero-adv skip | `skip_zero_advantage_batches: true` | Skip batches with no learning signal |
|
|
| Replay buffer | `replay_buffer_size: 100` | Cache high-signal groups |
|
|
| IS correction | `vllm_importance_sampling_correction: true` | Fix off-policy distribution shift |
|
|
|
|
## Health Checks
|
|
|
|
- `rewards/*/mean` > 0.15 within 20 steps (else: test reward function standalone)
|
|
- `reward_std` > 0 on most steps (else: no learning signal)
|
|
- `entropy` 0.05-0.5 (< 0.01 = mode collapse)
|
|
- `grad_norm` 0.001-1.0 (> 10 = unstable, 0.0 = zero-advantage skip)
|
|
|
|
See [training_stability.qmd](../training_stability.qmd) for detailed diagnostics.
|
|
|
|
## File Map
|
|
|
|
```
|
|
src/axolotl/
|
|
cli/train.py # Entry point
|
|
cli/vllm_serve.py # Entry point for vLLM server
|
|
core/trainers/grpo/
|
|
trainer.py # AxolotlGRPOTrainer
|
|
sampler.py # Sampling utilities
|
|
core/builders/rl.py # HFRLTrainerBuilder — routes rl type → trainer
|
|
scripts/vllm_serve_lora.py # vLLM serve script with LoRA sync support
|
|
utils/schemas/trl.py # TRL config schema (all trl: options)
|
|
|
|
docs/grpo.qmd # Full user docs: async, rewards, scaling, config reference
|
|
docs/vllm_serving.qmd # vLLM server modes, LoRA sync, weight sync
|
|
```
|