axolotl/docs/agents/grpo.md

# GRPO — Agent Reference

Online RL with verifiable reward functions. For full config reference, async features, and scaling, see [grpo.qmd](../grpo.qmd). For vLLM setup, see [vllm_serving.qmd](../vllm_serving.qmd).

## Architecture

```
Terminal 1 (GPU 0)                    Terminal 2 (GPU 1)
┌──────────────────────┐              ┌──────────────────────────────────┐
│  vLLM Server         │   HTTP       │  Trainer                         │
│  Serves base model   │◄────────────►│  1. Send prompts to vLLM         │
│  + LoRA adapter      │  /generate   │  2. Score completions (rewards)  │
│                      │  /set_lora   │  3. Compute advantages           │
│  Punica kernels for  │              │  4. PPO-clip gradient update     │
│  LoRA inference      │              │  5. Sync LoRA weights to vLLM    │
└──────────────────────┘              └──────────────────────────────────┘
```

## Components Required

1. A YAML config with `rl: grpo`
2. A reward module (Python file with reward functions)
3. A running vLLM server (`axolotl vllm-serve config.yaml`)

## Reward Function Signature

```python
def my_reward(completions, **kwargs) -> list[float]:
    # completions[i][0]["content"] = text of i-th completion
    # **kwargs contains dataset columns not removed by transform
    return [score_for_each_completion]
```

Multiple rewards: `reward_funcs: [r1, r2]` with `reward_weights: [1.0, 0.5]`.

## Key Async Features

| Feature | Config | Purpose |
|---------|--------|---------|
| Async prefetch | `async_prefetch: true` | Overlap generation with training |
| LoRA sync | `vllm_lora_sync: true` | Fast adapter sync via filesystem |
| Streaming scoring | `streaming_partial_batch: true` | Score one group at a time |
| Zero-adv skip | `skip_zero_advantage_batches: true` | Skip batches with no learning signal |
| Replay buffer | `replay_buffer_size: 100` | Cache high-signal groups |
| IS correction | `vllm_importance_sampling_correction: true` | Fix off-policy distribution shift |

## Health Checks

- `rewards/*/mean` > 0.15 within 20 steps (else: test reward function standalone)
- `reward_std` > 0 on most steps (else: no learning signal)
- `entropy` 0.05-0.5 (< 0.01 = mode collapse)
- `grad_norm` 0.001-1.0 (> 10 = unstable, 0.0 = zero-advantage skip)

See [training_stability.qmd](../training_stability.qmd) for detailed diagnostics.

## File Map

```
src/axolotl/
  cli/train.py                     # Entry point
  cli/vllm_serve.py                # Entry point for vLLM server
  core/trainers/grpo/
    trainer.py                     # AxolotlGRPOTrainer
    sampler.py                     # Sampling utilities
  core/builders/rl.py              # HFRLTrainerBuilder — routes rl type → trainer
  scripts/vllm_serve_lora.py       # vLLM serve script with LoRA sync support
  utils/schemas/trl.py             # TRL config schema (all trl: options)

docs/grpo.qmd                     # Full user docs: async, rewards, scaling, config reference
docs/vllm_serving.qmd             # vLLM server modes, LoRA sync, weight sync
```