* docs: comprehensive documentation improvements for humans and agents New human docs: - grpo.qmd: GRPO deep dive (async, rewards, IS correction, scaling) - ebft.qmd: EBFT guide (structured/strided modes, feature extraction) - choosing_method.qmd: decision tree for SFT vs LoRA vs DPO vs GRPO - vllm_serving.qmd: vLLM setup for GRPO (server/colocate, LoRA sync) - training_stability.qmd: monitoring, NaN debugging, OOM, healthy metrics New agent docs: - AGENTS_SFT.md: agent reference for supervised fine-tuning - AGENTS_DPO.md: agent reference for preference learning (DPO/KTO/ORPO) Updated existing docs: - rlhf.qmd: cross-references to new GRPO/EBFT/choosing-method guides - getting-started.qmd: reorganized Next Steps with links to new guides - debugging.qmd: link to training stability guide - _quarto.yml: added new pages to sidebar navigation Removed: - bak.agents.md: stale backup that confused agents * docs: trim duplicated generic config from AGENTS_DPO.md Remove boilerplate training params (optimizer, gradient_checkpointing, flash_attention, etc.) from each method template. These are not preference-learning-specific and are already covered in AGENTS_SFT.md. Config templates now show only method-specific fields with a reference to AGENTS_SFT.md for the rest. * docs: deduplicate across new doc pages - grpo.qmd: collapse vLLM setup section to brief config + link to vllm_serving.qmd; collapse IS correction to essentials + link; replace full monitoring tables with summary + link to training_stability.qmd - vllm_serving.qmd: remove duplicated async/IS config reference tables (already in grpo.qmd config reference); replace full example config with link to grpo.qmd quick start - ebft.qmd: trim generic training params in quick start config * fix: train scripts * feat: split files into cleaner parts * fix: cleanup pretraining docs --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>
49 lines
1.4 KiB
Markdown
49 lines
1.4 KiB
Markdown
# Reward Modelling — Agent Reference
|
|
|
|
Train models to score responses for use as reward signals in RL. For full docs, see [reward_modelling.qmd](../reward_modelling.qmd).
|
|
|
|
## Types
|
|
|
|
### Outcome Reward Models (ORM)
|
|
|
|
Train a classifier to predict preference over entire interactions. Uses `AutoModelForSequenceClassification`.
|
|
|
|
```yaml
|
|
base_model: google/gemma-2-2b
|
|
model_type: AutoModelForSequenceClassification
|
|
num_labels: 1
|
|
reward_model: true
|
|
chat_template: gemma
|
|
datasets:
|
|
- path: argilla/distilabel-intel-orca-dpo-pairs
|
|
type: bradley_terry.chat_template
|
|
```
|
|
|
|
Dataset format: `{"system": "...", "input": "...", "chosen": "...", "rejected": "..."}`
|
|
|
|
### Process Reward Models (PRM)
|
|
|
|
Train a token classifier to score each reasoning step. Uses `AutoModelForTokenClassification`.
|
|
|
|
```yaml
|
|
base_model: Qwen/Qwen2.5-3B
|
|
model_type: AutoModelForTokenClassification
|
|
num_labels: 2
|
|
process_reward_model: true
|
|
datasets:
|
|
- path: trl-lib/math_shepherd
|
|
type: stepwise_supervised
|
|
```
|
|
|
|
Dataset format: see [stepwise_supervised.qmd](../dataset-formats/stepwise_supervised.qmd).
|
|
|
|
## File Map
|
|
|
|
```
|
|
src/axolotl/
|
|
core/builders/causal.py # Handles reward_model flag in trainer builder
|
|
prompt_strategies/bradley_terry/ # Bradley-Terry prompt strategies
|
|
prompt_strategies/stepwise_supervised.py # PRM dataset strategy
|
|
utils/schemas/config.py # reward_model, process_reward_model config fields
|
|
```
|