feat(docs): comprehensive improvement (#3564)
* docs: comprehensive documentation improvements for humans and agents New human docs: - grpo.qmd: GRPO deep dive (async, rewards, IS correction, scaling) - ebft.qmd: EBFT guide (structured/strided modes, feature extraction) - choosing_method.qmd: decision tree for SFT vs LoRA vs DPO vs GRPO - vllm_serving.qmd: vLLM setup for GRPO (server/colocate, LoRA sync) - training_stability.qmd: monitoring, NaN debugging, OOM, healthy metrics New agent docs: - AGENTS_SFT.md: agent reference for supervised fine-tuning - AGENTS_DPO.md: agent reference for preference learning (DPO/KTO/ORPO) Updated existing docs: - rlhf.qmd: cross-references to new GRPO/EBFT/choosing-method guides - getting-started.qmd: reorganized Next Steps with links to new guides - debugging.qmd: link to training stability guide - _quarto.yml: added new pages to sidebar navigation Removed: - bak.agents.md: stale backup that confused agents * docs: trim duplicated generic config from AGENTS_DPO.md Remove boilerplate training params (optimizer, gradient_checkpointing, flash_attention, etc.) from each method template. These are not preference-learning-specific and are already covered in AGENTS_SFT.md. Config templates now show only method-specific fields with a reference to AGENTS_SFT.md for the rest. * docs: deduplicate across new doc pages - grpo.qmd: collapse vLLM setup section to brief config + link to vllm_serving.qmd; collapse IS correction to essentials + link; replace full monitoring tables with summary + link to training_stability.qmd - vllm_serving.qmd: remove duplicated async/IS config reference tables (already in grpo.qmd config reference); replace full example config with link to grpo.qmd quick start - ebft.qmd: trim generic training params in quick start config * fix: train scripts * feat: split files into cleaner parts * fix: cleanup pretraining docs --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>
This commit is contained in:
121
docs/agents/preference_tuning.md
Normal file
121
docs/agents/preference_tuning.md
Normal file
@@ -0,0 +1,121 @@
|
||||
# Preference Learning (RLHF) — Agent Reference
|
||||
|
||||
Reference for DPO, IPO, KTO, ORPO, and SimPO. For config templates and dataset format examples, see [rlhf.qmd](../rlhf.qmd). For GRPO, see [grpo.qmd](../grpo.qmd). For EBFT, see [ebft.qmd](../ebft.qmd).
|
||||
|
||||
## Method Overview
|
||||
|
||||
| Method | Data Requirement | Key Idea | Best For |
|
||||
|--------|-----------------|----------|----------|
|
||||
| **DPO** | Paired (chosen + rejected) | Implicit reward via preference pairs | General alignment, most common |
|
||||
| **IPO** | Paired (chosen + rejected) | DPO with different loss (avoids overfitting) | When DPO overfits |
|
||||
| **KTO** | Unpaired (completion + binary label) | Kahneman-Tversky loss, no pairs needed | When you only have thumbs-up/down |
|
||||
| **ORPO** | Paired (chosen + rejected) | Combined SFT + preference, no ref model | Single-stage alignment, saves VRAM |
|
||||
| **SimPO** | Paired (chosen + rejected) | Length-normalized, no ref model | Simple setup, length-robust |
|
||||
|
||||
Default: start with DPO. All methods require `sample_packing: false`.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────────┐ ┌───────────────┐ ┌───────────────┐
|
||||
│ Policy Model │ │ Reference │ │ Preference │
|
||||
│ (trainable) │ │ Model (frozen)│ │ Dataset │
|
||||
└──────┬───────┘ └──────┬────────┘ └──────┬────────┘
|
||||
└──────────┬───────┘ │
|
||||
v │
|
||||
Forward pass on chosen + rejected <─────┘
|
||||
│
|
||||
Preference Loss (DPO/IPO/KTO/...)
|
||||
│
|
||||
Backprop + Update
|
||||
|
||||
Exception: ORPO and SimPO do NOT use a reference model (~50% less VRAM).
|
||||
```
|
||||
|
||||
No vLLM server needed (unlike GRPO). Offline RL with pre-collected preference data.
|
||||
|
||||
## Method Selection
|
||||
|
||||
1. Paired preference data (chosen + rejected)?
|
||||
- Default → `rl: dpo`
|
||||
- Overfitting → `rl: ipo`
|
||||
- VRAM-limited → `rl: orpo` (no ref model)
|
||||
- Length-sensitive → `rl: simpo` (no ref model)
|
||||
2. Only binary labels (good/bad)? → `rl: kto`
|
||||
3. Single-stage training (no separate SFT)? → `rl: orpo`
|
||||
|
||||
| | DPO | IPO | KTO | ORPO | SimPO |
|
||||
|---|---|---|---|---|---|
|
||||
| **Reference model** | Yes | Yes | Yes | No | No |
|
||||
| **VRAM overhead** | ~2x model | ~2x model | ~2x model | ~1x model | ~1x model |
|
||||
| **TRL trainer class** | DPOTrainer | DPOTrainer | KTOTrainer | ORPOTrainer | CPOTrainer |
|
||||
|
||||
## Prompt Strategy Resolution
|
||||
|
||||
The `type` field resolves to a Python function:
|
||||
|
||||
```
|
||||
type: "chatml.intel"
|
||||
→ axolotl.prompt_strategies.dpo.chatml.intel(cfg, **kwargs)
|
||||
→ returns transform_fn(sample) → {"prompt", "chosen", "rejected"}
|
||||
|
||||
type: "chat_template.default"
|
||||
→ axolotl.prompt_strategies.dpo.chat_template.default(cfg, dataset_idx, **kwargs)
|
||||
|
||||
type: {"field_prompt": "prompt", ...} (dict)
|
||||
→ axolotl.prompt_strategies.dpo.user_defined.default(...)
|
||||
```
|
||||
|
||||
Module base: `axolotl.prompt_strategies.{rl_method}` — replace `dpo` with `kto` or `orpo`.
|
||||
|
||||
## Healthy Training Indicators
|
||||
|
||||
| Metric | Healthy Range | Problem |
|
||||
|--------|--------------|---------|
|
||||
| `train/loss` | Decreasing, 0.3-0.7 | Flat or increasing = broken data or too high LR |
|
||||
| `rewards/chosen` | Increasing | Flat = model not learning preferences |
|
||||
| `rewards/rejected` | Decreasing | Increasing = model prefers wrong responses |
|
||||
| `rewards/margins` | Positive and increasing | Negative = prefers rejected over chosen |
|
||||
| `rewards/accuracies` | > 0.5, toward 0.7+ | < 0.5 = worse than random |
|
||||
| `logps/rejected` | Decreasing | Increasing = reward hacking |
|
||||
| `grad_norm` | 0.01 - 10.0 | > 100 = exploding gradients |
|
||||
|
||||
Method-specific: DPO/IPO watch `rewards/margins`; KTO loss is noisier; ORPO monitor SFT + odds ratio components; SimPO check length-normalized reward separation.
|
||||
|
||||
## Known Issues
|
||||
|
||||
| Issue | Fix |
|
||||
|-------|-----|
|
||||
| Sample packing crash | Set `sample_packing: false` (required for all preference methods) |
|
||||
| KTO `KeyError: 'label'` | Ensure dataset has boolean `label` column |
|
||||
| ORPO/KTO `KeyError` during tokenization | Add `remove_unused_columns: false` |
|
||||
| ORPO template not applied | ORPO requires explicit `chat_template` setting |
|
||||
| OOM with ref model (DPO/IPO/KTO) | Use LoRA/QLoRA, or switch to ORPO/SimPO (no ref model) |
|
||||
| IPO + label_smoothing | Do not set `dpo_label_smoothing` when `rl: ipo` |
|
||||
|
||||
Full troubleshooting: [training_stability.qmd](../training_stability.qmd)
|
||||
|
||||
## File Map
|
||||
|
||||
```
|
||||
src/axolotl/
|
||||
core/trainers/dpo/ # DPO trainer, args, strategy
|
||||
core/builders/rl.py # HFRLTrainerBuilder — routes rl type → trainer class
|
||||
core/training_args.py # AxolotlKTOConfig, AxolotlORPOConfig, AxolotlCPOConfig
|
||||
prompt_strategies/
|
||||
dpo/ # DPO/IPO/SimPO dataset strategies
|
||||
chat_template.py # chat_template.default, chat_template.argilla_chat
|
||||
chatml.py # chatml.default/intel/icr/argilla_chat/prompt_pairs/ultra
|
||||
llama3.py # llama3 variants (same subtypes as chatml)
|
||||
user_defined.py # Custom field mapping
|
||||
passthrough.py # No transform
|
||||
kto/ # KTO dataset strategies (chatml, llama3, user_defined)
|
||||
orpo/ # ORPO dataset strategies (chat_template.argilla)
|
||||
utils/schemas/enums.py # RLType enum (dpo, ipo, kto, orpo, simpo, grpo, gdpo, ebft)
|
||||
utils/schemas/config.py # All rl/dpo/kto/orpo/simpo config fields
|
||||
|
||||
docs/rlhf.qmd # Full user docs: all dataset formats, config templates
|
||||
docs/choosing_method.qmd # SFT vs DPO vs GRPO decision guide
|
||||
examples/qwen2/dpo.yaml # DPO example
|
||||
examples/llama-3/qlora-1b-kto.yaml # KTO example
|
||||
```
|
||||
Reference in New Issue
Block a user