Files

NanoCode012 16e32232fb feat(docs): comprehensive improvement (#3564 )

* docs: comprehensive documentation improvements for humans and agents

New human docs:
- grpo.qmd: GRPO deep dive (async, rewards, IS correction, scaling)
- ebft.qmd: EBFT guide (structured/strided modes, feature extraction)
- choosing_method.qmd: decision tree for SFT vs LoRA vs DPO vs GRPO
- vllm_serving.qmd: vLLM setup for GRPO (server/colocate, LoRA sync)
- training_stability.qmd: monitoring, NaN debugging, OOM, healthy metrics

New agent docs:
- AGENTS_SFT.md: agent reference for supervised fine-tuning
- AGENTS_DPO.md: agent reference for preference learning (DPO/KTO/ORPO)

Updated existing docs:
- rlhf.qmd: cross-references to new GRPO/EBFT/choosing-method guides
- getting-started.qmd: reorganized Next Steps with links to new guides
- debugging.qmd: link to training stability guide
- _quarto.yml: added new pages to sidebar navigation

Removed:
- bak.agents.md: stale backup that confused agents

* docs: trim duplicated generic config from AGENTS_DPO.md

Remove boilerplate training params (optimizer, gradient_checkpointing,
flash_attention, etc.) from each method template. These are not
preference-learning-specific and are already covered in AGENTS_SFT.md.
Config templates now show only method-specific fields with a reference
to AGENTS_SFT.md for the rest.

* docs: deduplicate across new doc pages

- grpo.qmd: collapse vLLM setup section to brief config + link to
  vllm_serving.qmd; collapse IS correction to essentials + link;
  replace full monitoring tables with summary + link to
  training_stability.qmd
- vllm_serving.qmd: remove duplicated async/IS config reference tables
  (already in grpo.qmd config reference); replace full example config
  with link to grpo.qmd quick start
- ebft.qmd: trim generic training params in quick start config

* fix: train scripts

* feat: split files into cleaner parts

* fix: cleanup pretraining docs

---------

Co-authored-by: Wing Lian <wing.lian@gmail.com>

2026-04-02 08:01:26 -04:00

5.8 KiB

Raw Permalink Blame History

Preference Learning (RLHF) — Agent Reference

Reference for DPO, IPO, KTO, ORPO, and SimPO. For config templates and dataset format examples, see rlhf.qmd. For GRPO, see grpo.qmd. For EBFT, see ebft.qmd.

Method Overview

Method	Data Requirement	Key Idea	Best For
DPO	Paired (chosen + rejected)	Implicit reward via preference pairs	General alignment, most common
IPO	Paired (chosen + rejected)	DPO with different loss (avoids overfitting)	When DPO overfits
KTO	Unpaired (completion + binary label)	Kahneman-Tversky loss, no pairs needed	When you only have thumbs-up/down
ORPO	Paired (chosen + rejected)	Combined SFT + preference, no ref model	Single-stage alignment, saves VRAM
SimPO	Paired (chosen + rejected)	Length-normalized, no ref model	Simple setup, length-robust

Default: start with DPO. All methods require sample_packing: false.

Architecture

┌──────────────┐   ┌───────────────┐   ┌───────────────┐
│ Policy Model │   │ Reference     │   │ Preference    │
│ (trainable)  │   │ Model (frozen)│   │ Dataset       │
└──────┬───────┘   └──────┬────────┘   └──────┬────────┘
       └──────────┬───────┘                    │
                  v                            │
       Forward pass on chosen + rejected <─────┘
                  │
       Preference Loss (DPO/IPO/KTO/...)
                  │
       Backprop + Update

Exception: ORPO and SimPO do NOT use a reference model (~50% less VRAM).

No vLLM server needed (unlike GRPO). Offline RL with pre-collected preference data.

Method Selection

Paired preference data (chosen + rejected)?
- Default → rl: dpo
- Overfitting → rl: ipo
- VRAM-limited → rl: orpo (no ref model)
- Length-sensitive → rl: simpo (no ref model)
Only binary labels (good/bad)? → rl: kto
Single-stage training (no separate SFT)? → rl: orpo

	DPO	IPO	KTO	ORPO	SimPO
Reference model	Yes	Yes	Yes	No	No
VRAM overhead	~2x model	~2x model	~2x model	~1x model	~1x model
TRL trainer class	DPOTrainer	DPOTrainer	KTOTrainer	ORPOTrainer	CPOTrainer

Prompt Strategy Resolution

The type field resolves to a Python function:

type: "chatml.intel"
  → axolotl.prompt_strategies.dpo.chatml.intel(cfg, **kwargs)
  → returns transform_fn(sample) → {"prompt", "chosen", "rejected"}

type: "chat_template.default"
  → axolotl.prompt_strategies.dpo.chat_template.default(cfg, dataset_idx, **kwargs)

type: {"field_prompt": "prompt", ...}   (dict)
  → axolotl.prompt_strategies.dpo.user_defined.default(...)

Module base: axolotl.prompt_strategies.{rl_method} — replace dpo with kto or orpo.

Healthy Training Indicators

Metric	Healthy Range	Problem
`train/loss`	Decreasing, 0.3-0.7	Flat or increasing = broken data or too high LR
`rewards/chosen`	Increasing	Flat = model not learning preferences
`rewards/rejected`	Decreasing	Increasing = model prefers wrong responses
`rewards/margins`	Positive and increasing	Negative = prefers rejected over chosen
`rewards/accuracies`	> 0.5, toward 0.7+	< 0.5 = worse than random
`logps/rejected`	Decreasing	Increasing = reward hacking
`grad_norm`	0.01 - 10.0	> 100 = exploding gradients

Method-specific: DPO/IPO watch rewards/margins; KTO loss is noisier; ORPO monitor SFT + odds ratio components; SimPO check length-normalized reward separation.

Known Issues

Issue	Fix
Sample packing crash	Set `sample_packing: false` (required for all preference methods)
KTO `KeyError: 'label'`	Ensure dataset has boolean `label` column
ORPO/KTO `KeyError` during tokenization	Add `remove_unused_columns: false`
ORPO template not applied	ORPO requires explicit `chat_template` setting
OOM with ref model (DPO/IPO/KTO)	Use LoRA/QLoRA, or switch to ORPO/SimPO (no ref model)
IPO + label_smoothing	Do not set `dpo_label_smoothing` when `rl: ipo`

Full troubleshooting: training_stability.qmd

File Map

src/axolotl/
  core/trainers/dpo/              # DPO trainer, args, strategy
  core/builders/rl.py             # HFRLTrainerBuilder — routes rl type → trainer class
  core/training_args.py           # AxolotlKTOConfig, AxolotlORPOConfig, AxolotlCPOConfig
  prompt_strategies/
    dpo/                          # DPO/IPO/SimPO dataset strategies
      chat_template.py            # chat_template.default, chat_template.argilla_chat
      chatml.py                   # chatml.default/intel/icr/argilla_chat/prompt_pairs/ultra
      llama3.py                   # llama3 variants (same subtypes as chatml)
      user_defined.py             # Custom field mapping
      passthrough.py              # No transform
    kto/                          # KTO dataset strategies (chatml, llama3, user_defined)
    orpo/                         # ORPO dataset strategies (chat_template.argilla)
  utils/schemas/enums.py          # RLType enum (dpo, ipo, kto, orpo, simpo, grpo, gdpo, ebft)
  utils/schemas/config.py         # All rl/dpo/kto/orpo/simpo config fields

docs/rlhf.qmd                    # Full user docs: all dataset formats, config templates
docs/choosing_method.qmd          # SFT vs DPO vs GRPO decision guide
examples/qwen2/dpo.yaml           # DPO example
examples/llama-3/qlora-1b-kto.yaml  # KTO example

5.8 KiB Raw Permalink Blame History