Files
axolotl/docs/agents/preference_tuning.md
NanoCode012 16e32232fb feat(docs): comprehensive improvement (#3564)
* docs: comprehensive documentation improvements for humans and agents

New human docs:
- grpo.qmd: GRPO deep dive (async, rewards, IS correction, scaling)
- ebft.qmd: EBFT guide (structured/strided modes, feature extraction)
- choosing_method.qmd: decision tree for SFT vs LoRA vs DPO vs GRPO
- vllm_serving.qmd: vLLM setup for GRPO (server/colocate, LoRA sync)
- training_stability.qmd: monitoring, NaN debugging, OOM, healthy metrics

New agent docs:
- AGENTS_SFT.md: agent reference for supervised fine-tuning
- AGENTS_DPO.md: agent reference for preference learning (DPO/KTO/ORPO)

Updated existing docs:
- rlhf.qmd: cross-references to new GRPO/EBFT/choosing-method guides
- getting-started.qmd: reorganized Next Steps with links to new guides
- debugging.qmd: link to training stability guide
- _quarto.yml: added new pages to sidebar navigation

Removed:
- bak.agents.md: stale backup that confused agents

* docs: trim duplicated generic config from AGENTS_DPO.md

Remove boilerplate training params (optimizer, gradient_checkpointing,
flash_attention, etc.) from each method template. These are not
preference-learning-specific and are already covered in AGENTS_SFT.md.
Config templates now show only method-specific fields with a reference
to AGENTS_SFT.md for the rest.

* docs: deduplicate across new doc pages

- grpo.qmd: collapse vLLM setup section to brief config + link to
  vllm_serving.qmd; collapse IS correction to essentials + link;
  replace full monitoring tables with summary + link to
  training_stability.qmd
- vllm_serving.qmd: remove duplicated async/IS config reference tables
  (already in grpo.qmd config reference); replace full example config
  with link to grpo.qmd quick start
- ebft.qmd: trim generic training params in quick start config

* fix: train scripts

* feat: split files into cleaner parts

* fix: cleanup pretraining docs

---------

Co-authored-by: Wing Lian <wing.lian@gmail.com>
2026-04-02 08:01:26 -04:00

5.8 KiB

Preference Learning (RLHF) — Agent Reference

Reference for DPO, IPO, KTO, ORPO, and SimPO. For config templates and dataset format examples, see rlhf.qmd. For GRPO, see grpo.qmd. For EBFT, see ebft.qmd.

Method Overview

Method Data Requirement Key Idea Best For
DPO Paired (chosen + rejected) Implicit reward via preference pairs General alignment, most common
IPO Paired (chosen + rejected) DPO with different loss (avoids overfitting) When DPO overfits
KTO Unpaired (completion + binary label) Kahneman-Tversky loss, no pairs needed When you only have thumbs-up/down
ORPO Paired (chosen + rejected) Combined SFT + preference, no ref model Single-stage alignment, saves VRAM
SimPO Paired (chosen + rejected) Length-normalized, no ref model Simple setup, length-robust

Default: start with DPO. All methods require sample_packing: false.

Architecture

┌──────────────┐   ┌───────────────┐   ┌───────────────┐
│ Policy Model │   │ Reference     │   │ Preference    │
│ (trainable)  │   │ Model (frozen)│   │ Dataset       │
└──────┬───────┘   └──────┬────────┘   └──────┬────────┘
       └──────────┬───────┘                    │
                  v                            │
       Forward pass on chosen + rejected <─────┘
                  │
       Preference Loss (DPO/IPO/KTO/...)
                  │
       Backprop + Update

Exception: ORPO and SimPO do NOT use a reference model (~50% less VRAM).

No vLLM server needed (unlike GRPO). Offline RL with pre-collected preference data.

Method Selection

  1. Paired preference data (chosen + rejected)?
    • Default → rl: dpo
    • Overfitting → rl: ipo
    • VRAM-limited → rl: orpo (no ref model)
    • Length-sensitive → rl: simpo (no ref model)
  2. Only binary labels (good/bad)? → rl: kto
  3. Single-stage training (no separate SFT)? → rl: orpo
DPO IPO KTO ORPO SimPO
Reference model Yes Yes Yes No No
VRAM overhead ~2x model ~2x model ~2x model ~1x model ~1x model
TRL trainer class DPOTrainer DPOTrainer KTOTrainer ORPOTrainer CPOTrainer

Prompt Strategy Resolution

The type field resolves to a Python function:

type: "chatml.intel"
  → axolotl.prompt_strategies.dpo.chatml.intel(cfg, **kwargs)
  → returns transform_fn(sample) → {"prompt", "chosen", "rejected"}

type: "chat_template.default"
  → axolotl.prompt_strategies.dpo.chat_template.default(cfg, dataset_idx, **kwargs)

type: {"field_prompt": "prompt", ...}   (dict)
  → axolotl.prompt_strategies.dpo.user_defined.default(...)

Module base: axolotl.prompt_strategies.{rl_method} — replace dpo with kto or orpo.

Healthy Training Indicators

Metric Healthy Range Problem
train/loss Decreasing, 0.3-0.7 Flat or increasing = broken data or too high LR
rewards/chosen Increasing Flat = model not learning preferences
rewards/rejected Decreasing Increasing = model prefers wrong responses
rewards/margins Positive and increasing Negative = prefers rejected over chosen
rewards/accuracies > 0.5, toward 0.7+ < 0.5 = worse than random
logps/rejected Decreasing Increasing = reward hacking
grad_norm 0.01 - 10.0 > 100 = exploding gradients

Method-specific: DPO/IPO watch rewards/margins; KTO loss is noisier; ORPO monitor SFT + odds ratio components; SimPO check length-normalized reward separation.

Known Issues

Issue Fix
Sample packing crash Set sample_packing: false (required for all preference methods)
KTO KeyError: 'label' Ensure dataset has boolean label column
ORPO/KTO KeyError during tokenization Add remove_unused_columns: false
ORPO template not applied ORPO requires explicit chat_template setting
OOM with ref model (DPO/IPO/KTO) Use LoRA/QLoRA, or switch to ORPO/SimPO (no ref model)
IPO + label_smoothing Do not set dpo_label_smoothing when rl: ipo

Full troubleshooting: training_stability.qmd

File Map

src/axolotl/
  core/trainers/dpo/              # DPO trainer, args, strategy
  core/builders/rl.py             # HFRLTrainerBuilder — routes rl type → trainer class
  core/training_args.py           # AxolotlKTOConfig, AxolotlORPOConfig, AxolotlCPOConfig
  prompt_strategies/
    dpo/                          # DPO/IPO/SimPO dataset strategies
      chat_template.py            # chat_template.default, chat_template.argilla_chat
      chatml.py                   # chatml.default/intel/icr/argilla_chat/prompt_pairs/ultra
      llama3.py                   # llama3 variants (same subtypes as chatml)
      user_defined.py             # Custom field mapping
      passthrough.py              # No transform
    kto/                          # KTO dataset strategies (chatml, llama3, user_defined)
    orpo/                         # ORPO dataset strategies (chat_template.argilla)
  utils/schemas/enums.py          # RLType enum (dpo, ipo, kto, orpo, simpo, grpo, gdpo, ebft)
  utils/schemas/config.py         # All rl/dpo/kto/orpo/simpo config fields

docs/rlhf.qmd                    # Full user docs: all dataset formats, config templates
docs/choosing_method.qmd          # SFT vs DPO vs GRPO decision guide
examples/qwen2/dpo.yaml           # DPO example
examples/llama-3/qlora-1b-kto.yaml  # KTO example