Files

Andrew Wu 90090fa9e8 DPO support loss types (#3566 )

* Support loss_type/loss_weights DPO

* Validate dpo loss type/weights only set for dpo

* Tests: Update ipo tests to use new path

* Docs: Update docs for new ipo path

* PR fixes - typo/validation

* PR nit - warning

* chore: fix warnings arg

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>

2026-04-23 00:25:28 -04:00

5.0 KiB

Raw Blame History

Axolotl

Fine-tuning framework for LLMs. Config-driven: every training run is defined by a single YAML file.

Tech Stack

Python, PyTorch, HuggingFace Transformers, TRL, PEFT (LoRA/QLoRA), DeepSpeed, FSDP, vLLM (for GRPO generation).

Commands

axolotl train config.yaml              # Train (single or multi-GPU, auto-detected)
axolotl preprocess config.yaml         # Tokenize dataset and validate config
axolotl preprocess config.yaml --debug # Inspect tokenized samples and label masking
axolotl inference config.yaml          # Interactive inference
axolotl merge-lora config.yaml         # Merge LoRA adapter into base model
axolotl vllm-serve config.yaml         # Start vLLM server for GRPO/EBFT training
axolotl fetch examples                 # Download example configs
axolotl agent-docs                     # Show agent-optimized docs (bundled with pip package)
axolotl agent-docs grpo                # Topic-specific agent reference
axolotl config-schema                  # Dump config JSON schema

Training Methods

Method	Config Key	When to Use
SFT	(default)	Input-output pairs, instruction tuning
DPO/IPO	`rl: dpo` / `rl: dpo, dpo_loss_type: ["ipo"]`	Paired preference data (chosen vs rejected)
KTO	`rl: kto`	Unpaired binary preference labels
ORPO	`rl: orpo`	Single-stage alignment, no ref model
GRPO	`rl: grpo`	RL with verifiable reward functions (math, code)
EBFT	`rl: ebft`	Feature-matching rewards from internal representations

Agent-specific references:

docs/agents/sft.md — supervised fine-tuning
docs/agents/preference_tuning.md — DPO, IPO, KTO, ORPO, SimPO
docs/agents/grpo.md — GRPO online RL with reward functions
docs/agents/reward_modelling.md — outcome and process reward models
docs/agents/pretraining.md — continual pretraining
docs/agents/model_architectures.md — model-specific quirks (Gemma4, Qwen3.5 MoE, etc.)
docs/agents/new_model_support.md — debugging and adding support for new model architectures

Config Pattern

All training is config-driven. A YAML file specifies model, adapter, dataset(s), and hyperparameters:

base_model: meta-llama/Llama-3.1-8B-Instruct
adapter: lora                    # or qlora, or omit for full fine-tune
datasets:
  - path: my_dataset
    type: chat_template          # prompt strategy (see docs/dataset-formats/)
output_dir: ./outputs/lora-out

Config schema: src/axolotl/utils/schemas/config.py (AxolotlInputConfig).

Project Structure

src/axolotl/
  cli/                           # CLI entry points (train, preprocess, inference, merge_lora, vllm_serve)
  core/
    builders/                    # TrainerBuilder classes (causal.py for SFT, rl.py for RLHF)
    trainers/                    # Trainer classes, mixins (optimizer, scheduler, packing)
      dpo/                       # DPO trainer and config
      grpo/                      # GRPO trainer and sampler
  loaders/                       # Model, tokenizer, adapter, processor loading
  prompt_strategies/             # Dataset format handlers (chat_template, alpaca, dpo/, kto/, orpo/)
  utils/schemas/                 # Pydantic config schemas (config, model, training, peft, trl, fsdp)
  integrations/                  # Plugins (liger, cut_cross_entropy, swanlab, nemo_gym)
  monkeypatch/                   # Runtime patches for HF transformers

examples/                        # Example YAML configs by model (llama-3/, qwen2/, mistral/, ebft/)
deepspeed_configs/               # DeepSpeed JSON configs (zero2, zero3)
docs/                            # Quarto documentation site

Code Conventions

Config-driven: features are toggled via YAML, not code changes
Prompt strategies: src/axolotl/prompt_strategies/ — each type: value maps to a function
Plugin system: plugins: list in config loads integration modules
Trainer mixins: core/trainers/mixins/ for composable trainer behaviors
Schemas: all config validation via Pydantic in utils/schemas/

Key Documentation

Getting Started — quickstart tutorial
Choosing a Method — SFT vs DPO vs GRPO decision guide
Config Reference — all config options
Dataset Formats — chat_template, alpaca, input_output, completion
RLHF — DPO, KTO, ORPO, GRPO, EBFT configs and dataset formats
GRPO Deep Dive — async training, custom rewards, scaling
vLLM Serving — vLLM setup for GRPO/EBFT
Multi-GPU — FSDP and DeepSpeed
Training Stability — debugging loss, NaN, OOM
Debugging — VSCode setup, Docker debugging

5.0 KiB Raw Blame History