Files

NanoCode012 16e32232fb feat(docs): comprehensive improvement (#3564 )

* docs: comprehensive documentation improvements for humans and agents

New human docs:
- grpo.qmd: GRPO deep dive (async, rewards, IS correction, scaling)
- ebft.qmd: EBFT guide (structured/strided modes, feature extraction)
- choosing_method.qmd: decision tree for SFT vs LoRA vs DPO vs GRPO
- vllm_serving.qmd: vLLM setup for GRPO (server/colocate, LoRA sync)
- training_stability.qmd: monitoring, NaN debugging, OOM, healthy metrics

New agent docs:
- AGENTS_SFT.md: agent reference for supervised fine-tuning
- AGENTS_DPO.md: agent reference for preference learning (DPO/KTO/ORPO)

Updated existing docs:
- rlhf.qmd: cross-references to new GRPO/EBFT/choosing-method guides
- getting-started.qmd: reorganized Next Steps with links to new guides
- debugging.qmd: link to training stability guide
- _quarto.yml: added new pages to sidebar navigation

Removed:
- bak.agents.md: stale backup that confused agents

* docs: trim duplicated generic config from AGENTS_DPO.md

Remove boilerplate training params (optimizer, gradient_checkpointing,
flash_attention, etc.) from each method template. These are not
preference-learning-specific and are already covered in AGENTS_SFT.md.
Config templates now show only method-specific fields with a reference
to AGENTS_SFT.md for the rest.

* docs: deduplicate across new doc pages

- grpo.qmd: collapse vLLM setup section to brief config + link to
  vllm_serving.qmd; collapse IS correction to essentials + link;
  replace full monitoring tables with summary + link to
  training_stability.qmd
- vllm_serving.qmd: remove duplicated async/IS config reference tables
  (already in grpo.qmd config reference); replace full example config
  with link to grpo.qmd quick start
- ebft.qmd: trim generic training params in quick start config

* fix: train scripts

* feat: split files into cleaner parts

* fix: cleanup pretraining docs

---------

Co-authored-by: Wing Lian <wing.lian@gmail.com>

2026-04-02 08:01:26 -04:00

4.5 KiB

Raw Blame History

Axolotl

Fine-tuning framework for LLMs. Config-driven: every training run is defined by a single YAML file.

Tech Stack

Python, PyTorch, HuggingFace Transformers, TRL, PEFT (LoRA/QLoRA), DeepSpeed, FSDP, vLLM (for GRPO generation).

Commands

axolotl train config.yaml              # Train (single or multi-GPU, auto-detected)
axolotl preprocess config.yaml         # Tokenize dataset and validate config
axolotl preprocess config.yaml --debug # Inspect tokenized samples and label masking
axolotl inference config.yaml          # Interactive inference
axolotl merge-lora config.yaml         # Merge LoRA adapter into base model
axolotl vllm-serve config.yaml         # Start vLLM server for GRPO/EBFT training
axolotl fetch examples                 # Download example configs

Training Methods

Method	Config Key	When to Use
SFT	(default)	Input-output pairs, instruction tuning
DPO/IPO	`rl: dpo` / `rl: ipo`	Paired preference data (chosen vs rejected)
KTO	`rl: kto`	Unpaired binary preference labels
ORPO	`rl: orpo`	Single-stage alignment, no ref model
GRPO	`rl: grpo`	RL with verifiable reward functions (math, code)
EBFT	`rl: ebft`	Feature-matching rewards from internal representations

Agent-specific references:

docs/agents/sft.md — supervised fine-tuning
docs/agents/preference_tuning.md — DPO, IPO, KTO, ORPO, SimPO
docs/agents/grpo.md — GRPO online RL with reward functions
docs/agents/reward_modelling.md — outcome and process reward models
docs/agents/pretraining.md — continual pretraining

Config Pattern

All training is config-driven. A YAML file specifies model, adapter, dataset(s), and hyperparameters:

base_model: meta-llama/Llama-3.1-8B-Instruct
adapter: lora                    # or qlora, or omit for full fine-tune
datasets:
  - path: my_dataset
    type: chat_template          # prompt strategy (see docs/dataset-formats/)
output_dir: ./outputs/lora-out

Config schema: src/axolotl/utils/schemas/config.py (AxolotlInputConfig).

Project Structure

src/axolotl/
  cli/                           # CLI entry points (train, preprocess, inference, merge_lora, vllm_serve)
  core/
    builders/                    # TrainerBuilder classes (causal.py for SFT, rl.py for RLHF)
    trainers/                    # Trainer classes, mixins (optimizer, scheduler, packing)
      dpo/                       # DPO trainer and config
      grpo/                      # GRPO trainer and sampler
  loaders/                       # Model, tokenizer, adapter, processor loading
  prompt_strategies/             # Dataset format handlers (chat_template, alpaca, dpo/, kto/, orpo/)
  utils/schemas/                 # Pydantic config schemas (config, model, training, peft, trl, fsdp)
  integrations/                  # Plugins (liger, cut_cross_entropy, swanlab, nemo_gym)
  monkeypatch/                   # Runtime patches for HF transformers

examples/                        # Example YAML configs by model (llama-3/, qwen2/, mistral/, ebft/)
deepspeed_configs/               # DeepSpeed JSON configs (zero2, zero3)
docs/                            # Quarto documentation site

Code Conventions

Config-driven: features are toggled via YAML, not code changes
Prompt strategies: src/axolotl/prompt_strategies/ — each type: value maps to a function
Plugin system: plugins: list in config loads integration modules
Trainer mixins: core/trainers/mixins/ for composable trainer behaviors
Schemas: all config validation via Pydantic in utils/schemas/

Key Documentation

Getting Started — quickstart tutorial
Choosing a Method — SFT vs DPO vs GRPO decision guide
Config Reference — all config options
Dataset Formats — chat_template, alpaca, input_output, completion
RLHF — DPO, KTO, ORPO, GRPO, EBFT configs and dataset formats
GRPO Deep Dive — async training, custom rewards, scaling
vLLM Serving — vLLM setup for GRPO/EBFT
Multi-GPU — FSDP and DeepSpeed
Training Stability — debugging loss, NaN, OOM
Debugging — VSCode setup, Docker debugging

4.5 KiB Raw Blame History