Files
axolotl/docs/agents/pretraining.md
NanoCode012 16e32232fb feat(docs): comprehensive improvement (#3564)
* docs: comprehensive documentation improvements for humans and agents

New human docs:
- grpo.qmd: GRPO deep dive (async, rewards, IS correction, scaling)
- ebft.qmd: EBFT guide (structured/strided modes, feature extraction)
- choosing_method.qmd: decision tree for SFT vs LoRA vs DPO vs GRPO
- vllm_serving.qmd: vLLM setup for GRPO (server/colocate, LoRA sync)
- training_stability.qmd: monitoring, NaN debugging, OOM, healthy metrics

New agent docs:
- AGENTS_SFT.md: agent reference for supervised fine-tuning
- AGENTS_DPO.md: agent reference for preference learning (DPO/KTO/ORPO)

Updated existing docs:
- rlhf.qmd: cross-references to new GRPO/EBFT/choosing-method guides
- getting-started.qmd: reorganized Next Steps with links to new guides
- debugging.qmd: link to training stability guide
- _quarto.yml: added new pages to sidebar navigation

Removed:
- bak.agents.md: stale backup that confused agents

* docs: trim duplicated generic config from AGENTS_DPO.md

Remove boilerplate training params (optimizer, gradient_checkpointing,
flash_attention, etc.) from each method template. These are not
preference-learning-specific and are already covered in AGENTS_SFT.md.
Config templates now show only method-specific fields with a reference
to AGENTS_SFT.md for the rest.

* docs: deduplicate across new doc pages

- grpo.qmd: collapse vLLM setup section to brief config + link to
  vllm_serving.qmd; collapse IS correction to essentials + link;
  replace full monitoring tables with summary + link to
  training_stability.qmd
- vllm_serving.qmd: remove duplicated async/IS config reference tables
  (already in grpo.qmd config reference); replace full example config
  with link to grpo.qmd quick start
- ebft.qmd: trim generic training params in quick start config

* fix: train scripts

* feat: split files into cleaner parts

* fix: cleanup pretraining docs

---------

Co-authored-by: Wing Lian <wing.lian@gmail.com>
2026-04-02 08:01:26 -04:00

2.6 KiB

Pretraining / Continual Pretraining — Agent Reference

Train on raw text with no input masking. Two approaches depending on dataset size.

When to Use

  • Continual pretraining on domain-specific corpora
  • Adapting a base model to a new language or domain before fine-tuning
  • Pretraining-style data where the entire text is the training signal

Choosing an Approach

Non-streaming (type: completion) Streaming (pretraining_dataset)
Dataset size Fits in memory Too large to fit in memory
Tokenization Pre-tokenized before training On-demand during training
Config key datasets: pretraining_dataset:
Long text handling Splits texts exceeding sequence_len Concatenates into fixed-length sequences
Benefit Can preprocess on CPU, transfer to GPU Start training immediately, no preprocessing

Non-Streaming: type: completion

For smaller datasets that fit in memory. Pre-tokenizes the entire dataset.

datasets:
  - path: my_corpus
    type: completion
    # field: text              # Column name (default: "text")

Streaming: pretraining_dataset

For large corpora. Streams data on-demand without loading everything into memory.

pretraining_dataset:
  - path: HuggingFaceFW/fineweb-edu
    type: pretrain
    text_column: text
    split: train

max_steps: 1000                          # Required — axolotl can't infer dataset size
streaming_multipack_buffer_size: 10000   # Buffer for sample packing
pretrain_multipack_attn: true            # Prevent cross-attention between packed samples

max_steps is required for streaming — one step = sequence_len * micro_batch_size * gradient_accumulation_steps * num_gpus tokens.

Full streaming docs: streaming.qmd

Dataset Format

{"text": "The complete document text goes here."}

Key Settings

  • sample_packing: true + pad_to_sequence_len: true — pack documents into fixed-length sequences
  • flash_attention: true — required for sample packing
  • No adapter — typically full fine-tune for pretraining
  • train_on_inputs: true — default for completion (all tokens trained on)

File Map

src/axolotl/
  prompt_strategies/completion.py    # Non-streaming: completion prompt strategy (no masking)
  utils/data/sft.py                  # Non-streaming: dataset loading and processing
  utils/data/streaming.py            # Streaming: encode_streaming(), wrap_streaming_dataset()
  utils/schemas/config.py            # Config fields: pretraining_dataset, pretrain_multipack_attn, etc.

examples/streaming/pretrain.yaml     # Full streaming pretraining example config