axolotl/docs/agents/pretraining.md

# Pretraining / Continual Pretraining — Agent Reference

Train on raw text with no input masking. Two approaches depending on dataset size.

## When to Use

- Continual pretraining on domain-specific corpora
- Adapting a base model to a new language or domain before fine-tuning
- Pretraining-style data where the entire text is the training signal

## Choosing an Approach

| | Non-streaming (`type: completion`) | Streaming (`pretraining_dataset`) |
|---|---|---|
| **Dataset size** | Fits in memory | Too large to fit in memory |
| **Tokenization** | Pre-tokenized before training | On-demand during training |
| **Config key** | `datasets:` | `pretraining_dataset:` |
| **Long text handling** | Splits texts exceeding `sequence_len` | Concatenates into fixed-length sequences |
| **Benefit** | Can preprocess on CPU, transfer to GPU | Start training immediately, no preprocessing |

## Non-Streaming: `type: completion`

For smaller datasets that fit in memory. Pre-tokenizes the entire dataset.

```yaml
datasets:
  - path: my_corpus
    type: completion
    # field: text              # Column name (default: "text")
```

## Streaming: `pretraining_dataset`

For large corpora. Streams data on-demand without loading everything into memory.

```yaml
pretraining_dataset:
  - path: HuggingFaceFW/fineweb-edu
    type: pretrain
    text_column: text
    split: train

max_steps: 1000                          # Required — axolotl can't infer dataset size
streaming_multipack_buffer_size: 10000   # Buffer for sample packing
pretrain_multipack_attn: true            # Prevent cross-attention between packed samples
```

`max_steps` is required for streaming — one step = `sequence_len * micro_batch_size * gradient_accumulation_steps * num_gpus` tokens.

Full streaming docs: [streaming.qmd](../streaming.qmd)

## Dataset Format

```json
{"text": "The complete document text goes here."}
```

## Key Settings

- `sample_packing: true` + `pad_to_sequence_len: true` — pack documents into fixed-length sequences
- `flash_attention: true` — required for sample packing
- No adapter — typically full fine-tune for pretraining
- `train_on_inputs: true` — default for completion (all tokens trained on)

## File Map

```
src/axolotl/
  prompt_strategies/completion.py    # Non-streaming: completion prompt strategy (no masking)
  utils/data/sft.py                  # Non-streaming: dataset loading and processing
  utils/data/streaming.py            # Streaming: encode_streaming(), wrap_streaming_dataset()
  utils/schemas/config.py            # Config fields: pretraining_dataset, pretrain_multipack_attn, etc.

examples/streaming/pretrain.yaml     # Full streaming pretraining example config
```