feat(docs): comprehensive improvement (#3564)
* docs: comprehensive documentation improvements for humans and agents New human docs: - grpo.qmd: GRPO deep dive (async, rewards, IS correction, scaling) - ebft.qmd: EBFT guide (structured/strided modes, feature extraction) - choosing_method.qmd: decision tree for SFT vs LoRA vs DPO vs GRPO - vllm_serving.qmd: vLLM setup for GRPO (server/colocate, LoRA sync) - training_stability.qmd: monitoring, NaN debugging, OOM, healthy metrics New agent docs: - AGENTS_SFT.md: agent reference for supervised fine-tuning - AGENTS_DPO.md: agent reference for preference learning (DPO/KTO/ORPO) Updated existing docs: - rlhf.qmd: cross-references to new GRPO/EBFT/choosing-method guides - getting-started.qmd: reorganized Next Steps with links to new guides - debugging.qmd: link to training stability guide - _quarto.yml: added new pages to sidebar navigation Removed: - bak.agents.md: stale backup that confused agents * docs: trim duplicated generic config from AGENTS_DPO.md Remove boilerplate training params (optimizer, gradient_checkpointing, flash_attention, etc.) from each method template. These are not preference-learning-specific and are already covered in AGENTS_SFT.md. Config templates now show only method-specific fields with a reference to AGENTS_SFT.md for the rest. * docs: deduplicate across new doc pages - grpo.qmd: collapse vLLM setup section to brief config + link to vllm_serving.qmd; collapse IS correction to essentials + link; replace full monitoring tables with summary + link to training_stability.qmd - vllm_serving.qmd: remove duplicated async/IS config reference tables (already in grpo.qmd config reference); replace full example config with link to grpo.qmd quick start - ebft.qmd: trim generic training params in quick start config * fix: train scripts * feat: split files into cleaner parts * fix: cleanup pretraining docs --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>
This commit is contained in:
75
docs/agents/pretraining.md
Normal file
75
docs/agents/pretraining.md
Normal file
@@ -0,0 +1,75 @@
|
||||
# Pretraining / Continual Pretraining — Agent Reference
|
||||
|
||||
Train on raw text with no input masking. Two approaches depending on dataset size.
|
||||
|
||||
## When to Use
|
||||
|
||||
- Continual pretraining on domain-specific corpora
|
||||
- Adapting a base model to a new language or domain before fine-tuning
|
||||
- Pretraining-style data where the entire text is the training signal
|
||||
|
||||
## Choosing an Approach
|
||||
|
||||
| | Non-streaming (`type: completion`) | Streaming (`pretraining_dataset`) |
|
||||
|---|---|---|
|
||||
| **Dataset size** | Fits in memory | Too large to fit in memory |
|
||||
| **Tokenization** | Pre-tokenized before training | On-demand during training |
|
||||
| **Config key** | `datasets:` | `pretraining_dataset:` |
|
||||
| **Long text handling** | Splits texts exceeding `sequence_len` | Concatenates into fixed-length sequences |
|
||||
| **Benefit** | Can preprocess on CPU, transfer to GPU | Start training immediately, no preprocessing |
|
||||
|
||||
## Non-Streaming: `type: completion`
|
||||
|
||||
For smaller datasets that fit in memory. Pre-tokenizes the entire dataset.
|
||||
|
||||
```yaml
|
||||
datasets:
|
||||
- path: my_corpus
|
||||
type: completion
|
||||
# field: text # Column name (default: "text")
|
||||
```
|
||||
|
||||
## Streaming: `pretraining_dataset`
|
||||
|
||||
For large corpora. Streams data on-demand without loading everything into memory.
|
||||
|
||||
```yaml
|
||||
pretraining_dataset:
|
||||
- path: HuggingFaceFW/fineweb-edu
|
||||
type: pretrain
|
||||
text_column: text
|
||||
split: train
|
||||
|
||||
max_steps: 1000 # Required — axolotl can't infer dataset size
|
||||
streaming_multipack_buffer_size: 10000 # Buffer for sample packing
|
||||
pretrain_multipack_attn: true # Prevent cross-attention between packed samples
|
||||
```
|
||||
|
||||
`max_steps` is required for streaming — one step = `sequence_len * micro_batch_size * gradient_accumulation_steps * num_gpus` tokens.
|
||||
|
||||
Full streaming docs: [streaming.qmd](../streaming.qmd)
|
||||
|
||||
## Dataset Format
|
||||
|
||||
```json
|
||||
{"text": "The complete document text goes here."}
|
||||
```
|
||||
|
||||
## Key Settings
|
||||
|
||||
- `sample_packing: true` + `pad_to_sequence_len: true` — pack documents into fixed-length sequences
|
||||
- `flash_attention: true` — required for sample packing
|
||||
- No adapter — typically full fine-tune for pretraining
|
||||
- `train_on_inputs: true` — default for completion (all tokens trained on)
|
||||
|
||||
## File Map
|
||||
|
||||
```
|
||||
src/axolotl/
|
||||
prompt_strategies/completion.py # Non-streaming: completion prompt strategy (no masking)
|
||||
utils/data/sft.py # Non-streaming: dataset loading and processing
|
||||
utils/data/streaming.py # Streaming: encode_streaming(), wrap_streaming_dataset()
|
||||
utils/schemas/config.py # Config fields: pretraining_dataset, pretrain_multipack_attn, etc.
|
||||
|
||||
examples/streaming/pretrain.yaml # Full streaming pretraining example config
|
||||
```
|
||||
Reference in New Issue
Block a user