Resolve merge conflicts: unify pretraining utils imports, add alias handling; fix rl.py per new RL dataset API; resolve config schema conflict and add sequence_len_overflow_handling field

2025-08-12 20:45:26 +02:00
parent f5a3e3529e 3d45620008
commit 47b3fe8af3
603 changed files with 37614 additions and 14002 deletions
--- a/docs/.gitignore
+++ b/docs/.gitignore
@@ -2,3 +2,4 @@
 _site/
 /api/*.qmd
 /api/*.html
+config-reference.qmd
--- a/docs/cli.qmd
+++ b/docs/cli.qmd
@@ -23,6 +23,20 @@ axolotl <command> [config.yml] [options]

 The config file can be local or a URL to a raw YAML file.

+### Launcher Arguments
+
+For commands that support multi-GPU (`train`, `evaluate`, ...), you can pass launcher-specific arguments using the `--` separator:
+
+```bash
+# Pass torchrun arguments
+axolotl train config.yml --launcher torchrun -- --nproc_per_node=2 --nnodes=1
+
+# Pass accelerate arguments
+axolotl train config.yml --launcher accelerate -- --config_file=accelerate_config.yml --num_processes=4
+```
+
+Arguments after `--` are passed directly to the launcher (torchrun, accelerate launch, etc.).
+
 ## Command Reference

 ### fetch
@@ -80,7 +94,11 @@ axolotl train config.yml \
    --num-epochs 3

 # Training without accelerate
-axolotl train config.yml --no-accelerate
+axolotl train config.yml --launcher python
+
+# Pass launcher-specific arguments using -- separator
+axolotl train config.yml --launcher torchrun -- --nproc_per_node=2 --nnodes=1
+axolotl train config.yml --launcher accelerate -- --config_file=accelerate_config.yml

 # Resume training from checkpoint
 axolotl train config.yml --resume-from-checkpoint path/to/checkpoint
@@ -175,6 +193,9 @@ Evaluates a model's performance (loss etc) on the train and eval datasets.
 ```bash
 # Basic evaluation
 axolotl evaluate config.yml
+
+# Evaluation with launcher arguments
+axolotl evaluate config.yml --launcher torchrun -- --nproc_per_node=2
 ```

 ### lm-eval
@@ -209,6 +230,16 @@ axolotl delinearize-llama4 --model path/to/model_dir --output path/to/output_dir

 This would be necessary to use with other frameworks. If you have an adapter, merge it with the non-quantized linearized model before delinearizing.

+### quantize
+
+Quantizes a model using the quantization configuration specified in your YAML file.
+
+```bash
+axolotl quantize config.yml
+```
+
+See [Quantization](./quantize.qmd) for more details.
+

 ## Legacy CLI Usage

@@ -277,9 +308,6 @@ axolotl preprocess config.yml --cloud cloud_config.yml
 # Train on cloud
 axolotl train config.yml --cloud cloud_config.yml

-# Train without accelerate on cloud
-axolotl train config.yml --cloud cloud_config.yml --no-accelerate
-
 # Run lm-eval on cloud
 axolotl lm-eval config.yml --cloud cloud_config.yml
 ```
--- a/docs/custom_integrations.qmd
+++ b/docs/custom_integrations.qmd
@@ -7,6 +7,7 @@ toc-depth: 3
 ```{python}
 #| echo: false

+import os
 import re

 def process_readme(integration_name):
@@ -53,6 +54,24 @@ sections = [
    ("LLMCompressor", "llm_compressor")
 ]

+for folder_name in os.listdir("../src/axolotl/integrations/"):
+    if folder_name in [path for name, path in sections]:
+        # skip if already in sections
+        continue
+    if os.path.exists(f"../src/axolotl/integrations/{folder_name}/README.md"):
+        # grab the first heading in README.md as the section name
+        with open(f"../src/axolotl/integrations/{folder_name}/README.md", "r") as f:
+            txt = f.read()
+            matches = re.search(r'^# (.*)\n?', txt, flags=re.MULTILINE)
+            if matches:
+                name = matches.group(1)
+            else:
+                continue
+            sections.append((name, folder_name))
+
+# sort sections by name
+sections = sorted(sections, key=lambda x: x[0])
+
 for section_name, folder_name in sections:
    print(print_section(section_name, folder_name))
 ```
--- a/docs/dataset-formats/conversation.qmd
+++ b/docs/dataset-formats/conversation.qmd
@@ -9,10 +9,10 @@ order: 3
 Chat Template strategy uses a jinja2 template that converts a list of messages into a prompt. Support using tokenizer's template, a supported template, or custom jinja2.

 ```{.json filename="data.jsonl"}
-{"conversations": [{"role": "...", "content": "..."}]}
+{"messages": [{"role": "...", "content": "..."}, {"role": "...", "content": "..."}, ...]}
 ```

-See [configs](../config.qmd) for full configs and supported templates.
+See [configs](../config-reference.qmd) for full configs and supported templates.

 ### Migrating from sharegpt

@@ -52,7 +52,9 @@ We recommend checking the below examples for other usecases.

 ### Examples

-1. (Legacy) Using the default chat template in the tokenizer_config.json on OpenAI messages format, training on only last message.
+#### Training on last message
+
+(Legacy) Using the default chat template in the tokenizer_config.json on OpenAI messages format, training on only last message.

 ```yaml
 datasets:
@@ -66,7 +68,9 @@ datasets:
 If you receive an error like "`chat_template` choice is `tokenizer_default` but tokenizer's `chat_template` is null.", it means the tokenizer does not have a default `chat_template`. Follow the examples below instead to set a custom `chat_template`.
 :::

-2. Using the `gemma` chat template to override the tokenizer_config.json's chat template on OpenAI messages format, training on all assistant messages.
+#### Overriding default chat template
+
+Using the `gemma` chat template to override the tokenizer_config.json's chat template on OpenAI messages format, training on all assistant messages.

 ```yaml
 chat_template: gemma # this overwrites the tokenizer's chat_template
@@ -76,7 +80,13 @@ datasets:
    roles_to_train: ["assistant"]  # default value
 ```

-3. Using the tokenizer_config.json's chat template or `chatml` as fallback if the former's chat template does not exist, on OpenAI messages format, training on all assistant messages.
+::: {.callout-note}
+If you want to use built-in chat_template, use `chat_template: tokenizer_default` (this is set by default).
+:::
+
+#### Using default chat template with fallback
+
+Using the tokenizer_config.json's chat template or `chatml` as fallback if the former's chat template does not exist, on OpenAI messages format, training on all assistant messages.

 ```yaml
 chat_template: tokenizer_default_fallback_chatml # this overwrites the tokenizer's chat_template
@@ -85,7 +95,9 @@ datasets:
    type: chat_template
 ```

-4. Using a custom jinja template on OpenAI messages format, training on all assistant messages.
+#### Custom Jinja template
+
+Using a custom jinja template on OpenAI messages format, training on all assistant messages.

 ```yaml
 # chat_template: jinja # `jinja` will be implied if the `chat_template_jinja` is set and this field is empty
@@ -100,7 +112,9 @@ datasets:
 Please make sure that your `tokenizer.eos_token` is same as EOS (End-of-Sequence) token in template. Otherwise, set `eos_token` under `special_tokens: `.
 :::

-5. If you are using a template that has a different EOT (End-of-Turn) token from EOS token or multiple EOT tokens (like Mistral V7 Tekken), set the `eot_tokens: ` config. The handling of EOT tokens follows `train_on_eos: ` which defaults to turn.
+#### Using template with different token for EOT and EOS
+
+- If you are using a template that has a different EOT (End-of-Turn) token from EOS token or multiple EOT tokens (like Mistral V7 Tekken), set the `eot_tokens: ` config. The handling of EOT tokens follows `train_on_eos: ` which defaults to turn.

 ```yaml
 eot_tokens:
@@ -116,16 +130,16 @@ datasets:
 ```

 ::: {.callout-tip}
-See [config documentation](../config.qmd) for detailed explanations of "turn", "last", and "all" options for training on tokens.
+See [config documentation](../config-reference.qmd) for detailed explanations of "turn", "last", and "all" options for training on tokens.
 :::

 ::: {.callout-note}
 Using `eot_tokens` requires each token that exists in `chat_template` to be a single token in the tokenizer. Otherwise, the tokenizer will split the token and cause unexpected behavior.

-You can add those tokens as new tokens under `tokens: ` or (recommended) override unused added_tokens via `added_tokens_overrides: `. See [config](../config.qmd) for more details.
+You can add those tokens as new tokens under `tokens: ` or (recommended) override unused added_tokens via `added_tokens_overrides: `. See [config](../config-reference.qmd) for more details.
 :::

-6. Continuing from the previous example, if you want to train on all EOT token trainable turns but only last EOS token, set `train_on_eos: last`.
+- Continuing from the previous example, if you want to train on all EOT token trainable turns but only last EOS token, set `train_on_eos: last`.

 ```yaml
 eot_tokens:
@@ -145,7 +159,76 @@ If EOS token only appears at the end of a prompt, `train_on_eos: last` is equiva
 :::


-7. (Advanced) Using fine-grained control over tokens and turns to train in a conversation
+#### Using tool use
+
+Instead of passing `tools` via the system prompt, an alternative method would be to have the `tools` in a separate column and loaded via `chat_template` to let the template dynamically build it.
+
+```json
+{
+    "tools": [
+        {
+            "type": "...",
+            "function": {
+                "name": "...",
+                "description": "...",
+                "parameters": {
+                    "type": "...",
+                    "properties": {
+                        // ...
+                    },
+                    "required": ["..."],
+                },
+            },
+        },
+    ],
+    "messages": [
+        // ...
+        {
+            "role": "assistant", // call the function via assistant
+            "tool_calls": [
+                {
+                    "id": "...",  // required only for mistral
+                    "type": "function",
+                    "function": {
+                        "name": "...",
+                        "arguments": {
+                            "...": "...",
+                        }
+                    }
+                }
+            ]
+        },
+        {
+            "role": "tool",
+            "tool_call_id": "...",  // required only for mistral
+            "name": "...",
+            "content": "..."
+        },
+    ],
+}
+```
+
+::: {.callout-note}
+Tools need to follow [JSON schema](https://json-schema.org/learn/getting-started-step-by-step).
+:::
+
+Example config for Llama4:
+```yaml
+chat_template: llama4
+datasets:
+  - path: Nanobit/text-tools-2k-test
+    type: chat_template
+    # field_tools: tools # default is `tools`
+```
+
+::: {.callout-tip}
+Look into the `chat_template` you are using to see if it supports `tools` and what the expected role is for the tool answer. In the example above, the tool answer is expected to be in the `tool` or `ipython` role for `llama4` template.
+:::
+
+
+#### Using fine-grained control over token masking
+
+(Advanced) Using fine-grained control over tokens and turns to train in a conversation

 For a data sample that looks like:

@@ -196,7 +279,9 @@ datasets:
 It is not necessary to set both `message_field_training` and `message_field_training_detail` at once.
 :::

-8. (For Qwen3 template only) Enable reasoning split, where the reasoning is split from the content and passed as a separate field into the template.
+#### Reasoning split
+
+(For Qwen3 template only) Enable reasoning split, where the reasoning is split from the content and passed as a separate field into the template.

 ```yaml
 datasets:
--- a/docs/dataset-formats/index.qmd
+++ b/docs/dataset-formats/index.qmd
@@ -36,10 +36,6 @@ It is typically recommended to save your dataset as `.jsonl` due to its flexibil

 Axolotl supports loading from a Hugging Face hub repo or from local files.

-::: {.callout-important}
-For pre-training only, Axolotl would split texts if it exceeds the context length into multiple smaller prompts.
-:::
-
 ### Pre-training from Hugging Face hub datasets

 As an example, to train using a Hugging Face dataset `hf_org/name`, you can pass the following config:
@@ -77,18 +73,21 @@ datasets:
    type: completion
 ```

-From local files (either example works):
+From local files:

 ```yaml
 datasets:
  - path: A.jsonl
    type: completion

-  - path: json
-    data_files: ["A.jsonl", "B.jsonl", "C.jsonl"]
+  - path: B.jsonl
    type: completion
 ```

+::: {.callout-important}
+For `completion` only, Axolotl would split texts if it exceeds the context length into multiple smaller prompts. If you are interested in having this for `pretraining_dataset` too, please let us know or help make a PR!
+:::
+
 ### Pre-training dataset configuration tips

 #### Setting max_steps
--- a/docs/dataset-formats/inst_tune.qmd
+++ b/docs/dataset-formats/inst_tune.qmd
@@ -186,4 +186,4 @@ datasets:
      no_input_format: "[INST] {instruction} [/INST]"
 ```

-See full config options under [here](../config.qmd).
+See full config options under [here](../config-reference.qmd).
--- a/docs/dataset_loading.qmd
+++ b/docs/dataset_loading.qmd
@@ -36,7 +36,7 @@ This matches the API of [`datasets.load_dataset`](https://github.com/huggingface

 For HuggingFace's guide to load different dataset types, see [here](https://huggingface.co/docs/datasets/loading).

-For full details on the config, see [config.qmd](config.qmd).
+For full details on the config, see [config-reference.qmd](config-reference.qmd).

 ::: {.callout-note}

@@ -54,7 +54,7 @@ datasets:

 #### Files

-Usually, to load a JSON file, you would do something like this:
+To load a JSON file, you would do something like this:

 ```python
 from datasets import load_dataset
@@ -66,20 +66,12 @@ Which translates to the following config:

 ```yaml
 datasets:
-  - path: json
-    data_files: /path/to/your/file.jsonl
-```
-
-However, to make things easier, we have added a few shortcuts for loading local dataset files.
-
-You can just point the `path` to the file or directory along with the `ds_type` to load the dataset. The below example shows for a JSON file:
-
-```yaml
-datasets:
-  - path: /path/to/your/file.jsonl
+  - path: data.json
    ds_type: json
 ```

+In the example above, it can be seen that we can just point the `path` to the file or directory along with the `ds_type` to load the dataset.
+
 This works for CSV, JSON, Parquet, and Arrow files.

 ::: {.callout-tip}
--- a/docs/docker.qmd
+++ b/docs/docker.qmd
@@ -9,7 +9,7 @@ format:
 This section describes the different Docker images that are released by AxolotlAI at [Docker Hub](https://hub.docker.com/u/axolotlai).

 ::: {.callout-important}
-For Blackwell GPUs, please use the tags with Pytorch 2.7.0 and CUDA 12.8.
+For Blackwell GPUs, please use the tags with PyTorch 2.7.1 and CUDA 12.8.
 :::

 ## Base
@@ -32,11 +32,11 @@ main-base-py{python_version}-cu{cuda_version}-{pytorch_version}

 Tags examples:

- `main-base-py3.11-cu128-2.7.0`
+- `main-base-py3.11-cu128-2.7.1`
+- `main-base-py3.11-cu126-2.7.1`
 - `main-base-py3.11-cu126-2.7.0`
+- `main-base-py3.11-cu126-2.6.0`
 - `main-base-py3.11-cu124-2.6.0`
- `main-base-py3.11-cu124-2.5.1`
- `main-base-py3.11-cu124-2.4.1`

 ## Main

@@ -74,15 +74,15 @@ There may be some extra tags appended to the image, like `-vllm` which installs

 Tags examples:

+- `main-py3.11-cu128-2.7.1`
+- `main-py3.11-cu126-2.7.1`
 - `main-py3.11-cu126-2.7.0`
+- `main-py3.11-cu126-2.6.0`
 - `main-py3.11-cu124-2.6.0`
- `main-py3.11-cu124-2.5.1`
- `main-py3.11-cu124-2.4.1`
 - `main-latest`
 - `main-20250303-py3.11-cu124-2.6.0`
- `main-20250303-py3.11-cu124-2.5.1`
- `main-20250303-py3.11-cu124-2.4.1`
- `0.7.1`
+- `main-20250303-py3.11-cu126-2.6.0`
+- `0.10.1`

 ## Cloud

--- a/docs/faq.qmd
+++ b/docs/faq.qmd
@@ -9,11 +9,11 @@ description: Frequently asked questions

 > A: Usually an issue with the GPUs communicating with each other. See the [NCCL doc](nccl.qmd)

-**Q: Exitcode -9**
+**Q: exitcode: -9**

 > A: This usually happens when you run out of system RAM.

-**Q: Exitcode -7 while using deepspeed**
+**Q: exitcode: -7 while using deepspeed**

 > A: Try upgrading deepspeed w: `pip install -U deepspeed`

@@ -51,6 +51,18 @@ description: Frequently asked questions
 >   pad_token: "..."
 > ```

+**Q: `IterableDataset error` or `KeyError: 'input_ids'` when using `preprocess` CLI**
+
+> A: This is because you may be using `preprocess` CLI with `pretraining_dataset:` or `skip_prepare_dataset: true` respectively. Please use `axolotl train` CLI directly instead as these datasets are prepared on demand.
+
+**Q: vLLM is not working with Axolotl**
+
+> A: We currently recommend torch 2.6.0 for use with `vllm`. Please ensure you use the right version. For Docker, please use the `main-py3.11-cu124-2.6.0` tag.
+
+**Q: FA2 2.8.0 `undefined symbol` runtime error on CUDA 12.4**
+
+> A: There seems to be a wheel issue with FA2 2.8.0 on CUDA 12.4. Try CUDA 12.6 instead or downgrade to FA2 2.7.4. Please refer to the upstream issue: https://github.com/Dao-AILab/flash-attention/issues/1717.
+
 ### Chat templates

 **Q: `jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'content' / 'role' / ____`**
@@ -110,3 +122,21 @@ description: Frequently asked questions
 > A: If `eot_tokens: ` is not provided, the default behavior is the same as before. EOS tokens used to delimit turns are masked/unmasked depending on whether the turn is trainable.

 > Internally, `eot_tokens: tokenizer.eos_token` and `train_on_eot: train_on_eos` (which defaults to `turn`). This transition helps clarify the naming and behavior of EOT/EOS tokens.
+
+**Q: `Data processing error: CAS service error`**
+
+> A: Try disabling XET with `export HF_HUB_DISABLE_XET=1`
+
+**Q: `torch._inductor.exc.LoweringException: NoValidChoicesError: No choices to select, please consider adding ATEN into max_autotune_gemm_backends config (defined in torch/_inductor/config.py) to allow at least one choice. `**
+
+> A: Depending on the version of torch, you may need to include this in your YAML:
+
+> ```yaml
+> flex_attn_compile_kwargs:
+>   dynamic: false
+>   mode: max-autotune-no-cudagraphs
+> ```
+
+**Q: `ValueError("Backward pass should have cleared tracker of all tensors")`
+
+> A: This may happen due to edge cases in using the modern OffloadActivations context manager for CUDA streams. If you encounter this error, you may have success using the naive implementation with `offload_activations: legacy` in your YAML.
--- a/docs/fsdp_qlora.qmd
+++ b/docs/fsdp_qlora.qmd
@@ -20,7 +20,7 @@ To enable `QLoRA` with `FSDP`, you need to perform the following steps:
 > See the [example config](#example-config) file in addition to reading these instructions.

 1. Set `adapter: qlora` in your axolotl config file.
-2. Enable FSDP in your axolotl config, as [described here](https://github.com/axolotl-ai-cloud/axolotl?tab=readme-ov-file#fsdp).
+2. Enable FSDP in your axolotl config, as [described here](multi-gpu.qmd#sec-fsdp).
 3. Use one of the supported model types: `llama`, `mistral` or `mixtral`.

 ## Example Config
--- a/docs/getting-started.qmd
+++ b/docs/getting-started.qmd
@@ -55,7 +55,7 @@ output_dir: ./outputs/lora-out
 - To perform QLoRA finetuning, replace with `load_in_4bit: true` and `adapter: qlora`.
 :::

-See our [Config options](config.qmd) for more details.
+See our [config options](config-reference.qmd) for more details.

 ### Training {#sec-training}

@@ -179,7 +179,7 @@ Now that you have the basics, you might want to:

 Check our other guides for details on these topics:

- [Configuration Guide](config.qmd) - Full configuration options
+- [Configuration Guide](config-reference.qmd) - Full configuration options
 - [Dataset Loading](dataset_loading.qmd) - Loading datasets from various sources
 - [Dataset Formats](dataset-formats) - Working with different data formats
 - [Multi-GPU Training](multi-gpu.qmd)
--- a/docs/gradient_checkpointing.qmd
+++ b/docs/gradient_checkpointing.qmd
@@ -0,0 +1,29 @@
+---
+title: Gradient Checkpointing and Activation Offloading
+---
+
+Gradient checkpointing and activation offloading are techniques used to optimize the performance of deep learning
+models by reducing the memory footprint and improving computational efficiency.
+
+### Enabling Gradient Checkpointing
+
+```yaml
+gradient_checkpointing: true
+```
+
+### Enabling Activation Offloading
+
+```yaml
+gradient_checkpointing: true  # required for activation offloading
+activation_offloading: true
+```
+
+Activation offloading variants:
+
+The default `activation_offloading: true` offloads activations to CPU and uses CUDA streams
+to overlap the communications and computations when offloading.
+
+The `activation_offloading: legacy` naively offloads activations to CPU and without additional optimizations.
+
+For resource constrained environments with limited CPU memory, `activation_offloading: disk` offloads
+activations to disk instead of CPU RAM so that much larger context lengths can be trained with minimal memory.
--- a/docs/installation.qmd
+++ b/docs/installation.qmd
@@ -14,8 +14,8 @@ This guide covers all the ways you can install and set up Axolotl for your envir
 ## Requirements {#sec-requirements}

 - NVIDIA GPU (Ampere architecture or newer for `bf16` and Flash Attention) or AMD GPU
- Python ≥3.10
- PyTorch ≥2.4.1
+- Python ≥3.11
+- PyTorch ≥2.6.0

 ## Installation Methods {#sec-installation-methods}

@@ -41,6 +41,40 @@ installed) in order not to clobber it, and so that we set the correct version of
 dependencies that are specific to the PyTorch version or other installed
 co-dependencies.

+### uv Installation {#sec-uv}
+
+uv is a fast, reliable Python package installer and resolver built in Rust. It offers significant performance improvements over pip and provides better dependency resolution, making it an excellent choice for complex environments.
+
+Install uv if not already installed
+```{.bash}
+curl -LsSf https://astral.sh/uv/install.sh | sh
+source $HOME/.local/bin/env
+```
+
+Choose your CUDA version to use with PyTorch; e.g. `cu124`, `cu126`, `cu128`,
+then create the venv and activate
+```{.bash}
+export UV_TORCH_BACKEND=cu126
+uv venv --no-project --relocatable
+source .venv/bin/activate
+```
+
+Install PyTorch
+- PyTorch 2.6.0 recommended
+```{.bash}
+uv pip install packaging setuptools wheel
+uv pip install torch==2.6.0
+uv pip install awscli pydantic
+```
+
+Install axolotl from PyPi
+```{.bash}
+uv pip install --no-build-isolation axolotl[deepspeed,flash-attn]
+
+# optionally install with vLLM if you're using torch==2.6.0 and want to train w/ GRPO
+uv pip install --no-build-isolation axolotl[deepspeed,flash-attn,vllm]
+```
+
 ### Edge/Development Build {#sec-edge-build}

 For the latest features between releases:
@@ -90,10 +124,13 @@ For providers supporting Docker:

 - Use `axolotlai/axolotl-cloud:main-latest`
 - Available on:
-  - [Latitude.sh](https://latitude.sh/blueprint/989e0e79-3bf6-41ea-a46b-1f246e309d5c)
-  - [JarvisLabs.ai](https://jarvislabs.ai/templates/axolotl)
-  - [RunPod](https://runpod.io/gsc?template=v2ickqhz9s&ref=6i7fkpdz)
-  - [Novita](https://novita.ai/gpus-console?templateId=311)
+    - [RunPod](https://runpod.io/gsc?template=v2ickqhz9s&ref=6i7fkpdz)
+    - [Vast.ai](https://cloud.vast.ai?ref_id=62897&template_id=bdd4a49fa8bce926defc99471864cace&utm_source=axolotl&utm_medium=partner&utm_campaign=template_launch_july2025&utm_content=docs_link)
+    - [PRIME Intellect](https://app.primeintellect.ai/dashboard/create-cluster?image=axolotl&location=Cheapest&security=Cheapest&show_spot=true)
+    - [Modal](https://www.modal.com?utm_source=github&utm_medium=github&utm_campaign=axolotl)
+    - [Novita](https://novita.ai/gpus-console?templateId=311)
+    - [JarvisLabs.ai](https://jarvislabs.ai/templates/axolotl)
+    - [Latitude.sh](https://latitude.sh/blueprint/989e0e79-3bf6-41ea-a46b-1f246e309d5c)

 ### Google Colab {#sec-colab}

@@ -119,7 +156,7 @@ We recommend using WSL2 (Windows Subsystem for Linux) or Docker.

 ### Conda/Pip venv {#sec-conda}

-1. Install Python ≥3.10
+1. Install Python ≥3.11
 2. Install PyTorch: https://pytorch.org/get-started/locally/
 3. Install Axolotl:
   ```{.bash}
--- a/docs/lora_optims.qmd
+++ b/docs/lora_optims.qmd
@@ -84,6 +84,10 @@ lora_qkv_kernel: true
 lora_o_kernel: true
 ```

+::: {.callout-note}
+Currently, LoRA kernels are not supported for RLHF training, only SFT.
+:::
+
 ## Requirements

 - One or more NVIDIA or AMD GPUs (in order to use the Triton kernels)
--- a/docs/mixed_precision.qmd
+++ b/docs/mixed_precision.qmd
@@ -0,0 +1,149 @@
+---
+title: "Mixed Precision Training"
+format:
+  html:
+    toc: true
+    toc-depth: 3
+    number-sections: true
+    code-tools: true
+execute:
+  enabled: false
+---
+
+Mixed precision training uses lower precision data types to reduce memory usage and increase training speed while maintaining model quality. Axolotl supports several mixed precision formats:
+
+- **FP16** - Half precision 16-bit (Pascal generation+)
+- **BF16** - Brain Float 16-bit (Ampere generation+)
+- **FP8** - 8-bit floating point (Hopper generation+)
+
+## FP16 Mixed Precision {#sec-fp16}
+
+### Overview {#sec-fp16-overview}
+
+FP16 is the traditional half-precision format, supported on older GPUs but can be less numerically stable than BF16.
+
+### Configuration {#sec-fp16-config}
+
+```{.yaml}
+fp16: true
+```
+
+### FP16 Considerations {#sec-fp16-considerations}
+
+- May require gradient scaling to prevent underflow
+- Less numerically stable than BF16
+- Can cause training instability with some model architectures
+- Consider using BF16 if your hardware supports it
+
+## BF16 Mixed Precision {#sec-bf16}
+
+### Overview {#sec-bf16-overview}
+
+BF16 (Brain Float 16) offers better numerical stability than FP16 and is the recommended mixed precision format for modern GPUs. It provides the same dynamic range as FP32 while using half the memory.
+
+### Configuration {#sec-bf16-config}
+
+```{.yaml}
+# Automatic BF16 detection (recommended)
+bf16: auto
+
+# Or explicitly enable
+bf16: true
+
+# For evaluation with BF16
+bf16: full  # Equivalent to bf16_full_eval in the HF trainer
+```
+
+## FP8 Mixed Precision {#sec-fp8}
+
+::: {.callout-note}
+FP8 support is experimental and requires compatible hardware (H100, H200) and recent PyTorch versions with TorchAO.
+:::
+
+### What is FP8? {#sec-fp8-overview}
+
+FP8 (8-bit floating point) can provide significant time savings compared to FP16/BF16 while maintaining training stability. Axolotl's implementation uses PyTorch's TorchAO library with "tensorwise" scaling strategy.
+
+### Requirements {#sec-fp8-software}
+
+- Hopper+ GPUs (H100/H200)
+- PyTorch 2.7+ (+ compatible TorchAO version)
+- CUDA 12.4+
+
+### Configuration {#sec-fp8-config}
+
+Add to your YAML config:
+
+```{.yaml}
+# Enable FP8 mixed precision
+fp8: true
+
+# Optional: Enable FP8 for FSDP all-gather operations
+fp8_enable_fsdp_float8_all_gather: true
+
+# Enable torch.compile (almost always necessary for FP8 speedups)
+torch_compile: true
+```
+
+::: {.callout-important}
+**torch.compile is critical for FP8 performance**
+
+FP8 training requires `torch_compile: true` to see meaningful speedups. Without compilation, FP8 may actually be slower and use more memory than FP16/BF16.
+:::
+
+### Advanced FP8 Configs {#sec-fp8-advanced}
+
+For [FSDP](multi-gpu.qmd#sec-fsdp) (Fully Sharded Data Parallel) training:
+
+```{.yaml}
+fp8: true
+fp8_enable_fsdp_float8_all_gather: true
+
+torch_compile: true
+
+# FSDP configuration
+fsdp_version: 2
+fsdp_config:
+  offload_params: false
+  cpu_ram_efficient_loading: true
+  auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  transformer_layer_cls_to_wrap: LlamaDecoderLayer
+  state_dict_type: FULL_STATE_DICT
+  reshard_after_forward: true
+```
+
+## Best Practices {#sec-best-practices}
+
+### Choosing Precision Format {#sec-choosing-format}
+
+- **Start with automatic detection**: `bf16: auto`
+- **For Hopper+ (H100/H200)**: Try FP8 + torch.compile for maximum speed
+- **For Ampere (A100/RTX 30/40)**: Use BF16
+- **For older Pascal/Turing GPUs**: Use FP16 with caution
+- **For very old or unsupported GPUs**: Use FP32
+
+### Validation and Testing {#sec-validation}
+
+Always validate your mixed precision setup:
+
+- **Start with a small dataset** to verify stability
+- **Monitor loss curves** for irregularities
+- **Compare with FP32 baseline** when possible
+- **Test evaluation metrics** match expectations
+
+### FP8 Particulars {#sec-fp8-details}
+
+- Use cases
+  - Single GPU training
+  - Multi GPU training with FSDP2 or Deepspeed
+- Speedups
+  - Please refer to the [TorchAO FP8 training benchmarks](https://github.com/pytorch/ao/tree/main/torchao/float8#rowwise-scaling) for expected matmul speedups for different (M, K, N) settings
+  - Concrete number for LLaMA 3 8B training can be found [here](https://github.com/pytorch/ao/tree/main/torchao/float8#training-benchmarks)
+- Known issues:
+  - FP8 + DDP + `torch.compile` (causes [error](https://gist.github.com/djsaunde/0c1664c32e44a64d31b5e01b4aafe5c4))
+  - FP8 + FSDP2 + `torch.compile` + FSDP2 activation checkpointing tends to be _slower_ than the BF16 equivalent training
+  - Flash Attention 2 does not play nicely with `torch.compile`
+
+See `examples/llama-3/3b-fp8-fsdp2.yaml` for an optimized example config. Enabling FP8 mixed precision + FP8 all-gather training results in ~10% faster iterations per second vs. BF16 for a relatively small (3B param) model
+
+For more information on multi-GPU training, see our [Multi-GPU guide](multi-gpu.qmd).
--- a/docs/multi-gpu.qmd
+++ b/docs/multi-gpu.qmd
@@ -23,8 +23,6 @@ Axolotl supports several methods for multi-GPU training:

 ## DeepSpeed {#sec-deepspeed}

-DeepSpeed is the recommended approach for multi-GPU training due to its stability and performance. It provides various optimization levels through ZeRO stages.
-
 ### Configuration {#sec-deepspeed-config}

 Add to your YAML config:
@@ -32,7 +30,6 @@ Add to your YAML config:
 ```{.yaml}
 deepspeed: deepspeed_configs/zero1.json
 ```
-
 ### Usage {#sec-deepspeed-usage}

 ```{.bash}
@@ -66,9 +63,75 @@ Start from Stage 1 -> Stage 2 -> Stage 3.

 :::

-## FSDP {#sec-fsdp}
+::: {.callout-tip}

-### Basic FSDP Configuration {#sec-fsdp-config}
+Using ZeRO Stage 3 with Single-GPU training
+
+ZeRO Stage 3 can be used for training on a single GPU by manually setting the environment variables:
+`WORLD_SIZE=1 LOCAL_RANK=0 MASTER_ADDR=0.0.0.0 MASTER_PORT=29500`
+
+:::
+
+## Fully Sharded Data Parallel (FSDP) {#sec-fsdp}
+
+::: {.callout-note}
+
+FSDP2 is recommended for new users. FSDP1 is deprecated and will be removed in an upcoming release of Axolotl.
+
+:::
+
+### Migrating from FSDP1 to FSDP2 {#sec-migrate-fsdp1-fsdp2}
+
+To migrate your config from FSDP1 to FSDP2, you must use the `fsdp_version` top-level config field to specify the FSDP version, and
+also follow the config field mapping below to update field names.
+
+#### Config mapping
+
+FSDP1 | FSDP2
+-------- | --------
+fsdp_sharding_strategy | reshard_after_forward
+fsdp_backward_prefetch_policy | **REMOVED**
+fsdp_backward_prefetch | **REMOVED**
+fsdp_forward_prefetch | **REMOVED**
+fsdp_sync_module_states | **REMOVED**
+fsdp_cpu_ram_efficient_loading | cpu_ram_efficient_loading
+fsdp_state_dict_type | state_dict_type
+fsdp_use_orig_params | **REMOVED**
+
+For more details, please see the migration guide in the [torchtitan repo](https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md). In Axolotl,
+if you were using the following FSDP1 config:
+
+```{.yaml}
+fsdp_version: 1
+fsdp_config:
+  fsdp_offload_params: false
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer
+  fsdp_state_dict_type: FULL_STATE_DICT
+  fsdp_sharding_strategy: FULL_SHARD
+```
+
+You can migrate to the following FSDP2 config:
+
+```{.yaml}
+fsdp_version: 2
+fsdp_config:
+  offload_params: false
+  cpu_ram_efficient_loading: true
+  auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  transformer_layer_cls_to_wrap: Qwen3DecoderLayer
+  state_dict_type: FULL_STATE_DICT
+  reshard_after_forward: true
+```
+
+### FSDP1 (deprecated) {#sec-fsdp-config}
+
+::: {.callout-note}
+
+Using `fsdp` to configure FSDP is deprecated and will be removed in an upcoming release of Axolotl. Please use `fsdp_config` as above instead.
+
+:::

 ```{.yaml}
 fsdp:
@@ -80,6 +143,7 @@ fsdp_config:
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
 ```

+
 ## Sequence parallelism {#sec-sequence-parallelism}

 We support sequence parallelism (SP) via the
--- a/docs/multi-node.qmd
+++ b/docs/multi-node.qmd
@@ -40,13 +40,13 @@ use_cpu: false

 Configure your model to use FSDP in the Axolotl yaml. For example:
 ```yaml
-fsdp:
-  - full_shard
-  - auto_wrap
+fsdp_version: 2
 fsdp_config:
-  fsdp_offload_params: true
-  fsdp_state_dict_type: FULL_STATE_DICT
-  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
+  offload_params: true
+  state_dict_type: FULL_STATE_DICT
+  auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  transformer_layer_cls_to_wrap: LlamaDecoderLayer
+  reshard_after_forward: true
 ```

 All you have to do now is launch using accelerate as you would usually do on each machine and voila, the processes will start once you have launched accelerate on every machine.
@@ -69,11 +69,19 @@ export NCCL_BUFFSIZE=2097152

 Run the following on each node:

+### Option 1: New Axolotl CLI with launcher args (Recommended)
+
+```bash
+axolotl train config.yaml --launcher torchrun -- --nnodes $num_nodes --nproc_per_node $gpu_per_node --rdzv_id $rdzv_id --rdzv_backend c10d --rdzv_endpoint "$head_node_ip:$head_node_port"
+```
+
+### Option 2: Direct torchrun (Legacy)
+
 ```bash
 torchrun --nnodes $num_nodes --nproc_per_node $gpu_per_node --rdzv_id $rdzv_id --rdzv_backend c10d --rdzv_endpoint "$head_node_ip:$head_node_port" -m axolotl.cli.train config.yaml
 ```

-Please make sure to substitute the placeholder variables.
+Please make sure to substitute the placeholder variables:

 - `num_nodes`: Number of nodes (containing GPUs)
 - `gpu_per_node`: Number of gpus per node
@@ -81,8 +89,6 @@ Please make sure to substitute the placeholder variables.
 - `head_node_port`: Port of the head node (make sure other machines can connect to this. Default 29400)
 - `rdzv_id`: A unique job ID that is used by the job across nodes.

-::: {.callout-note}
-You need to call `axolotl.cli.train` instead of `axolotl train` as the latter calls accelerate under the hood
-:::
+The new CLI approach (Option 1) is recommended as it provides consistent argument handling and works seamlessly with other Axolotl CLI features.

 More info on the available configs can be found on the Pytorch docs [here](https://pytorch.org/docs/stable/elastic/run.html)
--- a/docs/multimodal.qmd
+++ b/docs/multimodal.qmd
@@ -14,6 +14,7 @@ format:
 - [Llava-1.5](#sec-llava-15)
 - [Mistral-Small-3.1](#sec-mistral-small-31)
 - [Gemma-3](#sec-gemma-3)
+- [Gemma-3n](#sec-gemma-3n)
 - [Qwen2-VL](#sec-qwen2-vl)
 - [Qwen2.5-VL](#sec-qwen25-vl)

@@ -43,7 +44,7 @@ datasets:
 # leave the vision model and vision tower frozen
 # load_in_8bit: true
 adapter: lora
-lora_target_modules: 'language_model.model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
+lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'

 # (optional) if you want to resize images to a set size
 image_size: 512
@@ -110,6 +111,22 @@ base_model: google/gemma-3-4b-it
 chat_template: gemma3
 ```

+### Gemma-3n {#sec-gemma-3n}
+
+::: {.callout-warning}
+The model's initial loss and grad norm will be very high. We suspect this to be due to the Conv in the vision layers.
+:::
+
+::: {.callout-tip}
+Please make sure to install `timm` via `pip3 install timm==1.0.17`
+:::
+
+```yaml
+base_model: google/gemma-3n-E2B-it
+
+chat_template: gemma3n
+```
+
 ### Qwen2-VL {#sec-qwen2-vl}

 ```yaml
@@ -132,7 +149,9 @@ For multi-modal datasets, we adopt an extended `chat_template` format similar to

 - A message is a list of `role` and `content`.
 - `role` can be `system`, `user`, `assistant`, etc.
- `content` is a list of `type` and (`text` or `image` or `path` or `url` or `base64`).
+- `content` is a list of `type` and (`text`, `image`, `path`, `url`, `base64`, or `audio`).
+
+### Image

 ::: {.callout-note}
 For backwards compatibility:
@@ -141,15 +160,29 @@ For backwards compatibility:
 - If `content` is a string, it will be converted to a list with `type` as `text`.
 :::

-::: {.callout-tip}
 For image loading, you can use the following keys within `content` alongside `"type": "image"`:

 - `"path": "/path/to/image.jpg"`
 - `"url": "https://example.com/image.jpg"`
 - `"base64": "..."`
 - `"image": PIL.Image`
+
+### Audio
+
+For audio loading, you can use the following keys within `content` alongside `"type": "audio"`:
+
+- `"path": "/path/to/audio.mp3"`
+- `"url": "https://example.com/audio.mp3"`
+- `"audio": np.ndarray`
+
+::: {.callout-tip}
+
+You may need to install `librosa` via `pip3 install librosa==0.11.0`.
+
 :::

+### Example
+
 Here is an example of a multi-modal dataset:
 ```json
 [
@@ -178,3 +211,9 @@ Here is an example of a multi-modal dataset:
  }
 ]
 ```
+
+## FAQ
+
+1. `PIL.UnidentifiedImageError: cannot identify image file ...`
+
+`PIL` could not retrieve the file at `url` using `requests`. Please check for typo. One alternative reason is that the request is blocked by the server.
--- a/docs/nd_parallelism.qmd
+++ b/docs/nd_parallelism.qmd
@@ -0,0 +1,108 @@
+---
+title: "N-D Parallelism (Beta)"
+---
+
+Axolotl enables training models at scale by composing different parallelism techniques. This is essential when:
+
+- A model's weights are too large to fit on a single GPU's memory.
+- A model's activations, especially with very long contexts, are too large for a single GPU.
+- You want to accelerate training by using multiple GPUs or nodes.
+
+or combinations of the above!
+
+## Core Concepts
+
+Parallelism strategies can be combined. The key is understanding how each one divides the workload. PyTorch's `DeviceMesh` is the modern way to manage these combinations, creating a logical grid of your GPUs and assigning different parallel strategies to different dimensions of the grid.
+
+### Data Parallelism {#sec-dp}
+
+Data Parallelism focuses on splitting the global data batch across GPUs.
+
+- Distributed Data Parallel (DDP): The classic approach. The full model is replicated on every GPU. Each GPU processes a different slice of the data batch. Gradients are then averaged across all GPUs after the backward pass to keep the models synchronized. This can substantially improve data throughput compared to single-device training, but requires that each GPU is able to hold the entire model, its gradients, and optimizer states.
+
+- [Fully Sharded Data Parallel (FSDP)](multi-gpu.qmd#fully-sharded-data-parallel-(fsdp)): A highly memory-efficient form of data parallelism (inspired by DeepSpeed's ZeRO). Instead of replicating the model, FSDP shards the model's *parameters, gradients, and optimizer states* across the GPUs in the data-parallel group. During computation, each GPU receives the specific parameters it needs via an `all_gather` operation just before they are used, and they can be discarded immediately after (`reshard-after-forward`).
+    - FSDP maps to ZeRO stages:
+        - ZeRO-2 (`reshard_after_forward=False`): Shards gradients and optimizer states. Model weights are replicated on each GPU.
+        - ZeRO-3 (`reshard_after_forward=True`): Shards gradients, optimizer states, AND model parameters. This provides the most memory savings at the cost of more communication (re-gathering parameters for both forward and backward passes).
+
+### [Experimental] Tensor Parallelism (TP) {#sec-tp}
+
+Also known as "horizontal model parallelism," as described in the [Megatron-LM paper](https://arxiv.org/pdf/1909.08053.pdf). Instead of splitting the batch, TP splits the model's layers themselves across GPUs.
+
+- How it works: For a linear layer `Y = XA`, the weight matrix `A` is split column-wise (`A = [A_1, A_2]`). The computation becomes `Y_1 = XA_1` and `Y_2 = XA_2`, which can happen in parallel on different GPUs. The final output `Y` is simply the concatenation of `Y_1` and `Y_2`. Check [this comment](https://github.com/huggingface/transformers/issues/10321#issuecomment-783543530) for more detailed info.
+- Requirement: TP involves frequent, small communications within a forward/backward pass. It requires a very fast interconnect between GPUs (e.g., NVLink) and is typically not recommended across different nodes.
+
+### Context Parallelism (CP) {#sec-cp}
+
+Context Parallelism, also called [Sequence Parallelism](sequence_parallelism.qmd), addresses the memory bottleneck from long sequences. The input sequence itself is split along the sequence length dimension and distributed across GPUs.
+
+- How it works: If you have a sequence of 8192 tokens and a `context_parallel_size` of 4, each GPU will only handle a chunk of 2048 tokens.
+- The Challenge: Attention is not local; every token needs to "attend to" every other token. Splitting the sequence breaks this.
+- The Solution (`ring-flash-attention`): An efficient communication protocol is used. To compute attention for its local sequence chunk, each GPU passes its Key-Value (KV) cache to its neighbor in a "ring." After `N-1` steps, every GPU has seen the KV-cache from all other GPUs, allowing it to compute the correct attention values for its chunk. This is implemented using the highly optimized `flash-attention` kernel at each step.
+
+### Hybrid Sharding Data Parallel (HSDP) {#sec-hsdp}
+
+HSDP is a 2D strategy that intelligently combines FSDP and DDP, typically for multi-node training.
+
+- Intra-Node (within a machine): Use FSDP. This is efficient because GPUs on the same node have fast interconnects (NVLink), making the `all_gather` operations for sharded parameters fast.
+- Inter-Node (across machines): Use DDP. The gradient synchronization between nodes is less frequent than FSDP's parameter gathering, making it a better fit for the slower node-to-node network (e.g., Ethernet/Infiniband).
+- Example: With 2 nodes of 8 GPUs each (16 total), you could have `dp_shard_size=8` (FSDP within each node) and `dp_replicate_size=2` (DDP across the two nodes).
+
+## Usage
+
+```yaml
+# FSDP config. See https://docs.axolotl.ai/docs/multi-gpu.html#sec-fsdp
+fsdp_version: 2
+fsdp_config:
+  # ...
+
+# The number of GPUs to shard the model parameters across (FSDP dimension).
+dp_shard_size: 4
+
+# The number of times to replicate the sharded model (DDP dimension).
+dp_replicate_size: 2
+
+# Number of GPUs for Tensor Parallelism.
+tensor_parallel_size: 1  # (default is 1, no TP)
+
+# Number of GPUs for Context/Sequence Parallelism.
+context_parallel_size: 1 # (default is 1, no CP)
+```
+
+Note: We recommend FSDP. DeepSpeed is only compatible with `tensor_parallel_size`.
+
+## Examples
+
+::: {.callout-tip}
+See our example configs [here](https://github.com/axolotl-ai-cloud/axolotl/tree/main/examples/distributed-parallel).
+:::
+
+1.  HSDP on 2 nodes with 4 GPUs each (8 GPUs total):
+    - You want FSDP within each node and DDP across nodes.
+    - Set `dp_shard_size: 4` and `dp_replicate_size: 2`.
+
+2.  FSDP + TP on a single 8-GPU node:
+    - You want to split the model across 4 GPUs using FSDP, and further split each layer across 2 GPUs with TP.
+    - Set `dp_shard_size: 4` and `tensor_parallel_size: 2`.
+
+3.  FSDP + CP on a single 8-GPU node for long context:
+    - You want to shard the model across all 8 GPUs and also split the sequence length across all 8 GPUs.
+    - Set `dp_shard_size: 8` and `context_parallel_size: 8`. Note: this means the data parallel group and context parallel group are the same. A more common setup might be to shard across a smaller group.
+
+## Support Matrix
+
+This matrix describes how different parallelism methods can be combined in Axolotl.
+
+| Combination | `dp_replicate_size` | `dp_shard_size` | `tp_size` | `cp_size` | Status & Notes |
+| --- | :---: | :---: |:---:|:---:|---|
+| **FSDP** (ZeRO-3) | 1 | >1 | 1 | 1 | ✅ Fully supported. Shards model across all GPUs. |
+| **HSDP** | >1 | >1 | 1 | 1 | ✅ Fully supported. FSDP intra-node, DDP inter-node. |
+| **FSDP + TP** | 1 | >1 | >1 | 1 | ✅ **2D Parallelism**. Shards the model across a `dp_shard` group, and TP-splits layers within the `tp` group. |
+| **HSDP + TP** | >1 | >1 | >1 | 1 | ✅ **3D Parallelism**. A powerful but complex combination. |
+| **FSDP + CP** | 1 | >1 | 1 | >1 | ✅ **2D Parallelism**. Combines FSDP with context parallelism. |
+| **FSDP + TP + CP**| 1 | >1 | >1| >1| ✅ **3D Parallelism**. Another advanced combination. |
+| DDP + TP/CP | >1 | 1 | >1 | >1 | ❌ **Not Supported**. The `ParallelismConfig` explicitly prevents this, as composing pure DDP with TP or CP is currently not supported. You should use FSDP + TP/CP instead (`dp_shard_size > 1`). |
+| Just TP / CP | 1 | 1 | >1 | >1 | ✅ Supported. Useful for inference or when the model fits on one GPU but context is too long. |
+
+- `tp_size` refers to `tensor_parallel_size`
+- `cp_size` refers to `context_parallel_size`
--- a/docs/optimizers.qmd
+++ b/docs/optimizers.qmd
@@ -0,0 +1,129 @@
+---
+title: Optimizers
+description: Configuring optimizers
+---
+
+## Overview
+
+Axolotl supports all optimizers supported by [transformers OptimizerNames](https://github.com/huggingface/transformers/blob/51f94ea06d19a6308c61bbb4dc97c40aabd12bad/src/transformers/training_args.py#L142-L187)
+
+Here is a list of optimizers supported by transformers as of `v4.54.0`:
+
+- `adamw_torch`
+- `adamw_torch_fused`
+- `adamw_torch_xla`
+- `adamw_torch_npu_fused`
+- `adamw_apex_fused`
+- `adafactor`
+- `adamw_anyprecision`
+- `adamw_torch_4bit`
+- `adamw_torch_8bit`
+- `ademamix`
+- `sgd`
+- `adagrad`
+- `adamw_bnb_8bit`
+- `adamw_8bit`  # alias for adamw_bnb_8bit
+- `ademamix_8bit`
+- `lion_8bit`
+- `lion_32bit`
+- `paged_adamw_32bit`
+- `paged_adamw_8bit`
+- `paged_ademamix_32bit`
+- `paged_ademamix_8bit`
+- `paged_lion_32bit`
+- `paged_lion_8bit`
+- `rmsprop`
+- `rmsprop_bnb`
+- `rmsprop_bnb_8bit`
+- `rmsprop_bnb_32bit`
+- `galore_adamw`
+- `galore_adamw_8bit`
+- `galore_adafactor`
+- `galore_adamw_layerwise`
+- `galore_adamw_8bit_layerwise`
+- `galore_adafactor_layerwise`
+- `lomo`
+- `adalomo`
+- `grokadamw`
+- `schedule_free_radam`
+- `schedule_free_adamw`
+- `schedule_free_sgd`
+- `apollo_adamw`
+- `apollo_adamw_layerwise`
+- `stable_adamw`
+
+
+## Custom Optimizers
+
+Enable custom optimizers by passing a string to the `optimizer` argument. Each optimizer will receive beta and epsilon args, however, some may accept additional args which are detailed below.
+
+### optimi_adamw
+
+```yaml
+optimizer: optimi_adamw
+```
+
+### ao_adamw_4bit
+
+Deprecated: Please use `adamw_torch_4bit`.
+
+### ao_adamw_8bit
+
+Deprecated: Please use `adamw_torch_8bit`.
+
+### ao_adamw_fp8
+
+
+```yaml
+optimizer: ao_adamw_fp8
+```
+
+### adopt_adamw
+
+GitHub: [https://github.com/iShohei220/adopt](https://github.com/iShohei220/adopt)
+Paper: [https://arxiv.org/abs/2411.02853](https://arxiv.org/abs/2411.02853)
+
+```yaml
+optimizer: adopt_adamw
+```
+
+### came_pytorch
+
+GitHub: [https://github.com/yangluo7/CAME/tree/master](https://github.com/yangluo7/CAME/tree/master)
+Paper: [https://arxiv.org/abs/2307.02047](https://arxiv.org/abs/2307.02047)
+
+```yaml
+optimizer: came_pytorch
+
+# optional args (defaults below)
+adam_beta1: 0.9
+adam_beta2: 0.999
+adam_beta3: 0.9999
+adam_epsilon: 1e-30
+adam_epsilon2: 1e-16
+```
+
+### muon
+
+Blog: [https://kellerjordan.github.io/posts/muon/](https://kellerjordan.github.io/posts/muon/)
+Paper: [https://arxiv.org/abs/2502.16982v1](https://arxiv.org/abs/2502.16982v1)
+
+```yaml
+optimizer: muon
+```
+
+### dion
+
+Microsoft's Dion (DIstributed OrthoNormalization) optimizer is a scalable and communication-efficient
+orthonormalizing optimizer that uses low-rank approximations to reduce gradient communication.
+
+GitHub: [https://github.com/microsoft/dion](https://github.com/microsoft/dion)
+Paper: [https://arxiv.org/pdf/2504.05295](https://arxiv.org/pdf/2504.05295)
+Note: Implementation written for PyTorch 2.7+ for DTensor
+
+```yaml
+optimizer: dion
+dion_lr: 0.01
+dion_momentum: 0.95
+lr: 0.00001  # learning rate for embeddings and parameters that fallback to AdamW
+```
--- a/docs/qat.qmd
+++ b/docs/qat.qmd
@@ -0,0 +1,32 @@
+---
+title: "Quantization Aware Training (QAT)"
+back-to-top-navigation: true
+toc: true
+toc-expand: 2
+toc-depth: 4
+---
+
+## Overview
+
+[Quantization Aware Training](https://pytorch.org/blog/introduction-to-quantization-on-pytorch/#quantization-aware-training) (QAT) is a technique for improving the accuracy of models which are quantized
+by applying "fake" quantizations to the model's weights (and optionally, activations) during training. This fake
+quantization allows for the model to adjust for noise introduced by the quantization, so when the model is eventually
+quantized, the accuracy loss is minimized. We use the quantization techniques implemented in [torchao](https://github.com/pytorch/ao) to provide
+support for QAT and post-training quantization (PTQ) in axolotl.
+
+We recommend reviewing the excellent QAT tutorial in the [torchtune library](https://pytorch.org/torchtune/main/tutorials/qat_finetune.html#quantizing-the-qat-model),
+and the QAT documentation in the [torchao library](https://github.com/pytorch/ao/tree/main/torchao/quantization/qat), for more details.
+
+## Configuring QAT in Axolotl
+
+To enable QAT in axolotl, add the following to your configuration file:
+
+```yaml
+qat:
+  activation_dtype: # Optional[str] = "int8". Fake quantization layout to use for activation quantization. Valid options are "int4" and "int8"
+  weight_dtype: # Optional[str] = "int8". Fake quantization layout to use for weight quantization. Valid options are "int4" and "int8"
+  group_size: # Optional[int] = 32. The number of elements in each group for per-group fake quantization
+  fake_quant_after_n_steps: # Optional[int] = None. The number of steps to apply fake quantization after
+```
+
+Once you have finished training, you must quantize your model by using the same quantization configuration which you used to train the model with. You can use the [`quantize`](./quantize.qmd) command to do this.
--- a/docs/quantize.qmd
+++ b/docs/quantize.qmd
@@ -0,0 +1,53 @@
+---
+title: "Quantization with torchao"
+back-to-top-navigation: true
+toc: true
+toc-expand: 2
+toc-depth: 4
+---
+
+Quantization is a technique to lower the memory footprint of your model, potentially at the cost of accuracy or model performance. We support quantizing your model using the [torchao](https://github.com/pytorch/ao) library. Quantization is supported for both post-training quantization (PTQ) and quantization-aware training (QAT).
+
+
+::: {.callout-note}
+
+We do not currently support quantization techniques such as GGUF/GPTQ,EXL2 at the moment.
+
+:::
+
+## Configuring Quantization in Axolotl
+
+Quantization is configured using the `quantization` key in your configuration file.
+
+```yaml
+base_model: # The path to the model to quantize.
+quantization:
+  weight_dtype: # Optional[str] = "int8". Fake quantization layout to use for weight quantization. Valid options are uintX for X in [1, 2, 3, 4, 5, 6, 7], or int4, or int8
+  activation_dtype: # Optional[str] = "int8". Fake quantization layout to use for activation quantization. Valid options are "int4" and "int8"
+  group_size: # Optional[int] = 32. The number of elements in each group for per-group fake quantization
+  quantize_embedding: # Optional[bool] = False. Whether to quantize the embedding layer.
+
+output_dir:  # The path to the output directory.
+```
+
+Once quantization is complete, your quantized model will be saved in the `{output_dir}/quantized` directory.
+
+You may also use the `quantize` command to quantize a model which has been trained with [QAT](./qat.qmd) - you can do this by using the existing QAT configuration file which
+you used to train the model:
+
+```yaml
+# qat.yml
+qat:
+  activation_dtype: int8
+  weight_dtype: int8
+  group_size: 256
+  quantize_embedding: true
+
+output_dir: # The path to the output directory used during training where the final checkpoint has been saved.
+```
+
+```bash
+axolotl quantize qat.yml
+```
+
+This ensures that an identical quantization configuration is used to quantize the model as was used to train it.
--- a/docs/rlhf.qmd
+++ b/docs/rlhf.qmd
@@ -16,7 +16,7 @@ feedback. Various methods include, but not limited to:
 - [Identity Preference Optimization (IPO)](#ipo)
 - [Kahneman-Tversky Optimization (KTO)](#kto)
 - [Odds Ratio Preference Optimization (ORPO)](#orpo)
- Proximal Policy Optimization (PPO) (not yet supported in axolotl)
+- [Group Relative Policy Optimization (GRPO)](#grpo)


 ## RLHF using Axolotl
@@ -274,15 +274,14 @@ rl: dpo
 datasets:
  - path: ...
    split: train
-    type: user_defined.default
-
-    field_prompt: "prompt"
-    field_system: "system"
-    field_chosen: "chosen"
-    field_rejected: "rejected"
-    prompt_format: "{prompt}"
-    chosen_format: "{chosen}"
-    rejected_format: "{rejected}"
+    type:
+      field_prompt: "prompt"
+      field_system: "system"
+      field_chosen: "chosen"
+      field_rejected: "rejected"
+      prompt_format: "{prompt}"
+      chosen_format: "{chosen}"
+      rejected_format: "{rejected}"
 ```

 The input format is a simple JSON input with customizable fields based on the above config.
@@ -475,14 +474,13 @@ rl: kto
 datasets:
  - path: ...
    split: train
-    type: user_defined.default
-
-    field_prompt: "prompt"
-    field_system: "system"
-    field_completion: "completion"
-    field_label: "label"
-    prompt_format: "{prompt}"
-    completion_format: "{completion}"
+    type:
+      field_prompt: "prompt"
+      field_system: "system"
+      field_completion: "completion"
+      field_label: "label"
+      prompt_format: "{prompt}"
+      completion_format: "{completion}"
 ```

 The input format is a simple JSON input with customizable fields based on the above config.
@@ -499,7 +497,7 @@ The input format is a simple JSON input with customizable fields based on the ab
 ### GRPO

 ::: {.callout-tip}
-Check out our [GRPO cookbook](https://github.com/axolotl-ai-cloud/axolotl-cookbook/tree/main/grpo#training-an-r1-style-large-language-model-using-grpo).
+Check out our [GRPO cookbook](https://github.com/axolotl-ai-cloud/grpo_code).
 :::

 In the latest GRPO implementation, `vLLM` is used to significantly speedup trajectory generation during training. In this example, we're using 4 GPUs - 2 for training, and 2 for vLLM:
@@ -582,7 +580,20 @@ datasets:

 To see other examples of custom reward functions, please see [TRL GRPO Docs](https://github.com/huggingface/trl/blob/main/docs/source/grpo_trainer.md#using-a-custom-reward-function).

-To see description of the configs, please see [TRLConfig](https://github.com/axolotl-ai-cloud/axolotl/blob/main/src/axolotl/utils/config/models/input/v0_4_1/trl.py).
+To see all configs, please see [TRLConfig](https://github.com/axolotl-ai-cloud/axolotl/blob/v0.9.2/src/axolotl/utils/schemas/trl.py).
+
+#### GRPO with DAPO/Dr. GRPO loss
+
+The DAPO paper and subsequently Dr. GRPO paper proposed an alternative loss function for GRPO to remediate the penalty in longer responses.
+
+```yaml
+trl:
+  loss_type: dr_grpo
+  # Normalizes loss based on max completion length (default: 256)
+  max_completion_length:
+```
+
+For more information, see [GRPO docs](https://huggingface.co/docs/trl/v0.17.0/en/grpo_trainer#loss-types).

 ### SimPO

--- a/docs/scripts/generate_config_docs.py
+++ b/docs/scripts/generate_config_docs.py
@@ -0,0 +1,752 @@
+# type: ignore
+
+"""
+Quarto documentation generation from Pydantic models. Uses Pydantic model source code
+to automatically group fields, including inherited fields from parent classes.
+"""
+
+import ast
+import inspect
+import textwrap
+import types
+import typing
+from typing import Any, FrozenSet, Type, Union
+
+from pydantic import BaseModel
+
+from axolotl.utils.schemas.config import AxolotlInputConfig
+
+
+class QuartoGenerator:
+    """Generate Quarto documentation from Pydantic models."""
+
+    def __init__(self):
+        self._class_fields_cache = {}
+        self._inheritance_map_cache = {}
+        self._nested_models_cache = {}
+
+    def _get_direct_fields(self, cls: Type[BaseModel]) -> FrozenSet[str]:
+        """Get fields defined directly in a single class (not inherited)."""
+        if cls in self._class_fields_cache:
+            return self._class_fields_cache[cls]
+
+        fields = set()
+
+        # Get annotated fields
+        if hasattr(cls, "__annotations__"):
+            fields.update(cls.__annotations__.keys())
+
+        # Filter out private/special methods
+        fields = {f for f in fields if not f.startswith("_")}
+
+        result = frozenset(fields)
+        self._class_fields_cache[cls] = result
+        return result
+
+    def _is_pydantic_model(self, type_obj) -> bool:
+        """Check if a type is a Pydantic BaseModel."""
+        return inspect.isclass(type_obj) and issubclass(type_obj, BaseModel)
+
+    # pylint: disable=too-many-return-statements
+    def _extract_nested_type(self, field_type) -> Any:
+        """Extract the actual type from complex type annotations."""
+        # Handle Annotated types (Python 3.9+)
+        if hasattr(typing, "get_origin") and hasattr(typing, "get_args"):
+            origin = typing.get_origin(field_type)
+            args = typing.get_args(field_type)
+
+            if origin is not None:
+                # Handle Annotated[SomeType, ...] - extract the first argument
+                if hasattr(typing, "Annotated") and origin is typing.Annotated:
+                    if args:
+                        return self._extract_nested_type(
+                            args[0]
+                        )  # Recursively process the actual type
+
+                # Handle list[SomeType], List[SomeType], etc.
+                elif origin in (list, typing.List):
+                    if args:
+                        return self._extract_nested_type(
+                            args[0]
+                        )  # Extract element type
+
+                # Handle Union types (including | syntax)
+                elif origin is typing.Union:
+                    # Get non-None types from the Union
+                    non_none_types = [arg for arg in args if arg is not type(None)]
+                    if len(non_none_types) >= 1:
+                        # Prioritize Pydantic models over primitive types
+                        pydantic_models = [
+                            arg
+                            for arg in non_none_types
+                            if self._is_pydantic_model(arg)
+                        ]
+                        if pydantic_models:
+                            # Return the first Pydantic model found
+                            return self._extract_nested_type(pydantic_models[0])
+
+                        # No Pydantic models, return the first non-None type
+                        return self._extract_nested_type(non_none_types[0])
+
+        # Handle new Python 3.10+ union syntax (PeftConfig | None)
+        if hasattr(field_type, "__class__") and field_type.__class__ is types.UnionType:
+            # Get non-None types from the Union
+            non_none_types = [
+                arg for arg in field_type.__args__ if arg is not type(None)
+            ]
+            if len(non_none_types) >= 1:
+                # Prioritize Pydantic models over primitive types
+                pydantic_models = [
+                    arg for arg in non_none_types if self._is_pydantic_model(arg)
+                ]
+                if pydantic_models:
+                    return self._extract_nested_type(pydantic_models[0])
+                return self._extract_nested_type(non_none_types[0])
+
+        # Handle old typing.Union syntax (fallback)
+        if hasattr(field_type, "__origin__"):
+            if field_type.__origin__ is Union:
+                # Get non-None types from the Union
+                non_none_types = [
+                    arg for arg in field_type.__args__ if arg is not type(None)
+                ]
+                if len(non_none_types) >= 1:
+                    # Prioritize Pydantic models over primitive types
+                    pydantic_models = [
+                        arg for arg in non_none_types if self._is_pydantic_model(arg)
+                    ]
+                    if pydantic_models:
+                        return self._extract_nested_type(pydantic_models[0])
+                    return self._extract_nested_type(non_none_types[0])
+            # Handle other generic types like dict[str, Any], etc.
+            elif hasattr(field_type, "__args__"):
+                return field_type
+
+        return field_type
+
+    # pylint: disable=too-many-return-statements
+    def _extract_all_pydantic_models_from_type(
+        self, field_type
+    ) -> list[type[BaseModel]]:
+        """Extract all Pydantic models from a type annotation, including from Unions."""
+        models = []
+
+        if field_type is None:
+            return models
+
+        # Handle Annotated types
+        if hasattr(typing, "get_origin") and hasattr(typing, "get_args"):
+            origin = typing.get_origin(field_type)
+            args = typing.get_args(field_type)
+
+            if origin is not None:
+                # Handle Annotated[SomeType, ...] - extract from the first argument
+                if hasattr(typing, "Annotated") and origin is typing.Annotated:
+                    if args:
+                        models.extend(
+                            self._extract_all_pydantic_models_from_type(args[0])
+                        )
+                    return models
+
+                # Handle list[SomeType], List[SomeType], etc.
+                if origin in (list, typing.List):
+                    if args:
+                        models.extend(
+                            self._extract_all_pydantic_models_from_type(args[0])
+                        )
+                    return models
+
+                # Handle Union types
+                if origin is typing.Union:
+                    for arg in args:
+                        if arg is not type(None):  # Skip None type
+                            models.extend(
+                                self._extract_all_pydantic_models_from_type(arg)
+                            )
+                    return models
+
+        # Handle new Python 3.10+ union syntax
+        if hasattr(field_type, "__class__") and field_type.__class__ is types.UnionType:
+            for arg in field_type.__args__:
+                if arg is not type(None):  # Skip None type
+                    models.extend(self._extract_all_pydantic_models_from_type(arg))
+            return models
+
+        # Handle old typing.Union syntax (fallback)
+        if hasattr(field_type, "__origin__") and field_type.__origin__ is Union:
+            for arg in field_type.__args__:
+                if arg is not type(None):  # Skip None type
+                    models.extend(self._extract_all_pydantic_models_from_type(arg))
+            return models
+
+        # Check if this type itself is a Pydantic model
+        if self._is_pydantic_model(field_type):
+            models.append(field_type)
+
+        return models
+
+    def _get_nested_models(
+        self, model_class: type[BaseModel], visited=None
+    ) -> dict[str, type[BaseModel]]:
+        """Get all nested Pydantic models from a model class."""
+        if visited is None:
+            visited = set()
+
+        # Avoid infinite recursion
+        if model_class in visited:
+            return {}
+
+        if model_class in self._nested_models_cache:
+            return self._nested_models_cache[model_class]
+
+        visited.add(model_class)
+        nested_models = {}
+
+        # Check all fields in the model
+        for field_info in model_class.model_fields.values():
+            field_type = self._extract_nested_type(field_info.annotation)
+
+            if self._is_pydantic_model(field_type):
+                nested_models[field_type.__name__] = field_type
+                # Recursively get nested models from this nested model
+                deeper_nested = self._get_nested_models(field_type, visited.copy())
+                nested_models.update(deeper_nested)
+
+        self._nested_models_cache[model_class] = nested_models
+        return nested_models
+
+    def _build_inheritance_map(self, child_class: Type[BaseModel]):
+        """Build inheritance map for a class and all its parents."""
+        if child_class in self._inheritance_map_cache:
+            return self._inheritance_map_cache[child_class]
+
+        inheritance_map = {}
+
+        # Get MRO and filter out BaseModel and object
+        mro_classes = [
+            cls
+            for cls in child_class.__mro__
+            if cls not in (BaseModel, object) and hasattr(cls, "__annotations__")
+        ]
+
+        # Process each class in the MRO
+        for cls in mro_classes:
+            inheritance_map[cls] = self._get_direct_fields(cls)
+
+        self._inheritance_map_cache[child_class] = inheritance_map
+        return inheritance_map
+
+    def _wrap_comment(self, text: str, width: int = 88) -> list[str]:
+        """Wrap a comment to specified width, accounting for '# ' prefix."""
+        if not text.strip():
+            return ["#"]
+
+        # Account for "# " prefix (2 characters)
+        content_width = width - 2
+        wrapped_lines = textwrap.wrap(text, width=content_width)
+        return [f"# {line}" for line in wrapped_lines]
+
+    def _extract_type_from_source(
+        self, model_class: type[BaseModel], field_name: str
+    ) -> str:
+        """Extract the actual type annotation text from source code, checking inheritance chain."""
+        # Use inheritance map to check classes efficiently
+        inheritance_map = self._build_inheritance_map(model_class)
+
+        # Check classes in MRO order
+        for cls in model_class.__mro__:
+            if cls in inheritance_map and field_name in inheritance_map[cls]:
+                type_annotation = self._get_type_from_class_source(cls, field_name)
+                if type_annotation != "unknown":
+                    return type_annotation
+
+        return "unknown"
+
+    def _get_type_from_class_source(self, class_obj: type, field_name: str) -> str:
+        """Extract type annotation from a specific class's source code."""
+        try:
+            source = inspect.getsource(class_obj)
+            tree = ast.parse(source)
+        except (OSError, TypeError):
+            return "unknown"
+
+        # Find the class definition
+        for node in tree.body:
+            if isinstance(node, ast.ClassDef) and node.name == class_obj.__name__:
+                # Find the field assignment
+                for body_node in node.body:
+                    if isinstance(body_node, ast.AnnAssign) and isinstance(
+                        body_node.target, ast.Name
+                    ):
+                        if body_node.target.id == field_name and body_node.annotation:
+                            return ast.unparse(body_node.annotation)
+                break
+
+        return "unknown"
+
+    def _extract_field_groups_from_all_classes(
+        self, model_class: type[BaseModel]
+    ) -> list[dict]:
+        """Extract field groups from all classes in the inheritance hierarchy."""
+        all_groups = []
+        inheritance_map = self._build_inheritance_map(model_class)
+
+        # Get all Pydantic base classes in MRO order (most specific first)
+        # This puts AxolotlInputConfig fields first, then parent class fields
+        pydantic_classes = [
+            cls
+            for cls in model_class.__mro__
+            if cls in inheritance_map and inheritance_map[cls]
+        ]
+
+        # Extract groups from each class
+        for cls in pydantic_classes:
+            class_groups = self._extract_field_groups_from_source(cls)
+            for group in class_groups:
+                all_groups.append(group)
+
+        # If no groups found, create a default grouping by class
+        if not all_groups:
+            for cls in pydantic_classes:
+                fields_in_class = inheritance_map[cls]
+                if fields_in_class:
+                    all_groups.append(
+                        {
+                            "fields": list(fields_in_class),
+                        }
+                    )
+
+        return all_groups
+
+    # pylint: disable=too-many-return-statements
+    def _extract_field_groups_from_source(
+        self, model_class: type[BaseModel]
+    ) -> list[dict]:
+        """Extract field groups from source code based on blank lines and comments."""
+        try:
+            source = inspect.getsource(model_class)
+            tree = ast.parse(source)
+        except (OSError, TypeError):
+            # Fallback if we can't get source code
+            fields_in_class = self._get_direct_fields(model_class)
+            if fields_in_class:
+                return [
+                    {
+                        "fields": list(fields_in_class),
+                    }
+                ]
+            return []
+
+        groups = []
+        current_group_fields = []
+        current_group_comment = None
+
+        # Find the class definition
+        class_node = None
+        for node in ast.walk(tree):
+            if isinstance(node, ast.ClassDef) and node.name == model_class.__name__:
+                class_node = node
+                break
+
+        if not class_node:
+            fields_in_class = self._get_direct_fields(model_class)
+            if fields_in_class:
+                return [
+                    {
+                        "fields": list(fields_in_class),
+                    }
+                ]
+            return []
+
+        # Parse the source lines to detect groupings
+        source_lines = source.split("\n")
+
+        # Get fields that are actually defined in this specific class
+        fields_in_class = self._get_direct_fields(model_class)
+
+        # Find assignments that correspond to model fields for THIS class only
+        field_assignments = []
+        for node in class_node.body:
+            if isinstance(node, ast.AnnAssign) and isinstance(node.target, ast.Name):
+                field_name = node.target.id
+                if field_name in fields_in_class:
+                    field_assignments.append(
+                        {
+                            "name": field_name,
+                            "lineno": node.lineno,
+                            "end_lineno": getattr(node, "end_lineno", node.lineno),
+                        }
+                    )
+
+        if not field_assignments:
+            if fields_in_class:
+                return [
+                    {
+                        "fields": list(fields_in_class),
+                    }
+                ]
+            return []
+
+        # Sort by line number
+        field_assignments.sort(key=lambda x: x["lineno"])
+
+        # Group fields based on blank lines and comments
+        for i, field_info in enumerate(field_assignments):
+            field_name = field_info["name"]
+            current_line = field_info["lineno"]
+
+            # Check if this starts a new group (blank line before or significant gap)
+            is_new_group = False
+
+            if i == 0:
+                is_new_group = True
+            else:
+                prev_end_line = field_assignments[i - 1]["end_lineno"]
+
+                # Check for blank lines or comments between fields
+                lines_between = source_lines[prev_end_line : current_line - 1]
+                has_blank_line = any(line.strip() == "" for line in lines_between)
+                has_comment = any(
+                    line.strip().startswith("#") for line in lines_between
+                )
+
+                # Start new group if there's a blank line or comment, or significant gap
+                if has_blank_line or has_comment or (current_line - prev_end_line > 3):
+                    is_new_group = True
+
+            if is_new_group and current_group_fields:
+                # Save the previous group
+                groups.append(
+                    {
+                        "fields": current_group_fields.copy(),
+                        "description": current_group_comment,
+                    }
+                )
+                current_group_fields = []
+                current_group_comment = None
+
+            current_group_fields.append(field_name)
+
+        # Add the final group
+        if current_group_fields:
+            groups.append(
+                {
+                    "fields": current_group_fields,
+                    "description": current_group_comment,
+                }
+            )
+
+        return groups
+
+    def _generate_field_documentation(
+        self,
+        model_class: type[BaseModel],
+        field_name: str,
+        field_info: dict,
+        field_type_str: str,
+        is_required: bool,
+        indent_level: int = 0,
+        visited_models: set = None,
+    ) -> list[str]:
+        """Generate documentation for a single field, expanding nested models inline."""
+        if visited_models is None:
+            visited_models = set()
+
+        lines = []
+        indent = "  " * indent_level
+
+        # Get the actual field type for nested model detection
+        if field_name in model_class.model_fields:
+            pydantic_field_info = model_class.model_fields[field_name]
+            actual_field_type = pydantic_field_info.annotation
+        else:
+            actual_field_type = None
+
+        # Add description comment if available
+        description = field_info.get("description", "")
+        if description:
+            wrapped_lines = self._wrap_comment(description, width=88 - len(indent))
+            for line in wrapped_lines:
+                lines.append(f"{indent}{line}")
+
+        # Extract nested Pydantic models from the type annotation
+        nested_models = self._extract_all_pydantic_models_from_type(actual_field_type)
+
+        # Filter out already visited models to prevent infinite recursion
+        expandable_models = [
+            model for model in nested_models if model not in visited_models
+        ]
+
+        if expandable_models:
+            # This field contains Pydantic models that can be expanded
+
+            # Show the field with its full type annotation
+            field_line = f"{indent}{field_name}: {field_type_str}"
+            if field_info.get("default") is not None:
+                field_line += f" = {field_info['default']}"
+            if is_required:
+                field_line += " (required)"
+            lines.append(field_line)
+
+            # Add to visited to prevent infinite recursion
+            new_visited = visited_models.copy()
+            new_visited.update(expandable_models)
+
+            # Expand each nested Pydantic model
+            for i, nested_model in enumerate(expandable_models):
+                if i > 0:
+                    lines.append("\n")
+                lines.append(f"{indent}  # For {nested_model.__name__}:")
+
+                # Get nested model schema
+                try:
+                    nested_schema = nested_model.model_json_schema()
+                    nested_properties = nested_schema.get("properties", {})
+                    nested_required = nested_schema.get("required", [])
+                except Exception:  # pylint: disable=broad-exception-caught
+                    # Fallback: use model fields directly
+                    nested_properties = {}
+                    nested_required = []
+                    for (
+                        nested_field_name,
+                        nested_field_info,
+                    ) in nested_model.model_fields.items():
+                        nested_description = ""
+                        if (
+                            hasattr(nested_field_info, "json_schema_extra")
+                            and nested_field_info.json_schema_extra
+                        ):
+                            nested_description = (
+                                nested_field_info.json_schema_extra.get(
+                                    "description", ""
+                                )
+                            )
+                        elif (
+                            hasattr(nested_field_info, "description")
+                            and nested_field_info.description
+                        ):
+                            nested_description = nested_field_info.description
+
+                        nested_default_val = None
+                        if (
+                            hasattr(nested_field_info, "default")
+                            and nested_field_info.default is not None
+                        ):
+                            if str(nested_field_info.default) != "PydanticUndefined":
+                                nested_default_val = nested_field_info.default
+
+                        nested_properties[nested_field_name] = {
+                            "type": "unknown",
+                            "description": nested_description,
+                            "default": nested_default_val,
+                        }
+
+                        if nested_field_info.is_required():
+                            nested_required.append(nested_field_name)
+
+                # Get field groups for the nested model
+                nested_field_groups = self._extract_field_groups_from_all_classes(
+                    nested_model
+                )
+
+                # Generate nested fields with increased indentation
+                for i, group in enumerate(nested_field_groups):
+                    if not group["fields"]:
+                        continue
+
+                    # Add blank line between groups (except before first group)
+                    if i > 0:
+                        lines.append("")
+
+                    # Process nested fields
+                    for nested_field_name in group["fields"]:
+                        if nested_field_name not in nested_properties:
+                            continue
+
+                        nested_field_info = nested_properties[nested_field_name]
+                        nested_field_type = self._extract_type_from_source(
+                            nested_model, nested_field_name
+                        )
+                        nested_is_required = nested_field_name in nested_required
+
+                        # Recursively generate documentation for nested field
+                        nested_lines = self._generate_field_documentation(
+                            nested_model,
+                            nested_field_name,
+                            nested_field_info,
+                            nested_field_type,
+                            nested_is_required,
+                            indent_level + 1,
+                            new_visited,
+                        )
+                        lines.extend(nested_lines)
+        else:
+            # Regular field (no expandable nested models)
+            field_line = f"{indent}{field_name}: {field_type_str}"
+            if field_info.get("default") is not None:
+                field_line += f" = {field_info['default']}"
+            if is_required:
+                field_line += " (required)"
+            lines.append(field_line)
+
+        return lines
+
+    def generate_qmd(
+        self,
+        model_class: type[BaseModel],
+        title: str | None = None,
+        expand_nested: bool = True,
+    ) -> str:
+        """Auto-generate config reference documentation including inherited fields."""
+
+        if title is None:
+            title = f"{model_class.__name__} Reference"
+
+        # Try to get JSON schema, with fallback for serialization issues
+        try:
+            schema = model_class.model_json_schema()
+            properties = schema.get("properties", {})
+            required = schema.get("required", [])
+        except Exception as e:  # pylint: disable=broad-exception-caught
+            print(
+                f"Warning: Could not generate JSON schema ({e}). Using model fields instead."
+            )
+            # Fallback: use model fields directly
+            properties = {}
+            required = []
+            for field_name, field_info in model_class.model_fields.items():
+                # Extract description from json_schema_extra or field info
+                description = ""
+                if (
+                    hasattr(field_info, "json_schema_extra")
+                    and field_info.json_schema_extra
+                ):
+                    description = field_info.json_schema_extra.get("description", "")
+                elif hasattr(field_info, "description") and field_info.description:
+                    description = field_info.description
+
+                # Get default value
+                default_val = None
+                if hasattr(field_info, "default") and field_info.default is not None:
+                    # Handle special Pydantic default markers
+                    if str(field_info.default) != "PydanticUndefined":
+                        default_val = field_info.default
+
+                properties[field_name] = {
+                    "type": "unknown",
+                    "description": description,
+                    "default": default_val,
+                }
+
+                if field_info.is_required():
+                    required.append(field_name)
+
+        # Extract field groups from all classes in inheritance hierarchy
+        field_groups = self._extract_field_groups_from_all_classes(model_class)
+
+        # Start building QMD content
+        qmd_lines = [
+            "---",
+            f"title: {title}",
+            "description: A complete list of all configuration options.",
+            "---",
+            "",
+        ]
+
+        # Generate one big code block with all fields (inline nested expansion)
+        qmd_lines.append("```yaml")
+
+        for i, group in enumerate(field_groups):
+            if not group["fields"]:
+                continue
+
+            # Add blank line between groups (except before first group)
+            if i > 0:
+                qmd_lines.append("")
+
+            # Process fields in the order they appear in source
+            for field_name in group["fields"]:
+                if field_name not in properties:
+                    continue
+
+                field_info = properties[field_name]
+                field_type = self._extract_type_from_source(model_class, field_name)
+                is_required = field_name in required
+
+                if expand_nested:
+                    # Check if this field has nested models
+                    if field_name in model_class.model_fields:
+                        pydantic_field_info = model_class.model_fields[field_name]
+                        nested_models = self._extract_all_pydantic_models_from_type(
+                            pydantic_field_info.annotation
+                        )
+                        has_nested = bool(nested_models)
+                    else:
+                        has_nested = False
+
+                    # Add blank line before nested config
+                    if has_nested:
+                        qmd_lines.append("")
+
+                    # Use the new inline generation method
+                    field_lines = self._generate_field_documentation(
+                        model_class,
+                        field_name,
+                        field_info,
+                        field_type,
+                        is_required,
+                        indent_level=0,
+                        visited_models=set(),
+                    )
+                    qmd_lines.extend(field_lines)
+
+                    # Add blank line after nested config
+                    if has_nested:
+                        qmd_lines.append("")
+                else:
+                    # Original simple approach
+                    description = field_info.get("description", "")
+                    default = field_info.get("default")
+
+                    # Add wrapped comment for description
+                    if description:
+                        wrapped_lines = self._wrap_comment(description)
+                        qmd_lines.extend(wrapped_lines)
+
+                    line = f"{field_name}: {field_type}"
+                    if default is not None:
+                        line += f" = {default}"
+                    if is_required:
+                        line += " (required)"
+                    qmd_lines.append(line)
+
+        qmd_lines.append("```")
+
+        # Join all lines and clean up any double newlines
+        content = "\n".join(qmd_lines)
+
+        # Replace multiple consecutive newlines with just two newlines (one blank line)
+        import re
+
+        content = re.sub(r"\n{3,}", "\n\n", content)
+
+        # Ensure single newline at the very end
+        content = content.rstrip("\n") + "\n"
+
+        return content
+
+
+def main():
+    generator = QuartoGenerator()
+
+    print("Generating config reference content...")
+    qmd_content = generator.generate_qmd(AxolotlInputConfig, "Config Reference", True)
+
+    print("Writing to file...")
+    with open("docs/config-reference.qmd", "w", encoding="utf-8") as f:
+        f.write(qmd_content)
+    print("Done!")
+
+
+if __name__ == "__main__":
+    main()
--- a/docs/sequence_parallelism.qmd
+++ b/docs/sequence_parallelism.qmd
@@ -22,7 +22,7 @@ To enable sequence parallelism, add the following to your configuration file:

 ```yaml
 # Set to a divisor (> 1) of the number of GPUs available
-sequence_parallel_degree: 4  # Split sequences across 4 GPUs
+context_parallel_size: 4  # Split sequences across 4 GPUs
 # Optional; strides across the key dimension. Larger values use more memory but should make training faster.
 heads_k_stride: 1
 # Optional; one of "varlen_llama3" or "batch_ring". Defaults to
@@ -30,7 +30,7 @@ heads_k_stride: 1
 ring_attn_func:
 ```

-The `sequence_parallel_degree` should be a divisor of the total number of GPUs. For example:
+The `context_parallel_size` should be a divisor of the total number of GPUs. For example:

 - With 8 GPUs, valid values would be 2, 4, or 8
 - With 4 GPUs, valid values would be 2 or 4
@@ -66,7 +66,7 @@ sequence_len: 8192

 ...

-sequence_parallel_degree: 4  # Split each sequence into 4 parts, one per GPU
+context_parallel_size: 4  # Split each sequence into 4 parts, one per GPU
 # Optional; strides across the key dimension. Larger values use more memory but should make training faster.
 heads_k_stride: 1
 # Optional; one of "varlen_llama3" or "batch_ring". Defaults to
@@ -89,12 +89,12 @@ Sequence parallelism is compatible with Axolotl's sample packing functionality.

 ## Effect on Batch Size

-When using sequence parallelism, your effective global batch size is **divided** by the `sequence_parallel_degree`. This happens because:
+When using sequence parallelism, your effective global batch size is **divided** by the `context_parallel_size`. This happens because:

- Each group of `sequence_parallel_degree` GPUs works on the same batch (just different parts of each sequence)
+- Each group of `context_parallel_size` GPUs works on the same batch (just different parts of each sequence)
 - The number of batches processed per step decreases

 For example:
 - With 8 GPUs and no sequence parallelism: 8 different batches processed per step
- With 8 GPUs and `sequence_parallel_degree=4`: Only 2 different batches processed per step (each split across 4 GPUs)
+- With 8 GPUs and `context_parallel_size=4`: Only 2 different batches processed per step (each split across 4 GPUs)
 - If your per-GPU `micro_batch_size` is 2, the global batch size decreases from 16 to 4