chore: fix docstring comment from distributed pr

update trl to 0.18.2 (#2814 )
fix(doc): address exitcode formatting to help search (#2809 ) [skip ci]
2025-06-20 05:48:34 +07:00 · 2025-06-19 11:27:59 -04:00 · 2025-06-19 11:19:52 -04:00 · 2025-06-19 11:16:52 -04:00 · 2025-06-18 15:59:07 -04:00 · 2025-06-18 15:49:05 -04:00
160 changed files with 7331 additions and 3886 deletions
--- a/.github/workflows/base.yml
+++ b/.github/workflows/base.yml
@@ -16,6 +16,7 @@ on:
 jobs:
  build-base:
    if: github.repository_owner == 'axolotl-ai-cloud'
+    timeout-minutes: 480
    # this job needs to be run on self-hosted GPU runners...
    runs-on: ubuntu-latest-m
    strategy:
@@ -47,14 +48,14 @@ jobs:
            cuda_version: 12.6.3
            cudnn_version: ""
            python_version: "3.11"
-            pytorch: 2.7.0
+            pytorch: 2.7.1
            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 8.7 8.9 9.0+PTX"
            dockerfile: "Dockerfile-base"
          - cuda: "128"
            cuda_version: 12.6.3
            cudnn_version: ""
            python_version: "3.11"
-            pytorch: 2.7.0
+            pytorch: 2.7.1
            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 8.7 8.9 9.0+PTX"
            dockerfile: "Dockerfile-base"
          - cuda: "128"
@@ -106,6 +107,7 @@ jobs:
            TORCH_CUDA_ARCH_LIST=${{ matrix.torch_cuda_arch_list }}
  build-base-uv:
    if: github.repository_owner == 'axolotl-ai-cloud'
+    timeout-minutes: 480
    runs-on: ubuntu-latest-m
    strategy:
      fail-fast: false
@@ -122,7 +124,7 @@ jobs:
            cuda_version: 12.8.1
            cudnn_version: ""
            python_version: "3.11"
-            pytorch: 2.7.0
+            pytorch: 2.7.1
            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 8.7 8.9 9.0+PTX"
            dockerfile: "Dockerfile-uv-base"
    steps:
--- a/.github/workflows/docs.yml
+++ b/.github/workflows/docs.yml
@@ -23,7 +23,7 @@ jobs:
        - name: Install dependencies
          run: |
            python3 -m pip install jupyter quartodoc
-            python3 -m pip install -e . --no-deps
+            python3 -m pip install -e .
        - name: Build autodoc
          run: quartodoc build
        - name: Publish to GitHub Pages (and render)
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -29,12 +29,12 @@ jobs:
          - cuda: 126
            cuda_version: 12.6.3
            python_version: "3.11"
-            pytorch: 2.7.0
+            pytorch: 2.7.1
            axolotl_extras:
          - cuda: 128
            cuda_version: 12.8.1
            python_version: "3.11"
-            pytorch: 2.7.0
+            pytorch: 2.7.1
            axolotl_extras:
    runs-on: axolotl-gpu-runner
    steps:
@@ -97,12 +97,12 @@ jobs:
          - cuda: 126
            cuda_version: 12.6.3
            python_version: "3.11"
-            pytorch: 2.7.0
+            pytorch: 2.7.1
            axolotl_extras:
          - cuda: 128
            cuda_version: 12.8.1
            python_version: "3.11"
-            pytorch: 2.7.0
+            pytorch: 2.7.1
            axolotl_extras:
    runs-on: axolotl-gpu-runner
    steps:
--- a/.github/workflows/multi-gpu-e2e.yml
+++ b/.github/workflows/multi-gpu-e2e.yml
@@ -43,7 +43,7 @@ jobs:
          - cuda: 126
            cuda_version: 12.6.3
            python_version: "3.11"
-            pytorch: 2.7.0
+            pytorch: 2.7.1
            axolotl_extras:
            num_gpus: 2
            nightly_build: "true"
--- a/.github/workflows/preview-docs.yml
+++ b/.github/workflows/preview-docs.yml
@@ -8,7 +8,9 @@ on:
    paths:
      - '**/*.md'      # any Markdown file
      - '**/*.qmd'     # any Quarto file
-      - '_quarto.yaml'
+      - '_quarto.yml'
+      - docs/scripts/generate_config_docs.py
+      - src/axolotl/utils/schemas/**.py

 permissions:
  checks: write
@@ -38,7 +40,7 @@ jobs:
      - name: Install dependencies
        run: |
          python3 -m pip install jupyter quartodoc
-          python3 -m pip install -e . --no-deps
+          python3 -m pip install -e .

      - name: Build autodoc
        run: quartodoc build
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -52,7 +52,7 @@ jobs:
      fail-fast: false
      matrix:
        python_version: ["3.11"]
-        pytorch_version: ["2.5.1", "2.6.0", "2.7.0"]
+        pytorch_version: ["2.5.1", "2.6.0", "2.7.1"]
    timeout-minutes: 20

    steps:
@@ -125,7 +125,7 @@ jobs:
      fail-fast: false
      matrix:
        python_version: ["3.11"]
-        pytorch_version: ["2.5.1", "2.6.0", "2.7.0"]
+        pytorch_version: ["2.5.1", "2.6.0", "2.7.1"]
    timeout-minutes: 20

    steps:
@@ -188,7 +188,7 @@ jobs:
    if: ${{ ! contains(github.event.commits[0].message, '[skip e2e]') && github.repository_owner == 'axolotl-ai-cloud' }}
    # this job needs to be run on self-hosted GPU runners...
    runs-on: [self-hosted, modal]
-    timeout-minutes: 90
+    timeout-minutes: 120
    needs: [pre-commit, pytest, pytest-sdist]

    strategy:
@@ -238,7 +238,7 @@ jobs:
    if: github.repository_owner == 'axolotl-ai-cloud'
    # this job needs to be run on self-hosted GPU runners...
    runs-on: [self-hosted, modal]
-    timeout-minutes: 90
+    timeout-minutes: 120
    # Only run the remainder of the matrix if the first e2e check passed;
    # this is to save on wasted compute costs for known failures that get caught in the first run
    needs: [pre-commit, pytest, docker-e2e-tests-1st]
@@ -262,13 +262,13 @@ jobs:
          - cuda: 126
            cuda_version: 12.6.3
            python_version: "3.11"
-            pytorch: 2.7.0
+            pytorch: 2.7.1
            num_gpus: 1
            axolotl_extras:
          - cuda: 128
            cuda_version: 12.8.1
            python_version: "3.11"
-            pytorch: 2.7.0
+            pytorch: 2.7.1
            num_gpus: 1
            axolotl_extras:
    steps:
--- a/.runpod/README.md
+++ b/.runpod/README.md
@@ -328,7 +328,7 @@ The following optimizers are supported:
 - Use `gradient_checkpointing: true` to reduce memory usage
 - Adjust `micro_batch_size` and `gradient_accumulation_steps` based on your GPU memory

-For more detailed information, please refer to the [documentation](https://axolotl-ai-cloud.github.io/axolotl/docs/config.html).
+For more detailed information, please refer to the [documentation](https://axolotl-ai-cloud.github.io/axolotl/docs/config-reference.html).

 ### Errors:

--- a/README.md
+++ b/README.md
@@ -22,28 +22,32 @@
    <img src="https://github.com/axolotl-ai-cloud/axolotl/actions/workflows/multi-gpu-e2e.yml/badge.svg" alt="multigpu-semi-weekly tests">
 </p>

-Axolotl is a tool designed to streamline post-training for various AI models.
-Post-training refers to any modifications or additional training performed on
-pre-trained models - including full model fine-tuning, parameter-efficient tuning (like
-LoRA and QLoRA), supervised fine-tuning (SFT), instruction tuning, and alignment
-techniques. With support for multiple model architectures and training configurations,
-Axolotl makes it easy to get started with these techniques.

-Axolotl is designed to work with YAML config files that contain everything you need to
-preprocess a dataset, train or fine-tune a model, run model inference or evaluation,
-and much more.
+## 🎉 Latest Updates
+
+- 2025/06: Magistral with mistral-common tokenizer support has been added to Axolotl. See [examples](https://github.com/axolotl-ai-cloud/axolotl/tree/main/examples/magistral) to start training your own Magistral models with Axolotl!
+- 2025/05: Quantization Aware Training (QAT) support has been added to Axolotl. Explore the [docs](https://docs.axolotl.ai/docs/qat.html) to learn more!
+- 2025/04: Llama 4 support has been added in Axolotl. See [examples](https://github.com/axolotl-ai-cloud/axolotl/tree/main/examples/llama-4) to start training your own Llama 4 models with Axolotl's linearized version!
+- 2025/03: Axolotl has implemented Sequence Parallelism (SP) support. Read the [blog](https://huggingface.co/blog/axolotl-ai-co/long-context-with-sequence-parallelism-in-axolotl) and [docs](https://docs.axolotl.ai/docs/sequence_parallelism.html) to learn how to scale your context length when fine-tuning.
+- 2025/03: (Beta) Fine-tuning Multimodal models is now supported in Axolotl. Check out the [docs](https://docs.axolotl.ai/docs/multimodal.html) to fine-tune your own!
+- 2025/02: Axolotl has added LoRA optimizations to reduce memory usage and improve training speed for LoRA and QLoRA in single GPU and multi-GPU training (DDP and DeepSpeed). Jump into the [docs](https://docs.axolotl.ai/docs/lora_optims.html) to give it a try.
+- 2025/02: Axolotl has added GRPO support. Dive into our [blog](https://huggingface.co/blog/axolotl-ai-co/training-llms-w-interpreter-feedback-wasm) and [GRPO example](https://github.com/axolotl-ai-cloud/grpo_code) and have some fun!
+- 2025/01: Axolotl has added Reward Modelling / Process Reward Modelling fine-tuning support. See [docs](https://docs.axolotl.ai/docs/reward_modelling.html).
+
+## ✨ Overview
+
+Axolotl is a tool designed to streamline post-training for various AI models.

 Features:

- Train various Huggingface models such as llama, pythia, falcon, mpt
- Supports fullfinetune, lora, qlora, relora, and gptq
- Customize configurations using a simple yaml file or CLI overwrite
- Load different dataset formats, use custom formats, or bring your own tokenized datasets
- Integrated with [xformers](https://github.com/facebookresearch/xformers), flash attention, [liger kernel](https://github.com/linkedin/Liger-Kernel), rope scaling, and multipacking
- Works with single GPU or multiple GPUs via FSDP or Deepspeed
- Easily run with Docker locally or on the cloud
- Log results and optionally checkpoints to wandb, mlflow or Comet
- And more!
+- **Multiple Model Support**: Train various models like LLaMA, Mistral, Mixtral, Pythia, and more. We are compatible with HuggingFace transformers causal language models.
+- **Training Methods**: Full fine-tuning, LoRA, QLoRA, GPTQ, QAT, Preference Tuning (DPO, IPO, KTO, ORPO), RL (GRPO), Multimodal, and Reward Modelling (RM) / Process Reward Modelling (PRM).
+- **Easy Configuration**: Re-use a single YAML file between dataset preprocess, training, evaluation, quantization, and inference.
+- **Performance Optimizations**: [Multipacking](https://docs.axolotl.ai/docs/multipack.html), [Flash Attention](https://github.com/Dao-AILab/flash-attention), [Xformers](https://github.com/facebookresearch/xformers), [Flex Attention](https://pytorch.org/blog/flexattention/), [Liger Kernel](https://github.com/linkedin/Liger-Kernel), [Cut Cross Entropy](https://github.com/apple/ml-cross-entropy/tree/main), Sequence Parallelism (SP), LoRA optimizations, Multi-GPU training (FSDP1, FSDP2, DeepSpeed), Multi-node training (Torchrun, Ray), and many more!
+- **Flexible Dataset Handling**: Load from local, HuggingFace, and cloud (S3, Azure, GCP, OCI) datasets.
+- **Cloud Ready**: We ship [Docker images](https://hub.docker.com/u/axolotlai) and also [PyPI packages](https://pypi.org/project/axolotl/) for use on cloud platforms and local hardware.
+
+

 ## 🚀 Quick Start

@@ -81,19 +85,12 @@ axolotl train examples/llama-3/lora-1b.yml

 That's it! Check out our [Getting Started Guide](https://docs.axolotl.ai/docs/getting-started.html) for a more detailed walkthrough.

-## ✨ Key Features
-
- **Multiple Model Support**: Train various models like LLaMA, Mistral, Mixtral, Pythia, and more
- **Training Methods**: Full fine-tuning, LoRA, QLoRA, and more
- **Easy Configuration**: Simple YAML files to control your training setup
- **Performance Optimizations**: Flash Attention, xformers, multi-GPU training
- **Flexible Dataset Handling**: Use various formats and custom datasets
- **Cloud Ready**: Run on cloud platforms or local hardware

 ## 📚 Documentation

 - [Installation Options](https://docs.axolotl.ai/docs/installation.html) - Detailed setup instructions for different environments
- [Configuration Guide](https://docs.axolotl.ai/docs/config.html) - Full configuration options and examples
+- [Configuration Guide](https://docs.axolotl.ai/docs/config-reference.html) - Full configuration options and examples
+- [Dataset Loading](https://docs.axolotl.ai/docs/dataset_loading.html) - Loading datasets from various sources
 - [Dataset Guide](https://docs.axolotl.ai/docs/dataset-formats/) - Supported formats and how to use them
 - [Multi-GPU Training](https://docs.axolotl.ai/docs/multi-gpu.html)
 - [Multi-Node Training](https://docs.axolotl.ai/docs/multi-node.html)
@@ -112,31 +109,6 @@ That's it! Check out our [Getting Started Guide](https://docs.axolotl.ai/docs/ge

 Contributions are welcome! Please see our [Contributing Guide](https://github.com/axolotl-ai-cloud/axolotl/blob/main/.github/CONTRIBUTING.md) for details.

-## Supported Models
-
-|             | fp16/fp32 | lora | qlora | gptq | gptq w/flash attn | flash attn | xformers attn |
-|-------------|:----------|:-----|-------|------|-------------------|------------|--------------|
-| llama       | ✅         | ✅    | ✅     | ✅             | ✅                 | ✅          | ✅            |
-| Mistral     | ✅         | ✅    | ✅     | ✅             | ✅                 | ✅          | ✅            |
-| Mixtral-MoE | ✅         | ✅    | ✅     | ❓             | ❓                 | ❓          | ❓            |
-| Mixtral8X22 | ✅         | ✅    | ✅     | ❓             | ❓                 | ❓          | ❓            |
-| Pythia      | ✅         | ✅    | ✅     | ❌             | ❌                 | ❌          | ❓            |
-| cerebras    | ✅         | ✅    | ✅     | ❌             | ❌                 | ❌          | ❓            |
-| btlm        | ✅         | ✅    | ✅     | ❌             | ❌                 | ❌          | ❓            |
-| mpt         | ✅         | ❌    | ❓     | ❌             | ❌                 | ❌          | ❓            |
-| falcon      | ✅         | ✅    | ✅     | ❌             | ❌                 | ❌          | ❓            |
-| gpt-j       | ✅         | ✅    | ✅     | ❌             | ❌                 | ❓          | ❓            |
-| XGen        | ✅         | ❓    | ✅     | ❓             | ❓                 | ❓          | ✅            |
-| phi         | ✅         | ✅    | ✅     | ❓             | ❓                 | ❓          | ❓            |
-| RWKV        | ✅         | ❓    | ❓     | ❓             | ❓                 | ❓          | ❓            |
-| Qwen        | ✅         | ✅    | ✅     | ❓             | ❓                 | ❓          | ❓            |
-| Gemma       | ✅         | ✅    | ✅     | ❓             | ❓                 | ✅          | ❓            |
-| Jamba       | ✅         | ✅    | ✅     | ❓             | ❓                 | ✅          | ❓            |
-
-✅: supported
-❌: not supported
-❓: untested
-
 ## ❤️ Sponsors

 Thank you to our sponsors who help make Axolotl possible:
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -1,5 +1,6 @@
 project:
  type: website
+  pre-render: docs/scripts/generate_config_docs.py

 quartodoc:
  dir: docs/api
@@ -235,7 +236,7 @@ website:
            - docs/installation.qmd
            - docs/inference.qmd
            - docs/cli.qmd
-            - docs/config.qmd
+            - docs/config-reference.qmd
            - text: "API Reference"
              href: docs/api

--- a/cicd/e2e_tests.py
+++ b/cicd/e2e_tests.py
@@ -6,7 +6,7 @@ from .single_gpu import GPU_CONFIG, VOLUME_CONFIG, app, cicd_image, run_cmd
@app.function(
    image=cicd_image,
    gpu=GPU_CONFIG,
-    timeout=90 * 60,  # 90 min
+    timeout=120 * 60,  # 90 min
    cpu=8.0,
    memory=131072,
    volumes=VOLUME_CONFIG,
--- a/cicd/multigpu.py
+++ b/cicd/multigpu.py
@@ -69,7 +69,7 @@ def run_cmd(cmd: str, run_folder: str):
@app.function(
    image=cicd_image,
    gpu=GPU_CONFIG,
-    timeout=90 * 60,
+    timeout=120 * 60,
    cpu=16.0,
    memory=131072 * N_GPUS,
    volumes=VOLUME_CONFIG,
--- a/docker/Dockerfile-base
+++ b/docker/Dockerfile-base
@@ -38,6 +38,6 @@ RUN git lfs install --skip-repo && \
    # The base image ships with `pydantic==1.8.2` which is not working
    pip3 install -U --no-cache-dir pydantic==1.10.10

-RUN if [ "$PYTORCH_VERSION" = "2.7.0" ] ; then \
+RUN if [ "$PYTORCH_VERSION" = "2.7.1" ] ; then \
        pip3 install flash-attn==2.7.4.post1; \
    fi
--- a/docker/Dockerfile-base-next
+++ b/docker/Dockerfile-base-next
@@ -29,7 +29,7 @@ ENV PATH="/root/miniconda3/envs/py${PYTHON_VERSION}/bin:${PATH}"
 WORKDIR /workspace

 RUN python3 -m pip install --upgrade pip && pip3 install packaging && \
-    python3 -m pip install --no-cache-dir -U torch==2.7.0 --extra-index-url https://download.pytorch.org/whl/test/cu$CUDA && \
+    python3 -m pip install --no-cache-dir -U torch==2.7.1 --extra-index-url https://download.pytorch.org/whl/test/cu$CUDA && \
    python3 -m pip install --no-cache-dir "causal_conv1d @ git+https://github.com/Dao-AILab/causal-conv1d.git@main" && \
    python3 -m pip install --no-cache-dir "mamba_ssm @ git+https://github.com/state-spaces/mamba.git@main"

--- a/docker/Dockerfile-uv-base
+++ b/docker/Dockerfile-uv-base
@@ -29,8 +29,12 @@ RUN uv venv --no-project --relocatable axolotl-venv

 ENV PATH="/workspace/axolotl-venv/bin:${PATH}"

-RUN uv pip install packaging setuptools wheel \
+RUN uv pip install packaging setuptools wheel psutil \
    && uv pip install torch==${PYTORCH_VERSION} \
    && uv pip install --no-build-isolation "causal_conv1d @ git+https://github.com/Dao-AILab/causal-conv1d.git@main" \
    && uv pip install "mamba_ssm @ git+https://github.com/state-spaces/mamba.git@main" \
    && uv pip install awscli pydantic
+
+RUN if [ "$PYTORCH_VERSION" = "2.7.1" ] ; then \
+        uv pip install --no-build-isolation flash-attn==2.7.4.post1; \
+    fi
--- a/docs/.gitignore
+++ b/docs/.gitignore
@@ -2,3 +2,4 @@
 _site/
 /api/*.qmd
 /api/*.html
+config-reference.qmd
--- a/docs/config.qmd
+++ b/docs/config.qmd
@@ -1,795 +0,0 @@
---
-title: Config Reference
-description: A complete list of all configuration options.
---
-
-```yaml
-# This is the huggingface model that contains *.pt, *.safetensors, or *.bin files
-# This can also be a relative path to a model on disk
-base_model: ./llama-7b-hf
-# You can specify an ignore pattern if the model repo contains more than 1 model type (*.pt, etc)
-base_model_ignore_patterns:
-# If the base_model repo on hf hub doesn't include configuration .json files,
-# You can set that here, or leave this empty to default to base_model
-base_model_config: ./llama-7b-hf
-# You can specify to choose a specific model revision from huggingface hub
-revision_of_model:
-# Optional tokenizer configuration path in case you want to use a different tokenizer
-# than the one defined in the base model
-tokenizer_config:
-# If you want to specify the type of model to load, AutoModelForCausalLM is a good choice too
-model_type: AutoModelForCausalLM
-# Corresponding tokenizer for the model AutoTokenizer is a good choice
-tokenizer_type: AutoTokenizer
-# Trust remote code for untrusted source
-trust_remote_code:
-# use_fast option for tokenizer loading from_pretrained, default to True
-tokenizer_use_fast:
-# Whether to use the legacy tokenizer setting, defaults to True
-tokenizer_legacy:
-# Resize the model embeddings when new tokens are added to multiples of 32
-# This is reported to improve training speed on some models
-resize_token_embeddings_to_32x:
-# Optional[bool] Whether to shrink the embeddings to len(tokenizer). By default, we won't shrink.
-shrink_embeddings:
-# Optional[bool] Don't upcast the embeddings to float32 when using PEFT. Useful for low-VRAM GPUs
-embeddings_skip_upcast:
-# Whether to load the model with randomly initialized weights. Useful for
-# pre-training a model from scratch or debugging purposes.
-random_init_weights:
-
-# (Internal use only)
-# Used to identify which the model is based on
-is_falcon_derived_model:
-is_llama_derived_model:
-is_qwen_derived_model:
-# Please note that if you set this to true, `padding_side` will be set to "left" by default
-is_mistral_derived_model:
-
-# optional overrides to the base model configuration
-overrides_of_model_config:
-  # RoPE Scaling https://github.com/huggingface/transformers/pull/24653
-  rope_scaling:
-    type: # linear | dynamic
-    factor: # float
-
-# optional overrides the base model loading from_pretrained
-overrides_of_model_kwargs:
-  # use_cache: False
-
-# optional overrides to the bnb 4bit quantization configuration
-# https://huggingface.co/docs/transformers/main/main_classes/quantization#transformers.BitsAndBytesConfig
-bnb_config_kwargs:
-  # These are default values
-  llm_int8_has_fp16_weight: false
-  bnb_4bit_quant_type: nf4
-  bnb_4bit_use_double_quant: true
-
-# quantization aware training
-qat:
-  activation_dtype: # Optional[str] = "int8". Fake quantization layout to use for activation quantization. Valid options are "int4" and "int8"
-  weight_dtype: # Optional[str] = "int8". Fake quantization layout to use for weight quantization. Valid options are "int4" and "int8"
-  group_size: # Optional[int] = 32. The number of elements in each group for per-group fake quantization
-  fake_quant_after_n_steps: # Optional[int] = None. The number of steps to apply fake quantization after
-
-# post-training quantization
-quantization:
-  weight_dtype: # Optional[str] = "int8". Fake quantization layout to use for weight quantization. Valid options are uintX for X in [1, 2, 3, 4, 5, 6, 7], or int4, or int8
-  activation_dtype: # Optional[str] = "int8". Fake quantization layout to use for activation quantization. Valid options are "int4" and "int8"
-  group_size: # Optional[int] = 32. The number of elements in each group for per-group fake quantization
-  quantize_embedding: # Optional[bool] = False. Whether to quantize the embedding layer.
-
-
-# Whether you are training a 4-bit GPTQ quantized model
-gptq: true
-
-# This will attempt to quantize the model down to 8 bits and use adam 8 bit optimizer
-load_in_8bit: true
-# Use bitsandbytes 4 bit
-load_in_4bit:
-
-# Use CUDA bf16
-bf16: true # bool or 'full' for `bf16_full_eval`, or 'auto' for automatic detection. require >=ampere
-# Use CUDA fp16
-fp16: true
-# Use CUDA tf32
-tf32: true # require >=ampere
-# Note: if bf16 is set to 'auto', and fp16 is set to true, we will prefer the explict fp16 setting
-
-# No AMP (automatic mixed precision)
-bfloat16: true # require >=ampere
-float16: true
-
-# Limit the memory for all available GPUs to this amount (if an integer, expressed in gigabytes); default: unset
-gpu_memory_limit: 20GiB
-# Do the LoRA/PEFT loading on CPU -- this is required if the base model is so large it takes up most or all of the available GPU VRAM, e.g. during a model and LoRA merge
-lora_on_cpu: true
-
-# List[str]. Add plugins to extend the pipeline.
-# See `src/axolotl/integrations` for the available plugins or doc below for more details.
-# https://docs.axolotl.ai/docs/custom_integrations.html
-plugins:
-  # - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
-
-# A list of one or more datasets to finetune the model with
-# See https://docs.axolotl.ai/docs/dataset_loading.html for guide on loading datasets
-# See https://docs.axolotl.ai/docs/dataset-formats/ for guide on dataset formats
-datasets:
-  # HuggingFace dataset repo | s3:// | gs:// | path to local file or directory
-  - path: vicgalle/alpaca-gpt4
-    # The type of prompt to use for training. [alpaca, gpteacher, oasst, reflection]
-    type: alpaca # format | format:<prompt_style> (chat/instruct) | <prompt_strategies>.load_<load_fn>
-    ds_type: # Optional[str] (json|arrow|parquet|text|csv) defines the datatype when path is a file
-    data_files: # Optional[str] path to source data files
-
-    shards: # Optional[int] split dataset into N pieces (use with shards_idx)
-    shards_idx: # Optional[int] = 0 the index of sharded dataset to use
-
-    preprocess_shards: # Optional[int] process dataset in N sequential chunks for memory efficiency (exclusive with `shards`)
-
-    name: # Optional[str] name of dataset configuration to load
-    split: train # Optional[str] name of dataset split to load from
-    revision: # Optional[str] The specific revision of the dataset to use when loading from the Hugging Face Hub. This can be a commit hash, tag, or branch name. If not specified, the latest version will be used. This parameter is ignored for local datasets.
-    trust_remote_code: # Optional[bool] Trust remote code for untrusted source
-
-  # Custom user instruction prompt
-  - path: repo
-    type:
-      # The below are defaults. only set what's needed if you use a different column name.
-      system_prompt: ""
-      system_format: "{system}"
-      field_system: system
-      field_instruction: instruction
-      field_input: input
-      field_output: output
-
-      # Customizable to be single line or multi-line
-      # Use {instruction}/{input} as key to be replaced
-      # 'format' can include {input}
-      format: |-
-        User: {instruction} {input}
-        Assistant:
-      # 'no_input_format' cannot include {input}
-      no_input_format: "{instruction} "
-
-      # For `completion` datsets only, uses the provided field instead of `text` column
-      field:
-
-  # Using chat template
-  - path: ...
-    # Set type to `chat_template` to use this strategy
-    type: chat_template
-    # Specify the name of the chat template to use
-    # The name of the chat template to use for training, following values are supported:
-    # - tokenizer_default: Uses the chat template that is available in the tokenizer_config.json. If the chat template is not available in the tokenizer, it will raise an error. This is the default.
-    # - alpaca/inst/chatml/gemma/cohere/llama3/phi_3/deepseek_v2/jamba: These chat templates are available in the axolotl codebase at src/axolotl/utils/chat_templates.py
-    # - tokenizer_default_fallback_*: where * is the name of the chat template to fallback to if the tokenizer does not have a chat template else default to tokenizer. E.g. tokenizer_default_fallback_chatml.
-    # - jinja: Uses a custom jinja template for the chat template. The custom jinja template should be provided in the chat_template_jinja field.
-    chat_template: tokenizer_default
-
-    # Custom jinja chat template. Used only if `chat_template: jinja` or empty.
-    chat_template_jinja:
-
-    # Key containing the messages (default: "messages")
-    field_messages: messages
-
-    # Key containing the system message (default: "system")
-    # If the system message is not present in the dataset sample, it will be loaded from the field_system property.
-    field_system: system
-
-    # Mapping of properties from the input dataset to the chat template.
-    # (default: message_property_mappings={'role':'role', 'content':'content'})
-    # If a property exists in the template but not in this mapping, the system will attempt
-    # to load it directly from the message using the property name as the key.
-    # Example: In the mapping below, 'from' is loaded from input dataset and used as 'role',
-    # while 'value' is loaded and used as 'content' in the chat template.
-    message_property_mappings:
-      role: from
-      content: value
-      # ...
-
-    # Optional[Dict[str, List]]. Roles mapping in the messages.
-    # The format is {target_role: [source_roles]}. All source roles will be mapped to the target role.
-    # The default is:
-    roles:
-      user: ["human", "user"]
-      assistant: ["gpt", "assistant"]
-      system: ["system"]
-      tool: ["tool"]
-
-    # Optional[bool]. Whether to drop the system turn from the dataset. Only works with chat_template.
-    # This does not drop the default system message from chat_template if it exists. If you wish to,
-    # we recommend using a custom jinja template with the default system message removed or
-    # adding a system turn with empty content.
-    drop_system_message:
-
-    # Optional[bool]. (for Qwen3 template only) Whether to split the assistant content based on a reasoning trace inside delimited tags
-    # See example at `docs/dataset-formats/conversation.qmd`
-    split_thinking:
-
-    # IMPORTANT: The following fields determine which parts of the conversation to train on.
-    # Priority order: message_field_training > message_field_training_detail > train_on_inputs or role in roles_to_train
-    # See examples at `docs/dataset-formats/conversation.qmd`
-    # Note: If the below 5 fields are empty, defaults to training only on the last message.
-
-    # Optional[List[str]]. Roles to train on. The tokens from these roles will be considered for the loss.
-    roles_to_train: ["assistant"]  # default
-    # Optional[str]. Which EOS tokens to train on in the conversation. Possible values are:
-    # - all: train on all EOS tokens
-    # - turn (default): train on the EOS token at the end of each trainable turn
-    # - last: train on the last EOS token in the conversation
-    # TIP: Please make sure that your `tokenizer.eos_token` is same as EOS/EOT token in template. Otherwise, set `eos_token` under `special_tokens`.
-    train_on_eos: turn
-    # Optional[str]. Which EOT (End-of-Turn) tokens to train on in the conversation. Possible values are:
-    # - all: train on all EOT tokens
-    # - turn: train on the EOT token at the end of each trainable turn
-    # - last: train on the last EOT token in the conversation
-    # If not specified, defaults to the value of train_on_eos for backward compatibility.
-    train_on_eot:
-    # The key in the message turn that indicates via boolean whether tokens of a turn should be considered for training. Useful to selectively train on certain turns besides the `roles_to_train`.
-    message_field_training: training
-    # The key in the message turn that contains the training details. Useful to selectively train on certain tokens in a turn.
-    # The value of the key is a List[Dict] containing `begin_offset` (start character index in content), `end_offset` (end character index in content), and `train` (boolean whether to train).
-    message_field_training_detail: train_detail
-
-
-# If false, the datasets will not be shuffled and will keep their original order in `datasets`.
-# The same applies to the `test_datasets` option and the `pretraining_dataset` option. Default is true.
-shuffle_merged_datasets: true
-
-# Deduplicates datasets and test_datasets with identical entries.
-dataset_exact_deduplication: true
-
-# A list of one or more datasets to eval the model with.
-# You can use either test_datasets, or val_set_size, but not both.
-test_datasets:
-  - path: /workspace/data/eval.jsonl
-    ds_type: json
-    # You need to specify a split. For "json" datasets the default split is called "train".
-    split: train
-    type: completion
-    data_files:
-      - /workspace/data/eval.jsonl
-
-# use RL training: 'dpo', 'ipo', 'kto', 'simpo', 'orpo', 'grpo'
-rl:
-rl_beta:  # Optional[float]. The beta parameter for the RL training.
-
-# dpo
-dpo_use_weighting:  # Optional[bool]. Whether to perform weighting.
-rpo_alpha: # Optional[float]. Weighting of NLL term in loss from RPO paper.
-
-# orpo
-orpo_alpha: 0.1  # Parameter controlling the relative ratio loss weight in the ORPO loss. Passed to `beta` in `ORPOConfig` due to trl mapping.
-
-# kto
-kto_desirable_weight: # Optional[float]. Factor for desirable loss term in KTO loss.
-kto_undesirable_weight: # Optional[float]. Factor for undesirable loss term in KTO loss.
-
-# simpo
-cpo_alpha: 1.0  # Weight of the BC regularizer
-simpo_gamma: 0.5  # Target reward margin for the SimPO loss
-
-# grpo
-trl:
-  use_vllm: # Optional[bool]. Whether to use VLLM for RL training.
-  vllm_server_host: # Optional[str]. Host of the vLLM server to connect to.
-  vllm_server_port: # Optional[int]. Port of the vLLM server to connect to.
-  vllm_server_timeout: # Optional[int]. Total timeout (in seconds) to wait for the vLLM server to respond.
-  vllm_guided_decoding_regex: # Optional[str]. Regex for vLLM guided decoding.
-
-  beta: # Optional[float]. Beta parameter for the RL training. Same as `rl_beta`. Use
-  max_completion_length: # Optional[int]. Maximum length of the completion for RL training.
-
-  reward_funcs: # Optional[list[str]]. List of reward functions to load. Paths must be importable from current dir.
-  reward_weights: # Optional[list[float]]. List of reward weights for the reward functions.
-
-  num_generations: # Optional[int]. Number of generations to sample.
-  log_completions: # Optional[bool]. Whether to log completions.
-  num_completions_to_print: # Optional[int]. Number of completions to print when log_completions is True.
-
-  sync_ref_model: # Optional[bool]. Whether to sync the reference model.
-  ref_model_mixup_alpha: # Optional[float]. Mixup alpha for the reference model.
-  ref_model_sync_steps: # Optional[int]. Sync steps for the reference model.
-  scale_rewards: # Optional[bool]. Whether to scale rewards by their standard deviation.
-
-  temperature: # Optional[float]. Sampling temperature for the GRPO policy.
-  top_p: # Optional[float]. Top-p sampling probability for the generation policy.
-  top_k: # Optional[int]. Top-k sampling for the generation policy.
-  min_p: # Optional[float]. Minimum probability for the generation policy.
-  repetition_penalty: # Optional[float]. Penalty for tokens that appear in prompt and generated text.
-
-  num_iterations: # Optional[int]. Number of iterations per batch (μ) for GRPO.
-  epsilon: # Optional[float]. Epsilon value for clipping in the GRPO algorithm.
-  epsilon_high: # Optional[float]. Upper-bound epsilon value for clipping in the GRPO algorithm.
-  use_liger_loss: # Optional[bool]. Whether to use Liger loss for GRPO.
-  loss_type: # Optional[str]. Loss formulation to use. Supported values: grpo, bnpo, dr_grpo.
-  mask_truncated_completions: # Optional[bool]. Whether to exclude truncated completions from loss calculation.
-
-
-# reward modelling: `True` or `False`
-reward_model:
-
-# process reward modelling: `True` or `False`
-process_reward_model:
-
-# The name of the chat template to use for training, following values are supported:
-# - tokenizer_default: Uses the chat template that is available in the tokenizer_config.json. If the chat template is not available in the tokenizer, it will raise an error. This is the default value.
-# - alpaca/inst/chatml/gemma/cohere/llama3/phi_3/deepseek_v2/jamba: These chat templates are available in the axolotl codebase at src/axolotl/utils/chat_templates.py
-# - tokenizer_default_fallback_*: where * is the name of the chat template to fallback to. E.g. tokenizer_default_fallback_chatml. This is useful when the chat template is not available in the tokenizer.
-# - jinja: Uses a custom jinja template for the chat template. The custom jinja template should be provided in the chat_template_jinja field.
-# The selected chat template will be saved to the tokenizer_config.json for easier inferencing
-# Note: It is recommended to set train_on_inputs to true when using a chat template that is different from the model's default chat template.
-chat_template: tokenizer_default
-# custom jinja template for chat template. This will be only used if chat_template is set to `jinja` or `null` (in which case chat_template is automatically set to `jinja`). Default is null.
-chat_template_jinja: null
-# Optional[List[str]]. Custom EOT (End-of-Turn) tokens to mask/unmask during training.
-# These tokens mark the boundaries between conversation turns.
-# For example: ["/INST", "</s>", "[/SYSTEM_PROMPT]"]
-# If not specified, defaults to just the model's eos_token.
-# This is useful for templates that use multiple delimiter tokens.
-eot_tokens:
-  # - "</s>"
-  # - "[/INST]"
-  # - "[/SYSTEM_PROMPT]"
-# Changes the default system message
-default_system_message: You are a helpful assistant. Please give a long and detailed answer. # Currently only supports chatml.
-# Axolotl attempts to save the dataset as an arrow after packing the data together so
-# subsequent training attempts load faster, relative path
-dataset_prepared_path: data/last_run_prepared
-# Push prepared dataset to hub
-push_dataset_to_hub: # Optional[str] repo_org/repo_name
-# The maximum number of processes to use while preprocessing your input dataset. This defaults to `os.cpu_count()`
-# if not set.
-dataset_processes: # defaults to os.cpu_count() if not set
-# Keep dataset in memory while preprocessing
-# Only needed if cached dataset is taking too much storage
-dataset_keep_in_memory:
-# push checkpoints to hub
-hub_model_id: # private repo path to push finetuned model
-# how to push checkpoints to hub
-# https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments.hub_strategy
-hub_strategy:
-# Whether to use hf `use_auth_token` for loading datasets. Useful for fetching private datasets
-# Required to be true when used in combination with `push_dataset_to_hub`
-hf_use_auth_token: # boolean
-# How much of the dataset to set aside as evaluation. 1 = 100%, 0.50 = 50%, etc. 0 for no eval.
-val_set_size: 0.04
-# Num shards for whole dataset
-dataset_shard_num:
-# Index of shard to use for whole dataset
-dataset_shard_idx:
-
-# The maximum length of an input to train with, this should typically be less than 2048
-# as most models have a token/context limit of 2048
-sequence_len: 2048
-# Pad inputs so each step uses constant sized buffers
-# This will reduce memory fragmentation and may prevent OOMs, by re-using memory more efficiently
-pad_to_sequence_len:
-# Use efficient multi-packing with block diagonal attention and per sequence position_ids. Recommend set to 'true'
-sample_packing:
-# Set to 'false' if getting errors during eval with sample_packing on.
-eval_sample_packing:
-# You can set these packing optimizations AFTER starting a training at least once.
-# The trainer will provide recommended values for these values.
-sample_packing_eff_est:
-total_num_tokens:
-# Increasing the following values helps with packing, but usually only slightly (<%1.)
-# The number of samples packed at a time.
-sample_packing_group_size: 100000
-# The number of samples which can be packed into one sequence. Increase if using a large sequence_len with many short samples.
-sample_packing_bin_size: 200
-sample_pack_sequentially: # Optional[bool]. Whether to pack samples sequentially.
-
-# whether to concatenate samples during pretraining
-pretraining_sample_concatenation:
-
-curriculum_sampling: # Optional[bool]. Whether to use sequential sampling for curriculum learning
-
-# Use batch flattening for speedups when not using sample_packing
-batch_flattening:
-
-# Passed through to transformers when loading the model when launched without accelerate
-# Use `sequential` when training w/ model parallelism to limit memory
-device_map:
-# Defines the max memory usage per gpu on the system. Passed through to transformers when loading the model.
-max_memory:
-
-# If you want to use 'lora' or 'qlora' or leave blank to train all parameters in original model
-adapter: lora
-# If you already have a lora model trained that you want to load, put that here.
-# This means after training, if you want to test the model, you should set this to the value of `output_dir`.
-# Note that if you merge an adapter to the base model, a new subdirectory `merged` will be created under the `output_dir`.
-lora_model_dir:
-
-# LoRA hyperparameters
-# For more details about the following options, see:
-# https://www.anyscale.com/blog/fine-tuning-llms-lora-or-full-parameter-an-in-depth-analysis-with-llama-2
-lora_r: 8
-lora_alpha: 16
-lora_dropout: 0.05
-lora_target_modules:
-  - q_proj
-  - v_proj
-#  - k_proj
-#  - o_proj
-#  - gate_proj
-#  - down_proj
-#  - up_proj
-lora_target_linear: # If true, will target all linear modules
-
-# List[int] | int. # The layer indices to transform, otherwise, apply to all layers
-# https://huggingface.co/docs/peft/v0.15.0/en/package_reference/lora#peft.LoraConfig.layers_to_transform
-peft_layers_to_transform:
-
-# Optional[bool]. Whether to use DoRA.
-# https://huggingface.co/docs/peft/v0.15.0/en/developer_guides/lora#weight-decomposed-low-rank-adaptation-dora
-peft_use_dora:
-
-# Optional[bool]. Whether to use RSLoRA.
-# https://huggingface.co/docs/peft/v0.15.0/en/developer_guides/lora#rank-stabilized-lora
-peft_use_rslora:
-
-# Optional[list[tuple[int, int]]]. List of layer indices to replicate.
-# https://huggingface.co/docs/peft/v0.15.0/en/developer_guides/lora#memory-efficient-layer-replication-with-lora
-peft_layer_replication:
-
-# bool | Literal["gaussian", "eva", "olora", "pissa", "pissa_niter_[number of iters]", "corda", "loftq"]
-# How to initialize LoRA weights. Default to True which is MS original implementation.
-# https://huggingface.co/docs/peft/v0.15.0/en/developer_guides/lora#initialization
-peft_init_lora_weights:
-
-# If you added new tokens to the tokenizer, you may need to save some LoRA modules because they need to know the new tokens.
-# For LLaMA and Mistral, you need to save `embed_tokens` and `lm_head`. It may vary for other models.
-# `embed_tokens` converts tokens to embeddings, and `lm_head` converts embeddings to token probabilities.
-# https://github.com/huggingface/peft/issues/334#issuecomment-1561727994
-lora_modules_to_save:
-#  - embed_tokens
-#  - lm_head
-
-lora_fan_in_fan_out: false
-
-# Apply custom LoRA autograd functions and activation function Triton kernels for
-# speed and memory savings
-# See: https://docs.axolotl.ai/docs/lora_optims.html
-lora_mlp_kernel: true
-lora_qkv_kernel: true
-lora_o_kernel: true
-
-# LoRA+ hyperparameters
-# For more details about the following options, see:
-# https://arxiv.org/abs/2402.12354  and `src/axolotl/core/train_builder.py`
-loraplus_lr_ratio: # loraplus learning rate ratio lr_B / lr_A. Recommended value is 2^4.
-loraplus_lr_embedding: #  loraplus learning rate for lora embedding layers. Default value is 1e-6.
-
-peft:
-  # Configuration options for loftq initialization for LoRA
-  # https://huggingface.co/docs/peft/developer_guides/quantization#loftq-initialization
-  loftq_config:
-    loftq_bits:  # typically 4 bits
-
-# ReLoRA configuration
-# Must use either 'lora' or 'qlora' adapter, and does not support fsdp or deepspeed
-relora_steps: # Number of steps per ReLoRA restart
-relora_warmup_steps: # Number of per-restart warmup steps
-relora_anneal_steps: # Number of anneal steps for each relora cycle
-relora_prune_ratio: # threshold for optimizer magnitude when pruning
-relora_cpu_offload: # True to perform lora weight merges on cpu during restarts, for modest gpu memory savings
-
-# wandb configuration if you're using it
-# Make sure your `WANDB_API_KEY` environment variable is set (recommended) or you login to wandb with `wandb login`.
-wandb_mode: # "offline" to save run metadata locally and not sync to the server, "disabled" to turn off wandb
-wandb_project: # Your wandb project name
-wandb_entity: # A wandb Team name if using a Team
-wandb_watch:
-wandb_name: # Set the name of your wandb run
-wandb_run_id: # Set the ID of your wandb run
-wandb_log_model: # "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only at the end of training
-
-# mlflow configuration if you're using it
-mlflow_tracking_uri: # URI to mlflow
-mlflow_experiment_name: # Your experiment name
-mlflow_run_name: # Your run name
-hf_mlflow_log_artifacts:  # set to true to copy each saved checkpoint on each save to mlflow artifact registry
-
-# Comet configuration if you're using it
-# Make sure your `COMET_API_KEY` environment variable is set (recommended) or you login to Comet with `comet login`.
-# Check out our documentation for more details https://www.comet.com/docs/v2/api-and-sdk/python-sdk/reference/Experiment-Creation/#comet_ml.start
-use_comet: # Enable or disable Comet integration.
-comet_api_key: # API key for Comet. Recommended to set via `comet login`.
-comet_workspace: # Workspace name in Comet. Defaults to the user's default workspace.
-comet_project_name: # Project name in Comet. Defaults to Uncategorized.
-comet_experiment_key: # Identifier for the experiment. Used to append data to an existing experiment or control the key of new experiments. Default to a random key.
-comet_mode: # Create a new experiment ("create") or log to an existing one ("get"). Default ("get_or_create") auto-selects based on configuration.
-comet_online: # Set to True to log data to Comet server, or False for offline storage. Default is True.
-comet_experiment_config: # Dictionary for additional configuration settings, see the doc for more details.
-
-# Tensorboard
-use_tensorboard: # Optional[bool]
-
-# Where to save the full-finetuned model to
-output_dir: ./completed-model
-
-# Whether to use torch.compile and which backend to use
-# setting to `auto` will enable torch compile when torch>=2.5.1
-torch_compile:  # Optional[Union[Literal["auto"], bool]]
-torch_compile_backend:  # Optional[str]
-torch_compile_mode:  # 'default' | 'reduce-overhead' | 'max-autotune'
-
-# Training hyperparameters
-
-# If greater than 1, backpropagation will be skipped and the gradients will be accumulated for the given number of steps.
-gradient_accumulation_steps: 1
-# The number of samples to include in each batch. This is the number of samples sent to each GPU.
-# Batch size per gpu = micro_batch_size * gradient_accumulation_steps
-micro_batch_size: 2
-eval_batch_size:
-num_epochs: 4
-warmup_steps: 100  # cannot use with warmup_ratio
-warmup_ratio: 0.05  # cannot use with warmup_steps
-learning_rate: 0.00003
-lr_quadratic_warmup:
-logging_steps:
-eval_steps: # Leave empty to eval at each epoch, integer for every N steps. float for fraction of total steps
-evals_per_epoch: # number of times per epoch to run evals, mutually exclusive with eval_steps
-eval_strategy: # Set to `"no"` to skip evaluation, `"epoch"` at end of each epoch, leave empty to infer from `eval_steps`.
-save_strategy: # Set to `"no"` to skip checkpoint saves, `"epoch"` at end of each epoch, `"best"` when better result is achieved, leave empty to infer from `save_steps`.
-save_steps: # Leave empty to save at each epoch, integer for every N steps. float for fraction of total steps
-saves_per_epoch: # number of times per epoch to save a checkpoint, mutually exclusive with save_steps
-save_total_limit: # Checkpoints saved at a time
-save_only_model: # Save only the model weights, skipping the optimizer. Using this means you can't resume from checkpoints.
-# Maximum number of iterations to train for. It precedes num_epochs which means that
-# if both are set, num_epochs will not be guaranteed.
-# e.g., when 1 epoch is 1000 steps => `num_epochs: 2` and `max_steps: 100` will train for 100 steps
-max_steps:
-
-# bool of whether to include tokens trainer per second in the training metrics. This iterates over the entire dataset once, so it takes some time.
-include_tokens_per_second: # Optional[bool]
-
-# whether to find batch size that fits in memory. Passed to underlying transformers Trainer
-auto_find_batch_size: # Optional[bool]
-
-eval_table_size: # Approximate number of predictions sent to wandb depending on batch size. Enabled above 0. Default is 0
-eval_max_new_tokens: # Total number of tokens generated for predictions sent to wandb. Default is 128
-do_causal_lm_eval: # Whether to run causal language model evaluation for metrics in `eval_causal_lm_metrics`.
-eval_causal_lm_metrics: # HF evaluate metrics used during evaluation. Default is ["sacrebleu", "comet", "ter", "chrf", "perplexity"]
-
-profiler_steps: # enable the pytorch profiler to capture the first N steps of training to the output_dir.
-                # see https://pytorch.org/blog/understanding-gpu-memory-1/ for more information
-                # snapshots can be visualized @ https://pytorch.org/memory_viz
-
-loss_watchdog_threshold: # High loss value, indicating the learning has broken down (a good estimate is ~2 times the loss at the start of training)
-loss_watchdog_patience: # Number of high-loss steps in a row before the trainer aborts (default: 3)
-
-# Save model as safetensors (require safetensors package). Default True
-save_safetensors:
-
-# Whether to mask out or include the human's prompt from the training labels
-train_on_inputs: false
-# Group similarly sized data to minimize padding.
-# May be slower to start, as it must download and sort the entire dataset.
-# Note that training loss may have an oscillating pattern with this enabled.
-group_by_length: false
-
-# Whether to use gradient checkpointing. Available options are: true, false, "offload", "offload_disk".
-# https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing
-gradient_checkpointing: false
-# additional kwargs to pass to the trainer for gradient checkpointing
-# gradient_checkpointing_kwargs:
-#   use_reentrant: true
-
-# Stop training after this many evaluation losses have increased in a row
-# https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback
-early_stopping_patience: 3
-
-# Specify a scheduler and kwargs to use with the optimizer
-# Valid values are driven by the Transformers SchedulerType class, see:
-# https://github.com/huggingface/transformers/blob/5f4ecf2d9f867a1255131d2461d75793c0cf1db2/src/transformers/trainer_utils.py#L420
-# Valid values include
-# - 'linear'
-# - 'cosine' (default)
-# - 'cosine_with_restarts'
-# - 'polynomial'
-# - 'constant'
-# - 'constant_with_warmup'
-# - 'inverse_sqrt'
-# - 'reduce_lr_on_plateau'
-# - 'cosine_with_min_lr'
-# - 'warmup_stable_decay'
-
-# Additional schedulers include:
-# - 'one_cycle'
-# - 'rex'
-lr_scheduler:
-lr_scheduler_kwargs:
-cosine_min_lr_ratio: # decay lr to some percentage of the peak lr, e.g. cosine_min_lr_ratio=0.1 for 10% of peak lr
-cosine_constant_lr_ratio: # freeze lr at some percentage of the step, e.g. cosine_constant_lr_ratio=0.8 means start cosine_min_lr at 80% of training step (https://arxiv.org/pdf/2308.04014.pdf)
-
-# For one_cycle optim
-lr_div_factor: # Learning rate div factor
-
-# Specify optimizer
-# Valid values are driven by the Transformers OptimizerNames class, see:
-# https://github.com/huggingface/transformers/blob/cbf924b76c03828101a34069a96d209314114fd5/src/transformers/training_args.py#L144-L189
-#
-# Note that not all optimizers may be available in your environment, ex: 'adamw_anyprecision' is part of
-# torchdistx, 'adamw_bnb_8bit' is part of bnb.optim.Adam8bit, etc. When in doubt, it is recommended to start with the optimizer used
-# in the examples/ for your model and fine-tuning use case.
-#
-# Valid values for 'optimizer' include:
-# - adamw_torch
-# - adamw_torch_fused (default)
-# - adamw_torch_xla
-# - adamw_torch_npu_fused
-# - adamw_apex_fused
-# - adopt_adamw  (an EXPERIMENTAL optimizer, only for torch version >= 2.5.1)
-# - adafactor
-# - adamw_anyprecision
-# - adamw_torch_4bit
-# - ademamix
-# - sgd
-# - adagrad
-# - adamw_bnb_8bit
-# - adamw_8bit   # alias for adamw_bnb_8bit
-# - ademamix_8bit
-# - lion_8bit
-# - lion_32bit
-# - paged_adamw_32bit
-# - paged_adamw_8bit
-# - paged_ademamix_32bit
-# - paged_ademamix_8bit
-# - paged_lion_32bit
-# - paged_lion_8bit
-# - rmsprop
-# - rmsprop_bnb
-# - rmsprop_bnb_8bit
-# - rmsprop_bnb_32bit
-# - galore_adamw
-# - galore_adamw_8bit
-# - galore_adafactor
-# - galore_adamw_layerwise
-# - galore_adamw_8bit_layerwise
-# - galore_adafactor_layerwise
-# - lomo
-# - adalomo
-# - grokadamw
-# - schedule_free_adamw
-# - schedule_free_sgd
-# - apollo_adamw
-# - apollo_adamw_layerwise
-#
-# Additional custom optimizers include:
-# - optimi_adamw
-# - ao_adamw_8bit
-# - ao_adamw_fp8
-# - came_pytorch
-optimizer:
-# Dictionary of arguments to pass to the optimizer
-optim_args:
-# For Galore Optimizers the following optim_args are available
-# rank:  # type: int
-# update_proj_gap  # type: int
-# scale  # type: float
-# proj_type:  # type: str, default = std
-
-# The target modules to optimize, i.e. the module names that you would like to train, right now this is used only for GaLore algorithm
-optim_target_modules:
-# - self_attn  # for llama
-# - mlp
-
-# Specify weight decay
-weight_decay:
-# adamw hyperparams
-adam_beta1:
-adam_beta2:
-adam_beta3:  # only used for CAME Optimizer
-adam_epsilon:
-adam_epsilon2:  # only used for CAME Optimizer
-# Gradient clipping max norm
-max_grad_norm:
-
-# Augmentation techniques
-# NEFT https://arxiv.org/abs/2310.05914, set this to a number (paper default is 5) to add noise to embeddings
-# currently only supported on Llama and Mistral
-neftune_noise_alpha:
-
-# Optional[bool]. Whether to bettertransformers
-flash_optimum:
-
-# Note: Only one of the following attention patches can be used at a time.
-# For example, if you set `xformers_attention` to `true`, do not set `flash_attention` to `true`.
-
-# Optional[bool]. Whether to use xformers attention patch https://github.com/facebookresearch/xformers:
-xformers_attention:
-# Optional[bool]. Whether to use flash attention patch https://github.com/Dao-AILab/flash-attention:
-flash_attention:
-flash_attn_cross_entropy:  # Optional[bool]. Whether to use flash-attention cross entropy implementation - advanced use only
-flash_attn_rms_norm:  # Optional[bool]. Whether to use flash-attention rms norm implementation - advanced use only
-flash_attn_fuse_qkv: # Optional[bool]. Whether to fuse QKV into a single operation
-flash_attn_fuse_mlp: # Optional[bool]. Whether to fuse part of the MLP into a single operation
-# Optional[bool]. Whether to use scaled-dot-product attention
-# https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
-sdp_attention:
-# Optional[bool]. Shifted-sparse attention (only llama) - https://arxiv.org/pdf/2309.12307.pdf
-s2_attention:
-
-# Optional[bool]. Whether to use low_cpu_mem_usage
-low_cpu_mem_usage:
-# Optional[str]. Resume from a specific checkpoint dir
-resume_from_checkpoint:
-# Optional[bool]. If resume_from_checkpoint isn't set and you simply want it to start where it left off.
-# Be careful with this being turned on between different models.
-auto_resume_from_checkpoints: false
-
-## Multimodal section
-# int | tuple[int, int] | None . Size to resize images to, width x height.
-# Will read from model/processor config if not set.
-image_size:
-# str. Algorithm to use for image resizing. "bilinear", "bicubic", "lanczos". Default is "bilinear".
-image_resize_algorithm: 'bilinear'
-## End of multimodal section
-
-# Don't mess with this, it's here for accelerate and torchrun
-local_rank:
-
-# Add or change special tokens.
-# If you add tokens here, you don't need to add them to the `tokens` list.
-special_tokens:
-  # bos_token: "<s>"
-  # eos_token: "</s>"
-  # unk_token: "<unk>"
-  # pad_token: "[PAD]"
-
-# Optional[list[str]]. Add extra tokens to the tokenizer.
-tokens:
-  # - "<|startoftext|>"
-  # - "<|endoftext|>"
-
-# Mapping token_id to new_token_string to override reserved added_tokens in the tokenizer.
-# Only works for tokens that are not part of the base vocab (aka are added_tokens).
-# Can be checked if they exist in tokenizer.json added_tokens.
-added_tokens_overrides:  # Dict[int, str]
-#  128041: "<|im_start|>"
-#  128042: "<|im_end|>"
-
-# FSDP
-fsdp:
-fsdp_config:
-
-# Deepspeed config path. e.g., deepspeed_configs/zero3.json
-deepspeed:
-
-# Advanced DDP Arguments
-ddp_timeout:
-ddp_bucket_cap_mb:
-ddp_broadcast_buffers:
-
-# Sequence parallelism
-# Set to a divisor of the number of GPUs available to split sequences into chunks of equal size.
-# Use in long context training to prevent OOM when sequences cannot fit into a single GPU's VRAM.
-# E.g., if 4 GPUs are available, set this value to 2 to split each sequence into two equal-sized
-# subsequences, or set to 4 to split into four equal-sized subsequences.
-# See https://docs.axolotl.ai/docs/sequence_parallelism.html for more details.
-sequence_parallel_degree:
-# Optional; strides across the key dimension. Larger values use more memory but should make training faster.
-# Must evenly divide the number of KV heads in your model.
-heads_k_stride: 1
-# One of "varlen_llama3", "batch_ring", "batch_zigzag", "batch_stripe". Defaults to "varlen_llama3"
-# in the sample packing case, and "batch_ring" in the non-sample packing case.
-ring_attn_func:
-
-# Path to torch distx for optim 'adamw_anyprecision'
-torchdistx_path:
-
-# Set to HF dataset for type: 'completion' for streaming instead of pre-tokenize
-pretraining_dataset:
-
-# Debug mode
-debug:
-
-# Seed
-seed:
-
-# Allow overwrite yml config using from cli
-strict:
-```
--- a/docs/dataset-formats/conversation.qmd
+++ b/docs/dataset-formats/conversation.qmd
@@ -12,7 +12,7 @@ Chat Template strategy uses a jinja2 template that converts a list of messages i
 {"conversations": [{"role": "...", "content": "..."}]}
 ```

-See [configs](../config.qmd) for full configs and supported templates.
+See [configs](../config-reference.qmd) for full configs and supported templates.

 ### Migrating from sharegpt

@@ -52,7 +52,9 @@ We recommend checking the below examples for other usecases.

 ### Examples

-1. (Legacy) Using the default chat template in the tokenizer_config.json on OpenAI messages format, training on only last message.
+#### Training on last message
+
+(Legacy) Using the default chat template in the tokenizer_config.json on OpenAI messages format, training on only last message.

 ```yaml
 datasets:
@@ -66,7 +68,9 @@ datasets:
 If you receive an error like "`chat_template` choice is `tokenizer_default` but tokenizer's `chat_template` is null.", it means the tokenizer does not have a default `chat_template`. Follow the examples below instead to set a custom `chat_template`.
 :::

-2. Using the `gemma` chat template to override the tokenizer_config.json's chat template on OpenAI messages format, training on all assistant messages.
+#### Overriding default chat template
+
+Using the `gemma` chat template to override the tokenizer_config.json's chat template on OpenAI messages format, training on all assistant messages.

 ```yaml
 chat_template: gemma # this overwrites the tokenizer's chat_template
@@ -76,7 +80,13 @@ datasets:
    roles_to_train: ["assistant"]  # default value
 ```

-3. Using the tokenizer_config.json's chat template or `chatml` as fallback if the former's chat template does not exist, on OpenAI messages format, training on all assistant messages.
+::: {.callout-note}
+If you want to use built-in chat_template, use `chat_template: tokenizer_default` (this is set by default).
+:::
+
+#### Using default chat template with fallback
+
+Using the tokenizer_config.json's chat template or `chatml` as fallback if the former's chat template does not exist, on OpenAI messages format, training on all assistant messages.

 ```yaml
 chat_template: tokenizer_default_fallback_chatml # this overwrites the tokenizer's chat_template
@@ -85,7 +95,9 @@ datasets:
    type: chat_template
 ```

-4. Using a custom jinja template on OpenAI messages format, training on all assistant messages.
+#### Custom Jinja template
+
+Using a custom jinja template on OpenAI messages format, training on all assistant messages.

 ```yaml
 # chat_template: jinja # `jinja` will be implied if the `chat_template_jinja` is set and this field is empty
@@ -100,7 +112,9 @@ datasets:
 Please make sure that your `tokenizer.eos_token` is same as EOS (End-of-Sequence) token in template. Otherwise, set `eos_token` under `special_tokens: `.
 :::

-5. If you are using a template that has a different EOT (End-of-Turn) token from EOS token or multiple EOT tokens (like Mistral V7 Tekken), set the `eot_tokens: ` config. The handling of EOT tokens follows `train_on_eos: ` which defaults to turn.
+#### Using template with different token for EOT and EOS
+
+- If you are using a template that has a different EOT (End-of-Turn) token from EOS token or multiple EOT tokens (like Mistral V7 Tekken), set the `eot_tokens: ` config. The handling of EOT tokens follows `train_on_eos: ` which defaults to turn.

 ```yaml
 eot_tokens:
@@ -116,16 +130,16 @@ datasets:
 ```

 ::: {.callout-tip}
-See [config documentation](../config.qmd) for detailed explanations of "turn", "last", and "all" options for training on tokens.
+See [config documentation](../config-reference.qmd) for detailed explanations of "turn", "last", and "all" options for training on tokens.
 :::

 ::: {.callout-note}
 Using `eot_tokens` requires each token that exists in `chat_template` to be a single token in the tokenizer. Otherwise, the tokenizer will split the token and cause unexpected behavior.

-You can add those tokens as new tokens under `tokens: ` or (recommended) override unused added_tokens via `added_tokens_overrides: `. See [config](../config.qmd) for more details.
+You can add those tokens as new tokens under `tokens: ` or (recommended) override unused added_tokens via `added_tokens_overrides: `. See [config](../config-reference.qmd) for more details.
 :::

-6. Continuing from the previous example, if you want to train on all EOT token trainable turns but only last EOS token, set `train_on_eos: last`.
+- Continuing from the previous example, if you want to train on all EOT token trainable turns but only last EOS token, set `train_on_eos: last`.

 ```yaml
 eot_tokens:
@@ -145,7 +159,73 @@ If EOS token only appears at the end of a prompt, `train_on_eos: last` is equiva
 :::


-7. (Advanced) Using fine-grained control over tokens and turns to train in a conversation
+#### Using tool use
+
+Instead of passing `tools` via the system prompt, an alternative method would be to have the `tools` in a separate column and loaded via `chat_template` to let the template dynamically build it.
+
+```json
+{
+    "tools": [
+        {
+            "type": "...",
+            "function": {
+                "name": "...",
+                "description": "...",
+                "parameters": {
+                    "type": "...",
+                    "properties": {
+                        // ...
+                    },
+                    "required": ["..."],
+                },
+            },
+        },
+    ],
+    "messages": [
+        // ...
+        {
+            "role": "assistant", // call the function via assistant
+            "tool_calls": [
+                {
+                    "type": "function",
+                    "function": {
+                        "name": "...",
+                        "arguments": {
+                            "...": "...",
+                        }
+                    }
+                }
+            ]
+        },
+        {
+            "role": "tool",
+            "name": "...",
+            "content": "..."
+        },
+    ],
+}
+```
+
+::: {.callout-note}
+Tools need to follow [JSON schema](https://json-schema.org/learn/getting-started-step-by-step).
+:::
+
+```yaml
+chat_template: llama4
+datasets:
+  - path: ...
+    type: chat_template
+    # field_tools: tools # default is `tools`
+```
+
+::: {.callout-tip}
+Look into the `chat_template` you are using to see if it supports `tools` and what the expected role is for the tool answer. In the example above, the tool answer is expected to be in the `tool` or `ipython` role for `llama4` template.
+:::
+
+
+#### Using fine-grained control over token masking
+
+(Advanced) Using fine-grained control over tokens and turns to train in a conversation

 For a data sample that looks like:

@@ -196,7 +276,9 @@ datasets:
 It is not necessary to set both `message_field_training` and `message_field_training_detail` at once.
 :::

-8. (For Qwen3 template only) Enable reasoning split, where the reasoning is split from the content and passed as a separate field into the template.
+#### Reasoning split
+
+(For Qwen3 template only) Enable reasoning split, where the reasoning is split from the content and passed as a separate field into the template.

 ```yaml
 datasets:
--- a/docs/dataset-formats/inst_tune.qmd
+++ b/docs/dataset-formats/inst_tune.qmd
@@ -186,4 +186,4 @@ datasets:
      no_input_format: "[INST] {instruction} [/INST]"
 ```

-See full config options under [here](../config.qmd).
+See full config options under [here](../config-reference.qmd).
--- a/docs/dataset_loading.qmd
+++ b/docs/dataset_loading.qmd
@@ -36,7 +36,7 @@ This matches the API of [`datasets.load_dataset`](https://github.com/huggingface

 For HuggingFace's guide to load different dataset types, see [here](https://huggingface.co/docs/datasets/loading).

-For full details on the config, see [config.qmd](config.qmd).
+For full details on the config, see [config-reference.qmd](config-reference.qmd).

 ::: {.callout-note}

--- a/docs/docker.qmd
+++ b/docs/docker.qmd
@@ -9,7 +9,7 @@ format:
 This section describes the different Docker images that are released by AxolotlAI at [Docker Hub](https://hub.docker.com/u/axolotlai).

 ::: {.callout-important}
-For Blackwell GPUs, please use the tags with Pytorch 2.7.0 and CUDA 12.8.
+For Blackwell GPUs, please use the tags with Pytorch 2.7.1 and CUDA 12.8.
 :::

 ## Base
@@ -32,8 +32,8 @@ main-base-py{python_version}-cu{cuda_version}-{pytorch_version}

 Tags examples:

- `main-base-py3.11-cu128-2.7.0`
- `main-base-py3.11-cu126-2.7.0`
+- `main-base-py3.11-cu128-2.7.1`
+- `main-base-py3.11-cu126-2.7.1`
 - `main-base-py3.11-cu124-2.6.0`
 - `main-base-py3.11-cu124-2.5.1`

--- a/docs/faq.qmd
+++ b/docs/faq.qmd
@@ -9,11 +9,11 @@ description: Frequently asked questions

 > A: Usually an issue with the GPUs communicating with each other. See the [NCCL doc](nccl.qmd)

-**Q: Exitcode -9**
+**Q: exitcode: -9**

 > A: This usually happens when you run out of system RAM.

-**Q: Exitcode -7 while using deepspeed**
+**Q: exitcode: -7 while using deepspeed**

 > A: Try upgrading deepspeed w: `pip install -U deepspeed`

--- a/docs/getting-started.qmd
+++ b/docs/getting-started.qmd
@@ -55,7 +55,7 @@ output_dir: ./outputs/lora-out
 - To perform QLoRA finetuning, replace with `load_in_4bit: true` and `adapter: qlora`.
 :::

-See our [Config options](config.qmd) for more details.
+See our [config options](config-reference.qmd) for more details.

 ### Training {#sec-training}

@@ -179,7 +179,7 @@ Now that you have the basics, you might want to:

 Check our other guides for details on these topics:

- [Configuration Guide](config.qmd) - Full configuration options
+- [Configuration Guide](config-reference.qmd) - Full configuration options
 - [Dataset Loading](dataset_loading.qmd) - Loading datasets from various sources
 - [Dataset Formats](dataset-formats) - Working with different data formats
 - [Multi-GPU Training](multi-gpu.qmd)
--- a/docs/installation.qmd
+++ b/docs/installation.qmd
@@ -14,7 +14,7 @@ This guide covers all the ways you can install and set up Axolotl for your envir
 ## Requirements {#sec-requirements}

 - NVIDIA GPU (Ampere architecture or newer for `bf16` and Flash Attention) or AMD GPU
- Python ≥3.10
+- Python ≥3.11
 - PyTorch ≥2.5.1

 ## Installation Methods {#sec-installation-methods}
@@ -153,7 +153,7 @@ We recommend using WSL2 (Windows Subsystem for Linux) or Docker.

 ### Conda/Pip venv {#sec-conda}

-1. Install Python ≥3.10
+1. Install Python ≥3.11
 2. Install PyTorch: https://pytorch.org/get-started/locally/
 3. Install Axolotl:
   ```{.bash}
--- a/docs/qat.qmd
+++ b/docs/qat.qmd
@@ -29,4 +29,4 @@ qat:
  fake_quant_after_n_steps: # Optional[int] = None. The number of steps to apply fake quantization after
 ```

-Once you have finished training, you must quantize your model by using the same quantization configuration which you used to train the model with. You can use the [`quantize` command](./quantize.md) to do this.
+Once you have finished training, you must quantize your model by using the same quantization configuration which you used to train the model with. You can use the [`quantize`](./quantize.qmd) command to do this.
--- a/docs/quantize.qmd
+++ b/docs/quantize.qmd
@@ -32,7 +32,7 @@ output_dir:  # The path to the output directory.

 Once quantization is complete, your quantized model will be saved in the `{output_dir}/quantized` directory.

-You may also use the `quantize` command to quantize a model which has been trained with [QAT](./qat.md) - you can do this by using the existing QAT configuration file which
+You may also use the `quantize` command to quantize a model which has been trained with [QAT](./qat.qmd) - you can do this by using the existing QAT configuration file which
 you used to train the model:

 ```yaml
--- a/docs/rlhf.qmd
+++ b/docs/rlhf.qmd
@@ -500,7 +500,7 @@ The input format is a simple JSON input with customizable fields based on the ab
 ### GRPO

 ::: {.callout-tip}
-Check out our [GRPO cookbook](https://github.com/axolotl-ai-cloud/axolotl-cookbook/tree/main/grpo#training-an-r1-style-large-language-model-using-grpo).
+Check out our [GRPO cookbook](https://github.com/axolotl-ai-cloud/grpo_code).
 :::

 In the latest GRPO implementation, `vLLM` is used to significantly speedup trajectory generation during training. In this example, we're using 4 GPUs - 2 for training, and 2 for vLLM:
--- a/docs/scripts/generate_config_docs.py
+++ b/docs/scripts/generate_config_docs.py
@@ -0,0 +1,752 @@
+# type: ignore
+
+"""
+Quarto documentation generation from Pydantic models. Uses Pydantic model source code
+to automatically group fields, including inherited fields from parent classes.
+"""
+
+import ast
+import inspect
+import textwrap
+import types
+import typing
+from typing import Any, FrozenSet, Type, Union
+
+from pydantic import BaseModel
+
+from axolotl.utils.schemas.config import AxolotlInputConfig
+
+
+class QuartoGenerator:
+    """Generate Quarto documentation from Pydantic models."""
+
+    def __init__(self):
+        self._class_fields_cache = {}
+        self._inheritance_map_cache = {}
+        self._nested_models_cache = {}
+
+    def _get_direct_fields(self, cls: Type[BaseModel]) -> FrozenSet[str]:
+        """Get fields defined directly in a single class (not inherited)."""
+        if cls in self._class_fields_cache:
+            return self._class_fields_cache[cls]
+
+        fields = set()
+
+        # Get annotated fields
+        if hasattr(cls, "__annotations__"):
+            fields.update(cls.__annotations__.keys())
+
+        # Filter out private/special methods
+        fields = {f for f in fields if not f.startswith("_")}
+
+        result = frozenset(fields)
+        self._class_fields_cache[cls] = result
+        return result
+
+    def _is_pydantic_model(self, type_obj) -> bool:
+        """Check if a type is a Pydantic BaseModel."""
+        return inspect.isclass(type_obj) and issubclass(type_obj, BaseModel)
+
+    # pylint: disable=too-many-return-statements
+    def _extract_nested_type(self, field_type) -> Any:
+        """Extract the actual type from complex type annotations."""
+        # Handle Annotated types (Python 3.9+)
+        if hasattr(typing, "get_origin") and hasattr(typing, "get_args"):
+            origin = typing.get_origin(field_type)
+            args = typing.get_args(field_type)
+
+            if origin is not None:
+                # Handle Annotated[SomeType, ...] - extract the first argument
+                if hasattr(typing, "Annotated") and origin is typing.Annotated:
+                    if args:
+                        return self._extract_nested_type(
+                            args[0]
+                        )  # Recursively process the actual type
+
+                # Handle list[SomeType], List[SomeType], etc.
+                elif origin in (list, typing.List):
+                    if args:
+                        return self._extract_nested_type(
+                            args[0]
+                        )  # Extract element type
+
+                # Handle Union types (including | syntax)
+                elif origin is typing.Union:
+                    # Get non-None types from the Union
+                    non_none_types = [arg for arg in args if arg is not type(None)]
+                    if len(non_none_types) >= 1:
+                        # Prioritize Pydantic models over primitive types
+                        pydantic_models = [
+                            arg
+                            for arg in non_none_types
+                            if self._is_pydantic_model(arg)
+                        ]
+                        if pydantic_models:
+                            # Return the first Pydantic model found
+                            return self._extract_nested_type(pydantic_models[0])
+
+                        # No Pydantic models, return the first non-None type
+                        return self._extract_nested_type(non_none_types[0])
+
+        # Handle new Python 3.10+ union syntax (PeftConfig | None)
+        if hasattr(field_type, "__class__") and field_type.__class__ is types.UnionType:
+            # Get non-None types from the Union
+            non_none_types = [
+                arg for arg in field_type.__args__ if arg is not type(None)
+            ]
+            if len(non_none_types) >= 1:
+                # Prioritize Pydantic models over primitive types
+                pydantic_models = [
+                    arg for arg in non_none_types if self._is_pydantic_model(arg)
+                ]
+                if pydantic_models:
+                    return self._extract_nested_type(pydantic_models[0])
+                return self._extract_nested_type(non_none_types[0])
+
+        # Handle old typing.Union syntax (fallback)
+        if hasattr(field_type, "__origin__"):
+            if field_type.__origin__ is Union:
+                # Get non-None types from the Union
+                non_none_types = [
+                    arg for arg in field_type.__args__ if arg is not type(None)
+                ]
+                if len(non_none_types) >= 1:
+                    # Prioritize Pydantic models over primitive types
+                    pydantic_models = [
+                        arg for arg in non_none_types if self._is_pydantic_model(arg)
+                    ]
+                    if pydantic_models:
+                        return self._extract_nested_type(pydantic_models[0])
+                    return self._extract_nested_type(non_none_types[0])
+            # Handle other generic types like dict[str, Any], etc.
+            elif hasattr(field_type, "__args__"):
+                return field_type
+
+        return field_type
+
+    # pylint: disable=too-many-return-statements
+    def _extract_all_pydantic_models_from_type(
+        self, field_type
+    ) -> list[type[BaseModel]]:
+        """Extract all Pydantic models from a type annotation, including from Unions."""
+        models = []
+
+        if field_type is None:
+            return models
+
+        # Handle Annotated types
+        if hasattr(typing, "get_origin") and hasattr(typing, "get_args"):
+            origin = typing.get_origin(field_type)
+            args = typing.get_args(field_type)
+
+            if origin is not None:
+                # Handle Annotated[SomeType, ...] - extract from the first argument
+                if hasattr(typing, "Annotated") and origin is typing.Annotated:
+                    if args:
+                        models.extend(
+                            self._extract_all_pydantic_models_from_type(args[0])
+                        )
+                    return models
+
+                # Handle list[SomeType], List[SomeType], etc.
+                if origin in (list, typing.List):
+                    if args:
+                        models.extend(
+                            self._extract_all_pydantic_models_from_type(args[0])
+                        )
+                    return models
+
+                # Handle Union types
+                if origin is typing.Union:
+                    for arg in args:
+                        if arg is not type(None):  # Skip None type
+                            models.extend(
+                                self._extract_all_pydantic_models_from_type(arg)
+                            )
+                    return models
+
+        # Handle new Python 3.10+ union syntax
+        if hasattr(field_type, "__class__") and field_type.__class__ is types.UnionType:
+            for arg in field_type.__args__:
+                if arg is not type(None):  # Skip None type
+                    models.extend(self._extract_all_pydantic_models_from_type(arg))
+            return models
+
+        # Handle old typing.Union syntax (fallback)
+        if hasattr(field_type, "__origin__") and field_type.__origin__ is Union:
+            for arg in field_type.__args__:
+                if arg is not type(None):  # Skip None type
+                    models.extend(self._extract_all_pydantic_models_from_type(arg))
+            return models
+
+        # Check if this type itself is a Pydantic model
+        if self._is_pydantic_model(field_type):
+            models.append(field_type)
+
+        return models
+
+    def _get_nested_models(
+        self, model_class: type[BaseModel], visited=None
+    ) -> dict[str, type[BaseModel]]:
+        """Get all nested Pydantic models from a model class."""
+        if visited is None:
+            visited = set()
+
+        # Avoid infinite recursion
+        if model_class in visited:
+            return {}
+
+        if model_class in self._nested_models_cache:
+            return self._nested_models_cache[model_class]
+
+        visited.add(model_class)
+        nested_models = {}
+
+        # Check all fields in the model
+        for field_info in model_class.model_fields.values():
+            field_type = self._extract_nested_type(field_info.annotation)
+
+            if self._is_pydantic_model(field_type):
+                nested_models[field_type.__name__] = field_type
+                # Recursively get nested models from this nested model
+                deeper_nested = self._get_nested_models(field_type, visited.copy())
+                nested_models.update(deeper_nested)
+
+        self._nested_models_cache[model_class] = nested_models
+        return nested_models
+
+    def _build_inheritance_map(self, child_class: Type[BaseModel]):
+        """Build inheritance map for a class and all its parents."""
+        if child_class in self._inheritance_map_cache:
+            return self._inheritance_map_cache[child_class]
+
+        inheritance_map = {}
+
+        # Get MRO and filter out BaseModel and object
+        mro_classes = [
+            cls
+            for cls in child_class.__mro__
+            if cls not in (BaseModel, object) and hasattr(cls, "__annotations__")
+        ]
+
+        # Process each class in the MRO
+        for cls in mro_classes:
+            inheritance_map[cls] = self._get_direct_fields(cls)
+
+        self._inheritance_map_cache[child_class] = inheritance_map
+        return inheritance_map
+
+    def _wrap_comment(self, text: str, width: int = 88) -> list[str]:
+        """Wrap a comment to specified width, accounting for '# ' prefix."""
+        if not text.strip():
+            return ["#"]
+
+        # Account for "# " prefix (2 characters)
+        content_width = width - 2
+        wrapped_lines = textwrap.wrap(text, width=content_width)
+        return [f"# {line}" for line in wrapped_lines]
+
+    def _extract_type_from_source(
+        self, model_class: type[BaseModel], field_name: str
+    ) -> str:
+        """Extract the actual type annotation text from source code, checking inheritance chain."""
+        # Use inheritance map to check classes efficiently
+        inheritance_map = self._build_inheritance_map(model_class)
+
+        # Check classes in MRO order
+        for cls in model_class.__mro__:
+            if cls in inheritance_map and field_name in inheritance_map[cls]:
+                type_annotation = self._get_type_from_class_source(cls, field_name)
+                if type_annotation != "unknown":
+                    return type_annotation
+
+        return "unknown"
+
+    def _get_type_from_class_source(self, class_obj: type, field_name: str) -> str:
+        """Extract type annotation from a specific class's source code."""
+        try:
+            source = inspect.getsource(class_obj)
+            tree = ast.parse(source)
+        except (OSError, TypeError):
+            return "unknown"
+
+        # Find the class definition
+        for node in tree.body:
+            if isinstance(node, ast.ClassDef) and node.name == class_obj.__name__:
+                # Find the field assignment
+                for body_node in node.body:
+                    if isinstance(body_node, ast.AnnAssign) and isinstance(
+                        body_node.target, ast.Name
+                    ):
+                        if body_node.target.id == field_name and body_node.annotation:
+                            return ast.unparse(body_node.annotation)
+                break
+
+        return "unknown"
+
+    def _extract_field_groups_from_all_classes(
+        self, model_class: type[BaseModel]
+    ) -> list[dict]:
+        """Extract field groups from all classes in the inheritance hierarchy."""
+        all_groups = []
+        inheritance_map = self._build_inheritance_map(model_class)
+
+        # Get all Pydantic base classes in MRO order (most specific first)
+        # This puts AxolotlInputConfig fields first, then parent class fields
+        pydantic_classes = [
+            cls
+            for cls in model_class.__mro__
+            if cls in inheritance_map and inheritance_map[cls]
+        ]
+
+        # Extract groups from each class
+        for cls in pydantic_classes:
+            class_groups = self._extract_field_groups_from_source(cls)
+            for group in class_groups:
+                all_groups.append(group)
+
+        # If no groups found, create a default grouping by class
+        if not all_groups:
+            for cls in pydantic_classes:
+                fields_in_class = inheritance_map[cls]
+                if fields_in_class:
+                    all_groups.append(
+                        {
+                            "fields": list(fields_in_class),
+                        }
+                    )
+
+        return all_groups
+
+    # pylint: disable=too-many-return-statements
+    def _extract_field_groups_from_source(
+        self, model_class: type[BaseModel]
+    ) -> list[dict]:
+        """Extract field groups from source code based on blank lines and comments."""
+        try:
+            source = inspect.getsource(model_class)
+            tree = ast.parse(source)
+        except (OSError, TypeError):
+            # Fallback if we can't get source code
+            fields_in_class = self._get_direct_fields(model_class)
+            if fields_in_class:
+                return [
+                    {
+                        "fields": list(fields_in_class),
+                    }
+                ]
+            return []
+
+        groups = []
+        current_group_fields = []
+        current_group_comment = None
+
+        # Find the class definition
+        class_node = None
+        for node in ast.walk(tree):
+            if isinstance(node, ast.ClassDef) and node.name == model_class.__name__:
+                class_node = node
+                break
+
+        if not class_node:
+            fields_in_class = self._get_direct_fields(model_class)
+            if fields_in_class:
+                return [
+                    {
+                        "fields": list(fields_in_class),
+                    }
+                ]
+            return []
+
+        # Parse the source lines to detect groupings
+        source_lines = source.split("\n")
+
+        # Get fields that are actually defined in this specific class
+        fields_in_class = self._get_direct_fields(model_class)
+
+        # Find assignments that correspond to model fields for THIS class only
+        field_assignments = []
+        for node in class_node.body:
+            if isinstance(node, ast.AnnAssign) and isinstance(node.target, ast.Name):
+                field_name = node.target.id
+                if field_name in fields_in_class:
+                    field_assignments.append(
+                        {
+                            "name": field_name,
+                            "lineno": node.lineno,
+                            "end_lineno": getattr(node, "end_lineno", node.lineno),
+                        }
+                    )
+
+        if not field_assignments:
+            if fields_in_class:
+                return [
+                    {
+                        "fields": list(fields_in_class),
+                    }
+                ]
+            return []
+
+        # Sort by line number
+        field_assignments.sort(key=lambda x: x["lineno"])
+
+        # Group fields based on blank lines and comments
+        for i, field_info in enumerate(field_assignments):
+            field_name = field_info["name"]
+            current_line = field_info["lineno"]
+
+            # Check if this starts a new group (blank line before or significant gap)
+            is_new_group = False
+
+            if i == 0:
+                is_new_group = True
+            else:
+                prev_end_line = field_assignments[i - 1]["end_lineno"]
+
+                # Check for blank lines or comments between fields
+                lines_between = source_lines[prev_end_line : current_line - 1]
+                has_blank_line = any(line.strip() == "" for line in lines_between)
+                has_comment = any(
+                    line.strip().startswith("#") for line in lines_between
+                )
+
+                # Start new group if there's a blank line or comment, or significant gap
+                if has_blank_line or has_comment or (current_line - prev_end_line > 3):
+                    is_new_group = True
+
+            if is_new_group and current_group_fields:
+                # Save the previous group
+                groups.append(
+                    {
+                        "fields": current_group_fields.copy(),
+                        "description": current_group_comment,
+                    }
+                )
+                current_group_fields = []
+                current_group_comment = None
+
+            current_group_fields.append(field_name)
+
+        # Add the final group
+        if current_group_fields:
+            groups.append(
+                {
+                    "fields": current_group_fields,
+                    "description": current_group_comment,
+                }
+            )
+
+        return groups
+
+    def _generate_field_documentation(
+        self,
+        model_class: type[BaseModel],
+        field_name: str,
+        field_info: dict,
+        field_type_str: str,
+        is_required: bool,
+        indent_level: int = 0,
+        visited_models: set = None,
+    ) -> list[str]:
+        """Generate documentation for a single field, expanding nested models inline."""
+        if visited_models is None:
+            visited_models = set()
+
+        lines = []
+        indent = "  " * indent_level
+
+        # Get the actual field type for nested model detection
+        if field_name in model_class.model_fields:
+            pydantic_field_info = model_class.model_fields[field_name]
+            actual_field_type = pydantic_field_info.annotation
+        else:
+            actual_field_type = None
+
+        # Add description comment if available
+        description = field_info.get("description", "")
+        if description:
+            wrapped_lines = self._wrap_comment(description, width=88 - len(indent))
+            for line in wrapped_lines:
+                lines.append(f"{indent}{line}")
+
+        # Extract nested Pydantic models from the type annotation
+        nested_models = self._extract_all_pydantic_models_from_type(actual_field_type)
+
+        # Filter out already visited models to prevent infinite recursion
+        expandable_models = [
+            model for model in nested_models if model not in visited_models
+        ]
+
+        if expandable_models:
+            # This field contains Pydantic models that can be expanded
+
+            # Show the field with its full type annotation
+            field_line = f"{indent}{field_name}: {field_type_str}"
+            if field_info.get("default") is not None:
+                field_line += f" = {field_info['default']}"
+            if is_required:
+                field_line += " (required)"
+            lines.append(field_line)
+
+            # Add to visited to prevent infinite recursion
+            new_visited = visited_models.copy()
+            new_visited.update(expandable_models)
+
+            # Expand each nested Pydantic model
+            for i, nested_model in enumerate(expandable_models):
+                if i > 0:
+                    lines.append("\n")
+                lines.append(f"{indent}  # For {nested_model.__name__}:")
+
+                # Get nested model schema
+                try:
+                    nested_schema = nested_model.model_json_schema()
+                    nested_properties = nested_schema.get("properties", {})
+                    nested_required = nested_schema.get("required", [])
+                except Exception:  # pylint: disable=broad-exception-caught
+                    # Fallback: use model fields directly
+                    nested_properties = {}
+                    nested_required = []
+                    for (
+                        nested_field_name,
+                        nested_field_info,
+                    ) in nested_model.model_fields.items():
+                        nested_description = ""
+                        if (
+                            hasattr(nested_field_info, "json_schema_extra")
+                            and nested_field_info.json_schema_extra
+                        ):
+                            nested_description = (
+                                nested_field_info.json_schema_extra.get(
+                                    "description", ""
+                                )
+                            )
+                        elif (
+                            hasattr(nested_field_info, "description")
+                            and nested_field_info.description
+                        ):
+                            nested_description = nested_field_info.description
+
+                        nested_default_val = None
+                        if (
+                            hasattr(nested_field_info, "default")
+                            and nested_field_info.default is not None
+                        ):
+                            if str(nested_field_info.default) != "PydanticUndefined":
+                                nested_default_val = nested_field_info.default
+
+                        nested_properties[nested_field_name] = {
+                            "type": "unknown",
+                            "description": nested_description,
+                            "default": nested_default_val,
+                        }
+
+                        if nested_field_info.is_required():
+                            nested_required.append(nested_field_name)
+
+                # Get field groups for the nested model
+                nested_field_groups = self._extract_field_groups_from_all_classes(
+                    nested_model
+                )
+
+                # Generate nested fields with increased indentation
+                for i, group in enumerate(nested_field_groups):
+                    if not group["fields"]:
+                        continue
+
+                    # Add blank line between groups (except before first group)
+                    if i > 0:
+                        lines.append("")
+
+                    # Process nested fields
+                    for nested_field_name in group["fields"]:
+                        if nested_field_name not in nested_properties:
+                            continue
+
+                        nested_field_info = nested_properties[nested_field_name]
+                        nested_field_type = self._extract_type_from_source(
+                            nested_model, nested_field_name
+                        )
+                        nested_is_required = nested_field_name in nested_required
+
+                        # Recursively generate documentation for nested field
+                        nested_lines = self._generate_field_documentation(
+                            nested_model,
+                            nested_field_name,
+                            nested_field_info,
+                            nested_field_type,
+                            nested_is_required,
+                            indent_level + 1,
+                            new_visited,
+                        )
+                        lines.extend(nested_lines)
+        else:
+            # Regular field (no expandable nested models)
+            field_line = f"{indent}{field_name}: {field_type_str}"
+            if field_info.get("default") is not None:
+                field_line += f" = {field_info['default']}"
+            if is_required:
+                field_line += " (required)"
+            lines.append(field_line)
+
+        return lines
+
+    def generate_qmd(
+        self,
+        model_class: type[BaseModel],
+        title: str | None = None,
+        expand_nested: bool = True,
+    ) -> str:
+        """Auto-generate config reference documentation including inherited fields."""
+
+        if title is None:
+            title = f"{model_class.__name__} Reference"
+
+        # Try to get JSON schema, with fallback for serialization issues
+        try:
+            schema = model_class.model_json_schema()
+            properties = schema.get("properties", {})
+            required = schema.get("required", [])
+        except Exception as e:  # pylint: disable=broad-exception-caught
+            print(
+                f"Warning: Could not generate JSON schema ({e}). Using model fields instead."
+            )
+            # Fallback: use model fields directly
+            properties = {}
+            required = []
+            for field_name, field_info in model_class.model_fields.items():
+                # Extract description from json_schema_extra or field info
+                description = ""
+                if (
+                    hasattr(field_info, "json_schema_extra")
+                    and field_info.json_schema_extra
+                ):
+                    description = field_info.json_schema_extra.get("description", "")
+                elif hasattr(field_info, "description") and field_info.description:
+                    description = field_info.description
+
+                # Get default value
+                default_val = None
+                if hasattr(field_info, "default") and field_info.default is not None:
+                    # Handle special Pydantic default markers
+                    if str(field_info.default) != "PydanticUndefined":
+                        default_val = field_info.default
+
+                properties[field_name] = {
+                    "type": "unknown",
+                    "description": description,
+                    "default": default_val,
+                }
+
+                if field_info.is_required():
+                    required.append(field_name)
+
+        # Extract field groups from all classes in inheritance hierarchy
+        field_groups = self._extract_field_groups_from_all_classes(model_class)
+
+        # Start building QMD content
+        qmd_lines = [
+            "---",
+            f"title: {title}",
+            "description: A complete list of all configuration options.",
+            "---",
+            "",
+        ]
+
+        # Generate one big code block with all fields (inline nested expansion)
+        qmd_lines.append("```yaml")
+
+        for i, group in enumerate(field_groups):
+            if not group["fields"]:
+                continue
+
+            # Add blank line between groups (except before first group)
+            if i > 0:
+                qmd_lines.append("")
+
+            # Process fields in the order they appear in source
+            for field_name in group["fields"]:
+                if field_name not in properties:
+                    continue
+
+                field_info = properties[field_name]
+                field_type = self._extract_type_from_source(model_class, field_name)
+                is_required = field_name in required
+
+                if expand_nested:
+                    # Check if this field has nested models
+                    if field_name in model_class.model_fields:
+                        pydantic_field_info = model_class.model_fields[field_name]
+                        nested_models = self._extract_all_pydantic_models_from_type(
+                            pydantic_field_info.annotation
+                        )
+                        has_nested = bool(nested_models)
+                    else:
+                        has_nested = False
+
+                    # Add blank line before nested config
+                    if has_nested:
+                        qmd_lines.append("")
+
+                    # Use the new inline generation method
+                    field_lines = self._generate_field_documentation(
+                        model_class,
+                        field_name,
+                        field_info,
+                        field_type,
+                        is_required,
+                        indent_level=0,
+                        visited_models=set(),
+                    )
+                    qmd_lines.extend(field_lines)
+
+                    # Add blank line after nested config
+                    if has_nested:
+                        qmd_lines.append("")
+                else:
+                    # Original simple approach
+                    description = field_info.get("description", "")
+                    default = field_info.get("default")
+
+                    # Add wrapped comment for description
+                    if description:
+                        wrapped_lines = self._wrap_comment(description)
+                        qmd_lines.extend(wrapped_lines)
+
+                    line = f"{field_name}: {field_type}"
+                    if default is not None:
+                        line += f" = {default}"
+                    if is_required:
+                        line += " (required)"
+                    qmd_lines.append(line)
+
+        qmd_lines.append("```")
+
+        # Join all lines and clean up any double newlines
+        content = "\n".join(qmd_lines)
+
+        # Replace multiple consecutive newlines with just two newlines (one blank line)
+        import re
+
+        content = re.sub(r"\n{3,}", "\n\n", content)
+
+        # Ensure single newline at the very end
+        content = content.rstrip("\n") + "\n"
+
+        return content
+
+
+def main():
+    generator = QuartoGenerator()
+
+    print("Generating config reference content...")
+    qmd_content = generator.generate_qmd(AxolotlInputConfig, "Config Reference", True)
+
+    print("Writing to file...")
+    with open("docs/config-reference.qmd", "w", encoding="utf-8") as f:
+        f.write(qmd_content)
+    print("Done!")
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/llama-3/instruct-dpo-lora-8b.yml
+++ b/examples/llama-3/instruct-dpo-lora-8b.yml
@@ -5,6 +5,10 @@ tokenizer_type: AutoTokenizer
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name

+special_tokens:
+  pad_token: <|finetune_right_pad_id|>
+  eos_token: <|eot_id|>
+
 load_in_8bit: true
 load_in_4bit: false

--- a/examples/magistral/README.md
+++ b/examples/magistral/README.md
@@ -0,0 +1,71 @@
+# Finetune Magistral Small with Axolotl
+
+Magistral Small is a 24B parameter opensource model from MistralAI found on [HuggingFace](https://huggingface.co/mistralai/Magistral-Small-2506). This guide shows how to fine-tune it with Axolotl with multi-turn conversations with proper masking.
+
+MistralAI has also released a proprietary medium-sized version called Magistral Medium.
+
+Thanks to the team at MistralAI for giving us early access to prepare for this release.
+
+## Getting started
+
+1. Install Axolotl following the [installation guide](https://docs.axolotl.ai/docs/installation.html). You need to install from main as Magistral is only on nightly or use our latest [Docker images](https://docs.axolotl.ai/docs/docker.html).
+
+    Here is an example of how to install from main for pip:
+
+```bash
+# Ensure you have Pytorch installed (Pytorch 2.6.0 recommended)
+git clone https://github.com/axolotl-ai-cloud/axolotl.git
+cd axolotl
+
+pip3 install packaging==23.2 setuptools==75.8.0 wheel ninja
+pip3 install --no-build-isolation -e '.[flash-attn,mistral]'
+```
+
+2. Download the example config:
+
+```bash
+axolotl fetch examples
+```
+
+3. Run the finetuning example:
+
+```bash
+axolotl train examples/magistral/magistral-small-qlora.yaml
+```
+
+This config uses about 24GB VRAM.
+
+Let us know how it goes. Happy finetuning! 🚀
+
+### TIPS
+
+- For inference, the official MistralAI team recommends `top_p: 0.95` and `temperature: 0.7` with `max_tokens: 40960`.
+- You can run a full finetuning by removing the `adapter: qlora` and `load_in_4bit: true` from the config.
+- Read more on how to load your own dataset at [docs](https://docs.axolotl.ai/docs/dataset_loading.html).
+- The dataset format is the OpenAI Messages format as seen [here](https://docs.axolotl.ai/docs/dataset-formats/conversation.html#chat_template).
+
+## Optimization Guides
+
+- [Multi-GPU Training](https://docs.axolotl.ai/docs/multi-gpu.html)
+- [Multi-Node Training](https://docs.axolotl.ai/docs/multi-node.html)
+- [LoRA Optimizations](https://docs.axolotl.ai/docs/lora_optims.html)
+
+## Limitations
+
+We only support the `mistral-common` tokenizer for Supervised Fine-tuning at the moment and for `type: chat_template` only.
+
+The tokenizer does not work with `dataset.map` with multiprocessing, so we had to disable it. In addition, we do not support overriding tokens yet.
+
+## Related Resources
+
+- [MistralAI Magistral Blog](https://mistral.ai/news/magistral/)
+- [Axolotl Docs](https://docs.axolotl.ai)
+- [Axolotl Website](https://axolotl.ai)
+- [Axolotl GitHub](https://github.com/axolotl-ai-cloud/axolotl)
+- [Axolotl Discord](https://discord.gg/7m9sfhzaf3)
+
+
+## Future Work
+
+- Add parity to Preference Tuning, RL, Multi-modal, etc.
+- Add parity to other tokenizer configs like overriding tokens.
--- a/examples/magistral/magistral-small-fsdp-qlora.yaml
+++ b/examples/magistral/magistral-small-fsdp-qlora.yaml
@@ -0,0 +1,72 @@
+base_model: mistralai/Magistral-Small-2506
+
+# Enable to use mistral-common tokenizer
+tokenizer_use_mistral_common: true
+
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
+load_in_8bit: false
+load_in_4bit: true
+
+datasets:
+  - path: fozziethebeat/alpaca_messages_2k_test
+    type: chat_template
+
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.1
+output_dir: ./outputs/lora-out
+
+adapter: qlora
+lora_model_dir:
+
+sequence_len: 2048
+sample_packing: true
+eval_sample_packing: false
+pad_to_sequence_len: true
+
+lora_r: 32
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_linear: true
+lora_target_modules:
+  - gate_proj
+  - down_proj
+  - up_proj
+  - q_proj
+  - v_proj
+  - k_proj
+  - o_proj
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 2
+num_epochs: 1
+optimizer: adamw_torch_fused
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+bf16: auto
+tf32: false
+
+gradient_checkpointing:
+resume_from_checkpoint:
+logging_steps: 1
+flash_attention: true
+
+warmup_ratio: 0.1
+evals_per_epoch: 1
+saves_per_epoch: 1
+
+fsdp:
+  - full_shard
+  - auto_wrap
+fsdp_config:
+  fsdp_state_dict_type: FULL_STATE_DICT
+  fsdp_transformer_layer_cls_to_wrap: MistralDecoderLayer
+  fsdp_activation_checkpointing: true
--- a/examples/magistral/magistral-small-qlora.yaml
+++ b/examples/magistral/magistral-small-qlora.yaml
@@ -0,0 +1,63 @@
+base_model: mistralai/Magistral-Small-2506
+
+# Enable to use mistral-common tokenizer
+tokenizer_use_mistral_common: true
+
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
+load_in_8bit: false
+load_in_4bit: true
+
+datasets:
+  - path: fozziethebeat/alpaca_messages_2k_test
+    type: chat_template
+
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.1
+output_dir: ./outputs/lora-out
+
+adapter: qlora
+lora_model_dir:
+
+sequence_len: 2048
+sample_packing: true
+pad_to_sequence_len: true
+
+lora_r: 32
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_linear: true
+lora_target_modules:
+  - gate_proj
+  - down_proj
+  - up_proj
+  - q_proj
+  - v_proj
+  - k_proj
+  - o_proj
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 2
+num_epochs: 1
+optimizer: adamw_bnb_8bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+bf16: auto
+tf32: false
+
+gradient_checkpointing: true
+resume_from_checkpoint:
+logging_steps: 1
+flash_attention: true
+
+warmup_ratio: 0.1
+evals_per_epoch: 1
+saves_per_epoch: 1
--- a/examples/qwen2-vl/lora-7b.yaml
+++ b/examples/qwen2-vl/lora-7b.yaml
@@ -25,7 +25,7 @@ pad_to_sequence_len: false
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
-lora_target_modules: 'model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
+lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'

 wandb_project:
 wandb_entity:
--- a/favicon.jpg
+++ b/favicon.jpg
--- a/requirements.txt
+++ b/requirements.txt
@@ -13,12 +13,12 @@ packaging==23.2

 huggingface_hub==0.32.2
 peft==0.15.2
-transformers==4.52.3
+transformers==4.52.4
 tokenizers>=0.21.1
 accelerate==1.7.0
 datasets==3.6.0
 deepspeed>=0.17.0
-trl==0.18.1
+trl==0.18.2
 hf_xet==1.1.2

 optimum==1.16.2
@@ -67,3 +67,5 @@ schedulefree==1.4.1

 axolotl-contribs-lgpl==0.0.6
 axolotl-contribs-mit==0.0.3
+
+mistral-common==1.6.0
--- a/setup.py
+++ b/setup.py
@@ -118,7 +118,7 @@ extras_require = {
        "yunchang==0.6.0",
    ],
    "deepspeed": [
-        "deepspeed==0.17.0",
+        "deepspeed==0.17.1",
        "deepspeed-kernels",
    ],
    "mamba-ssm": [
--- a/src/axolotl/init.py
+++ b/src/axolotl/init.py
@@ -4,4 +4,4 @@ import pkgutil

 __path__ = pkgutil.extend_path(__path__, __name__)  # Make this a namespace package

-__version__ = "0.10.0.dev0"
+__version__ = "0.11.0.dev"
--- a/src/axolotl/cli/config.py
+++ b/src/axolotl/cli/config.py
@@ -26,7 +26,7 @@ from axolotl.utils.mlflow_ import setup_mlflow_env_vars
 from axolotl.utils.trainer import prepare_opinionated_env, prepare_optim_env
 from axolotl.utils.wandb_ import setup_wandb_env_vars

-LOG = get_logger(__name__, use_environ=True)
+LOG = get_logger(__name__)


 def check_remote_config(config: Union[str, Path]) -> Union[str, Path]:
--- a/src/axolotl/common/const.py
+++ b/src/axolotl/common/const.py
@@ -1,5 +1,3 @@
-"""
-Various shared constants
-"""
+"""Various shared constants"""

 DEFAULT_DATASET_PREPARED_PATH = "last_run_prepared"
--- a/src/axolotl/common/datasets.py
+++ b/src/axolotl/common/datasets.py
@@ -3,15 +3,13 @@
 import math
 import random
 from dataclasses import dataclass
-from typing import Optional, Union

 from datasets import Dataset

 import axolotl.monkeypatch.data.batch_dataset_fetcher  # pylint: disable=unused-import  # noqa: F401
 from axolotl.cli.args import PreprocessCliArgs, TrainerCliArgs
 from axolotl.loaders import load_processor, load_tokenizer
-from axolotl.utils.data import prepare_dataset
-from axolotl.utils.data.rl import load_prepare_preference_datasets
+from axolotl.utils.data import prepare_datasets, prepare_preference_datasets
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.logging import get_logger
 from axolotl.utils.schemas.enums import RLType
@@ -30,16 +28,7 @@ class TrainDatasetMeta:


 def sample_dataset(dataset: Dataset, num_samples: int) -> Dataset:
-    """
-    Randomly sample `num_samples` samples from `dataset`.
-
-    Args:
-        dataset: Dataset.
-        num_samples: Number of samples to return.
-
-    Returns:
-        Random sample (with replacement) of examples in `dataset`.
-    """
+    """Randomly sample `num_samples` samples with replacement from `dataset`."""
    return dataset.select(
        [random.randrange(0, len(dataset) - 1) for _ in range(num_samples)]  # nosec
    )
@@ -51,44 +40,37 @@ def load_datasets(
    cli_args: PreprocessCliArgs | TrainerCliArgs | None = None,
    debug: bool = False,
 ) -> TrainDatasetMeta:
-    """
-    Loads one or more training or evaluation datasets, calling
-    `axolotl.utils.data.prepare_dataset`. Optionally, logs out debug information.
+    """Loads one or more training or evaluation datasets, calling
+    `axolotl.utils.data.prepare_datasets`. Optionally, logs out debug information.

    Args:
        cfg: Dictionary mapping `axolotl` config keys to values.
        cli_args: Command-specific CLI arguments.
-        debug: Whether to print out tokenization of sample
+        debug: Whether to print out tokenization of sample. This is duplicated in
+            `cfg` and `cli_args`, but is kept due to use in our Colab notebooks.

    Returns:
        Dataclass with fields for training and evaluation datasets and the computed
-        `total_num_steps`.
+            `total_num_steps`.
    """
    tokenizer = load_tokenizer(cfg)
    processor = load_processor(cfg, tokenizer=tokenizer) if cfg.processor_type else None
-    preprocess_iterable = (
-        cli_args
-        and hasattr(cli_args, "iterable")
-        and cli_args.iterable is not None
-        and cli_args.iterable
-    )
+    preprocess_iterable = getattr(cli_args, "iterable", False)

-    train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
+    train_dataset, eval_dataset, total_num_steps, prompters = prepare_datasets(
        cfg,
        tokenizer,
        processor=processor,
        preprocess_iterable=preprocess_iterable,
    )

-    if (  # pylint: disable=too-many-boolean-expressions
-        cli_args
-        and (
-            cli_args.debug
-            or cfg.debug
-            or cli_args.debug_text_only
-            or int(cli_args.debug_num_examples) > 0
-        )
-    ) or debug:
+    if (
+        cfg.debug
+        or getattr(cli_args, "debug", False)
+        or getattr(cli_args, "debug_text_only", False)
+        or getattr(cli_args, "debug_num_examples", 0) > 0
+        or debug
+    ):
        LOG.info("check_dataset_labels...")

        num_examples = cli_args.debug_num_examples if cli_args else 1
@@ -113,13 +95,10 @@ def load_datasets(


 def load_preference_datasets(
-    *,
-    cfg: DictDefault,
-    cli_args: Union[PreprocessCliArgs, TrainerCliArgs],
+    *, cfg: DictDefault, cli_args: PreprocessCliArgs | TrainerCliArgs | None = None
 ) -> TrainDatasetMeta:
-    """
-    Loads one or more training or evaluation datasets for RL training using paired
-    preference data, calling `axolotl.utils.data.rl.load_prepare_preference_datasets`.
+    """Loads one or more training or evaluation datasets for RL training using paired
+    preference data, calling `axolotl.utils.data.rl.prepare_preference_datasets`.
    Optionally, logs out debug information.

    Args:
@@ -130,23 +109,28 @@ def load_preference_datasets(
        Dataclass with fields for training and evaluation datasets and the computed
        `total_num_steps`.
    """
-    train_dataset, eval_dataset = load_prepare_preference_datasets(cfg)
-    total_num_steps: Optional[int] = int(
-        math.ceil(len(train_dataset) * cfg.num_epochs / cfg.batch_size)
-    )
-    if cfg.rl is RLType.GRPO:
-        total_num_steps = None
+    tokenizer = load_tokenizer(cfg)
+    train_dataset, eval_dataset = prepare_preference_datasets(cfg, tokenizer)

-    if cli_args.debug or cfg.debug:
+    total_num_steps: int | None = None
+    if cfg.rl is not RLType.GRPO:
+        total_num_steps = int(
+            math.ceil(len(train_dataset) * cfg.num_epochs / cfg.batch_size)
+        )
+
+    if (cli_args and cli_args.debug) or cfg.debug:
        LOG.info("check_dataset_labels...")

+        num_examples = cli_args.debug_num_examples if cli_args else 1
+        text_only = cli_args.debug_text_only if cli_args else False
+
        tokenizer = load_tokenizer(cfg)
-        train_samples = sample_dataset(train_dataset, cli_args.debug_num_examples)
+        train_samples = sample_dataset(train_dataset, num_examples)
        check_dataset_labels(
-            train_samples,
-            tokenizer,
-            num_examples=cli_args.debug_num_examples,
-            text_only=cli_args.debug_text_only,
+            dataset=train_samples,
+            tokenizer=tokenizer,
+            num_examples=num_examples,
+            text_only=text_only,
            rl_mode=True,
        )

--- a/src/axolotl/core/builders/base.py
+++ b/src/axolotl/core/builders/base.py
@@ -380,14 +380,16 @@ class TrainerBuilderBase(abc.ABC):
        )

        # eval_strategy and eval_steps
-        if not self.eval_dataset or self.cfg.val_set_size == 0:
-            # do not eval if no eval_dataset or val_set_size=0
+        if not self.eval_dataset and self.cfg.val_set_size == 0:
+            # do not eval if no eval_dataset and val_set_size=0
            training_args_kwargs["eval_strategy"] = "no"
        elif self.cfg.eval_steps:
            training_args_kwargs["eval_strategy"] = "steps"
            training_args_kwargs["eval_steps"] = self.cfg.eval_steps
+            training_args_kwargs["eval_on_start"] = True
        elif self.cfg.eval_strategy:
            training_args_kwargs["eval_strategy"] = self.cfg.eval_strategy
+            training_args_kwargs["eval_on_start"] = True

    def _configure_reporting(self, training_args_kwargs: dict):
        report_to = []
@@ -490,6 +492,9 @@ class TrainerBuilderBase(abc.ABC):
        training_args_kwargs["max_steps"] = self.cfg.max_steps or total_num_steps or -1
        training_args_kwargs["num_train_epochs"] = self.cfg.num_epochs

+        if self.cfg.dataset_processes:
+            training_args_kwargs["dataset_num_proc"] = self.cfg.dataset_processes
+
        # max_length is not used in CausalTrainer
        if self.cfg.reward_model or self.cfg.rl:
            training_args_kwargs["max_length"] = self.cfg.sequence_len
--- a/src/axolotl/core/builders/causal.py
+++ b/src/axolotl/core/builders/causal.py
@@ -27,7 +27,6 @@ from axolotl.monkeypatch.relora import ReLoRACallback
 from axolotl.processing_strategies import get_processing_strategy
 from axolotl.utils import is_comet_available, is_mlflow_available
 from axolotl.utils.callbacks import (
-    EvalFirstStepCallback,
    LossWatchDogCallback,
    SaveBetterTransformerModelCallback,
    bench_eval_callback_factory,
@@ -58,7 +57,6 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):

    def get_callbacks(self):
        callbacks = super().get_callbacks()
-        callbacks.append(EvalFirstStepCallback())

        if self.cfg.relora_steps:
            callbacks.append(ReLoRACallback(self.cfg))
@@ -377,7 +375,7 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        elif "tokenizer" in sig.parameters:
            trainer_kwargs["tokenizer"] = self.tokenizer
        if (
-            not (trainer_cls in [AxolotlRewardTrainer, AxolotlPRMTrainer])
+            trainer_cls not in [AxolotlRewardTrainer, AxolotlPRMTrainer]
            and self.cfg.datasets is not None
        ):
            trainer_kwargs["dataset_tags"] = [
@@ -437,6 +435,7 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        ]
        collator_args = [self.tokenizer]

+        collator_cls_and_kwargs = None
        if self.cfg.plugins:
            plugin_manager = PluginManager.get_instance()
            collator_cls_and_kwargs = plugin_manager.get_collator_cls_and_kwargs(
--- a/src/axolotl/core/builders/rl.py
+++ b/src/axolotl/core/builders/rl.py
@@ -14,6 +14,7 @@ from axolotl.core.trainers.dpo.args import AxolotlDPOConfig
 from axolotl.core.trainers.grpo import GRPOStrategy
 from axolotl.integrations.base import PluginManager
 from axolotl.loaders.utils import ensure_dtype
+from axolotl.utils.callbacks.qat import QATCallback
 from axolotl.utils.logging import get_logger
 from axolotl.utils.schemas.enums import RLType

@@ -26,6 +27,9 @@ class HFRLTrainerBuilder(TrainerBuilderBase):
    def get_callbacks(self):
        callbacks = super().get_callbacks()

+        if self.cfg.qat:
+            callbacks.append(QATCallback(self.cfg.qat))
+
        return callbacks

    def get_post_trainer_create_callbacks(self, trainer):
@@ -91,10 +95,6 @@ class HFRLTrainerBuilder(TrainerBuilderBase):
        else:
            training_args_kwargs["remove_unused_columns"] = False

-        # only rlhf
-        if self.cfg.dataset_processes:
-            training_args_kwargs["dataset_num_proc"] = self.cfg.dataset_processes
-
        if self.cfg.trl and self.cfg.trl.beta is not None:
            training_args_kwargs["beta"] = self.cfg.trl.beta
        elif self.cfg.rl_beta is not None:
@@ -143,22 +143,7 @@ class HFRLTrainerBuilder(TrainerBuilderBase):

        elif self.cfg.rl in [RLType.DPO, RLType.IPO]:
            training_args_cls = AxolotlDPOConfig
-            if self.cfg.rl is RLType.IPO:
-                training_args_kwargs["loss_type"] = "ipo"
-
-            # Not compatible with IPO
-            if self.cfg.rl is RLType.DPO and self.cfg.dpo_label_smoothing:
-                training_args_kwargs["label_smoothing"] = self.cfg.dpo_label_smoothing
-
-            training_args_kwargs["max_completion_length"] = None
-            training_args_kwargs["max_prompt_length"] = self.cfg.sequence_len
-            training_args_kwargs["generate_during_eval"] = self.cfg.use_wandb
-            if self.cfg.dpo_use_weighting is not None:
-                training_args_kwargs["use_weighting"] = self.cfg.dpo_use_weighting
-            if self.cfg.dpo_use_logits_to_keep is not None:
-                training_args_kwargs["use_logits_to_keep"] = (
-                    self.cfg.dpo_use_logits_to_keep
-                )
+            training_args_kwargs.update(DPOStrategy.set_training_args_kwargs(self.cfg))
        else:
            raise ValueError(f"Unsupported RL: {self.cfg.rl}")

@@ -166,7 +151,6 @@ class HFRLTrainerBuilder(TrainerBuilderBase):
            if blocklist_key in training_args_kwargs:
                del training_args_kwargs[blocklist_key]

-
        if self.cfg.plugins:
            plugin_manager = PluginManager.get_instance()
            plugin_training_args = plugin_manager.get_training_args(self.cfg)
--- a/src/axolotl/core/trainers/base.py
+++ b/src/axolotl/core/trainers/base.py
@@ -25,6 +25,7 @@ from trl.trainer.utils import pad_to_length
 from typing_extensions import override

 from axolotl.core.trainers.mixins import (
+    CheckpointSaveMixin,
    OptimizerMixin,
    RngLoaderMixin,
    SchedulerMixin,
@@ -40,7 +41,9 @@ from axolotl.utils.samplers import MultipackBatchSampler, get_dataset_lengths
 LOG = get_logger(__name__)


-class AxolotlTrainer(SchedulerMixin, OptimizerMixin, RngLoaderMixin, Trainer):
+class AxolotlTrainer(
+    SchedulerMixin, OptimizerMixin, RngLoaderMixin, CheckpointSaveMixin, Trainer
+):
    """Extend the base Trainer for axolotl helpers"""

    args = None  # type: "AxolotlTrainingArguments"  # type: ignore[name-defined]
@@ -112,6 +115,7 @@ class AxolotlTrainer(SchedulerMixin, OptimizerMixin, RngLoaderMixin, Trainer):
            bin_size=self.args.sample_packing_bin_size,
            sequential=self.args.sample_packing_sequentially,
            drop_last=True,
+            num_processes=self.args.dataset_num_proc,
        )

        len(sampler)
--- a/src/axolotl/core/trainers/dpo/init.py
+++ b/src/axolotl/core/trainers/dpo/init.py
@@ -22,10 +22,19 @@ class DPOStrategy:
        training_args_kwargs = {}
        if cfg.rl is RLType.IPO:
            training_args_kwargs["loss_type"] = "ipo"
-        training_args_kwargs["max_length"] = cfg.sequence_len
+        # Label smoothing is not compatible with IPO
+        if cfg.rl is RLType.DPO and cfg.dpo_label_smoothing:
+            training_args_kwargs["label_smoothing"] = cfg.dpo_label_smoothing
        training_args_kwargs["max_completion_length"] = None
+        training_args_kwargs["max_length"] = cfg.sequence_len
        training_args_kwargs["max_prompt_length"] = cfg.sequence_len
        training_args_kwargs["generate_during_eval"] = cfg.use_wandb
        if cfg.dpo_use_weighting is not None:
            training_args_kwargs["use_weighting"] = cfg.dpo_use_weighting
+        if cfg.dpo_padding_free is not None:
+            training_args_kwargs["padding_free"] = cfg.dpo_padding_free
+        if cfg.dpo_norm_loss is not None:
+            training_args_kwargs["dpo_norm_loss"] = cfg.dpo_norm_loss
+        if cfg.dpo_use_logits_to_keep is not None:
+            training_args_kwargs["use_logits_to_keep"] = cfg.dpo_use_logits_to_keep
        return training_args_kwargs
--- a/src/axolotl/core/trainers/dpo/args.py
+++ b/src/axolotl/core/trainers/dpo/args.py
@@ -14,3 +14,5 @@ class AxolotlDPOConfig(AxolotlTrainingMixins, DPOConfig):
    """
    DPO config for DPO training
    """
+
+    dpo_norm_loss: bool | None = False
--- a/src/axolotl/core/trainers/dpo/trainer.py
+++ b/src/axolotl/core/trainers/dpo/trainer.py
@@ -83,3 +83,20 @@ class AxolotlDPOTrainer(
        gc.collect()
        torch.cuda.empty_cache()
        return loss
+
+    def concatenated_forward(
+        self,
+        model: nn.Module,
+        batch: dict[str, Union[list, torch.LongTensor]],
+        is_ref_model: bool = False,
+    ) -> dict[str, torch.Tensor]:
+        if self.args.dpo_norm_loss:
+            # fmt: off
+            loss_type: str = self.loss_type  # type: ignore[has-type]  # pylint: disable=access-member-before-definition
+            # fmt: on
+            # concatenated_forward handles avg token logprob for ipo case already
+            self.loss_type = "ipo"  # pylint: disable=attribute-defined-outside-init
+            res = super().concatenated_forward(model, batch, is_ref_model=is_ref_model)
+            self.loss_type = loss_type  # pylint: disable=attribute-defined-outside-init
+            return res
+        return super().concatenated_forward(model, batch, is_ref_model=is_ref_model)
--- a/src/axolotl/core/trainers/grpo/trainer.py
+++ b/src/axolotl/core/trainers/grpo/trainer.py
@@ -3,6 +3,7 @@
 # pylint: disable=too-many-lines,duplicate-code,protected-access,no-member

 import warnings
+from functools import partial
 from typing import Any

 import datasets
@@ -58,6 +59,42 @@ class AxolotlGRPOTrainer(

    _tag_names = ["trl", "grpo", "axolotl"]

+    def get_train_dataloader(self):
+        if self.train_dataset is None:
+            raise ValueError("Trainer: training requires a train_dataset.")
+
+        train_dataset = self.train_dataset
+        data_collator = self.data_collator
+        if isinstance(train_dataset, datasets.Dataset):
+            train_dataset = self._remove_unused_columns(
+                train_dataset, description="training"
+            )
+        else:
+            data_collator = self._get_collator_with_removed_columns(
+                data_collator, description="training"
+            )
+
+        dataloader_params = {
+            "batch_size": self._train_batch_size
+            * self.args.steps_per_generation,  # < this is the change
+            "collate_fn": data_collator,
+            "num_workers": self.args.dataloader_num_workers,
+            "pin_memory": self.args.dataloader_pin_memory,
+            "persistent_workers": self.args.dataloader_persistent_workers,
+        }
+
+        if not isinstance(train_dataset, torch.utils.data.IterableDataset):
+            dataloader_params["sampler"] = self._get_train_sampler()
+            dataloader_params["drop_last"] = self.args.dataloader_drop_last
+            dataloader_params["worker_init_fn"] = partial(
+                seed_worker,
+                num_workers=self.args.dataloader_num_workers,
+                rank=self.args.process_index,
+            )
+            dataloader_params["prefetch_factor"] = self.args.dataloader_prefetch_factor
+
+        return self.accelerator.prepare(DataLoader(train_dataset, **dataloader_params))
+

 class AxolotlGRPOSequenceParallelTrainer(AxolotlGRPOTrainer):
    """Extend the base GRPOTrainer for sequence parallelism handling"""
--- a/src/axolotl/core/trainers/mixins/init.py
+++ b/src/axolotl/core/trainers/mixins/init.py
@@ -3,6 +3,7 @@
 # pylint: disable=unused-import
 # flake8: noqa

+from .checkpoints import CheckpointSaveMixin
 from .optimizer import OptimizerMixin
 from .rng_state_loader import RngLoaderMixin
 from .scheduler import SchedulerMixin
--- a/src/axolotl/core/trainers/mixins/checkpoints.py
+++ b/src/axolotl/core/trainers/mixins/checkpoints.py
@@ -0,0 +1,21 @@
+"""Custom handling to not fail training if fsdp optimizer is not savable"""
+
+from transformers import Trainer
+
+from axolotl.utils.logging import get_logger
+
+LOG = get_logger(__name__)
+
+
+class CheckpointSaveMixin(Trainer):
+    """Mixin to handle saving the optimizer and scheduler if they are not savable."""
+
+    def _save_optimizer_and_scheduler(self, output_dir):
+        try:
+            super()._save_optimizer_and_scheduler(output_dir)
+        except NotImplementedError as exc:
+            LOG.warning(
+                f"Trainer does not support saving optimizer and scheduler:  {exc}\n"
+                "Optimizer and scheduler states were not saved - resuming from checkpoints "
+                "for this training run will not be possible."
+            )
--- a/src/axolotl/core/training_args_base.py
+++ b/src/axolotl/core/training_args_base.py
@@ -66,6 +66,10 @@ class AxolotlTrainingMixins:
        default=2048,
        metadata={"help": "The maximum sequence length the model can handle"},
    )
+    dataset_num_proc: int | None = field(
+        default=None,
+        metadata={"help": "The number of processes to use for data processing"},
+    )
    relora_steps: Optional[int] = field(
        default=None,
        metadata={"help": "how often to reset for ReLoRA"},
--- a/src/axolotl/datasets.py
+++ b/src/axolotl/datasets.py
@@ -1,7 +1,6 @@
 """Module containing Dataset functionality"""

 import os
-from typing import List, Optional, Union

 import torch
 from datasets import Dataset, IterableDataset
@@ -20,21 +19,21 @@ LOG = get_logger(__name__)


 class TokenizedPromptDataset(Dataset):
-    """
-    Dataset that returns tokenized prompts from a stream of text files.
-        Args:
-            prompt_tokenizer (PromptTokenizingStrategy): The prompt tokenizing method for processing the data.
-            dataset (dataset.Dataset): Dataset with text files.
-            process_count (int): Number of processes to use for tokenizing.
-            keep_in_memory (bool): Whether to keep the tokenized dataset in memory.
+    """Dataset that returns tokenized prompts from a stream of text files.
+
+    Args:
+        prompt_tokenizer: The prompt tokenizing method for processing the data.
+        dataset: Dataset with text files.
+        process_count: Number of processes to use for tokenizing.
+        keep_in_memory: Whether to keep the tokenized dataset in memory.
    """

    def __init__(  # pylint: disable=super-init-not-called
        self,
        prompt_tokenizer: PromptTokenizingStrategy,
        dataset: Dataset,
-        process_count: Optional[int] = None,
-        keep_in_memory: Optional[bool] = False,
+        process_count: int | None = None,
+        keep_in_memory: bool | None = False,
        **kwargs,
    ):
        self.prompt_tokenizer = prompt_tokenizer
@@ -49,6 +48,13 @@ class TokenizedPromptDataset(Dataset):
        features = dataset.features.keys()
        num_proc = min(64, self.process_count if self.process_count else os.cpu_count())

+        # Disable multiprocessing if the tokenizer doesn't support it (e.g., mistral_common)
+        if not getattr(self.prompt_tokenizer, "supports_multiprocessing", True):
+            LOG.info(
+                "Disabling multiprocessing for tokenizer as it doesn't support it (e.g., mistral_common)"
+            )
+            num_proc = 1
+
        map_kwargs = {}
        if self.prompt_tokenizer.supports_batched:
            map_kwargs["batched"] = True
@@ -76,14 +82,14 @@ class TokenizedPromptDataset(Dataset):

 def wrap_dataset_for_tokenized_prompt(
    prompt_tokenizer: PromptTokenizingStrategy,
-    dataset: Union[Dataset, IterableDataset],
+    dataset: Dataset | IterableDataset,
    **kwargs,
 ):
    if isinstance(dataset, IterableDataset):
        map_kwargs = {}
        if prompt_tokenizer.supports_batched:
            map_kwargs["batched"] = True
-        features = dataset.features.keys()
+        features = list(dataset.features.keys())
        return dataset.map(
            prompt_tokenizer.tokenize_prompt,
            remove_columns=features,
@@ -94,12 +100,13 @@ def wrap_dataset_for_tokenized_prompt(

 # TODO this isn't the best since it can't interleave datasets
 class ConstantLengthDataset(IterableDataset):
-    """
-    Iterable dataset that returns constant length chunks of tokens from stream of text files.
-        Args:
-            tokenizer (Tokenizer): The processor used for processing the data.
-            dataset (dataset.Dataset): Dataset with text files.
-            seq_length (int): Length of token sequences to return.
+    """Iterable dataset that returns constant length chunks of tokens from stream of
+    text files.
+
+    Args:
+        tokenizer: The processor used for processing the data.
+        dataset: Dataset with text files.
+        seq_length: Length of token sequences to return.
    """

    def __init__(  # pylint: disable=super-init-not-called
@@ -110,7 +117,7 @@ class ConstantLengthDataset(IterableDataset):
    ):
        self.tokenizer = tokenizer
        self.concat_token_id = tokenizer.eos_token_id
-        self.datasets: List[IterableDataset] = datasets
+        self.datasets: list[IterableDataset] = datasets
        self.seq_length = seq_length

        vocab_size = len(tokenizer.get_vocab())
@@ -174,7 +181,10 @@ class ConstantLengthDataset(IterableDataset):
                            }
                        else:
                            LOG.warning(
-                                f"dropping batch due to tensor size mismatch input_ids: {input_ids.size()}, labels: {labels.size()}, attention_mask: {attention_mask.size()}"
+                                "Dropping batch due to tensor size mismatch "
+                                f"input_ids: {input_ids.size()}, "
+                                f"labels: {labels.size()}, "
+                                f"attention_mask: {attention_mask.size()}"
                            )
                    buffer = {
                        "input_ids": [],
--- a/src/axolotl/evaluate.py
+++ b/src/axolotl/evaluate.py
@@ -7,7 +7,6 @@ from pathlib import Path
 from typing import Dict, Optional

 import torch
-from accelerate.logging import get_logger
 from datasets import Dataset
 from transformers.trainer import Trainer

@@ -17,6 +16,7 @@ from axolotl.train import (
 )
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.distributed import cleanup_distributed
+from axolotl.utils.logging import get_logger
 from axolotl.utils.trainer import setup_trainer

 project_root = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
--- a/src/axolotl/integrations/base.py
+++ b/src/axolotl/integrations/base.py
@@ -33,7 +33,7 @@ from transformers import PreTrainedModel, Trainer
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.logging import get_logger

-LOG = get_logger(__name__, use_environ=True)
+LOG = get_logger(__name__)

 if TYPE_CHECKING:
    from axolotl.common.datasets import TrainDatasetMeta
--- a/src/axolotl/integrations/cut_cross_entropy/README.md
+++ b/src/axolotl/integrations/cut_cross_entropy/README.md
@@ -24,6 +24,14 @@ pip3 uninstall -y cut-cross-entropy && pip3 install "cut-cross-entropy[transform

 ## Usage

+**NOTE**: If you are training a VLM model, please use older version of Axolotl as upstream has applied a major VLM refactor, and our patches have not been updated yet.
+
+```bash
+git checkout 787880215b3ab32ccaf81c1b2e9588c6f3e6e764
+
+pip3 install --no-build-isolation -e .
+```
+
 ```yaml
 plugins:
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
--- a/src/axolotl/integrations/cut_cross_entropy/init.py
+++ b/src/axolotl/integrations/cut_cross_entropy/init.py
@@ -28,7 +28,7 @@ from axolotl.utils.logging import get_logger

 from .args import CutCrossEntropyArgs  # pylint: disable=unused-import. # noqa: F401

-LOG = get_logger(__name__, use_environ=True)
+LOG = get_logger(__name__)

 _CCE_INSTALL_MESSAGE = (
    "Please install cut_cross_entropy with transformers support using "
--- a/src/axolotl/integrations/kd/README.md
+++ b/src/axolotl/integrations/kd/README.md
@@ -21,32 +21,3 @@ datasets:
 ```

 An example dataset can be found at [`axolotl-ai-co/evolkit-logprobs-pipeline-75k-v2-sample`](https://huggingface.co/datasets/axolotl-ai-co/evolkit-logprobs-pipeline-75k-v2-sample)
-
-## Online KD (sglang)
-
-```bash
-export UV_TORCH_BACKEND=cu124
-uv venv sglang --python 3.11
-source sglang/bin/activate
-uv pip install --upgrade pip
-uv pip install setuptools
-uv pip install torch~=2.5.1 --index-url https://download.pytorch.org/whl/cu124
-uv pip install sgl-kernel --force-reinstall --no-deps
-uv pip install "sglang[all]>=0.4.2.post4" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer/
-```
-
-## Online KD (vllm)
-
-```bash
-VLLM_USE_V1=0 vllm serve open-r1/OlympicCoder-32B --max-model-len 16400 --port 8888 --max-logprobs 128 --return-tokens-as-token-ids --tensor-parallel-size 8 --max-num-seqs
-256 --gpu_memory_utilization 0.2 --enable-chunked-prefill
-```
-
-```bash
-vllm serve open-r1/OlympicCoder-32B --max-model-len 16400 --port 8888 --max-logprobs 128 --return-tokens-as-token-ids --tensor-parallel-size 8 --no-enable-prefix-caching --gpu-memory-utilization 0.3 --max-num-batched-tokens 131072 --host 0.0.0.0
-```
-
-
-```bash
-python -m sglang.launch_server --model-path open-r1/OlympicCoder-32B --tensor-parallel-size 8 --port 8080 --host 0.0.0.0 --max-running-requests 256 --context-length 16400 --mem-fraction-static 0.2 --schedule-conservativeness 0.3 --chunked-prefill-size 131072 --schedule-policy fcfs --skip-tokenizer-init
-```
--- a/src/axolotl/integrations/kd/args.py
+++ b/src/axolotl/integrations/kd/args.py
@@ -41,7 +41,7 @@ class KDArgs(BaseModel):
    )
    kd_alpha: float | None = None  # loss coefficient for KD loss
    kd_temperature: float | None = None  # temperature for sampling during KD
-    kd_beta: float | None = None  # beta coefficient for ratio of fwd and reverse KL
+    kd_beta: float | None = 0.0  # beta coefficient for ratio of fwd and reverse KL
    kd_normalize_topk: bool | None = (
        None  # whether to normalize student logits during KD
    )
--- a/src/axolotl/integrations/kd/chat_template.py
+++ b/src/axolotl/integrations/kd/chat_template.py
@@ -299,8 +299,8 @@ class KDStrategyLoader(StrategyLoader):
    Load ChatTemplateStrategy with KD support using StrategyLoader.
    """

-    def _get_strategy_cls(self):
-        return ChatTemplateStrategyWithKDv2
+    def _get_strategy_cls(self, cfg):  # pylint: disable=unused-argument
+        return ChatTemplateStrategyWithKD

    def _get_strategy_params(self, cfg, ds_cfg: Dict[str, Any]):
        strategy_params = super()._get_strategy_params(cfg, ds_cfg)
@@ -314,4 +314,14 @@ class KDStrategyLoader(StrategyLoader):
        return strategy_params


-load = KDStrategyLoader()
+class KDStrategyLoaderV2(KDStrategyLoader):
+    """
+    Load KD chat template datasets with pre-tokenized logprob data
+    """
+
+    def _get_strategy_cls(self, cfg):  # pylint: disable=unused-argument
+        return ChatTemplateStrategyWithKDv2
+
+
+load_legacy = KDStrategyLoader()
+load = KDStrategyLoaderV2()
--- a/src/axolotl/integrations/kd/kernels/init.py
+++ b/src/axolotl/integrations/kd/kernels/init.py
@@ -0,0 +1,8 @@
+"""
+Liger Chunked loss optimizations module
+"""
+
+from .liger import LigerFusedLinearKLTopKLogprobLoss
+from .models import apply_kernel
+
+__all__ = ["LigerFusedLinearKLTopKLogprobLoss", "apply_kernel"]
--- a/src/axolotl/integrations/kd/trainer.py
+++ b/src/axolotl/integrations/kd/trainer.py
@@ -33,7 +33,7 @@ class AxolotlKDTrainer(AxolotlTrainer):
            self.args.kd_ce_alpha,  # hard label loss
            self.args.kd_alpha,  # kd loss
            self.args.kd_temperature,
-            self.args.kd_beta,
+            self.args.kd_beta or 0.0,
            compute_ce_loss=bool(self.args.kd_ce_alpha),
            normalize_topk=self.args.kd_normalize_topk,
        )
--- a/src/axolotl/integrations/liger/init.py
+++ b/src/axolotl/integrations/liger/init.py
@@ -27,7 +27,7 @@ from axolotl.utils.logging import get_logger
 from .args import LigerArgs  # pylint: disable=unused-import. # noqa: F401
 from .utils import patch_with_compile_disable

-LOG = get_logger(__name__, use_environ=True)
+LOG = get_logger(__name__)


 class LigerPlugin(BasePlugin):
--- a/src/axolotl/integrations/liger/args.py
+++ b/src/axolotl/integrations/liger/args.py
@@ -15,6 +15,7 @@
 """
 Module for handling LIGER input arguments.
 """
+
 from typing import Optional

 from pydantic import BaseModel, model_validator
--- a/src/axolotl/loaders/patch_manager.py
+++ b/src/axolotl/loaders/patch_manager.py
@@ -166,6 +166,17 @@ class PatchManager:
    def _apply_self_attention_lora_patch(self):
        """Apply self-attention LoRA patches if configured."""
        if self.cfg.lora_qkv_kernel or self.cfg.lora_o_kernel:
+            # Only patch if conditions are met
+            can_patch = (
+                self.cfg.lora_dropout == 0
+                if hasattr(self.cfg, "lora_dropout")
+                else True
+            )  # default to True if lora_dropout is not set
+
+            if not can_patch:
+                LOG.warning("Cannot patch self-attention - requires no dropout")
+                return
+
            from axolotl.monkeypatch.lora_kernels import patch_self_attn_lora

            patch_self_attn_lora(self.cfg)
--- a/src/axolotl/loaders/tokenizer.py
+++ b/src/axolotl/loaders/tokenizer.py
@@ -7,12 +7,14 @@ import transformers
 from transformers import (
    AddedToken,
    AutoTokenizer,
+    PreTrainedTokenizer,
 )

 from axolotl.integrations.base import PluginManager
 from axolotl.loaders.utils import get_linear_embedding_layers, load_model_config
 from axolotl.prompt_tokenizers import LLAMA_DEFAULT_EOS_TOKEN
 from axolotl.utils.chat_templates import get_chat_template_from_config
+from axolotl.utils.dict import DictDefault
 from axolotl.utils.distributed import (
    barrier,
    is_local_main_process,
@@ -117,8 +119,21 @@ def modify_tokenizer_files(
    return tokenizer_dir


-def load_tokenizer(cfg):
+def load_tokenizer(cfg: DictDefault) -> PreTrainedTokenizer:
    """Load and configure the tokenizer based on the provided config."""
+
+    def _load_mistral_common_tokenizer(cfg: DictDefault):
+        """Load mistral-common tokenizer"""
+        from axolotl.utils.mistral_tokenizer import HFMistralTokenizer
+
+        # Load the HF-compatible wrapper around MistralTokenizer
+        tokenizer = HFMistralTokenizer.from_pretrained(cfg.tokenizer_config)
+
+        return tokenizer
+
+    if cfg.tokenizer_use_mistral_common:
+        return _load_mistral_common_tokenizer(cfg)
+
    model_config = load_model_config(cfg)
    tokenizer_kwargs = {}
    use_fast = True  # this is the default
@@ -207,11 +222,12 @@ def load_tokenizer(cfg):
                )
                and k != "pad_token"
            ):
-                lora_modules_to_save = ", ".join(
+                lora_modules_to_save_str = ", ".join(
                    [f"`{x}`" for x in lora_modules_to_save]
                )
                raise ValueError(
-                    f"Please set lora_modules_to_save to [{lora_modules_to_save}] when using an adapter and changing the special tokens."
+                    f"Please set lora_modules_to_save to [{lora_modules_to_save_str}] "
+                    "when using an adapter and changing the special tokens."
                )

            tokenizer.add_special_tokens(
@@ -257,7 +273,7 @@ def load_tokenizer(cfg):
            {"additional_special_tokens": additional_special_tokens}
        )

-    if is_main_process(use_environ=True):
+    if is_main_process():
        LOG.debug(f"EOS: {tokenizer.eos_token_id} / {tokenizer.eos_token}")
        LOG.debug(f"BOS: {tokenizer.bos_token_id} / {tokenizer.bos_token}")
        LOG.debug(f"PAD: {tokenizer.pad_token_id} / {tokenizer.pad_token}")
--- a/src/axolotl/logging_config.py
+++ b/src/axolotl/logging_config.py
@@ -25,12 +25,20 @@ class AxolotlOrWarnErrorFilter(logging.Filter):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

-        self.axolotl_level = logging.getLevelNamesMapping()[
-            os.getenv("AXOLOTL_LOG_LEVEL", DEFAULT_AXOLOTL_LOG_LEVEL)
-        ]
-        self.other_level = logging.getLevelNamesMapping()[
-            os.getenv("LOG_LEVEL", DEFAULT_LOG_LEVEL)
-        ]
+        axolotl_log_level = os.getenv(
+            "AXOLOTL_LOG_LEVEL", DEFAULT_AXOLOTL_LOG_LEVEL
+        ).upper()
+        other_log_level = os.getenv("LOG_LEVEL", DEFAULT_LOG_LEVEL).upper()
+
+        try:
+            # py311+ only
+            level_mapping = logging.getLevelNamesMapping()
+            self.axolotl_level = level_mapping[axolotl_log_level]
+            self.other_level = level_mapping[other_log_level]
+        except AttributeError:
+            # For py310, use getLevelName directly
+            self.axolotl_level = logging.getLevelName(axolotl_log_level)
+            self.other_level = logging.getLevelName(other_log_level)

    def filter(self, record: LogRecord) -> bool:
        # General filter
--- a/src/axolotl/monkeypatch/lora_kernels.py
+++ b/src/axolotl/monkeypatch/lora_kernels.py
@@ -145,6 +145,11 @@ def get_attention_cls_from_config(cfg: DictDefault) -> Type[nn.Module]:

        return Qwen2Attention

+    if model_type == "mllama":
+        from transformers.models.mllama.modeling_mllama import MllamaTextSelfAttention
+
+        return MllamaTextSelfAttention
+
    try:
        # Dynamically import the module and attention class
        module_path = f"transformers.models.{model_type}.modeling_{model_type}"
@@ -269,6 +274,29 @@ def find_mlp_in_layer(
                )


+def get_layers(model: PeftModelForCausalLM) -> list[nn.Module]:
+    """
+    Get the layers of the model. Handles text-only and multimodal models.
+
+    Args:
+        model: A PEFT model.
+
+    Returns:
+        A list of layers.
+    """
+    pretrained_model = model.model
+
+    # check for multimodal models first
+    if hasattr(pretrained_model, "language_model"):
+        return pretrained_model.language_model.layers
+    if hasattr(pretrained_model, "model"):
+        return pretrained_model.model.layers
+
+    raise NotImplementedError(
+        f"Model type {model.config.model_type} is not supported yet. Please create an Issue."
+    )
+
+
 def apply_lora_kernel_patches(
    model: PeftModelForCausalLM, cfg: DictDefault
 ) -> PeftModelForCausalLM:
@@ -340,17 +368,7 @@ def apply_lora_kernel_patches(
    if activation not in SUPPORTED_ACTIVATIONS:
        raise NotImplementedError(f"Activation {activation} is not supported")

-    layers = []
-    # check for multimodal models first
-    pretrained_model = model.model
-    if hasattr(pretrained_model, "language_model"):
-        layers = pretrained_model.language_model.layers
-    elif hasattr(pretrained_model, "model"):
-        layers = pretrained_model.model.layers
-    else:
-        raise NotImplementedError(
-            f"Model type {model.config.model_type} is not supported yet. Please create an Issue."
-        )
+    layers = get_layers(model)

    # Patch each layer
    for layer in layers:
--- a/src/axolotl/monkeypatch/ring_attn/patch.py
+++ b/src/axolotl/monkeypatch/ring_attn/patch.py
@@ -13,9 +13,9 @@ import inspect
 import accelerate
 import torch
 import torch.distributed as dist
-from accelerate.logging import get_logger

 from axolotl.monkeypatch.utils import get_cu_seqlens_from_pos_ids
+from axolotl.utils.logging import get_logger
 from axolotl.utils.schemas.enums import RingAttnFunc

 LOG = get_logger(__name__)
--- a/src/axolotl/monkeypatch/unsloth_.py
+++ b/src/axolotl/monkeypatch/unsloth_.py
@@ -4,12 +4,12 @@ import inspect
 import types

 import torch
-from accelerate.logging import get_logger
 from peft import PeftModelForCausalLM
 from torch import nn
 from transformers.models.llama.modeling_llama import LlamaFlashAttention2

 from axolotl.monkeypatch.utils import detab_code
+from axolotl.utils.logging import get_logger

 LOG = get_logger(__name__)

--- a/src/axolotl/prompt_strategies/init.py
+++ b/src/axolotl/prompt_strategies/init.py
@@ -17,7 +17,10 @@ def load(strategy, tokenizer, cfg, ds_cfg, processor=None):
            return messages_load(tokenizer, cfg, ds_cfg, processor=processor)
        load_fn = "load"
        package = "axolotl.prompt_strategies"
-        if strategy.split(".")[-1].startswith("load_"):
+        if (
+            strategy.split(".")[-1].startswith("load_")
+            or strategy.split(".")[-1] == "load"
+        ):
            load_fn = strategy.split(".")[-1]
            strategy = ".".join(strategy.split(".")[:-1])
        elif len(strategy.split(".")) > 1:
--- a/src/axolotl/prompt_strategies/chat_template.py
+++ b/src/axolotl/prompt_strategies/chat_template.py
@@ -2,8 +2,10 @@
 HF Chat Templates prompt strategy
 """

+# pylint: disable=too-many-lines
+
 from collections import defaultdict
-from typing import Any, Dict, List, Set, Union
+from typing import TYPE_CHECKING, Any, Dict, List, Set, Union

 from pydantic import BaseModel
 from transformers import ProcessorMixin
@@ -15,6 +17,9 @@ from axolotl.utils.chat_templates import get_chat_template_from_config
 from axolotl.utils.logging import get_logger
 from axolotl.utils.schemas.datasets import DatasetConfig

+if TYPE_CHECKING:
+    from axolotl.utils.mistral_tokenizer import HFMistralTokenizer
+
 # Configure the logger
 LOG = get_logger(__name__)
 LOG.setLevel("INFO")
@@ -34,6 +39,7 @@ class ChatTemplatePrompter(Prompter):
        message_field_training_detail: str | None = None,
        field_messages: str = "messages",
        field_system: str = "system",
+        field_tools: str = "tools",
        roles: dict[str, list[str]] | None = None,
        chat_template_kwargs: dict[str, Any] | None = None,
        drop_system_message: bool = False,
@@ -66,6 +72,7 @@ class ChatTemplatePrompter(Prompter):
        self.message_field_training_detail = message_field_training_detail
        self.field_messages = field_messages
        self.field_system = field_system
+        self.field_tools = field_tools
        self.tokenizer = tokenizer
        self.processor: ProcessorMixin | None = processor
        self.chat_template = chat_template
@@ -77,17 +84,38 @@ class ChatTemplatePrompter(Prompter):
    def chat_template_msg_variables(self) -> Set[str]:
        return self._chat_template_msg_variables

-    def build_prompt(self, conversation, add_generation_prompt=False, images=None):
+    def build_prompt(
+        self,
+        conversation: list[dict],
+        add_generation_prompt=False,
+        images=None,
+        tools=None,
+    ):
+        """
+        Build a prompt from a conversation.
+
+        Args:
+            conversation: A list of messages.
+            add_generation_prompt: Whether to add a generation prompt.
+            images: A list of images. (optional)
+            tools: A list of tools. (optional)
+        """
+        chat_template_kwargs = {
+            "chat_template": self.chat_template,
+            "add_generation_prompt": add_generation_prompt,
+        }
+
+        if tools:
+            chat_template_kwargs["tools"] = tools
+
        if self.processor:
            if not callable(self.processor):
                raise TypeError("Processor must be callable")

            text = self.processor.apply_chat_template(
                conversation,
-                chat_template=self.chat_template,
                tokenize=False,
-                add_generation_prompt=add_generation_prompt,
-                **self.chat_template_kwargs,
+                **chat_template_kwargs,
            )
            batch = self.processor(
                text=text,
@@ -104,9 +132,7 @@ class ChatTemplatePrompter(Prompter):

        return self.tokenizer.apply_chat_template(
            conversation,
-            add_generation_prompt=add_generation_prompt,
-            chat_template=self.chat_template,
-            **self.chat_template_kwargs,
+            **chat_template_kwargs,
        )

    def get_offsets_for_train_detail(
@@ -250,9 +276,15 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):
        self.train_on_eot = train_on_eot if train_on_eot is not None else train_on_eos

        # Default to eos_token if eot_tokens not provided
-        self.eot_tokens = (
-            eot_tokens if eot_tokens is not None else [self.tokenizer.eos_token]
-        )
+        self.eot_tokens = []
+        if eot_tokens is not None:
+            self.eot_tokens = eot_tokens
+        elif (
+            hasattr(self.tokenizer, "eos_token")
+            and self.tokenizer.eos_token is not None
+        ):
+            self.eot_tokens = [self.tokenizer.eos_token]
+
        self.split_thinking = split_thinking

        self.images = "images"
@@ -376,7 +408,7 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):
            and not self.prompter.message_field_training_detail  # type: ignore
        ):
            turns = self.get_conversation_thread(prompt)
-            images = self.get_images(prompt)
+            images = self._get_images(prompt)
            prompt_ids = self.prompter.build_prompt(  # type: ignore
                turns[:-1],
                add_generation_prompt=True,
@@ -405,7 +437,8 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):
            return tokenized_prompt

        turns = self.get_conversation_thread(prompt)
-        input_ids = self.prompter.build_prompt(turns)  # type: ignore
+        tools = self._get_tools(prompt)
+        input_ids = self.prompter.build_prompt(turns, tools=tools)  # type: ignore
        labels = [IGNORE_TOKEN_ID] * len(input_ids)

        last_eos_idx = -1
@@ -444,7 +477,9 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):

                continue

-            turn_start_idx, turn_end_idx = self.find_turn(turns=turns, turn_idx=index)
+            turn_start_idx, turn_end_idx = self.find_turn(
+                turns=turns, turn_idx=index, tools=tools
+            )

            LOG.debug(f"Turn indices: start={turn_start_idx}, end={turn_end_idx}")

@@ -546,7 +581,9 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):
                return i
        return -1

-    def find_turn(self, turns: list[dict], turn_idx: int):
+    def find_turn(
+        self, turns: list[dict], turn_idx: int, tools: list[dict] | None = None
+    ):
        """
        Locate the starting and ending indices of the specified turn in a conversation.
        """
@@ -559,11 +596,7 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):
        if (
            turn_idx == 0
            and turns[0].get("role") == "system"
-            and (
-                "mistral" in self.tokenizer.name_or_path.lower()
-                or "gemma"
-                in self.tokenizer.name_or_path.lower()  # gemma3 uses gemma tokenizer
-            )
+            and ("mistral" in self.tokenizer.name_or_path.lower())
        ):
            return -1, -1

@@ -577,10 +610,10 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):
        turns_with_content = turns[: turn_idx + 1]

        # Generate the conversation up to the turn, with final turn replaced with dummy content
-        dummy_ids = self.prompter.build_prompt(turns_with_empty)  # type: ignore
+        dummy_ids = self.prompter.build_prompt(turns_with_empty, tools=tools)  # type: ignore

        # Generate the conversation up to the turn, with final turn included
-        full_ids = self.prompter.build_prompt(turns_with_content)  # type: ignore
+        full_ids = self.prompter.build_prompt(turns_with_content, tools=tools)  # type: ignore

        if not full_ids or not dummy_ids:
            LOG.warning(f"Empty template generated for turn {turn_idx}")
@@ -633,9 +666,10 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):
    def get_conversation_thread(self, prompt):
        turns = []

-        possible_sys_turn = self.transform_message(
-            prompt[self.prompter.field_messages][0]
-        )
+        messages = self._get_messages(prompt)
+
+        possible_sys_turn = self.transform_message(messages[0])
+
        if (
            possible_sys_turn["role"] != "system"
            and self.prompter.field_system in prompt
@@ -643,7 +677,7 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):
            turn = {"role": "system", "content": prompt[self.prompter.field_system]}
            turns.append(turn)

-        for message in prompt[self.prompter.field_messages]:
+        for message in messages:
            transformed_message = self.transform_message(message)

            turn = {
@@ -661,7 +695,7 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):

        return turns

-    def transform_message(self, message):
+    def transform_message(self, message: dict) -> dict:
        # Build the initial transformed message from the mappings
        transformed_message = {}
        for key, value in self.prompter.message_property_mappings.items():
@@ -738,18 +772,135 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):

        return transformed_message

-    def get_images(self, prompt):
+    def _get_images(self, prompt):
        return prompt.get(self.images, None)

+    def _get_tools(self, prompt) -> list[dict] | None:
+        """Get tools from prompt if available."""
+        tools = prompt.get(self.prompter.field_tools, None)
+        if tools is None:
+            return None
+
+        if isinstance(tools, list):
+            return tools
+
+        raise ValueError(
+            "Unknown tools format. Please convert it into a list[dict].\n"
+            f"Current format: {type(tools)}"
+        )
+
+    def _get_messages(self, prompt):
+        messages = prompt.get(self.prompter.field_messages, None)
+        if messages is None:
+            raise ValueError("Messages is null. Please check `field_messages`.")
+
+        if isinstance(messages, list):
+            return messages
+
+        raise ValueError(
+            "Unknown messages format. Please convert it into a list[dict].\n"
+            f"Current format: {type(messages)}"
+        )
+
+
+class MistralStrategy(ChatTemplateStrategy):
+    """
+    Mistral strategy for chat template.
+    """
+
+    def __init__(
+        self,
+        prompter: "ChatTemplatePrompter",
+        tokenizer: "HFMistralTokenizer",
+        train_on_inputs: bool,
+        sequence_len: int,
+        roles_to_train: list[str] | None = None,
+        train_on_eos: str | None = None,
+        train_on_eot: str | None = None,
+        eot_tokens: list[str] | None = None,
+        split_thinking: bool | None = False,
+    ):
+        # Call the parent's parent __init__ (PromptTokenizingStrategy) to skip ChatTemplateStrategy's validation
+        # pylint: disable=non-parent-init-called,super-init-not-called
+        PromptTokenizingStrategy.__init__(
+            self, prompter, tokenizer, train_on_inputs, sequence_len
+        )
+        self.prompter: ChatTemplatePrompter = prompter
+
+        self.roles_to_train = []
+        if roles_to_train:
+            # map roles if exist in prompter.roles else use the role as is
+            self.roles_to_train = [
+                prompter.roles.get(role, role) for role in roles_to_train
+            ]
+
+        self.train_on_eos = train_on_eos
+        # Backward compatibility, load from train_on_eos
+        self.train_on_eot = train_on_eot if train_on_eot is not None else train_on_eos
+
+        # Default to eos_token if eot_tokens not provided
+        self.eot_tokens = []
+        if eot_tokens is not None:
+            self.eot_tokens = eot_tokens
+        else:
+            # set eot_tokens to the eos_token
+            self.eot_tokens = [self.tokenizer.eos_token]
+
+        self.split_thinking = split_thinking
+
+        self.images = "images"
+
+        LOG.debug(
+            f"The chat template uses the following properites on the message: {self.prompter.chat_template_msg_variables}"
+        )
+
+        # Skip the validation that ChatTemplateStrategy calls
+        # TODO: address this in the future with mistral-specific checks
+        # self._validate_eot_and_eos_tokens()
+
+    @property
+    def supports_multiprocessing(self) -> bool:
+        """
+        Whether this tokenizing strategy supports multiprocessing.
+        mistral_common tokenizers cannot be pickled for multiprocessing.
+        """
+
+        return False
+
+    def find_first_eot_token(self, input_ids, start_idx):
+        """Find the first EOT token in the input_ids starting from start_idx."""
+        # mistral-common tokenizer does not support eot_tokens
+        return self.find_first_eos_token(input_ids, start_idx)
+
+
+class MistralPrompter(ChatTemplatePrompter):
+    """
+    Mistral prompter for chat template.
+    """
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+        self._chat_template_msg_variables = set(["tool_call_id", "name", "tool_calls"])
+

 class StrategyLoader:
    """
    Load chat template strategy based on configuration.
    """

-    def _get_strategy_cls(self):
+    def _get_strategy_cls(self, cfg):
+        if cfg.tokenizer_use_mistral_common:
+            return MistralStrategy
+
        return ChatTemplateStrategy

+    def _get_prompter_cls(self, cfg):
+        if cfg.tokenizer_use_mistral_common:
+            return MistralPrompter
+
+        return ChatTemplatePrompter
+
    def _get_strategy_params(self, cfg, ds_cfg: Dict[str, Any]):
        return {
            "train_on_inputs": cfg.train_on_inputs,
@@ -775,9 +926,14 @@ class StrategyLoader:
        else:
            dataset_config = ds_cfg

-        chat_template_string = get_chat_template_from_config(
-            cfg=cfg, ds_cfg=dataset_config, tokenizer=tokenizer
-        )
+        if cfg.tokenizer_use_mistral_common:
+            # mistral-common does not use this, so we pass an empty string
+            chat_template_string = ""
+        else:
+            chat_template_string = get_chat_template_from_config(
+                cfg=cfg, ds_cfg=dataset_config, tokenizer=tokenizer
+            )
+
        LOG.info(f"Using chat template:\n---\n{chat_template_string!s}\n---")

        prompter_params = {
@@ -803,10 +959,11 @@ class StrategyLoader:
        }

        strategy_params = self._get_strategy_params(cfg, dataset_config)
-        strategy_cls = self._get_strategy_cls()
+        strategy_cls = self._get_strategy_cls(cfg)
+        prompter_cls = self._get_prompter_cls(cfg)

        strategy = strategy_cls(
-            ChatTemplatePrompter(**prompter_params),
+            prompter_cls(**prompter_params),
            tokenizer=tokenizer,
            **strategy_params,
        )
--- a/src/axolotl/prompt_strategies/dpo/chat_template.py
+++ b/src/axolotl/prompt_strategies/dpo/chat_template.py
@@ -46,6 +46,14 @@ def default(
        )

        messages = sample[field_messages]
+        if isinstance(messages, str):
+            messages = [
+                {
+                    message_property_mappings["role"]: "user",
+                    message_property_mappings["content"]: messages,
+                }
+            ]
+
        messages = [
            {
                "role": role_map[m[message_property_mappings["role"]]],
@@ -53,13 +61,35 @@ def default(
            }
            for m in messages
        ]
+
+        chosen_raw = sample[field_chosen]
+        if isinstance(chosen_raw, str):
+            chosen_msg = {
+                message_property_mappings["role"]: "assistant",
+                message_property_mappings["content"]: chosen_raw,
+            }
+        elif isinstance(chosen_raw, dict):
+            chosen_msg = chosen_raw
+        else:
+            chosen_msg = chosen_raw[-1]
        chosen = {
-            "role": role_map[sample[field_chosen][message_property_mappings["role"]]],
-            "content": sample[field_chosen][message_property_mappings["content"]],
+            "role": role_map[chosen_msg[message_property_mappings["role"]]],
+            "content": chosen_msg[message_property_mappings["content"]],
        }
+
+        rejected_raw = sample[field_rejected]
+        if isinstance(rejected_raw, str):
+            rejected_msg = {
+                message_property_mappings["role"]: "assistant",
+                message_property_mappings["content"]: rejected_raw,
+            }
+        elif isinstance(rejected_raw, dict):
+            rejected_msg = rejected_raw
+        else:
+            rejected_msg = rejected_raw[-1]
        rejected = {
-            "role": role_map[sample[field_rejected][message_property_mappings["role"]]],
-            "content": sample[field_rejected][message_property_mappings["content"]],
+            "role": role_map[rejected_msg[message_property_mappings["role"]]],
+            "content": rejected_msg[message_property_mappings["content"]],
        }
        dummy_user_message = {"role": "user", "content": "[[dummy_message]]"}

--- a/src/axolotl/prompt_strategies/jinja_template_analyzer.py
+++ b/src/axolotl/prompt_strategies/jinja_template_analyzer.py
@@ -3,6 +3,7 @@
 from typing import Dict, Optional, Set, TypedDict, Union

 from jinja2 import Environment, meta, nodes
+from jinja2.ext import Extension


 class JinjaTemplateAnalysis(TypedDict):
@@ -27,6 +28,18 @@ class JinjaTemplateAnalysis(TypedDict):
    iteration_target: Optional[Union[str, list[str]]]


+class GenerationTagIgnore(Extension):
+    """
+    Ignores the generation and endgeneration tags in Jinja templates.
+    """
+
+    tags = {"generation", "endgeneration"}
+
+    def parse(self, parser):
+        parser.stream.skip(1)
+        return nodes.Const("")
+
+
 class JinjaTemplateAnalyzer:
    """
    Analyzes Jinja templates to extract information about variable usage,
@@ -57,7 +70,9 @@ class JinjaTemplateAnalyzer:
    """

    def __init__(self, template: str):
-        self.env: Environment = Environment(autoescape=True)
+        self.env: Environment = Environment(
+            autoescape=True, extensions=[GenerationTagIgnore]
+        )
        self.property_access: Dict[str, Set[str]] = {}
        self.iteration_targets: Dict[str, Union[str, list[str]]] = {}
        self.index_access: Dict[str, Set[Union[int, float]]] = {}
--- a/src/axolotl/prompt_strategies/messages/init.py
+++ b/src/axolotl/prompt_strategies/messages/init.py
@@ -32,4 +32,3 @@ def load(tokenizer, cfg, ds_cfg, processor=None):
    except Exception as exc:  # pylint: disable=broad-exception-caught
        LOG.error(f"Failed to load prompt strategy `{strategy}`: {str(exc)}")
        raise exc
-    return None
--- a/src/axolotl/prompt_tokenizers.py
+++ b/src/axolotl/prompt_tokenizers.py
@@ -3,6 +3,7 @@
 import abc
 from typing import Callable, Dict, List, Optional, Tuple, Union

+from datasets import Dataset
 from transformers import BatchEncoding, PreTrainedTokenizer

 from axolotl.prompters import Prompter
@@ -28,6 +29,16 @@ class DatasetWrappingStrategy(abc.ABC):
    Abstract class for wrapping datasets for Chat Messages
    """

+    @abc.abstractmethod
+    def wrap_dataset(
+        self,
+        dataset,
+        process_count: int | None = None,
+        keep_in_memory: bool | None = False,
+        **kwargs,
+    ) -> Dataset:
+        pass
+

 class PromptTokenizingStrategy(abc.ABC):
    """
@@ -59,6 +70,14 @@ class PromptTokenizingStrategy(abc.ABC):
    def supports_batched(self):
        return False

+    @property
+    def supports_multiprocessing(self):
+        """
+        Whether this tokenizing strategy supports multiprocessing.
+        Should return False if the tokenizer has unpicklable objects.
+        """
+        return True
+
    def _tokenize(
        self, prompt: str, add_eos_token: bool = True, strip_bos_token: bool = False
    ) -> BatchEncoding:
--- a/src/axolotl/train.py
+++ b/src/axolotl/train.py
@@ -58,8 +58,8 @@ def setup_model_and_tokenizer(
 ) -> tuple[
    PreTrainedModel, PreTrainedTokenizer, PeftConfig | None, ProcessorMixin | None
 ]:
-    """
-    Load the tokenizer, processor (for multimodal models), and model based on configuration.
+    """Load the tokenizer, processor (for multimodal models), and model based on
+    configuration.

    Args:
        cfg: Dictionary mapping `axolotl` config keys to values.
--- a/src/axolotl/utils/callbacks/init.py
+++ b/src/axolotl/utils/callbacks/init.py
@@ -53,25 +53,6 @@ IGNORE_INDEX = -100
 LOG = get_logger(__name__)


-class EvalFirstStepCallback(
-    TrainerCallback
-):  # pylint: disable=too-few-public-methods disable=unused-argument
-    """
-    Callback to trigger evals on the first step
-    """
-
-    def on_step_end(
-        self,
-        args: TrainingArguments,
-        state: TrainerState,
-        control: TrainerControl,
-        **kwargs,
-    ):
-        if args.eval_strategy == IntervalStrategy.STEPS and state.global_step == 1:
-            control.should_evaluate = True
-        return control
-
-
 class SaveBetterTransformerModelCallback(
    TrainerCallback
 ):  # pylint: disable=too-few-public-methods
--- a/src/axolotl/utils/chat_templates.py
+++ b/src/axolotl/utils/chat_templates.py
--- a/src/axolotl/utils/collators/batching.py
+++ b/src/axolotl/utils/collators/batching.py
@@ -81,9 +81,11 @@ class DataCollatorForSeq2Seq:

                padding_side = self.tokenizer.padding_side
                for feature in features:
-                    remainder = [pad_token_id] * (
-                        max_feature_length - len(feature[feature_name])
-                    )
+                    remainder_len = max_feature_length - len(feature[feature_name])
+                    if feature_name == "position_ids":
+                        remainder = list(range(remainder_len))
+                    else:
+                        remainder = [pad_token_id] * remainder_len
                    if isinstance(feature[feature_name], list):
                        feature[feature_name] = (
                            feature[feature_name] + remainder
--- a/src/axolotl/utils/config/init.py
+++ b/src/axolotl/utils/config/init.py
@@ -21,7 +21,7 @@ from axolotl.utils.schemas.config import (
 from axolotl.utils.schemas.config import AxolotlInputConfig as AxolotlInputConfigBase
 from axolotl.utils.schemas.datasets import DPODataset, KTODataset, SFTDataset

-LOG = get_logger(__name__, use_environ=True)
+LOG = get_logger(__name__)


 def choose_device(cfg):
--- a/src/axolotl/utils/data/init.py
+++ b/src/axolotl/utils/data/init.py
@@ -1,16 +1,21 @@
-"""
-Data processing modules
-"""
+"""Init for `axolotl.utils.data` module."""

-from axolotl.utils.data.pretraining import (  # noqa: F401
+from axolotl.utils.data.pretraining import (
    encode_pretraining,
    wrap_pretraining_dataset,
 )
-from axolotl.utils.data.rl import load_prepare_preference_datasets  # noqa: F401
-from axolotl.utils.data.sft import (  # noqa: F401
+from axolotl.utils.data.rl import prepare_preference_datasets
+from axolotl.utils.data.sft import (
    get_dataset_wrapper,
-    load_prepare_datasets,
-    load_tokenized_prepared_datasets,
-    prepare_dataset,
+    prepare_datasets,
 )
-from axolotl.utils.data.utils import md5  # noqa: F401
+from axolotl.utils.data.utils import md5
+
+__all__ = [
+    "encode_pretraining",
+    "wrap_pretraining_dataset",
+    "prepare_preference_datasets",
+    "get_dataset_wrapper",
+    "prepare_datasets",
+    "md5",
+]
--- a/src/axolotl/utils/data/lock.py
+++ b/src/axolotl/utils/data/lock.py
@@ -0,0 +1,66 @@
+"""Logic for loading / preparing a dataset once over all processes."""
+
+import time
+from pathlib import Path
+from typing import Any, Callable
+
+from filelock import FileLock
+
+from axolotl.common.const import DEFAULT_DATASET_PREPARED_PATH
+from axolotl.utils.dict import DictDefault
+
+LOCK_FILE_NAME = "datasets_prep.lock"
+READY_FILE_NAME = "datasets_ready.flag"
+PROCESS_COUNTER_FILE_NAME = "process_counter.txt"
+
+
+class FileLockLoader:
+    """
+    Simple class for abstracting single process data loading / processing. The first
+    process that creates a lock file does the work; the remaining procesees simply load
+    the preprocessed dataset once the first process is done.
+    """
+
+    def __init__(self, cfg: DictDefault):
+        self.cfg = cfg
+        self.dataset_prepared_path = (
+            cfg.dataset_prepared_path or DEFAULT_DATASET_PREPARED_PATH
+        )
+        self.lock_file_path = Path(self.dataset_prepared_path) / LOCK_FILE_NAME
+        self.ready_flag_path = Path(self.dataset_prepared_path) / READY_FILE_NAME
+        self.counter_path = Path(self.dataset_prepared_path) / PROCESS_COUNTER_FILE_NAME
+
+    def load(self, load_fn: Callable[[], Any]) -> Any:
+        with FileLock(str(self.lock_file_path)):
+            self._increment_counter()
+
+            if not self.ready_flag_path.exists():
+                result = load_fn()
+                self.ready_flag_path.touch()
+                return result
+
+            while not self.ready_flag_path.exists():
+                time.sleep(1)
+            return load_fn()
+
+    def _increment_counter(self):
+        """Safely increment the process counter."""
+        if self.counter_path.exists():
+            count = int(self.counter_path.read_text().strip())
+        else:
+            count = 0
+        self.counter_path.write_text(str(count + 1))
+
+    def cleanup(self):
+        """Clean up ready flag when last process is done."""
+        with FileLock(str(self.lock_file_path)):
+            count = int(self.counter_path.read_text().strip())
+            count -= 1
+
+            if count == 0:
+                # Last process cleans everything up
+                self.ready_flag_path.unlink(missing_ok=True)
+                self.counter_path.unlink(missing_ok=True)
+            else:
+                # Still have active processes
+                self.counter_path.write_text(str(count))
--- a/src/axolotl/utils/data/pretraining.py
+++ b/src/axolotl/utils/data/pretraining.py
@@ -250,7 +250,7 @@ def encode_packed_pretraining(
    # pylint: disable=duplicate-code
    # tokenize all the examples
    # rows get split with stride (overlap)
-    train_dataset = ds_wrapper(Dataset.from_dict(examples))[0]
+    train_dataset = ds_wrapper(dataset=Dataset.from_dict(examples))[0]

    train_dataset = process_pretraining_datasets_for_packing(
        train_dataset,
--- a/src/axolotl/utils/data/rl.py
+++ b/src/axolotl/utils/data/rl.py
@@ -1,75 +1,117 @@
-"""data handling specific to DPO"""
+"""Data handling specific to RL trainers."""

 import inspect
 from functools import partial
-from pathlib import Path
-from typing import Any, List, Union
+from typing import Any, Callable, Literal

-import yaml
-from datasets import Dataset, DatasetDict, concatenate_datasets, load_from_disk
+from datasets import Dataset, DatasetDict
+from transformers import PreTrainedTokenizer

-from axolotl.common.const import DEFAULT_DATASET_PREPARED_PATH
 from axolotl.loaders import load_tokenizer
 from axolotl.prompt_strategies.dpo import load as load_dpo
 from axolotl.prompt_strategies.kto import load as load_kto
 from axolotl.prompt_strategies.orpo import load as load_orpo
-from axolotl.utils.data.shared import datasets_w_name_generator, load_dataset_w_config
-from axolotl.utils.data.utils import deduplicate_and_log_datasets, md5
+from axolotl.utils.data.lock import FileLockLoader
+from axolotl.utils.data.shared import (
+    create_train_validation_split,
+    datasets_with_name_generator,
+    generate_dataset_hash_from_config,
+    load_dataset_with_config,
+    load_preprocessed_dataset,
+    merge_datasets,
+    save_preprocessed_dataset,
+    try_load_from_hub,
+)
+from axolotl.utils.data.utils import (
+    deduplicate_and_log_datasets,
+    retry_on_request_exceptions,
+)
 from axolotl.utils.dict import DictDefault
-from axolotl.utils.distributed import is_main_process, zero_first
 from axolotl.utils.logging import get_logger
 from axolotl.utils.schemas.enums import RLType

 LOG = get_logger(__name__)


-def _get_path(ds_hash, cfg):
-    prepared_ds_path = (
-        Path(cfg.dataset_prepared_path) / ds_hash
-        if cfg.dataset_prepared_path
-        else Path(DEFAULT_DATASET_PREPARED_PATH) / ds_hash
-    )
+@retry_on_request_exceptions(max_retries=3, delay=5)
+def prepare_preference_datasets(
+    cfg: DictDefault, tokenizer: PreTrainedTokenizer
+) -> tuple[Dataset, Dataset | None]:
+    """Load and prepare preference datasets for RL training.

-    return prepared_ds_path
+    Loads training and evaluation datasets, handling preprocessing, caching, and
+    deduplication as configured. Uses FileLock for distributed coordination.
+
+    Args:
+        cfg: Configuration object containing dataset and training settings.
+        tokenizer: Tokenizer to use for processing text.
+
+    Returns:
+        Tuple of (train_dataset, eval_dataset). eval_dataset may be None
+            if no evaluation dataset is configured.
+    """
+
+    def _load_datasets():
+        # Load training dataset
+        train_dataset = _load_or_create_dataset_split(cfg, tokenizer, split="train")
+
+        # Load or create evaluation dataset
+        eval_dataset: Dataset | None = None
+        if cfg.test_datasets:
+            eval_dataset = _load_or_create_dataset_split(cfg, tokenizer, split="test")
+        elif cfg.val_set_size:
+            # Create validation split from training data
+            train_dataset, eval_dataset = create_train_validation_split(
+                train_dataset, cfg, cfg.val_set_size
+            )
+
+        return train_dataset, eval_dataset
+
+    # Prepare datasets (with file locking logic for multiple ranks)
+    loader = FileLockLoader(cfg)
+    try:
+        train_dataset, eval_dataset = loader.load(_load_datasets)
+    finally:
+        loader.cleanup()
+
+    # Apply deduplication if configured
+    if cfg.dataset_exact_deduplication:
+        train_dataset, eval_dataset = deduplicate_and_log_datasets(
+            dataset=train_dataset, other_dataset=eval_dataset
+        )
+
+    return train_dataset, eval_dataset


-def _load_preprocessed_ds(cfg, sub_cfg):
-    ds_hash = md5(yaml.dump(sub_cfg, Dumper=yaml.Dumper))
-    prepared_ds_path = _get_path(ds_hash, cfg)
-    dataset = None
+def _map_dataset(
+    cfg: DictDefault,
+    dataset: Dataset | DatasetDict,
+    ds_transform_fn: Callable[..., Any],
+    tokenizer: Any | None = None,
+    **map_kwargs: Any,
+) -> Dataset:
+    """Apply transformation function to dataset.

-    # pylint: disable=duplicate-code
-    if (
-        cfg.dataset_prepared_path
-        and any(prepared_ds_path.glob("*"))
-        and not cfg.is_preprocess
-    ):
-        LOG.info(f"Loading prepared dataset from disk at {prepared_ds_path}...")
-        dataset = load_from_disk(str(prepared_ds_path))
+    Args:
+        cfg: Configuration object.
+        dataset: Dataset to transform.
+        ds_transform_fn: Transformation function to apply.
+        tokenizer: Optional tokenizer for transformation.
+        **map_kwargs: Additional arguments for dataset mapping.

-    return dataset
-
-
-def _save_preprocessed_ds(cfg, sub_cfg, dataset):
-    ds_hash = md5(yaml.dump(sub_cfg, Dumper=yaml.Dumper))
-    prepared_ds_path = _get_path(ds_hash, cfg)
-
-    if cfg.is_preprocess and is_main_process():
-        LOG.info(f"Loading prepared dataset from disk at {prepared_ds_path}...")
-        dataset.save_to_disk(str(prepared_ds_path))
-
-
-def map_dataset(cfg, data_set, ds_transform_fn, tokenizer, **map_kwargs):
+    Returns:
+        Transformed dataset.
+    """
    sig = inspect.signature(ds_transform_fn)
    if "tokenizer" in sig.parameters:
        if not tokenizer:
            tokenizer = load_tokenizer(cfg)
        ds_transform_fn = partial(ds_transform_fn, tokenizer=tokenizer)

-    if isinstance(data_set, DatasetDict):
-        data_set = data_set["train"]
+    if isinstance(dataset, DatasetDict):
+        dataset = dataset["train"]

-    data_set = data_set.map(
+    dataset = dataset.map(
        ds_transform_fn,
        num_proc=cfg.dataset_processes,
        load_from_cache_file=not cfg.is_preprocess,
@@ -77,13 +119,27 @@ def map_dataset(cfg, data_set, ds_transform_fn, tokenizer, **map_kwargs):
        **map_kwargs,
    )

-    return data_set
+    return dataset


-def drop_long_rl_seq(
-    sample, rl, tokenizer, sequence_len  # pylint: disable=invalid-name
-):
-    if rl in (RLType.DPO, RLType.IPO, RLType.ORPO, RLType.SIMPO):
+def _drop_long_sequences(
+    sample: dict[str, Any], rl: RLType, tokenizer: Any, sequence_len: int
+) -> bool:
+    """Filter out samples that exceed maximum sequence length.
+
+    Args:
+        sample: Dataset sample to check.
+        rl: Reinforcement learning type.
+        tokenizer: Tokenizer for length calculation.
+        sequence_len: Maximum allowed sequence length.
+
+    Returns:
+        True if sample should be kept, False if it should be dropped.
+
+    Raises:
+        ValueError: If required keys are missing or RL type is unknown.
+    """
+    if rl in {RLType.DPO, RLType.IPO, RLType.ORPO, RLType.SIMPO}:
        if not (
            sample.get("prompt") and sample.get("chosen") and sample.get("rejected")
        ):
@@ -123,132 +179,115 @@ def drop_long_rl_seq(
    raise ValueError("Unknown RL type")


-def load_prepare_preference_datasets(cfg):
-    def load_split(dataset_cfgs, _cfg):
-        split_datasets: List[Any] = []
-        use_auth_token = _cfg.hf_use_auth_token
-        for config_dataset in datasets_w_name_generator(dataset_cfgs):
-            ds: Union[Dataset, DatasetDict] = load_dataset_w_config(
-                config_dataset, use_auth_token, streaming=False
-            )
-            split_datasets.append(ds)
+def _load_split(cfg: DictDefault, split: Literal["train", "test"]) -> Dataset:
+    """Load and process dataset split for RL training.

-        tokenizer = load_tokenizer(cfg)
+    Args:
+        cfg: Configuration object containing dataset settings.
+        split: Dataset split to load ("train" or "test").

-        for i, data_set in enumerate(split_datasets):
-            _type = dataset_cfgs[i]["type"]
-            if _type:
-                if isinstance(_type, DictDefault):
-                    _type = "user_defined.default"
-                if _cfg.rl is RLType.ORPO:
-                    ds_transform_fn = load_orpo(_type, _cfg, dataset_idx=i)
-                elif _cfg.rl is RLType.KTO:
-                    ds_transform_fn = load_kto(_type, _cfg, dataset_idx=i)
-                else:
-                    ds_transform_fn = load_dpo(_type, _cfg, dataset_idx=i)
+    Returns:
+        Combined and processed dataset for the specified split.
+    """
+    datasets_configs = cfg.datasets if split == "train" else cfg.test_datasets
+    split_datasets: list[Dataset | DatasetDict] = []

-                map_kwargs = {}
-                if isinstance(ds_transform_fn, tuple):
-                    ds_transform_fn, map_kwargs = ds_transform_fn
-                split_datasets[i] = map_dataset(
-                    cfg, data_set, ds_transform_fn, tokenizer, **map_kwargs
-                )
-            elif _cfg.rl is RLType.KTO:
-                ds_transform_fn = load_kto(_type, _cfg, dataset_idx=i)
-                map_kwargs = {}
-                if isinstance(ds_transform_fn, tuple):
-                    ds_transform_fn, map_kwargs = ds_transform_fn
-                split_datasets[i] = map_dataset(
-                    cfg, data_set, ds_transform_fn, tokenizer, **map_kwargs
-                )
-            else:
-                # If no `type` is provided, assume the dataset is already in the expected format with
-                # "prompt", "chosen" and "rejected" already preprocessed
-                split_datasets[i] = data_set
-
-            if not cfg.skip_prepare_dataset:
-                drop_long = partial(
-                    drop_long_rl_seq,
-                    rl=_cfg.rl,
-                    tokenizer=tokenizer,
-                    sequence_len=cfg.sequence_len,
-                )
-
-                prior_len = len(split_datasets[i])
-                split_datasets[i] = split_datasets[i].filter(
-                    drop_long,
-                    num_proc=cfg.dataset_processes,
-                    load_from_cache_file=not cfg.is_preprocess,
-                    desc="Dropping Long Sequences",
-                )
-                dropped = prior_len - len(split_datasets[i])
-                if dropped:
-                    LOG.warning(
-                        f"Dropped {dropped} long samples from dataset index {i}"
-                    )
-
-        combined_datasets = concatenate_datasets(split_datasets)
-        combined_datasets = combined_datasets.shuffle(seed=cfg.seed or 42)
-
-        return combined_datasets
-
-    with zero_first(is_main_process()):
-        train_is_preprocessed = False
-        eval_is_preprocessed = False
-        if train_dataset := _load_preprocessed_ds(cfg, cfg.datasets):
-            train_is_preprocessed = True
-        else:
-            train_dataset = load_split(cfg.datasets, cfg)
-
-        eval_dataset = None
-        if cfg.test_datasets:
-            if eval_dataset := _load_preprocessed_ds(cfg, cfg.test_datasets):
-                eval_is_preprocessed = True
-            else:
-                eval_dataset = load_split(cfg.test_datasets, cfg)
-        if not eval_dataset:
-            if cfg.val_set_size:
-                seed = cfg.seed if cfg.seed is not None else 42
-
-                # ensure we end up with the same fingerprint by doing rank0 first and being able to cache
-                to_hash_train = (
-                    train_dataset._fingerprint  # pylint: disable=protected-access
-                    + "|"
-                    + str(cfg.val_set_size)
-                    + "|"
-                    + "train"
-                    + "|"
-                    + str(cfg.seed or 42)
-                )
-                to_hash_test = (
-                    train_dataset._fingerprint  # pylint: disable=protected-access
-                    + "|"
-                    + str(cfg.val_set_size)
-                    + "|"
-                    + "test"
-                    + "|"
-                    + str(cfg.seed or 42)
-                )
-                train_fingerprint = md5(to_hash_train)
-                test_fingerprint = md5(to_hash_test)
-                ds_w_test_split = train_dataset.train_test_split(
-                    test_size=cfg.val_set_size,
-                    seed=seed,
-                    shuffle=False,
-                    train_new_fingerprint=train_fingerprint,
-                    test_new_fingerprint=test_fingerprint,
-                )
-                eval_dataset = ds_w_test_split["test"]
-                train_dataset = ds_w_test_split["train"]
-
-        if not train_is_preprocessed:
-            _save_preprocessed_ds(cfg, cfg.datasets, train_dataset)
-        if eval_dataset and not eval_is_preprocessed:
-            _save_preprocessed_ds(cfg, cfg.test_datasets, eval_dataset)
-
-    if cfg.dataset_exact_deduplication:
-        train_dataset, eval_dataset, _ = deduplicate_and_log_datasets(
-            train_dataset=train_dataset, eval_dataset=eval_dataset
+    for dataset_config in datasets_with_name_generator(datasets_configs):
+        dataset: Dataset | DatasetDict = load_dataset_with_config(
+            dataset_config, cfg.hf_use_auth_token, streaming=False
        )
+        split_datasets.append(dataset)

-    return train_dataset, eval_dataset
+    tokenizer = load_tokenizer(cfg)
+
+    for i, dataset in enumerate(split_datasets):
+        _type = datasets_configs[i]["type"]
+        if _type:
+            if isinstance(_type, DictDefault):
+                _type = "user_defined.default"
+            if cfg.rl is RLType.ORPO:
+                ds_transform_fn = load_orpo(_type, cfg, dataset_idx=i)
+            elif cfg.rl is RLType.KTO:
+                ds_transform_fn = load_kto(_type, cfg, dataset_idx=i)
+            else:
+                ds_transform_fn = load_dpo(_type, cfg, dataset_idx=i)
+
+            map_kwargs: dict[str, Any] = {}
+            if isinstance(ds_transform_fn, tuple):
+                ds_transform_fn, map_kwargs = ds_transform_fn
+            split_datasets[i] = _map_dataset(
+                cfg, dataset, ds_transform_fn, tokenizer, **map_kwargs
+            )
+        else:
+            # If no `type` is provided, assume the dataset is already in the expected format with
+            # "prompt", "chosen", and "rejected" already preprocessed
+            split_datasets[i] = dataset
+
+        if not cfg.skip_prepare_dataset:
+            drop_long = partial(
+                _drop_long_sequences,
+                rl=cfg.rl,
+                tokenizer=tokenizer,
+                sequence_len=cfg.sequence_len,
+            )
+
+            prior_len = len(split_datasets[i])
+            split_datasets[i] = split_datasets[i].filter(
+                drop_long,
+                num_proc=cfg.dataset_processes,
+                load_from_cache_file=not cfg.is_preprocess,
+                desc="Dropping Long Sequences",
+            )
+            dropped = prior_len - len(split_datasets[i])
+            if dropped:
+                LOG.warning(f"Dropped {dropped} long samples from dataset index {i}")
+
+    # Merge datasets
+    dataset = merge_datasets(split_datasets, cfg)
+
+    if not cfg.skip_prepare_dataset:
+        # Save preprocessed dataset
+        dataset_hash = generate_dataset_hash_from_config(
+            cfg, datasets_configs, tokenizer.name_or_path
+        )
+        save_preprocessed_dataset(cfg, dataset, dataset_hash, split)
+
+    return dataset
+
+
+# pylint: disable=duplicate-code
+def _load_or_create_dataset_split(
+    cfg: DictDefault, tokenizer: PreTrainedTokenizer, split: Literal["train", "test"]
+) -> Dataset:
+    """Load preprocessed dataset or create new one for given split.
+
+    Args:
+        cfg: Configuration object.
+        tokenizer: Tokenizer to use for processing text.
+        split: Dataset split to load.
+
+    Returns:
+        Tuple of (dataset, is_preprocessed).
+    """
+    # Select correct dataset configuration based on split
+    datasets_config = cfg.datasets if split == "train" else cfg.test_datasets
+
+    # Generate dataset hash for caching
+    dataset_hash = generate_dataset_hash_from_config(
+        cfg, datasets_config, tokenizer.name_or_path
+    )
+
+    # Try loading from hub if push_dataset_to_hub is configured
+    dataset = None
+    if cfg.push_dataset_to_hub:
+        dataset = try_load_from_hub(cfg, dataset_hash, split)
+
+    # Attempt to load preprocessed dataset
+    if dataset is None:
+        dataset = load_preprocessed_dataset(cfg, dataset_hash)
+
+    # Otherwise, load it
+    if dataset is None:
+        dataset = _load_split(cfg, split=split)
+
+    return dataset
--- a/src/axolotl/utils/data/sft.py
+++ b/src/axolotl/utils/data/sft.py
--- a/src/axolotl/utils/data/shared.py
+++ b/src/axolotl/utils/data/shared.py
@@ -1,11 +1,21 @@
-"""
-dataset loading shared utils
-"""
+"""Dataset loading shared utils."""

+from __future__ import annotations
+
+import functools
+import os
 from pathlib import Path
-from typing import Optional, Union
+from typing import TYPE_CHECKING, Any, Generator

-from datasets import Dataset, DatasetDict, load_dataset, load_from_disk
+from datasets import (
+    Dataset,
+    DatasetDict,
+    IterableDataset,
+    IterableDatasetDict,
+    concatenate_datasets,
+    load_dataset,
+    load_from_disk,
+)
 from huggingface_hub import hf_hub_download, snapshot_download
 from huggingface_hub.errors import (
    HFValidationError,
@@ -13,78 +23,141 @@ from huggingface_hub.errors import (
    RevisionNotFoundError,
 )

+from axolotl.common.const import DEFAULT_DATASET_PREPARED_PATH
+from axolotl.utils.data.utils import deduplicate_and_log_datasets, md5
 from axolotl.utils.dict import DictDefault
+from axolotl.utils.logging import get_logger
+
+if TYPE_CHECKING:
+    from adlfs import AzureBlobFileSystem
+    from gcsfs import GCSFileSystem
+    from ocifs import OCIFileSystem
+    from s3fs import S3FileSystem
+
+LOG = get_logger(__name__)
+
+EXTENSIONS_TO_DATASET_TYPES = {
+    ".parquet": "parquet",
+    ".arrow": "arrow",
+    ".csv": "csv",
+    ".txt": "text",
+}


-def get_ds_type(config_dataset: DictDefault):
-    """
-    Get the dataset type from the path if it's not specified
-    """
-    ds_type = "json"
-    if config_dataset.ds_type:
-        ds_type = config_dataset.ds_type
-    elif ".parquet" in config_dataset.path:
-        ds_type = "parquet"
-    elif ".arrow" in config_dataset.path:
-        ds_type = "arrow"
-    elif ".csv" in config_dataset.path:
-        ds_type = "csv"
-    elif ".txt" in config_dataset.path:
-        ds_type = "text"
-    return ds_type
+def get_dataset_type(dataset_config: DictDefault) -> str:
+    """Get the dataset type from the path if it's not specified."""
+    if dataset_config.ds_type:
+        return dataset_config.ds_type
+
+    for extension, dataset_type in EXTENSIONS_TO_DATASET_TYPES.items():
+        if extension in dataset_config.path:
+            return dataset_type
+
+    return "json"


-def datasets_w_name_generator(dataset_configs: list[DictDefault]):
-    """
-    Yields dataset configs handling multiple names or preprocess_shards
+def datasets_with_name_generator(
+    dataset_configs: list[DictDefault],
+) -> Generator[DictDefault, None, None]:
+    """Yields expanded dataset configurations based on multiple names or preprocessing
+    shards.
+
+    When a dataset config has a list of names, it yields separate configs for each
+    name. When a dataset config specifies preprocessing shards, it yields configs for
+    each shard.

    Args:
-        dataset_configs: list of dataset configs (equivalent to cfg.datasets)
+        dataset_configs: List of dataset configuration objects.
+
+    Yields:
+        Individual dataset configurations, expanded as needed for names or shards.
    """
-    for dataset in dataset_configs:
-        if dataset.name and isinstance(dataset.name, list):
-            # load_dataset doesn't properly handle multiple named configurations
-            # at the same time for a given dataset
-            for name in dataset.name:
-                yield DictDefault({**dataset, "name": name})
-        elif dataset.preprocess_shards and not dataset.shards:
-            for shard in range(dataset.preprocess_shards):
+    for config in dataset_configs:
+        if config.name and isinstance(config.name, list):
+            for name in config.name:
+                yield DictDefault({**config, "name": name})
+        elif config.preprocess_shards and not config.shards:
+            for shard_idx in range(config.preprocess_shards):
                yield DictDefault(
                    {
-                        **dataset,
-                        "shards": dataset.preprocess_shards,
-                        "shards_idx": shard,
+                        **config,
+                        "shards": config.preprocess_shards,
+                        "shards_idx": shard_idx,
                    }
                )
        else:
-            yield dataset
+            yield config


-def load_dataset_w_config(
-    config_dataset: DictDefault, use_auth_token: bool, streaming=False
-) -> Union[Dataset, DatasetDict]:
-    """
-    Load a dataset from a config
+def load_dataset_with_config(
+    dataset_config: DictDefault, use_auth_token: bool, streaming=False
+) -> Dataset | IterableDataset:
+    """Load a dataset from a config. Handles datasets that are stored locally, in the
+    HuggingFace Hub, in a remote filesystem (S3, GCS, Azure, OCI), a URL, or
+    `data_files`.

    Args:
-        config_dataset: single dataset config
-        use_auth_token: whether to use HF auth token
-        streaming: whether to stream the dataset
+        dataset_config: Single dataset config.
+        use_auth_token: Whether to use HF auth token.
+        streaming: Whether to stream the dataset.
+
+    Returns:
+        Loaded dataset.
    """
-    # pylint: disable=invalid-name
-    ds: Optional[Union[Dataset, DatasetDict]] = None  # pylint: disable=invalid-name
-    ds_from_hub = False
+    # Set up common kwargs for dataset loading
+    load_dataset_kwargs = {
+        "split": dataset_config.split if dataset_config.split else None,
+        "name": dataset_config.name,
+        "streaming": streaming,
+        "trust_remote_code": dataset_config.trust_remote_code,
+    }
+
+    # First check if it's a local path
+    if Path(dataset_config.path).exists():
+        return _load_from_local_path(dataset_config, load_dataset_kwargs)
+
+    # Check if it's a HuggingFace dataset
+    is_hub_dataset = _check_if_hub_dataset(dataset_config, use_auth_token)
+
+    # Check if it's a cloud storage path and get appropriate filesystem
+    remote_fs, storage_options = _get_remote_filesystem(dataset_config.path)
+    is_cloud_dataset = False
+    if remote_fs:
+        try:
+            is_cloud_dataset = remote_fs.exists(dataset_config.path)
+        except (FileNotFoundError, ConnectionError):
+            pass
+
+    # Load from appropriate source
+    if is_hub_dataset:
+        return _load_from_hub(dataset_config, use_auth_token, load_dataset_kwargs)
+    if is_cloud_dataset:
+        return _load_from_cloud(
+            dataset_config, remote_fs, storage_options, load_dataset_kwargs
+        )
+    if dataset_config.path.startswith("https://"):
+        return _load_from_url(dataset_config, load_dataset_kwargs)
+    if dataset_config.data_files:
+        return _load_from_data_files(dataset_config, load_dataset_kwargs)
+
+    raise ValueError(
+        f"The dataset could not be loaded. This could be due to a misconfigured dataset path "
+        f"({dataset_config.path}). Try double-check your path / name / data_files. "
+        f"This is not caused by the dataset type."
+    )
+
+
+def _check_if_hub_dataset(dataset_config: DictDefault, use_auth_token: bool) -> bool:
+    """Check if a dataset exists on the HuggingFace Hub."""
    try:
-        # this is just a basic check to see if the path is a
-        # valid HF dataset that's loadable
        snapshot_download(
-            repo_id=config_dataset.path,
+            repo_id=dataset_config.path,
            repo_type="dataset",
            token=use_auth_token,
-            revision=config_dataset.revision,
+            revision=dataset_config.revision,
            ignore_patterns=["*"],
        )
-        ds_from_hub = True
+        return True
    except (
        RepositoryNotFoundError,
        RevisionNotFoundError,
@@ -93,198 +166,373 @@ def load_dataset_w_config(
        HFValidationError,
        ValueError,
    ):
-        pass
+        return False

-    ds_from_cloud = False
-    storage_options: dict = {}
-    remote_file_system = None
-    if config_dataset.path.startswith("s3://"):
+
+def _get_remote_filesystem(
+    path: str,
+) -> tuple[
+    S3FileSystem | GCSFileSystem | AzureBlobFileSystem | OCIFileSystem | None, dict
+]:
+    """Get the appropriate filesystem for a remote path."""
+    if path.startswith("s3://"):
        try:
-            import s3fs  # type: ignore
+            import s3fs
+
+            storage_options = {"anon": False}
+            return s3fs.S3FileSystem(**storage_options), storage_options
        except ImportError as exc:
            raise ImportError("s3:// paths require s3fs to be installed") from exc

-        # Reads env, credentials from ~/.aws/credentials, or IAM metadata provider
-        # https://s3fs.readthedocs.io/en/latest/index.html?highlight=storage_options#credentials
-        storage_options = {"anon": False}
-        remote_file_system = s3fs.S3FileSystem(**storage_options)
-    elif config_dataset.path.startswith("gs://") or config_dataset.path.startswith(
-        "gcs://"
-    ):
+    elif path.startswith(("gs://", "gcs://")):
        try:
-            import gcsfs  # type: ignore
+            import gcsfs
+
+            storage_options = {"token": None}  # type: ignore
+            return gcsfs.GCSFileSystem(**storage_options), storage_options
        except ImportError as exc:
            raise ImportError(
                "gs:// or gcs:// paths require gcsfs to be installed"
            ) from exc

-        # gcsfs will use default credentials from the environment else anon
-        # https://gcsfs.readthedocs.io/en/latest/#credentials
-        storage_options = {"token": None}
-        remote_file_system = gcsfs.GCSFileSystem(**storage_options)
-    elif (
-        config_dataset.path.startswith("adl://")
-        or config_dataset.path.startswith("abfs://")
-        or config_dataset.path.startswith("az://")
-    ):
+    elif path.startswith(("adl://", "abfs://", "az://")):
        try:
            import adlfs
+
+            storage_options = {"anon": False}
+            return adlfs.AzureBlobFileSystem(**storage_options), storage_options
        except ImportError as exc:
            raise ImportError(
                "adl:// or abfs:// paths require adlfs to be installed"
            ) from exc

-        # # Ensure you have the following environment variables set:
-        # # Gen 1
-        # storage_options = {
-        #     "tenant_id": AZURE_STORAGE_TENANT_ID,
-        #     "client_id": AZURE_STORAGE_CLIENT_ID,
-        #     "client_secret": AZURE_STORAGE_CLIENT_SECRET,
-        # }
-        # # Gen 2
-        # storage_options = {
-        #     "account_name": AZURE_STORAGE_ACCOUNT_NAME,
-        #     "account_key": AZURE_STORAGE_ACCOUNT_KEY,
-        # }
-
-        # Reads env
-        # https://github.com/fsspec/adlfs?tab=readme-ov-file#setting-credentials
-        storage_options = {"anon": False}
-        remote_file_system = adlfs.AzureBlobFileSystem(**storage_options)
-    elif config_dataset.path.startswith("oci://"):
+    elif path.startswith("oci://"):
        try:
            import ocifs
+
+            storage_options = {}
+            return ocifs.OCIFileSystem(**storage_options), storage_options
        except ImportError as exc:
            raise ImportError("oci:// paths require ocifs to be installed") from exc

-        # https://ocifs.readthedocs.io/en/latest/getting-connected.html#Using-Environment-Variables
-        remote_file_system = ocifs.OCIFileSystem(**storage_options)
+    return None, {}

-    try:
-        if remote_file_system and remote_file_system.exists(config_dataset.path):
-            ds_from_cloud = True
-    except (FileNotFoundError, ConnectionError):
-        pass

-    # gather extra args from the config
-    load_ds_kwargs = {}
-    if config_dataset.split:
-        load_ds_kwargs["split"] = config_dataset.split
+def _load_from_local_path(
+    dataset_config: DictDefault, load_dataset_kwargs: dict
+) -> Dataset | IterableDataset | DatasetDict | IterableDatasetDict:
+    """Load a dataset from a local path."""
+    local_path = Path(dataset_config.path)
+
+    if local_path.is_dir():
+        if dataset_config.data_files:
+            dataset_type = get_dataset_type(dataset_config)
+            return load_dataset(
+                dataset_type,
+                data_files=dataset_config.data_files,
+                **load_dataset_kwargs,
+            )
+        try:
+            return load_from_disk(dataset_config.path)
+        except FileNotFoundError:
+            load_dataset_kwargs["streaming"] = False
+            return load_dataset(dataset_config.path, **load_dataset_kwargs)
+    elif local_path.is_file():
+        dataset_type = get_dataset_type(dataset_config)
+        load_dataset_kwargs["streaming"] = False
+        return load_dataset(
+            dataset_type,
+            data_files=dataset_config.path,
+            **load_dataset_kwargs,
+        )
    else:
-        load_ds_kwargs["split"] = None
-
-    # prefer local dataset, even if hub exists
-    local_path = Path(config_dataset.path)
-    if local_path.exists():
-        if local_path.is_dir():
-            if config_dataset.data_files:
-                ds_type = get_ds_type(config_dataset)
-                ds = load_dataset(  # pylint: disable=invalid-name
-                    ds_type,
-                    name=config_dataset.name,
-                    data_files=config_dataset.data_files,
-                    streaming=streaming,
-                    **load_ds_kwargs,
-                )
-            else:
-                try:
-                    ds = load_from_disk(
-                        config_dataset.path
-                    )  # pylint: disable=invalid-name
-                except FileNotFoundError:
-                    ds = load_dataset(
-                        config_dataset.path,
-                        name=config_dataset.name,
-                        streaming=False,
-                        **load_ds_kwargs,
-                    )
-        elif local_path.is_file():
-            ds_type = get_ds_type(config_dataset)
-
-            ds = load_dataset(  # pylint: disable=invalid-name
-                ds_type,
-                name=config_dataset.name,
-                data_files=config_dataset.path,
-                streaming=False,
-                **load_ds_kwargs,
-            )
-        else:
-            raise ValueError(
-                "unhandled dataset load: local path exists, but is neither a directory or a file"
-            )
-    elif ds_from_hub:
-        ds = load_dataset(
-            config_dataset.path,
-            name=config_dataset.name,
-            streaming=streaming,
-            data_files=config_dataset.data_files,
-            token=use_auth_token,
-            revision=config_dataset.revision,
-            trust_remote_code=config_dataset.trust_remote_code,
-            **load_ds_kwargs,
-        )
-    elif ds_from_cloud and remote_file_system:
-        if remote_file_system.isdir(config_dataset.path):
-            ds = load_from_disk(
-                config_dataset.path,
-                storage_options=storage_options,
-            )
-        elif remote_file_system.isfile(config_dataset.path):
-            ds_type = get_ds_type(config_dataset)
-            ds = load_dataset(
-                ds_type,
-                name=config_dataset.name,
-                data_files=config_dataset.path,
-                streaming=streaming,
-                storage_options=storage_options,
-                trust_remote_code=config_dataset.trust_remote_code,
-                **load_ds_kwargs,
-            )
-    elif config_dataset.path.startswith("https://"):
-        ds_type = get_ds_type(config_dataset)
-        ds = load_dataset(
-            ds_type,
-            name=config_dataset.name,
-            data_files=config_dataset.path,
-            streaming=streaming,
-            storage_options=storage_options,
-            trust_remote_code=config_dataset.trust_remote_code,
-            **load_ds_kwargs,
-        )
-    elif config_dataset.data_files:
-        fp: str | list[str] | None = None
-        if isinstance(config_dataset.data_files, str):
-            fp = hf_hub_download(
-                repo_id=config_dataset.path,
-                repo_type="dataset",
-                filename=config_dataset.data_files,
-                revision=config_dataset.revision,
-            )
-        elif isinstance(config_dataset.data_files, list):
-            fp = []
-            for file in config_dataset.data_files:
-                fp.append(
-                    hf_hub_download(
-                        repo_id=config_dataset.path,
-                        repo_type="dataset",
-                        filename=file,
-                        revision=config_dataset.revision,
-                    )
-                )
-        else:
-            raise ValueError("data_files must be either a string or list of strings")
-        ds = load_dataset(
-            "json",
-            name=config_dataset.name,
-            data_files=fp,
-            streaming=streaming,
-            **load_ds_kwargs,
-        )
-    if not ds:
        raise ValueError(
-            "The dataset could not be loaded. This could be due to a misconfigured dataset path "
-            f"({config_dataset.path}). Try double-check your path / name / data_files. "
-            "This is not caused by the dataset type."
+            "Unhandled dataset load: local path exists, but is neither a directory or a file"
        )

-    return ds
+
+def _load_from_hub(
+    dataset_config: DictDefault, use_auth_token: bool, load_dataset_kwargs: dict
+) -> Dataset | IterableDataset | DatasetDict | IterableDatasetDict:
+    """Load a dataset from the HuggingFace Hub."""
+    return load_dataset(
+        dataset_config.path,
+        data_files=dataset_config.data_files,
+        token=use_auth_token,
+        revision=dataset_config.revision,
+        **load_dataset_kwargs,
+    )
+
+
+def _load_from_cloud(
+    dataset_config: DictDefault,
+    remote_fs: S3FileSystem | GCSFileSystem | AzureBlobFileSystem | OCIFileSystem,
+    storage_options: dict,
+    load_dataset_kwargs: dict,
+) -> Dataset | IterableDataset | DatasetDict | IterableDatasetDict:
+    """Load a dataset from cloud storage."""
+    if remote_fs.isdir(dataset_config.path):
+        return load_from_disk(
+            dataset_config.path,
+            storage_options=storage_options,
+        )
+
+    if remote_fs.isfile(dataset_config.path):
+        dataset_type = get_dataset_type(dataset_config)
+        return load_dataset(
+            dataset_type,
+            data_files=dataset_config.path,
+            storage_options=storage_options,
+            **load_dataset_kwargs,
+        )
+
+    raise ValueError(
+        f"Cloud path {dataset_config.path} is neither a directory nor a file"
+    )
+
+
+def _load_from_url(
+    dataset_config: DictDefault, load_dataset_kwargs: dict
+) -> Dataset | IterableDataset | DatasetDict | IterableDatasetDict:
+    """Load a dataset from a URL."""
+    dataset_type = get_dataset_type(dataset_config)
+    return load_dataset(
+        dataset_type,
+        data_files=dataset_config.path,
+        **load_dataset_kwargs,
+    )
+
+
+def _load_from_data_files(
+    dataset_config: DictDefault, load_dataset_kwargs: dict
+) -> Dataset | IterableDataset | DatasetDict | IterableDatasetDict:
+    """Load a dataset from data files."""
+    file_path = None
+
+    if isinstance(dataset_config.data_files, str):
+        file_path = hf_hub_download(
+            repo_id=dataset_config.path,
+            repo_type="dataset",
+            filename=dataset_config.data_files,
+            revision=dataset_config.revision,
+        )
+    elif isinstance(dataset_config.data_files, list):
+        file_path = [
+            hf_hub_download(
+                repo_id=dataset_config.path,
+                repo_type="dataset",
+                filename=file,
+                revision=dataset_config.revision,
+            )
+            for file in dataset_config.data_files
+        ]
+    else:
+        raise ValueError("data_files must be either a string or list of strings")
+
+    return load_dataset("json", data_files=file_path, **load_dataset_kwargs)
+
+
+def generate_split_fingerprints(
+    dataset: Dataset, val_set_size: int | float, seed: int
+) -> tuple[str, str]:
+    """Generate consistent fingerprints for train/test splits."""
+    fingerprint = dataset._fingerprint  # pylint: disable=protected-access
+
+    train_hash_input = f"{fingerprint}|{val_set_size}|train|{seed}"
+    test_hash_input = f"{fingerprint}|{val_set_size}|test|{seed}"
+
+    train_fingerprint = md5(train_hash_input)
+    test_fingerprint = md5(test_hash_input)
+
+    return train_fingerprint, test_fingerprint
+
+
+def get_prepared_dataset_path(cfg: DictDefault, dataset_hash: str) -> Path:
+    """Get standardized path for prepared datasets.
+
+    Args:
+        cfg: Configuration object.
+        dataset_hash: Hash identifying the specific dataset configuration.
+
+    Returns:
+        Path where the prepared dataset should be stored.
+    """
+    base_path = cfg.dataset_prepared_path or DEFAULT_DATASET_PREPARED_PATH
+    return Path(base_path) / dataset_hash
+
+
+def create_train_validation_split(
+    dataset: Dataset, cfg: DictDefault, val_set_size: int | float
+) -> tuple[Dataset, Dataset]:
+    """Create train/validation split with consistent fingerprinting.
+
+    Args:
+        dataset: Dataset to split.
+        cfg: Configuration object containing seed and other settings.
+        val_set_size: Size of validation set (absolute number or fraction).
+
+    Returns:
+        Tuple of (train_dataset, eval_dataset).
+    """
+    train_fingerprint, test_fingerprint = generate_split_fingerprints(
+        dataset, val_set_size, cfg.seed
+    )
+
+    # Apply deduplication before splitting if configured
+    if cfg.dataset_exact_deduplication:
+        dataset, _ = deduplicate_and_log_datasets(dataset=dataset)
+
+    split_dataset = dataset.train_test_split(
+        test_size=val_set_size,
+        shuffle=False,
+        seed=cfg.seed,
+        train_new_fingerprint=train_fingerprint,
+        test_new_fingerprint=test_fingerprint,
+    )
+
+    return split_dataset["train"], split_dataset["test"]
+
+
+def _generate_from_iterable_dataset(
+    dataset: IterableDataset, worker_id: list[int], num_workers: list[int]
+) -> Generator[Any, None, None]:
+    """Generator function to correctly split the dataset for each worker"""
+    for i, item in enumerate(dataset):
+        if i % num_workers[0] == worker_id[0]:
+            yield item
+
+
+def save_preprocessed_dataset(
+    cfg: DictDefault,
+    dataset: Dataset,
+    dataset_hash: str,
+    split: str,
+) -> None:
+    """Save preprocessed dataset to disk and optionally push to the HF Hub."""
+    prepared_ds_path = get_prepared_dataset_path(cfg, dataset_hash)
+    if isinstance(dataset, IterableDataset):
+        num_workers = cfg.dataset_processes
+
+        ds_from_iter = Dataset.from_generator(
+            functools.partial(_generate_from_iterable_dataset, dataset),
+            features=dataset.features,
+            num_proc=num_workers,
+            split=split,
+            gen_kwargs={
+                "worker_id": list(range(num_workers)),
+                "num_workers": [num_workers] * num_workers,
+            },
+        )
+        ds_from_iter.save_to_disk(str(prepared_ds_path))
+    else:
+        os.makedirs(prepared_ds_path, exist_ok=True)
+        dataset.save_to_disk(str(prepared_ds_path))
+    if cfg.push_dataset_to_hub:
+        LOG.info(
+            "Pushing merged prepared dataset to Huggingface hub at "
+            f"{cfg.push_dataset_to_hub} (version {dataset_hash})...",
+            main_process_only=False,
+        )
+        dataset.push_to_hub(
+            cfg.push_dataset_to_hub,
+            dataset_hash,
+            private=True,
+        )
+
+
+def load_preprocessed_dataset(cfg: DictDefault, dataset_hash: str) -> Dataset | None:
+    """Load preprocessed dataset from disk if available.
+
+    Args:
+        cfg: Configuration object.
+        dataset_hash: Hash identifying the dataset configuration.
+
+    Returns:
+        Loaded dataset if found and conditions are met, None otherwise.
+    """
+    prepared_ds_path = get_prepared_dataset_path(cfg, dataset_hash)
+
+    if (
+        cfg.dataset_prepared_path
+        and any(prepared_ds_path.glob("*"))
+        and not cfg.skip_prepare_dataset
+        and not cfg.is_preprocess
+    ):
+        LOG.info(
+            f"Loading prepared dataset from disk at {prepared_ds_path}...",
+            main_process_only=False,
+        )
+        return load_from_disk(str(prepared_ds_path))
+
+    LOG.info(
+        f"Unable to find prepared dataset in {prepared_ds_path}",
+        main_process_only=False,
+    )
+    return None
+
+
+def try_load_from_hub(
+    cfg: DictDefault, dataset_hash: str, split: str
+) -> Dataset | None:
+    """Try to load the prepared dataset from HuggingFace Hub."""
+    try:
+        LOG.info(
+            "Attempting to load prepared dataset from HuggingFace Hub at "
+            f"{cfg.push_dataset_to_hub} (version {dataset_hash})..."
+        )
+        dataset = load_dataset(
+            cfg.push_dataset_to_hub,
+            dataset_hash,
+            token=cfg.hf_use_auth_token,
+        )
+        return dataset[split]
+    except Exception:  # pylint: disable=broad-except # nosec
+        LOG.info("Unable to find prepared dataset in HuggingFace Hub")
+        return None
+
+
+def generate_dataset_hash_from_config(
+    cfg: DictDefault, cfg_datasets: list, tokenizer_name: str
+) -> str:
+    """Generate a hash to uniquely identify a dataset configuration for SFT.
+
+    Args:
+        cfg: Main configuration object.
+        cfg_datasets: List of dataset configurations.
+        tokenizer_name: Name of the tokenizer being used.
+
+    Returns:
+        MD5 hash string representing the configuration.
+    """
+    config_str = (
+        f"{cfg.sequence_len}@{cfg.sample_packing}@{cfg.eval_sample_packing}@"
+        f"{cfg.group_by_length}@{cfg.kd_temperature or 1.0}|"
+        f"{'|'.join(sorted([f'{d.path}:{d.type}:{d.shards}:{d.conversation}:{d.split}:{d.temperature or 1.0}' for d in cfg_datasets]))}"
+        f"|{tokenizer_name}"
+    )
+    return str(md5(config_str))
+
+
+def merge_datasets(datasets: list[Dataset], cfg: DictDefault) -> Dataset:
+    """Merge multiple datasets into one with optional shuffling.
+
+    Args:
+        datasets: List of datasets to merge.
+        cfg: Configuration object containing shuffle settings.
+
+    Returns:
+        Merged dataset.
+    """
+    if len(datasets) == 1:
+        return datasets[0]
+
+    LOG.info("Merging datasets...")
+    merged_dataset = concatenate_datasets(datasets)
+
+    if cfg.shuffle_merged_datasets:
+        LOG.debug("Shuffling merged datasets...")
+        merged_dataset = merged_dataset.shuffle(seed=cfg.seed)
+    else:
+        LOG.debug("Not shuffling merged datasets.")
+
+    return merged_dataset
--- a/src/axolotl/utils/data/utils.py
+++ b/src/axolotl/utils/data/utils.py
@@ -1,9 +1,11 @@
-"""data handling helpers"""
+"""Data handling helpers"""

+import contextlib
 import functools
 import hashlib
 import time
 from enum import Enum
+from typing import Callable

 import huggingface_hub
 import numpy as np
@@ -19,9 +21,7 @@ LOG = get_logger(__name__)


 class RetryStrategy(Enum):
-    """
-    Enum for retry strategies.
-    """
+    """Enum for retry strategies."""

    CONSTANT = 1
    LINEAR = 2
@@ -30,7 +30,18 @@ class RetryStrategy(Enum):

 def retry_on_request_exceptions(
    max_retries=3, delay=1, retry_strategy: RetryStrategy = RetryStrategy.LINEAR
-):
+) -> Callable:
+    """Decorator that retries function calls on specific request exceptions.
+
+    Args:
+        max_retries: Maximum number of retry attempts.
+        delay: Base delay between retries in seconds.
+        retry_strategy: Strategy for calculating retry delays.
+
+    Returns:
+        Decorated function with retry logic.
+    """
+
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):  # pylint: disable=inconsistent-return-statements
@@ -60,6 +71,7 @@ def retry_on_request_exceptions(


 def md5(to_hash: str, encoding: str = "utf-8") -> str:
+    """Generate MD5 hash of a string."""
    try:
        return hashlib.md5(to_hash.encode(encoding), usedforsecurity=False).hexdigest()
    except TypeError:
@@ -67,102 +79,89 @@ def md5(to_hash: str, encoding: str = "utf-8") -> str:


 def sha256(to_hash: str, encoding: str = "utf-8") -> str:
+    """Generate SHA256 hash of a string."""
    return hashlib.sha256(to_hash.encode(encoding)).hexdigest()


-def deduplicate_dataset(
-    dataset: Dataset, seen_hashes: dict[str, list[int]], other_dataset: Dataset = None
-) -> Dataset:
-    unique_indices = []
+def _deduplicate_dataset(
+    dataset: Dataset,
+    seen_hashes: set[str] | None = None,
+) -> tuple[Dataset, set[str]]:
+    """Remove duplicate rows from a dataset using SHA256 hashes.

+    Args:
+        dataset: Dataset to deduplicate.
+        seen_hashes: Set of previously seen row hashes (for cross-deduplication).
+
+    Returns:
+        Tuple of deduplicated dataset and the set of seen hashes.
+    """
+    if seen_hashes is None:
+        seen_hashes = set()
+
+    unique_indices = []
    for idx, row in enumerate(dataset):
-        row_hash = sha256(str(row))  # Using SHA256 for collision resistance.
+        row_hash = sha256(str(row))  # Using SHA256 for collision resistance
        if row_hash not in seen_hashes:
-            seen_hashes[row_hash] = [idx]
+            seen_hashes.add(row_hash)
            unique_indices.append(idx)
-        else:
-            # Check for collision by looking up the original dataset indices
-            original_indices = seen_hashes[row_hash]
-            is_duplicate = False
-            for original_idx in original_indices:
-                if (
-                    not idx == original_idx
-                    and original_idx < len(dataset)
-                    and str(dataset[original_idx]) == str(row)
-                ):
-                    is_duplicate = True
-                    break
-                # Check in the other dataset if provided
-                if other_dataset is not None:
-                    if original_idx < len(other_dataset) and str(
-                        other_dataset[original_idx]
-                    ) == str(row):
-                        is_duplicate = True
-                        break
-            if not is_duplicate:
-                seen_hashes[row_hash].append(idx)
-                unique_indices.append(idx)
-                continue
-    return dataset.select(unique_indices)
+
+    return dataset.select(unique_indices), seen_hashes


 def deduplicate_and_log_datasets(
-    *,
-    train_dataset: Dataset = None,
-    eval_dataset: Dataset = None,
-    dataset: Dataset = None,
-) -> tuple[Dataset, Dataset, Dataset]:
-    """
-    Deduplicates train, eval, and an optional dataset if provided, logging original and new sizes.
+    dataset: Dataset,
+    other_dataset: Dataset | None = None,
+    dataset_name: str | None = "train",
+    other_name: str | None = "eval",
+) -> tuple[Dataset, Dataset | None]:
+    """Deduplicate datasets, with optional cross-dataset deduplication.
+
+    Args:
+        dataset: Primary dataset to deduplicate.
+        other_dataset: Optional second dataset to deduplicate against the first.
+        dataset_name: Name for the primary dataset (for logging).
+        other_name: Name for the second dataset (for logging).

    Returns:
-        tuple: Deduplicated train, eval, and additional datasets.
+        Tuple of (deduplicated_dataset, deduplicated_other_dataset).
    """
-    seen_hashes: dict[str, list[int]] = {}
+    # Deduplicate primary dataset
+    LOG.info(
+        f"Starting deduplication for {dataset_name} dataset. Original size: {len(dataset)}"
+    )
+    dataset, seen_rows = _deduplicate_dataset(dataset)
+    LOG.info(
+        f"Deduplication complete for {dataset_name} dataset. New size: {len(dataset)}"
+    )

-    # Handle cases where datasets are None
-    if train_dataset is not None:
+    # Deduplicate second dataset if provided
+    if other_dataset is not None:
        LOG.info(
-            f"Starting deduplication for train dataset. Original size: {len(train_dataset)}"
-        )
-        train_dataset = deduplicate_dataset(
-            dataset=train_dataset, seen_hashes=seen_hashes
+            f"Starting deduplication for {other_name} dataset. Original size: {len(other_dataset)}"
        )
+        other_dataset, _ = _deduplicate_dataset(other_dataset, seen_rows)
        LOG.info(
-            f"Deduplication complete for train dataset. New size: {len(train_dataset)}"
-        )
-    else:
-        LOG.info("Train dataset is None. Skipping deduplication.")
-
-    if eval_dataset is not None:
-        LOG.info(
-            f"Starting deduplication for eval dataset. Original size: {len(eval_dataset)}"
-        )
-        eval_dataset = deduplicate_dataset(
-            dataset=eval_dataset, seen_hashes=seen_hashes, other_dataset=train_dataset
-        )
-        LOG.info(
-            f"Deduplication complete for eval dataset. New size: {len(eval_dataset)}"
-        )
-    else:
-        LOG.info("Eval dataset is None. Skipping deduplication.")
-
-    if dataset is not None and (eval_dataset is None and train_dataset is None):
-        LOG.info(
-            f"Starting deduplication for combined dataset. Original size: {len(dataset)}"
-        )
-        dataset = deduplicate_dataset(dataset=dataset, seen_hashes=seen_hashes)
-        LOG.info(
-            f"Deduplication complete for combined dataset. New size: {len(dataset)}"
+            f"Deduplication complete for {other_name} dataset. New size: {len(other_dataset)}"
        )

-    return train_dataset, eval_dataset, dataset
+    return dataset, other_dataset


-def drop_long_seq_in_dataset(dataset: Dataset, cfg: DictDefault):
+def drop_long_seq_in_dataset(dataset: Dataset, cfg: DictDefault) -> Dataset:
+    """Remove sequences longer than configured maximum from dataset.
+
+    Args:
+        dataset: Dataset to filter.
+        cfg: Dictionary mapping `axolotl` config keys to values.
+
+    Returns:
+        Filtered dataset with long sequences removed.
+    """
    if "input_ids" not in dataset.column_names:
        LOG.warning(
-            "Dataset does not contain 'input_ids' column. Skip drop long seq. This is expected for RewardModeling."
+            "Dataset does not contain 'input_ids' column. Skip drop long seq. This is "
+            "expected for reward modeling."
        )
        return dataset

@@ -172,20 +171,14 @@ def drop_long_seq_in_dataset(dataset: Dataset, cfg: DictDefault):
        min_sequence_len=cfg.min_sample_len,
    )

-    try:
+    with contextlib.suppress(AttributeError):
        ds_lengths = get_dataset_lengths(dataset, from_arrow=True)
        min_input_len = np.min(ds_lengths)
        LOG.info(f"min_input_len: {min_input_len}")
        max_input_len = np.max(ds_lengths)
        LOG.info(f"max_input_len: {max_input_len}")
-    except AttributeError:
-        pass

-    try:
-        prior_len = len(dataset)
-    except TypeError:
-        # handle iterable datasets case
-        prior_len = None
+    prior_len = len(dataset) if hasattr(dataset, "__len__") else None

    filter_map_kwargs = {}
    if not isinstance(dataset, IterableDataset):
--- a/src/axolotl/utils/data/wrappers.py
+++ b/src/axolotl/utils/data/wrappers.py
@@ -0,0 +1,425 @@
+"""Data handling specific to SFT."""
+
+import logging
+from typing import Any, NoReturn, cast
+
+from datasets import (
+    Dataset,
+    IterableDataset,
+    Sequence,
+    Value,
+)
+from transformers import PreTrainedTokenizer
+from transformers.processing_utils import ProcessorMixin
+
+from axolotl.datasets import TokenizedPromptDataset, wrap_dataset_for_tokenized_prompt
+from axolotl.prompt_strategies import load
+from axolotl.prompt_strategies.bradley_terry import load as bradley_terry_load
+from axolotl.prompt_tokenizers import (
+    AlpacaMultipleChoicePromptTokenizingStrategy,
+    AlpacaPromptTokenizingStrategy,
+    AlpacaReflectionPTStrategy,
+    DatasetWrappingStrategy,
+    GPTeacherPromptTokenizingStrategy,
+    JeopardyPromptTokenizingStrategy,
+    OpenAssistantPromptTokenizingStrategy,
+    PromptTokenizingStrategy,
+    SummarizeTLDRPromptTokenizingStrategy,
+)
+from axolotl.prompters import (
+    AlpacaPrompter,
+    GPTeacherPrompter,
+    JeopardyPrompter,
+    MultipleChoiceConcisePrompter,
+    MultipleChoiceExplainPrompter,
+    Prompter,
+    ReflectAlpacaPrompter,
+    SummarizeTLDRPrompter,
+    UnsupportedPrompter,
+)
+from axolotl.utils.dict import DictDefault
+
+LOG = logging.getLogger(__name__)
+
+
+def handle_unknown_dataset_strategy(dataset_config: DictDefault) -> NoReturn:
+    """Raise error for unknown dataset strategy."""
+    ds_type = dataset_config.type
+    suffix = ""
+    if ":load_" in ds_type:
+        suffix = f"Did you mean {ds_type.replace(':load_', '.load_')}?"
+
+    error_message = f"unhandled prompt tokenization strategy: {ds_type}. {suffix}"
+    LOG.error(error_message)
+    raise ValueError(error_message)
+
+
+# pylint: disable=too-many-return-statements
+def get_dataset_wrapper(
+    dataset_config: DictDefault,
+    tokenizer: PreTrainedTokenizer,
+    cfg: DictDefault,
+    dataset_base_type: str | None,
+    dataset: Dataset | IterableDataset,
+    dataset_prompt_style: str | None = None,
+    processor: ProcessorMixin | None = None,  # pylint: disable=unused-argument
+) -> tuple[Dataset | IterableDataset, Prompter | None]:
+    """Create an appropriate dataset wrapper and prompter based on dataset
+    configuration.
+
+    Args:
+        dataset_config: Configuration for the dataset.
+        tokenizer: Tokenizer to use for processing text.
+        cfg: Global configuration object.
+        dataset_base_type: The base type of the dataset.
+        dataset: The actual dataset object.
+        dataset_prompt_style: Optional prompt style specification.
+        processor: Optional processor for multimodal datasets.
+
+    Returns:
+        tuple of (dataset_wrapper, dataset_prompter).
+    """
+    # Common parameters for dataset wrapping
+    dataset_kwargs: dict[str, Any] = {
+        "process_count": cfg.dataset_processes,
+        "keep_in_memory": cfg.dataset_keep_in_memory is True,
+    }
+
+    LOG.info(
+        f"Loading dataset: {dataset_config['path']} with base_type: "
+        f"{dataset_base_type} and prompt_style: {dataset_prompt_style}"
+    )
+
+    # Dataset is already tokenized
+    if _is_dataset_already_tokenized(dataset):
+        return dataset, UnsupportedPrompter()
+
+    # Custom dataset type definition
+    if isinstance(dataset_config.type, DictDefault):
+        return _handle_custom_dataset_type(
+            dataset_config, tokenizer, cfg, dataset, dataset_kwargs
+        )
+
+    # Skip preparation if configured
+    if cfg.skip_prepare_dataset:
+        return dataset, None
+
+    # Bradley-Terry dataset
+    if dataset_config.type.startswith("bradley_terry"):
+        return _handle_bradley_terry_dataset(
+            dataset_config, tokenizer, cfg, dataset, dataset_kwargs
+        )
+
+    # Stepwise supervised dataset
+    if dataset_config.type.startswith("stepwise_supervised"):
+        return _handle_stepwise_supervised_dataset(
+            dataset_config, tokenizer, cfg, dataset, dataset_kwargs
+        )
+
+    # Try to load prompt tokenizer / dataset wrapper strategy from registry
+    dataset_strategy = load(
+        dataset_config.type, tokenizer, cfg, dataset_config, processor=processor
+    )
+    if dataset_strategy:
+        return _handle_loaded_strategy(dataset_strategy, dataset, dataset_kwargs)
+
+    # Known dataset types with specific handling
+    if dataset_base_type in DATASET_HANDLERS:
+        handler = DATASET_HANDLERS[dataset_base_type]
+        return handler(dataset_prompt_style, tokenizer, cfg, dataset, dataset_kwargs)
+
+    # Unhandled dataset type
+    handle_unknown_dataset_strategy(dataset_config)
+
+
+def _is_dataset_already_tokenized(dataset: Dataset | IterableDataset) -> bool:
+    """Check if the dataset is already tokenized."""
+    return (
+        isinstance(dataset, Dataset)
+        and "input_ids" in dataset.features
+        and "attention_mask" in dataset.features
+        and "labels" in dataset.features
+    )
+
+
+def _handle_custom_dataset_type(
+    dataset_config: DictDefault,
+    tokenizer: PreTrainedTokenizer,
+    cfg: DictDefault,
+    dataset: Dataset | IterableDataset,
+    dataset_kwargs: dict[str, Any],
+) -> tuple[Dataset | IterableDataset, Prompter]:
+    """Handle a custom dataset type defined in the configuration."""
+    dataset_strategy = cast(
+        PromptTokenizingStrategy,
+        load("user_defined", tokenizer, cfg, dataset_config.type.to_dict()),
+    )
+    dataset_prompter = UnsupportedPrompter()
+    dataset_wrapper = wrap_dataset_for_tokenized_prompt(
+        dataset_strategy,
+        dataset,
+        **dataset_kwargs,
+    )
+    return dataset_wrapper, dataset_prompter
+
+
+def _handle_bradley_terry_dataset(
+    dataset_config: DictDefault,
+    tokenizer: PreTrainedTokenizer,
+    cfg: DictDefault,
+    dataset: Dataset | IterableDataset,
+    dataset_kwargs: dict[str, Any],
+) -> tuple[Dataset | IterableDataset, Prompter | None]:
+    """Handle a Bradley-Terry dataset."""
+    bt_type = dataset_config.type.split(".", 1)[1]
+    dataset_strategy = bradley_terry_load(bt_type, tokenizer, cfg, dataset_config)
+
+    if not dataset_strategy:
+        handle_unknown_dataset_strategy(dataset_config)
+
+    dataset_prompter = UnsupportedPrompter()
+    dataset_wrapper = wrap_dataset_for_tokenized_prompt(
+        dataset_strategy,
+        dataset,
+        **dataset_kwargs,
+    )
+
+    return dataset_wrapper, dataset_prompter
+
+
+def _handle_stepwise_supervised_dataset(
+    dataset_config: DictDefault,
+    tokenizer: PreTrainedTokenizer,
+    cfg: DictDefault,
+    dataset: Dataset | IterableDataset,
+    dataset_kwargs: dict[str, Any],
+) -> tuple[Dataset | IterableDataset, Prompter]:
+    """Handle a stepwise supervised dataset."""
+    dataset_prompter = UnsupportedPrompter()
+    dataset_strategy = load(dataset_config.type, tokenizer, cfg, dataset_config)
+
+    # We need to explicitly cast boolean labels to int
+    # for compatibility with how trl's PRMTrainer works
+    if isinstance(dataset, Dataset):
+        dataset = dataset.cast_column("labels", Sequence(Value("int64")))
+
+    dataset_wrapper = TokenizedPromptDataset(
+        dataset_strategy,
+        dataset,
+        **dataset_kwargs,
+    )
+    return dataset_wrapper, dataset_prompter
+
+
+def _handle_loaded_strategy(
+    dataset_strategy: PromptTokenizingStrategy | DatasetWrappingStrategy,
+    dataset: Dataset | IterableDataset,
+    dataset_kwargs: dict[str, Any],
+) -> tuple[Dataset | IterableDataset, Prompter | None]:
+    """Handle a dataset with a strategy loaded from the registry."""
+    if isinstance(dataset_strategy, DatasetWrappingStrategy):
+        return dataset_strategy.wrap_dataset(dataset, **dataset_kwargs), None
+
+    dataset_prompter = UnsupportedPrompter()
+    dataset_wrapper = wrap_dataset_for_tokenized_prompt(
+        dataset_strategy,
+        dataset,
+        **dataset_kwargs,
+    )
+    return dataset_wrapper, dataset_prompter
+
+
+def _handle_alpaca_dataset(
+    dataset_prompt_style: str | None,
+    tokenizer: PreTrainedTokenizer,
+    cfg: DictDefault,
+    dataset: Dataset | IterableDataset,
+    dataset_kwargs: dict[str, Any],
+) -> tuple[Dataset | IterableDataset, Prompter]:
+    """Handle an Alpaca dataset."""
+    dataset_prompter = AlpacaPrompter(dataset_prompt_style)
+    dataset_strategy = AlpacaPromptTokenizingStrategy(
+        dataset_prompter,
+        tokenizer,
+        cfg.train_on_inputs,
+        cfg.sequence_len,
+    )
+    dataset_wrapper = wrap_dataset_for_tokenized_prompt(
+        dataset_strategy,
+        dataset,
+        **dataset_kwargs,
+    )
+    return dataset_wrapper, dataset_prompter
+
+
+def _handle_explainchoice_dataset(
+    dataset_prompt_style: str | None,
+    tokenizer: PreTrainedTokenizer,
+    cfg: DictDefault,
+    dataset: Dataset | IterableDataset,
+    dataset_kwargs: dict[str, Any],
+) -> tuple[Dataset | IterableDataset, Prompter]:
+    """Handle an ExplainChoice dataset."""
+    dataset_prompter = MultipleChoiceExplainPrompter(dataset_prompt_style)
+    dataset_strategy = AlpacaMultipleChoicePromptTokenizingStrategy(
+        dataset_prompter,
+        tokenizer,
+        cfg.train_on_inputs,
+        cfg.sequence_len,
+    )
+    dataset_wrapper = wrap_dataset_for_tokenized_prompt(
+        dataset_strategy,
+        dataset,
+        **dataset_kwargs,
+    )
+    return dataset_wrapper, dataset_prompter
+
+
+def _handle_concisechoice_dataset(
+    dataset_prompt_style: str | None,
+    tokenizer: PreTrainedTokenizer,
+    cfg: DictDefault,
+    dataset: Dataset | IterableDataset,
+    dataset_kwargs: dict[str, Any],
+) -> tuple[Dataset | IterableDataset, Prompter]:
+    """Handle a ConciseChoice dataset."""
+    dataset_prompter = MultipleChoiceConcisePrompter(dataset_prompt_style)
+    dataset_strategy = AlpacaMultipleChoicePromptTokenizingStrategy(
+        dataset_prompter,
+        tokenizer,
+        cfg.train_on_inputs,
+        cfg.sequence_len,
+    )
+    dataset_wrapper = wrap_dataset_for_tokenized_prompt(
+        dataset_strategy,
+        dataset,
+        **dataset_kwargs,
+    )
+    return dataset_wrapper, dataset_prompter
+
+
+def _handle_summarizetldr_dataset(
+    dataset_prompt_style: str | None,
+    tokenizer: PreTrainedTokenizer,
+    cfg: DictDefault,
+    dataset: Dataset | IterableDataset,
+    dataset_kwargs: dict[str, Any],
+) -> tuple[Dataset | IterableDataset, Prompter]:
+    """Handle a SummarizeTLDR dataset."""
+    dataset_prompter = SummarizeTLDRPrompter(dataset_prompt_style)
+    dataset_strategy = SummarizeTLDRPromptTokenizingStrategy(
+        dataset_prompter,
+        tokenizer,
+        cfg.train_on_inputs,
+        cfg.sequence_len,
+    )
+    dataset_wrapper = wrap_dataset_for_tokenized_prompt(
+        dataset_strategy,
+        dataset,
+        **dataset_kwargs,
+    )
+    return dataset_wrapper, dataset_prompter
+
+
+def _handle_jeopardy_dataset(
+    dataset_prompt_style: str | None,
+    tokenizer: PreTrainedTokenizer,
+    cfg: DictDefault,
+    dataset: Dataset | IterableDataset,
+    dataset_kwargs: dict[str, Any],
+) -> tuple[Dataset | IterableDataset, Prompter]:
+    """Handle a Jeopardy dataset."""
+    dataset_prompter = JeopardyPrompter(dataset_prompt_style)
+    dataset_strategy = JeopardyPromptTokenizingStrategy(
+        dataset_prompter,
+        tokenizer,
+        cfg.train_on_inputs,
+        cfg.sequence_len,
+    )
+    dataset_wrapper = wrap_dataset_for_tokenized_prompt(
+        dataset_strategy,
+        dataset,
+        **dataset_kwargs,
+    )
+    return dataset_wrapper, dataset_prompter
+
+
+def _handle_oasst_dataset(
+    dataset_prompt_style: str | None,
+    tokenizer: PreTrainedTokenizer,
+    cfg: DictDefault,
+    dataset: Dataset | IterableDataset,
+    dataset_kwargs: dict[str, Any],
+) -> tuple[Dataset | IterableDataset, Prompter]:
+    """Handle an OpenAssistant dataset."""
+    dataset_prompter = AlpacaPrompter(dataset_prompt_style)
+    dataset_strategy = OpenAssistantPromptTokenizingStrategy(
+        dataset_prompter,
+        tokenizer,
+        cfg.train_on_inputs,
+        cfg.sequence_len,
+    )
+    dataset_wrapper = wrap_dataset_for_tokenized_prompt(
+        dataset_strategy,
+        dataset,
+        **dataset_kwargs,
+    )
+    return dataset_wrapper, dataset_prompter
+
+
+def _handle_gpteacher_dataset(
+    dataset_prompt_style: str | None,
+    tokenizer: PreTrainedTokenizer,
+    cfg: DictDefault,
+    dataset: Dataset | IterableDataset,
+    dataset_kwargs: dict[str, Any],
+) -> tuple[Dataset | IterableDataset, Prompter]:
+    """Handle a GPTeacher dataset."""
+    dataset_prompter = GPTeacherPrompter(dataset_prompt_style)
+    dataset_strategy = GPTeacherPromptTokenizingStrategy(
+        dataset_prompter,
+        tokenizer,
+        cfg.train_on_inputs,
+        cfg.sequence_len,
+    )
+    dataset_wrapper = wrap_dataset_for_tokenized_prompt(
+        dataset_strategy,
+        dataset,
+        **dataset_kwargs,
+    )
+    return dataset_wrapper, dataset_prompter
+
+
+def _handle_reflection_dataset(
+    dataset_prompt_style: str | None,
+    tokenizer: PreTrainedTokenizer,
+    cfg: DictDefault,
+    dataset: Dataset | IterableDataset,
+    dataset_kwargs: dict[str, Any],
+) -> tuple[Dataset | IterableDataset, Prompter]:
+    """Handle a Reflection dataset."""
+    dataset_prompter = ReflectAlpacaPrompter(dataset_prompt_style)
+    dataset_strategy = AlpacaReflectionPTStrategy(
+        dataset_prompter,
+        tokenizer,
+        cfg.train_on_inputs,
+        cfg.sequence_len,
+    )
+    dataset_wrapper = wrap_dataset_for_tokenized_prompt(
+        dataset_strategy,
+        dataset,
+        **dataset_kwargs,
+    )
+    return dataset_wrapper, dataset_prompter
+
+
+DATASET_HANDLERS = {
+    "alpaca": _handle_alpaca_dataset,
+    "explainchoice": _handle_explainchoice_dataset,
+    "concisechoice": _handle_concisechoice_dataset,
+    "summarizetldr": _handle_summarizetldr_dataset,
+    "jeopardy": _handle_jeopardy_dataset,
+    "oasst": _handle_oasst_dataset,
+    "gpteacher": _handle_gpteacher_dataset,
+    "reflection": _handle_reflection_dataset,
+}
--- a/src/axolotl/utils/distributed.py
+++ b/src/axolotl/utils/distributed.py
@@ -1,6 +1,4 @@
-"""
-utility helpers for distributed checks
-"""
+"""Utilities for distributed functionality."""

 import os
 import pickle  # nosec
@@ -19,7 +17,7 @@ from transformers.utils.import_utils import (
 distributed_state = None  # pylint: disable=invalid-name


-def get_device_type():
+def get_device_type() -> torch.device:
    device = torch.device("cpu")
    if is_torch_cuda_available():
        device = torch.device("cuda")
@@ -30,7 +28,7 @@ def get_device_type():
    return device


-def get_device_count():
+def get_device_count() -> int:
    cur_device = get_device_type()
    if "cuda" in str(cur_device):
        return torch.cuda.device_count()
@@ -39,7 +37,7 @@ def get_device_count():
    return 1


-def get_current_device():
+def get_current_device() -> int:
    cur_device = get_device_type()
    if "cuda" in str(cur_device):
        return torch.cuda.current_device()
@@ -48,12 +46,14 @@ def get_current_device():
    return 0


-def is_distributed():
-    """
-    Check if distributed training is initialized.
-    """
+def get_distributed_state() -> PartialState | None:
+    return distributed_state
+
+
+def is_distributed() -> bool:
+    """Check if distributed training is initialized."""
    global distributed_state  # pylint: disable=global-statement
-    if not distributed_state:
+    if distributed_state is None:
        timeout = int(os.environ.get("AXOLOTL_NCCL_TIMEOUT", 1800))
        distributed_state = PartialState(timeout=timedelta(seconds=timeout))

@@ -69,31 +69,28 @@ def barrier():
        dist.barrier()


-def is_main_process(use_environ=False):
+def is_main_process() -> bool:
    """
    Check if the current process is the main process. If not in distributed mode,
    always return `True`.

-    Args:
-    - use_environ (bool, optional): Use environment variable to determine main process.
-
    Returns:
-    - bool: `True` if the current process is the main process, `False` otherwise.
+        `True` if the current process is the main process, `False` otherwise.
    """
-    if use_environ:
+    if get_distributed_state() is None:
        return os.environ.get("LOCAL_RANK", "0") == "0"
    if not is_distributed():
        return True
    return dist.get_rank() == 0


-def is_local_main_process(use_environ=False):
-    if use_environ:
+def is_local_main_process() -> bool:
+    if get_distributed_state() is None:
        return os.environ.get("LOCAL_RANK", "0") == "0"
    return PartialState().is_local_main_process


-def get_world_size():
+def get_world_size() -> int:
    return int(os.getenv("WORLD_SIZE", "1"))


@@ -115,7 +112,7 @@ def cleanup_distributed():


@contextmanager
-def zero_first(is_main):
+def zero_first(is_main: bool):
    """
    runs the wrapped context so that rank 0 runs first before other ranks
    """
--- a/src/axolotl/utils/freeze.py
+++ b/src/axolotl/utils/freeze.py
@@ -5,9 +5,8 @@ module to freeze/unfreeze parameters by name
 import re
 from typing import Callable, List, Tuple, Union

-from accelerate.logging import get_logger
-
 from axolotl.utils.distributed import is_main_process
+from axolotl.utils.logging import get_logger

 LOG = get_logger(__name__)

--- a/src/axolotl/utils/logging.py
+++ b/src/axolotl/utils/logging.py
@@ -1,6 +1,4 @@
-"""
-logging helpers to only log on main process
-"""
+"""Logging helpers to only log on main process."""

 import functools
 import logging
@@ -14,27 +12,18 @@ from axolotl.utils.distributed import is_main_process

 class MultiProcessAdapter(logging.LoggerAdapter):
    """
-    logger adapter for distributed logging, specifically to only log on main process
+    Logger adapter for distributed logging, specifically to only log on main process.
    """

-    def __init__(self, logger, use_environ=False, extra=None):
-        super().__init__(logger, extra)
-        self.use_environ = use_environ
-
    @staticmethod
-    def _should_log(main_process_only, use_environ=False):
-        return not main_process_only or (
-            main_process_only and is_main_process(use_environ=use_environ)
-        )
+    def _should_log(main_process_only: bool):
+        return not main_process_only or is_main_process()

    def log(self, level, msg, *args, **kwargs):
-        use_environ = kwargs.pop("use_environ", self.use_environ)
        main_process_only = kwargs.pop("main_process_only", True)
        kwargs.setdefault("stacklevel", 2)

-        if self.isEnabledFor(level) and self._should_log(
-            main_process_only, use_environ=use_environ
-        ):
+        if self.isEnabledFor(level) and self._should_log(main_process_only):
            msg, kwargs = self.process(msg, kwargs)
            self.logger.log(level, msg, *args, **kwargs)

@@ -50,13 +39,11 @@ class MultiProcessAdapter(logging.LoggerAdapter):
        self.warning(*args, **kwargs)


-def get_logger(
-    name: str, log_level: str | None = None, use_environ: bool = False
-) -> MultiProcessAdapter:
+def get_logger(name: str, log_level: str | None = None) -> MultiProcessAdapter:
    if log_level is None:
        log_level = os.environ.get("AXOLOTL_LOG_LEVEL", None)
    logger = logging.getLogger(name)
    if log_level is not None:
        logger.setLevel(log_level.upper())
        logger.root.setLevel(log_level.upper())
-    return MultiProcessAdapter(logger, use_environ=use_environ, extra={})
+    return MultiProcessAdapter(logger, extra={})
--- a/src/axolotl/utils/mistral_tokenizer.py
+++ b/src/axolotl/utils/mistral_tokenizer.py
@@ -0,0 +1,567 @@
+"""Wrapper for MistralTokenizer from mistral-common"""
+
+import math
+import os
+from shutil import copyfile
+from typing import TYPE_CHECKING, Optional
+
+import numpy as np
+from huggingface_hub import hf_hub_download
+from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
+from mistral_common.tokens.tokenizers.tekken import Tekkenizer
+from torch import Tensor
+from transformers.utils import PaddingStrategy
+
+from axolotl.utils.collators.core import IGNORE_INDEX
+
+if TYPE_CHECKING:
+    from mistral_common.protocol.instruct.request import ChatCompletionRequest
+
+
+def _get_file_path(path_or_repo_id: str, filename: str) -> str:
+    """Get the file path from local or HF Hub"""
+    if os.path.exists(path_or_repo_id):
+        maybe_file_path = os.path.join(path_or_repo_id, filename)
+        if os.path.exists(maybe_file_path):
+            return maybe_file_path
+
+        raise FileNotFoundError(f"File not found at {path_or_repo_id}")
+
+    return hf_hub_download(repo_id=path_or_repo_id, filename=filename)
+
+
+class HFMistralTokenizer:
+    """
+    Wraps mistral_common.tokens.tokenizers.mistral.MistralTokenizer
+    and exposes HuggingFace API for special tokens.
+    """
+
+    def __init__(
+        self, mistral: MistralTokenizer, name_or_path: str, tokenizer_path: str
+    ):
+        """
+        Args:
+            mistral: The mistral-common tokenizer to wrap.
+            name_or_path: The name or path to the tokenizer files or the repo id.
+        """
+        self._mistral = mistral
+        self._padding_side = "right"
+        self._name_or_path = name_or_path
+        self._tokenizer_path = tokenizer_path
+
+        # Manual set to training mode
+        from mistral_common.protocol.instruct.validator import (
+            MistralRequestValidator,
+            ValidationMode,
+        )
+
+        # Check if MistralRequestValidator has a _mode attribute.
+        # This is a private API and may change in the future.
+        # pylint: disable=protected-access
+        if not (
+            hasattr(self._mistral, "_chat_completion_request_validator")
+            and isinstance(
+                self._mistral._chat_completion_request_validator,
+                MistralRequestValidator,
+            )
+            and hasattr(self._mistral._chat_completion_request_validator, "_mode")
+        ):
+            raise RuntimeError(
+                "Unable to switch mistral tokenizer to finetuning mode – "
+                "private API `_chat_completion_request_validator._mode` missing."
+            )
+
+        self._mistral._chat_completion_request_validator._mode = (
+            ValidationMode.finetuning
+        )
+
+    def _load_system_prompt(self, path_or_repo_id: str) -> str:
+        """Load system prompt from local or HF Hub.
+
+        Note: Unused for now as we don't want to explicitly set the system prompt if a user does
+        not provide one.
+
+        Args:
+            path_or_repo_id: The path to the tokenizer files or the repo id.
+
+        Returns:
+            The system prompt.
+        """
+        file_path = _get_file_path(path_or_repo_id, "SYSTEM_PROMPT.txt")
+
+        if not os.path.exists(file_path):
+            raise FileNotFoundError(f"System prompt file not found at {file_path}")
+
+        with open(file_path, "r", encoding="utf-8") as file:
+            return file.read()
+
+    @property
+    def bos_token_id(self) -> int:
+        return self._mistral.instruct_tokenizer.tokenizer.bos_id
+
+    @property
+    def eos_token_id(self) -> int:
+        return self._mistral.instruct_tokenizer.tokenizer.eos_id
+
+    @property
+    def pad_token_id(self) -> int:
+        return self._mistral.instruct_tokenizer.tokenizer.pad_id
+
+    @property
+    def unk_token_id(self) -> int:
+        return self._mistral.instruct_tokenizer.tokenizer.unk_id
+
+    @property
+    def bos_token(self) -> str:
+        return self._mistral.instruct_tokenizer.tokenizer.id_to_piece(self.bos_token_id)
+
+    @property
+    def eos_token(self) -> str:
+        return self._mistral.instruct_tokenizer.tokenizer.id_to_piece(self.eos_token_id)
+
+    @property
+    def pad_token(self) -> str:
+        return self._mistral.instruct_tokenizer.tokenizer.id_to_piece(self.pad_token_id)
+
+    @property
+    def unk_token(self) -> str:
+        return self._mistral.instruct_tokenizer.tokenizer.id_to_piece(self.unk_token_id)
+
+    @property
+    def padding_side(self) -> str:
+        return self._padding_side
+
+    @property
+    def name_or_path(self) -> str:
+        return self._name_or_path
+
+    @property
+    def chat_template(self) -> str | None:
+        """Chat template is not supported. Dummy method to satisfy HuggingFace API."""
+        return None
+
+    def __len__(self) -> int:
+        return self._mistral.instruct_tokenizer.tokenizer.n_words
+
+    @classmethod
+    def from_pretrained(
+        cls,
+        name_or_path: str,
+        *,
+        revision: Optional[str] = None,
+        **kwargs,  # pylint: disable=unused-argument
+    ) -> "HFMistralTokenizer":
+        """
+        Load a mistral tekken tokenizer from a local file or HF Hub and wrap it.
+
+        Args:
+            path_or_repo_id: The path to the tokenizer files or the repo id.
+            revision: The revision of the tokenizer to download.
+            kwargs: Additional keyword arguments.
+
+        Returns:
+            A HFMistralTokenizer instance.
+        """
+        if revision:
+            raise NotImplementedError(
+                "Revision not supported yet for mistral-common tokenizer"
+            )
+
+        # only support Tekken tokenizer for now
+        # downloads from HF Hub if not local
+        tokenizer_path = _get_file_path(name_or_path, "tekken.json")
+
+        base = MistralTokenizer.from_file(tokenizer_path)
+
+        return cls(
+            base,
+            name_or_path=name_or_path,
+            tokenizer_path=tokenizer_path,
+        )
+
+    def save_pretrained(self, save_directory: str) -> None:
+        """
+        Save the Tekken/SentencePiece model file so that from_pretrained can pick it up again.
+
+        Only Tekken models are supported.
+
+        Args:
+            save_directory: The directory to save the tokenizer files.
+        """
+        inner = self._mistral.instruct_tokenizer.tokenizer
+        if isinstance(inner, Tekkenizer):
+            # Create the directory and save the model
+            try:
+                os.makedirs(save_directory, exist_ok=True)
+
+                # Verify directory was created
+                if not os.path.exists(save_directory):
+                    raise RuntimeError(f"Failed to create directory: {save_directory}")
+
+                # Verify source file exists
+                if not os.path.exists(self._tokenizer_path):
+                    raise FileNotFoundError(
+                        f"Source tokenizer file not found: {self._tokenizer_path}"
+                    )
+
+                destination_path = os.path.join(save_directory, "tekken.json")
+                copyfile(self._tokenizer_path, destination_path)
+
+            except Exception as e:
+                raise RuntimeError(
+                    f"Failed to save tokenizer to {save_directory}: {e}. "
+                    f"Source path: {self._tokenizer_path}, "
+                    f"Directory exists: {os.path.exists(save_directory)}"
+                ) from e
+
+        else:
+            raise RuntimeError(f"Unknown tokenizer type: {type(inner)}")
+
+    def encode(self, text: str, add_special_tokens: bool = True) -> list[int]:
+        """
+        Encode a text string into a list of token IDs.
+
+        Args:
+            text: The text string to encode.
+            add_special_tokens: Whether to add special tokens to the encoded tokens.
+
+        Returns:
+            A list of token IDs.
+        """
+        return self._mistral.instruct_tokenizer.tokenizer.encode(
+            text,
+            bos=add_special_tokens,
+            eos=add_special_tokens,
+        )
+
+    def decode(
+        self, token_ids: int | list[int], skip_special_tokens: bool = False
+    ) -> str:
+        """
+        Decode a list of token IDs into a text string.
+
+        Args:
+            token_ids: The int or list of token IDs to decode.
+            skip_special_tokens: Whether to skip special tokens in the decoded text.
+
+        Returns:
+            The decoded text string.
+        """
+        if isinstance(token_ids, int):
+            token_ids = [token_ids]
+
+        if skip_special_tokens:
+            return self._mistral.instruct_tokenizer.tokenizer.decode(token_ids)
+
+        # to_string returns a string with special tokens
+        return self._mistral.instruct_tokenizer.tokenizer.to_string(token_ids)
+
+    def _create_mistral_chat_completion_request(
+        self, conversation: list[dict], tools: list[dict] | None = None
+    ) -> "ChatCompletionRequest":
+        from mistral_common.protocol.instruct.messages import (
+            AssistantMessage,
+            SystemMessage,
+            ToolMessage,
+            UserMessage,
+        )
+        from mistral_common.protocol.instruct.request import ChatCompletionRequest
+        from mistral_common.protocol.instruct.tool_calls import Function, Tool
+
+        messages: list[UserMessage | AssistantMessage | ToolMessage | SystemMessage] = (
+            []
+        )
+        for turn in conversation:
+            role = turn.get("role")
+
+            if role == "user":
+                messages.append(UserMessage(content=turn["content"]))
+            elif role == "assistant":
+                messages.append(
+                    AssistantMessage(
+                        content=turn.get("content"),
+                        tool_calls=turn.get("tool_calls"),
+                    )
+                )
+            elif role == "tool":
+                messages.append(
+                    ToolMessage(
+                        content=turn.get("content"),
+                        tool_call_id=turn.get("tool_call_id"),
+                        name=turn.get("name"),
+                    )
+                )
+            elif role == "system":
+                messages.append(SystemMessage(content=turn["content"]))
+            else:
+                raise ValueError(
+                    f"Unknown role for use with mistral-common tokenizer: {turn['role']}"
+                )
+
+        tool_calls: list[Tool] = []
+        if tools:
+            # convert to Tool
+            for tool in tools:
+                if tool["type"] != "function":
+                    continue
+
+                function = tool["function"]
+
+                tool_calls.append(
+                    Tool(
+                        function=Function(
+                            name=function["name"],
+                            description=function["description"],
+                            # set parameters to empty dict if not provided
+                            parameters=function.get("parameters", {}),
+                        )
+                    )
+                )
+
+        chat_completion: ChatCompletionRequest = ChatCompletionRequest(
+            messages=messages,
+            tools=tool_calls,
+        )
+
+        return chat_completion
+
+    def apply_chat_template(
+        self,
+        messages: list[dict],
+        tokenize: bool = True,
+        tools: list[dict] | None = None,
+        chat_template: str | None = None,  # pylint: disable=unused-argument
+        add_generation_prompt: bool = False,  # pylint: disable=unused-argument
+    ) -> list[int] | str:
+        if chat_template:
+            raise NotImplementedError("chat_template not supported yet")
+
+        if add_generation_prompt:
+            raise NotImplementedError("add_generation_prompt not supported yet")
+
+        chat_completion: ChatCompletionRequest = (
+            self._create_mistral_chat_completion_request(messages, tools)
+        )
+
+        tokens: list[int] = self._mistral.encode_chat_completion(chat_completion).tokens
+
+        if tokenize:
+            return tokens
+
+        return self.decode(tokens)
+
+    def pad(
+        self,
+        features: list[dict[str, list[int] | np.ndarray]],
+        *,
+        padding: bool | str | PaddingStrategy = True,
+        max_length: int | None = None,
+        pad_to_multiple_of: int | None = None,
+        return_tensors: str | None = None,  # "np", "pt", or "tf"
+    ) -> dict[str, np.ndarray | Tensor]:
+        """
+        HF-style pad method that properly handles all sequence-related features:
+        - pad 'input_ids' & 'labels' to the longest (or to max_length)
+        """
+        import torch
+        from torch.nn import functional as F
+
+        # Check for unsupported fields
+        if any("token_type_ids" in f for f in features):
+            raise ValueError("token_type_ids is not supported by this tokenizer")
+
+        # Determine desired sequence length
+        lengths = [len(f["input_ids"]) for f in features]
+        if padding in (True, "longest", PaddingStrategy.LONGEST):
+            target_length = max(lengths)
+        elif padding in ("max_length", PaddingStrategy.MAX_LENGTH):
+            if max_length is None:
+                raise ValueError("max_length must be set for 'max_length' padding")
+            target_length = max_length
+        elif padding in (False, "do_not_pad", PaddingStrategy.DO_NOT_PAD):
+            target_length = None
+        else:
+            raise ValueError(f"Unknown padding strategy: {padding}")
+
+        # Apply pad_to_multiple_of
+        if target_length is not None and pad_to_multiple_of is not None:
+            target_length = (
+                math.ceil(target_length / pad_to_multiple_of) * pad_to_multiple_of
+            )
+
+        # If no padding requested, just stack tensors
+        do_pad = target_length is not None
+
+        # Pad sequences using torch.nn.utils.rnn.pad_sequence
+        input_ids = torch.nn.utils.rnn.pad_sequence(
+            [torch.tensor(x["input_ids"], dtype=torch.long) for x in features],
+            batch_first=True,
+            padding_value=self.pad_token_id if self.pad_token_id is not None else 0,
+        )
+
+        labels = torch.nn.utils.rnn.pad_sequence(
+            [torch.tensor(x["labels"], dtype=torch.long) for x in features],
+            batch_first=True,
+            padding_value=IGNORE_INDEX,
+        )
+
+        attention_mask = torch.nn.utils.rnn.pad_sequence(
+            [torch.tensor(x["attention_mask"], dtype=torch.long) for x in features],
+            batch_first=True,
+            padding_value=0,
+        )
+
+        # Handle position_ids - pad with sequential values for right padding, 0s for left padding
+        if "position_ids" in features[0]:
+            if self.padding_side == "left":
+                # Likely not needed, but keeping for now
+                # For left padding, we'll pad with 0s using pad_sequence, then handle manually
+                position_ids = torch.nn.utils.rnn.pad_sequence(
+                    [
+                        torch.tensor(x["position_ids"], dtype=torch.long)
+                        for x in features
+                    ],
+                    batch_first=True,
+                    padding_value=0,
+                )
+            else:
+                # For right padding, continue the sequence
+                max_pos_len = max(len(f["position_ids"]) for f in features)
+                position_ids_list = []
+                for f in features:
+                    pos_seq = torch.tensor(f["position_ids"], dtype=torch.long)
+                    if len(pos_seq) < max_pos_len:
+                        # Continue the sequence
+                        last_pos = pos_seq[-1].item() if len(pos_seq) > 0 else -1
+                        pad_len = max_pos_len - len(pos_seq)
+                        pad_positions = torch.arange(
+                            last_pos + 1, last_pos + 1 + pad_len, dtype=torch.long
+                        )
+                        pos_seq = torch.cat([pos_seq, pad_positions])
+                    position_ids_list.append(pos_seq)
+                position_ids = torch.stack(position_ids_list)
+        else:
+            # Create position_ids if not present
+            seq_len = input_ids.size(1)
+            position_ids = (
+                torch.arange(seq_len, dtype=torch.long)
+                .unsqueeze(0)
+                .expand(input_ids.size(0), -1)
+            )
+
+        # Ensure all tensors have the same sequence length
+        max_seq_len = max(
+            input_ids.size(1),
+            labels.size(1),
+            attention_mask.size(1),
+            position_ids.size(1),
+        )
+
+        # TODO: check if trimming is needed? and correct.
+
+        if do_pad and target_length is not None:
+            max_seq_len = target_length
+
+        # Pad all tensors to the same length
+        if input_ids.size(1) < max_seq_len:
+            pad_len = max_seq_len - input_ids.size(1)
+            if self.padding_side == "right":
+                input_ids = F.pad(
+                    input_ids,
+                    (0, pad_len),
+                    value=self.pad_token_id if self.pad_token_id is not None else 0,
+                )
+            else:
+                input_ids = F.pad(
+                    input_ids,
+                    (pad_len, 0),
+                    value=self.pad_token_id if self.pad_token_id is not None else 0,
+                )
+        elif input_ids.size(1) > max_seq_len:
+            input_ids = input_ids[:, :max_seq_len]
+
+        if labels.size(1) < max_seq_len:
+            pad_len = max_seq_len - labels.size(1)
+            if self.padding_side == "right":
+                labels = F.pad(labels, (0, pad_len), value=IGNORE_INDEX)
+            else:
+                labels = F.pad(labels, (pad_len, 0), value=IGNORE_INDEX)
+        elif labels.size(1) > max_seq_len:
+            labels = labels[:, :max_seq_len]
+
+        if attention_mask.size(1) < max_seq_len:
+            pad_len = max_seq_len - attention_mask.size(1)
+            if self.padding_side == "right":
+                attention_mask = F.pad(attention_mask, (0, pad_len), value=0)
+            else:
+                attention_mask = F.pad(attention_mask, (pad_len, 0), value=0)
+        elif attention_mask.size(1) > max_seq_len:
+            attention_mask = attention_mask[:, :max_seq_len]
+
+        if position_ids.size(1) < max_seq_len:
+            pad_len = max_seq_len - position_ids.size(1)
+            if self.padding_side == "right":
+                batch_size = position_ids.size(0)
+                new_position_ids = []
+                for i in range(batch_size):
+                    seq = position_ids[i]
+                    if len(seq) > 0:
+                        # get last position and pad with sequential values
+                        last_pos = seq[-1].item()
+                        pad_positions = torch.arange(
+                            last_pos + 1, last_pos + 1 + pad_len, dtype=torch.long
+                        )
+                        new_seq = torch.cat([seq, pad_positions])
+                    else:
+                        new_seq = torch.arange(pad_len, dtype=torch.long)
+                    new_position_ids.append(new_seq)
+                position_ids = torch.stack(new_position_ids)
+            else:
+                position_ids = F.pad(position_ids, (pad_len, 0), value=0)
+        elif position_ids.size(1) > max_seq_len:
+            position_ids = position_ids[:, :max_seq_len]
+
+        final_batch = {
+            "input_ids": input_ids,
+            "labels": labels,
+            "attention_mask": attention_mask,
+            "position_ids": position_ids,
+        }
+
+        # Handle non-sequence fields (raise error)
+        sequence_fields = {"input_ids", "labels", "attention_mask", "position_ids"}
+        for f in features:
+            for key in f.keys():
+                if key not in sequence_fields:
+                    raise NotImplementedError(
+                        f"Non-sequence field {key} not handled yet"
+                    )
+
+        # Convert to requested tensor type
+        if return_tensors is None or return_tensors == "np":
+            result = {}
+            for k, v in final_batch.items():
+                if isinstance(v, torch.Tensor):
+                    result[k] = v.numpy().astype(np.long)
+                else:
+                    result[k] = v
+            return result
+
+        if return_tensors == "pt":
+            return final_batch
+
+        raise ValueError(f"Unsupported return_tensors='{return_tensors}'")
+
+    def convert_ids_to_tokens(self, ids: list[int]) -> list[str]:
+        """
+        Convert a list of token IDs to a list of tokens.
+
+        Args:
+            ids: The list of token IDs to convert.
+
+        Returns:
+            The list of tokens.
+        """
+        return [
+            self._mistral.instruct_tokenizer.tokenizer.id_to_piece(id) for id in ids
+        ]
--- a/src/axolotl/utils/samplers/multipack.py
+++ b/src/axolotl/utils/samplers/multipack.py
@@ -3,6 +3,7 @@ Multipack Batch Sampler - An efficient batch sampler for packing variable-length
 into fixed-capacity batches to optimize memory usage and training throughput.
 """

+import gc
 import math
 from concurrent.futures import ProcessPoolExecutor
 from multiprocessing import cpu_count, get_context
@@ -145,7 +146,7 @@ def pack_parallel(
    """
    num_items = len(sequence_lengths)
    if num_processes is None:
-        num_processes = max(1, min(num_items // group_size, cpu_count()))
+        num_processes = max(1, min(num_items // group_size, cpu_count(), 16))

    # Create tasks for parallel processing
    tasks = []
@@ -259,7 +260,7 @@ class MultipackBatchSampler(BatchSampler):
        lengths: np.ndarray,  # Sequence lengths
        packing_efficiency_estimate: float = 1.0,  # Initial efficiency estimate
        drop_last: bool = True,  # Whether to drop final batches (might be incomplete)
-        num_count_samples: int = 16,  # Number of times to estimate batch count
+        num_count_samples: int = 8,  # Number of times to estimate batch count
        sequential: bool = False,  # Whether to use sequential packing
        group_size: int = 100_000,  # Size of groups for parallel packing
        bin_size: int = 200,  # The max number of samples that can be packed in a single bin
@@ -349,6 +350,7 @@ class MultipackBatchSampler(BatchSampler):
            # Calculate efficiency statistics
            total_used = lengths.sum()
            total_slots = len(all_bins) * self.batch_max_len
+            del all_bins

        # Group bins into batches (each batch contains batch_size bins)
        batches = [
@@ -368,6 +370,7 @@ class MultipackBatchSampler(BatchSampler):
            self.total_token_slots += total_slots

        self._batches = batches
+        gc.collect()
        return batches

    def __iter__(self) -> Iterator[list[list[int]]]:
--- a/src/axolotl/utils/schemas/config.py
+++ b/src/axolotl/utils/schemas/config.py
--- a/src/axolotl/utils/schemas/datasets.py
+++ b/src/axolotl/utils/schemas/datasets.py
@@ -1,6 +1,8 @@
 """Pydantic models for datasets-related configuration"""

-from pydantic import BaseModel, model_validator
+from typing import Literal
+
+from pydantic import BaseModel, Field, model_validator

 from axolotl.utils.schemas.enums import ChatTemplate
 from axolotl.utils.schemas.utils import handle_legacy_message_fields_logic
@@ -9,56 +11,178 @@ from axolotl.utils.schemas.utils import handle_legacy_message_fields_logic
 class UserDefinedPrompterType(BaseModel):
    """Structure for user defined prompt types"""

-    system_prompt: str | None = None
-    system_format: str | None = None
+    system_prompt: str | None = Field(
+        default=None,
+        json_schema_extra={"description": "Custom user instruction prompt"},
+    )
+    system_format: str | None = Field(
+        default=None,
+        json_schema_extra={"description": "Use {system} as key to be replaced"},
+    )
    field_system: str | None = None
    field_instruction: str | None = None
    field_input: str | None = None
    field_output: str | None = None

-    format: str | None = None
-    no_input_format: str | None = None
-    field: str | None = None
+    format: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Customizable to be single line or multi-line. Use {instruction}/{input} as key to be replaced. 'format' can include {input}"
+        },
+    )
+    no_input_format: str | None = Field(
+        default=None,
+        json_schema_extra={"description": "'no_input_format' cannot include {input}"},
+    )
+    field: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "For `completion` datsets only, uses the provided field instead of `text` column"
+        },
+    )


 class SFTDataset(BaseModel):
    """SFT configuration subset"""

-    path: str | None = None
-    split: str | None = None
-    type: str | UserDefinedPrompterType | None = None
+    path: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "HuggingFace dataset repo | s3:// | gs:// | path to local file or directory"
+        },
+    )
+    split: str | None = Field(
+        default=None,
+        json_schema_extra={"description": "name of dataset split to load from"},
+    )
+    type: str | UserDefinedPrompterType | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "The type of prompt to use for training. [alpaca, gpteacher, oasst, reflection]"
+        },
+    )
    input_transform: str | None = None
-    shards: int | None = None
-    shards_idx: int | None = None
-    preprocess_shards: int | None = None
+    shards: int | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "split dataset into N pieces (use with shards_idx)"
+        },
+    )
+    shards_idx: int | None = Field(
+        default=None,
+        json_schema_extra={"description": "the index of sharded dataset to use"},
+    )
+    preprocess_shards: int | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "process dataset in N sequential chunks for memory efficiency (exclusive with `shards`)"
+        },
+    )
    conversation: str | None = None
    # Do not make this too strict or it will break the validator to choose different dataset class
-    chat_template: ChatTemplate | str | None = None
-    chat_template_jinja: str | None = None
-    data_files: str | list[str] | None = None
+    chat_template: ChatTemplate | str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "The name of the chat template to use for training, following values are supported: tokenizer_default: Uses the chat template that is available in the tokenizer_config.json. If the chat template is not available in the tokenizer, it will raise an error. This is the default. alpaca/inst/chatml/gemma/cohere/llama3/phi_3/deepseek_v2/jamba: These chat templates are available in the axolotl codebase at src/axolotl/utils/chat_templates.py. tokenizer_default_fallback_*: where * is the name of the chat template to fallback to if the tokenizer does not have a chat template else default to tokenizer. E.g. tokenizer_default_fallback_chatml. jinja: Uses a custom jinja template for the chat template. The custom jinja template should be provided in the chat_template_jinja field."
+        },
+    )
+    chat_template_jinja: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Custom jinja chat template. Used only if `chat_template: jinja` or empty."
+        },
+    )
+    data_files: str | list[str] | None = Field(
+        default=None, json_schema_extra={"description": "path to source data files"}
+    )
    input_format: str | None = None
-    name: str | None = None
-    ds_type: str | None = None
+    name: str | None = Field(
+        default=None,
+        json_schema_extra={"description": "name of dataset configuration to load"},
+    )
+    ds_type: str | None = Field(
+        default=None,
+        json_schema_extra={"description": "defines the datatype when path is a file"},
+    )
    field: str | None = None
    field_human: str | None = None
    field_model: str | None = None
-    field_messages: str | None = None
+    field_messages: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": 'Key containing the messages (default: "messages")'
+        },
+    )
+    field_tools: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": 'Key containing the tools (default: "tools"). Must be a list[dict] and follow [JSON schema](https://json-schema.org/learn/getting-started-step-by-step).'
+        },
+    )
    # deprecated, use message_property_mappings
    message_field_role: str | None = None
    # deprecated, use message_property_mappings
    message_field_content: str | None = None
-    message_property_mappings: dict[str, str] | None = None
-    message_field_training: str | None = None
-    message_field_training_detail: str | None = None
-    split_thinking: bool | None = None
+    message_property_mappings: dict[str, str] | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Mapping of properties from the input dataset to the chat template. (default: message_property_mappings={'role':'role', 'content':'content'}) If a property exists in the template but not in this mapping, the system will attempt to load it directly from the message using the property name as the key. Example: In the mapping below, 'from' is loaded from input dataset and used as 'role', while 'value' is loaded and used as 'content' in the chat template."
+        },
+    )
+    message_field_training: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "The key in the message turn that indicates via boolean whether tokens of a turn should be considered for training. Useful to selectively train on certain turns besides the `roles_to_train`."
+        },
+    )
+    message_field_training_detail: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "The key in the message turn that contains the training details. Useful to selectively train on certain tokens in a turn. The value of the key is a List[Dict] containing `begin_offset` (start character index in content), `end_offset` (end character index in content), and `train` (boolean whether to train)."
+        },
+    )
+    split_thinking: bool | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "(for Qwen3 template only) Whether to split the assistant content based on a reasoning trace inside delimited tags"
+        },
+    )
    logprobs_field: str | None = None
    temperature: float | None = None
-    roles_to_train: list[str] | None = None
-    train_on_eos: str | None = None
-    roles: dict[str, list[str]] | None = None
-    drop_system_message: bool | None = None
-    trust_remote_code: bool | None = False
-    revision: str | None = None
+    roles_to_train: list[str] | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Roles to train on. The tokens from these roles will be considered for the loss."
+        },
+    )
+    train_on_eos: Literal["all", "turn", "last"] | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Which EOS tokens to train on in the conversation. Possible values are: all: train on all EOS tokens, turn (default): train on the EOS token at the end of each trainable turn, last: train on the last EOS token in the conversation"
+        },
+    )
+    roles: dict[str, list[str]] | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": 'Roles mapping in the messages. The format is {target_role: [source_roles]}. All source roles will be mapped to the target role. The default is: user: ["human", "user"], assistant: ["gpt", "assistant"], system: ["system"], tool: ["tool"]'
+        },
+    )
+    drop_system_message: bool | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Whether to drop the system turn from the dataset. Only works with chat_template. This does not drop the default system message from chat_template if it exists. If you wish to, we recommend using a custom jinja template with the default system message removed or adding a system turn with empty content."
+        },
+    )
+    trust_remote_code: bool | None = Field(
+        default=False,
+        json_schema_extra={"description": "Trust remote code for untrusted source"},
+    )
+    revision: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "The specific revision of the dataset to use when loading from the Hugging Face Hub. This can be a commit hash, tag, or branch name. If not specified, the latest version will be used. This parameter is ignored for local datasets."
+        },
+    )

    @model_validator(mode="before")
    @classmethod
--- a/src/axolotl/utils/schemas/deprecated.py
+++ b/src/axolotl/utils/schemas/deprecated.py
@@ -60,10 +60,30 @@ class RemappedParameters(BaseModel):
    """Parameters that have been remapped to other names"""

    overrides_of_model_config: dict[str, Any] | None = Field(
-        default=None, alias="model_config"
+        default=None,
+        alias="model_config",
+        json_schema_extra={
+            "description": "optional overrides to the base model configuration"
+        },
    )
    overrides_of_model_kwargs: dict[str, Any] | None = Field(
-        default=None, alias="model_kwargs"
+        default=None,
+        alias="model_kwargs",
+        json_schema_extra={
+            "description": "optional overrides the base model loading from_pretrained"
+        },
+    )
+    type_of_model: str | None = Field(
+        default=None,
+        alias="model_type",
+        json_schema_extra={
+            "description": "If you want to specify the type of model to load, AutoModelForCausalLM is a good choice too"
+        },
+    )
+    revision_of_model: str | None = Field(
+        default=None,
+        alias="model_revision",
+        json_schema_extra={
+            "description": "You can specify to choose a specific model revision from huggingface hub"
+        },
    )
-    type_of_model: str | None = Field(default=None, alias="model_type")
-    revision_of_model: str | None = Field(default=None, alias="model_revision")
--- a/src/axolotl/utils/schemas/enums.py
+++ b/src/axolotl/utils/schemas/enums.py
@@ -1,5 +1,7 @@
 """Enums for Axolotl input config"""

+# pylint: disable=invalid-name
+
 from enum import Enum

 import torch
@@ -8,81 +10,81 @@ import torch
 class TorchIntDType(Enum):
    """Torch integer data types - `getattr` guards against torch < 2.6 which does not support int4"""

-    uint1 = getattr(torch, "uint1", None)  # pylint: disable=invalid-name
-    uint2 = getattr(torch, "uint2", None)  # pylint: disable=invalid-name
-    uint3 = getattr(torch, "uint3", None)  # pylint: disable=invalid-name
-    uint4 = getattr(torch, "uint4", None)  # pylint: disable=invalid-name
-    uint5 = getattr(torch, "uint5", None)  # pylint: disable=invalid-name
-    uint6 = getattr(torch, "uint6", None)  # pylint: disable=invalid-name
-    uint7 = getattr(torch, "uint7", None)  # pylint: disable=invalid-name
-    int4 = getattr(torch, "int4", None)  # pylint: disable=invalid-name
-    int8 = getattr(torch, "int8", None)  # pylint: disable=invalid-name
+    uint1 = getattr(torch, "uint1", None)
+    uint2 = getattr(torch, "uint2", None)
+    uint3 = getattr(torch, "uint3", None)
+    uint4 = getattr(torch, "uint4", None)
+    uint5 = getattr(torch, "uint5", None)
+    uint6 = getattr(torch, "uint6", None)
+    uint7 = getattr(torch, "uint7", None)
+    int4 = getattr(torch, "int4", None)
+    int8 = getattr(torch, "int8", None)


 class RLType(str, Enum):
    """RL trainer type configuration subset"""

-    DPO = "dpo"  # pylint: disable=invalid-name
-    GRPO = "grpo"  # pylint: disable=invalid-name
-    IPO = "ipo"  # pylint: disable=invalid-name
-    ORPO = "orpo"  # pylint: disable=invalid-name
-    KTO = "kto"  # pylint: disable=invalid-name
-    SIMPO = "simpo"  # pylint: disable=invalid-name
+    DPO = "dpo"
+    GRPO = "grpo"
+    IPO = "ipo"
+    ORPO = "orpo"
+    KTO = "kto"
+    SIMPO = "simpo"


 class ChatTemplate(str, Enum):
    """Chat templates configuration subset"""

-    alpaca = "alpaca"  # pylint: disable=invalid-name
-    chatml = "chatml"  # pylint: disable=invalid-name
-    mistral_v1 = "mistral_v1"  # pylint: disable=invalid-name
-    mistral_v2v3 = "mistral_v2v3"  # pylint: disable=invalid-name
-    mistral_v3_tekken = "mistral_v3_tekken"  # pylint: disable=invalid-name
-    mistral_v7_tekken = "mistral_v7_tekken"  # pylint: disable=invalid-name
-    gemma = "gemma"  # pylint: disable=invalid-name
-    cohere = "cohere"  # pylint: disable=invalid-name
-    llama3 = "llama3"  # pylint: disable=invalid-name
-    llama3_2_vision = "llama3_2_vision"  # pylint: disable=invalid-name
-    llama4 = "llama4"  # pylint: disable=invalid-name
-    phi_3 = "phi_3"  # pylint: disable=invalid-name
-    phi_35 = "phi_35"  # pylint: disable=invalid-name
-    deepseek_v2 = "deepseek_v2"  # pylint: disable=invalid-name
-    deepseek_v3 = "deepseek_v3"  # pylint: disable=invalid-name
-    jamba = "jamba"  # pylint: disable=invalid-name
-    jinja = "jinja"  # pylint: disable=invalid-name
-    qwen_25 = "qwen_25"  # pylint: disable=invalid-name
-    qwen3 = "qwen3"  # pylint: disable=invalid-name
-    tokenizer_default = "tokenizer_default"  # pylint: disable=invalid-name
-    exaone = "exaone"  # pylint: disable=invalid-name
-    metharme = "metharme"  # pylint: disable=invalid-name
-    pixtral = "pixtral"  # pylint: disable=invalid-name
-    llava = "llava"  # pylint: disable=invalid-name
-    qwen2_vl = "qwen2_vl"  # pylint: disable=invalid-name
-    gemma3 = "gemma3"  # pylint: disable=invalid-name
-    command_a = "command_a"  # pylint: disable=invalid-name
-    command_a_tool_use = "command_a_tool_use"  # pylint: disable=invalid-name
-    command_a_rag = "command_a_rag"  # pylint: disable=invalid-name
-    aya = "aya"  # pylint: disable=invalid-name
+    alpaca = "alpaca"
+    chatml = "chatml"
+    mistral_v1 = "mistral_v1"
+    mistral_v2v3 = "mistral_v2v3"
+    mistral_v3_tekken = "mistral_v3_tekken"
+    mistral_v7_tekken = "mistral_v7_tekken"
+    gemma = "gemma"
+    cohere = "cohere"
+    llama3 = "llama3"
+    llama3_2_vision = "llama3_2_vision"
+    llama4 = "llama4"
+    phi_3 = "phi_3"
+    phi_35 = "phi_35"
+    deepseek_v2 = "deepseek_v2"
+    deepseek_v3 = "deepseek_v3"
+    jamba = "jamba"
+    jinja = "jinja"
+    qwen_25 = "qwen_25"
+    qwen3 = "qwen3"
+    tokenizer_default = "tokenizer_default"
+    exaone = "exaone"
+    metharme = "metharme"
+    pixtral = "pixtral"
+    llava = "llava"
+    qwen2_vl = "qwen2_vl"
+    gemma3 = "gemma3"
+    command_a = "command_a"
+    command_a_tool_use = "command_a_tool_use"
+    command_a_rag = "command_a_rag"
+    aya = "aya"


 class CustomSupportedOptimizers(str, Enum):
    """Custom supported optimizers"""

-    optimi_adamw = "optimi_adamw"  # pylint: disable=invalid-name
-    ao_adamw_4bit = "ao_adamw_4bit"  # pylint: disable=invalid-name
-    ao_adamw_8bit = "ao_adamw_8bit"  # pylint: disable=invalid-name
-    ao_adamw_fp8 = "ao_adamw_fp8"  # pylint: disable=invalid-name
-    adopt_adamw = "adopt_adamw"  # pylint: disable=invalid-name
-    came_pytorch = "came_pytorch"  # pylint: disable=invalid-name
-    muon = "muon"  # pylint: disable=invalid-name
+    optimi_adamw = "optimi_adamw"
+    ao_adamw_4bit = "ao_adamw_4bit"
+    ao_adamw_8bit = "ao_adamw_8bit"
+    ao_adamw_fp8 = "ao_adamw_fp8"
+    adopt_adamw = "adopt_adamw"
+    came_pytorch = "came_pytorch"
+    muon = "muon"


 class RingAttnFunc(str, Enum):
    """Enum class for supported `ring-flash-attn` implementations"""

-    # VARLEN_RING = "varlen_ring"
-    # VARLEN_ZIGZAG = "varlen_zigzag"
    VARLEN_LLAMA3 = "varlen_llama3"
    BATCH_RING = "batch_ring"
+    # VARLEN_RING = "varlen_ring"
+    # VARLEN_ZIGZAG = "varlen_zigzag"
    # BATCH_ZIGZAG = "batch_zigzag"
    # BATCH_STRIPE = "batch_stripe"
--- a/src/axolotl/utils/schemas/integrations.py
+++ b/src/axolotl/utils/schemas/integrations.py
@@ -13,10 +13,21 @@ class MLFlowConfig(BaseModel):
    """MLFlow configuration subset"""

    use_mlflow: bool | None = None
-    mlflow_tracking_uri: str | None = None
-    mlflow_experiment_name: str | None = None
-    mlflow_run_name: str | None = None
-    hf_mlflow_log_artifacts: bool | None = None
+    mlflow_tracking_uri: str | None = Field(
+        default=None, json_schema_extra={"description": "URI to mlflow"}
+    )
+    mlflow_experiment_name: str | None = Field(
+        default=None, json_schema_extra={"description": "Your experiment name"}
+    )
+    mlflow_run_name: str | None = Field(
+        default=None, json_schema_extra={"description": "Your run name"}
+    )
+    hf_mlflow_log_artifacts: bool | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "set to true to copy each saved checkpoint on each save to mlflow artifact registry"
+        },
+    )


 class LISAConfig(BaseModel):
@@ -40,13 +51,33 @@ class WandbConfig(BaseModel):
    """Wandb configuration subset"""

    use_wandb: bool | None = None
-    wandb_name: str | None = None
-    wandb_run_id: str | None = None
-    wandb_mode: str | None = None
-    wandb_project: str | None = None
-    wandb_entity: str | None = None
+    wandb_name: str | None = Field(
+        default=None,
+        json_schema_extra={"description": "Set the name of your wandb run"},
+    )
+    wandb_run_id: str | None = Field(
+        default=None, json_schema_extra={"description": "Set the ID of your wandb run"}
+    )
+    wandb_mode: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": '"offline" to save run metadata locally and not sync to the server, "disabled" to turn off wandb'
+        },
+    )
+    wandb_project: str | None = Field(
+        default=None, json_schema_extra={"description": "Your wandb project name"}
+    )
+    wandb_entity: str | None = Field(
+        default=None,
+        json_schema_extra={"description": "A wandb Team name if using a Team"},
+    )
    wandb_watch: str | None = None
-    wandb_log_model: str | None = None
+    wandb_log_model: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": '"checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only at the end of training'
+        },
+    )

    @model_validator(mode="before")
    @classmethod
@@ -64,14 +95,52 @@ class WandbConfig(BaseModel):
 class CometConfig(BaseModel):
    """Comet configuration subset"""

-    use_comet: bool | None = None
-    comet_api_key: str | None = None
-    comet_workspace: str | None = None
-    comet_project_name: str | None = None
-    comet_experiment_key: str | None = None
-    comet_mode: str | None = None
-    comet_online: bool | None = None
-    comet_experiment_config: dict[str, Any] | None = None
+    use_comet: bool | None = Field(
+        default=None,
+        json_schema_extra={"description": "Enable or disable Comet integration."},
+    )
+    comet_api_key: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "API key for Comet. Recommended to set via `comet login`."
+        },
+    )
+    comet_workspace: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Workspace name in Comet. Defaults to the user's default workspace."
+        },
+    )
+    comet_project_name: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Project name in Comet. Defaults to Uncategorized."
+        },
+    )
+    comet_experiment_key: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Identifier for the experiment. Used to append data to an existing experiment or control the key of new experiments. Default to a random key."
+        },
+    )
+    comet_mode: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": 'Create a new experiment ("create") or log to an existing one ("get"). Default ("get_or_create") auto-selects based on configuration.'
+        },
+    )
+    comet_online: bool | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Set to True to log data to Comet server, or False for offline storage. Default is True."
+        },
+    )
+    comet_experiment_config: dict[str, Any] | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Dictionary for additional configuration settings, see the doc for more details."
+        },
+    )


 class GradioConfig(BaseModel):
--- a/src/axolotl/utils/schemas/model.py
+++ b/src/axolotl/utils/schemas/model.py
@@ -4,7 +4,7 @@ from pydantic import BaseModel, Field, field_validator

 from axolotl.utils.logging import get_logger

-LOG = get_logger(__name__, use_environ=True)
+LOG = get_logger(__name__)


 class ModelInputConfig(BaseModel):
@@ -12,19 +12,55 @@ class ModelInputConfig(BaseModel):

    model_config = {"protected_namespaces": ()}

-    base_model: str
-    base_model_config: str | None = None
+    base_model: str = Field(
+        json_schema_extra={
+            "description": "This is the huggingface model that contains *.pt, *.safetensors, or *.bin files. This can also be a relative path to a model on disk"
+        }
+    )
+    base_model_config: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "If the base_model repo on hf hub doesn't include configuration .json files, You can set that here, or leave this empty to default to base_model"
+        },
+    )
    cls_model_config: str | None = None
-    tokenizer_config: str | None = None
-    tokenizer_use_fast: bool | None = None
-    tokenizer_legacy: bool | None = None
+    tokenizer_config: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Optional tokenizer configuration path in case you want to use a different tokenizer than the one defined in the base model"
+        },
+    )
+    tokenizer_use_fast: bool | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "use_fast option for tokenizer loading from_pretrained, default to True"
+        },
+    )
+    tokenizer_legacy: bool | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Whether to use the legacy tokenizer setting, defaults to True"
+        },
+    )
+    tokenizer_use_mistral_common: bool | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Whether to use mistral-common tokenizer. If set to True, it will use the mistral-common tokenizer."
+        },
+    )
    tokenizer_type: str | None = Field(
-        default=None, json_schema_extra={"description": "transformers tokenizer class"}
+        default=None,
+        json_schema_extra={
+            "description": "Corresponding tokenizer for the model AutoTokenizer is a good choice"
+        },
    )
    processor_type: str | None = Field(
        default=None, json_schema_extra={"description": "transformers processor class"}
    )
-    trust_remote_code: bool | None = None
+    trust_remote_code: bool | None = Field(
+        default=None,
+        json_schema_extra={"description": "Trust remote code for untrusted source"},
+    )

    @field_validator("trust_remote_code")
    @classmethod
@@ -39,10 +75,23 @@ class ModelInputConfig(BaseModel):
 class ModelOutputConfig(BaseModel):
    """model save configuration subset"""

-    output_dir: str = Field(default="./model-out")
-    hub_model_id: str | None = None
-    hub_strategy: str | None = None
-    save_safetensors: bool | None = True
+    output_dir: str = Field(
+        default="./model-out",
+        json_schema_extra={"description": "Where to save the full-finetuned model to"},
+    )
+    hub_model_id: str | None = Field(
+        default=None, json_schema_extra={"description": "push checkpoints to hub"}
+    )
+    hub_strategy: str | None = Field(
+        default=None,
+        json_schema_extra={"description": "how to push checkpoints to hub"},
+    )
+    save_safetensors: bool | None = Field(
+        default=True,
+        json_schema_extra={
+            "description": "Save model as safetensors (require safetensors package). Default True"
+        },
+    )


 class SpecialTokensConfig(BaseModel):
--- a/src/axolotl/utils/schemas/peft.py
+++ b/src/axolotl/utils/schemas/peft.py
@@ -9,7 +9,7 @@ class LoftQConfig(BaseModel):
    """LoftQ configuration subset"""

    loftq_bits: int = Field(
-        default=4, json_schema_extra={"description": "Quantization bits for LoftQ"}
+        default=4, json_schema_extra={"description": "typically 4 bits"}
    )
    # loftq_iter: int = Field(default=1, json_schema_extra={"description": "Alternating iterations for LoftQ"})

@@ -17,31 +17,78 @@ class LoftQConfig(BaseModel):
 class PeftConfig(BaseModel):
    """peftq configuration subset"""

-    loftq_config: LoftQConfig | None = None
+    loftq_config: LoftQConfig | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Configuration options for loftq initialization for LoRA"
+        },
+    )


 class LoraConfig(BaseModel):
    """Peft / LoRA configuration subset"""

-    load_in_8bit: bool | None = Field(default=False)
-    load_in_4bit: bool | None = Field(default=False)
+    load_in_8bit: bool | None = Field(
+        default=False,
+        json_schema_extra={
+            "description": "This will attempt to quantize the model down to 8 bits and use adam 8 bit optimizer"
+        },
+    )
+    load_in_4bit: bool | None = Field(
+        default=False, json_schema_extra={"description": "Use bitsandbytes 4 bit"}
+    )

-    adapter: str | None = None
-    lora_model_dir: str | None = None
+    adapter: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "If you want to use 'lora' or 'qlora' or leave blank to train all parameters in original model"
+        },
+    )
+    lora_model_dir: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "If you already have a lora model trained that you want to load, put that here. This means after training, if you want to test the model, you should set this to the value of `output_dir`. Note that if you merge an adapter to the base model, a new subdirectory `merged` will be created under the `output_dir`."
+        },
+    )
    lora_r: int | None = None
    lora_alpha: int | None = None
    lora_fan_in_fan_out: bool | None = None
    lora_target_modules: str | list[str] | None = None
-    lora_target_linear: bool | None = None
-    lora_modules_to_save: list[str] | None = None
+    lora_target_linear: bool | None = Field(
+        default=None,
+        json_schema_extra={"description": "If true, will target all linear modules"},
+    )
+    lora_modules_to_save: list[str] | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "If you added new tokens to the tokenizer, you may need to save some LoRA modules because they need to know the new tokens. For LLaMA and Mistral, you need to save `embed_tokens` and `lm_head`. It may vary for other models. `embed_tokens` converts tokens to embeddings, and `lm_head` converts embeddings to token probabilities."
+        },
+    )
    lora_dropout: float | None = 0.0
-    peft_layers_to_transform: list[int] | None = None
+    peft_layers_to_transform: list[int] | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "The layer indices to transform, otherwise, apply to all layers"
+        },
+    )
    peft_layers_pattern: list[str] | None = None
    peft: PeftConfig | None = None
-    peft_use_dora: bool | None = None
-    peft_use_rslora: bool | None = None
-    peft_layer_replication: list[tuple[int, int]] | None = None
-    peft_init_lora_weights: bool | str | None = None
+    peft_use_dora: bool | None = Field(
+        default=None, json_schema_extra={"description": "Whether to use DoRA."}
+    )
+    peft_use_rslora: bool | None = Field(
+        default=None, json_schema_extra={"description": "Whether to use RSLoRA."}
+    )
+    peft_layer_replication: list[tuple[int, int]] | None = Field(
+        default=None,
+        json_schema_extra={"description": "List of layer indices to replicate."},
+    )
+    peft_init_lora_weights: bool | str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "How to initialize LoRA weights. Default to True which is MS original implementation."
+        },
+    )

    qlora_sharded_model_loading: bool | None = Field(
        default=False,
@@ -49,9 +96,24 @@ class LoraConfig(BaseModel):
            "description": "load qlora model in sharded format for FSDP using answer.ai technique."
        },
    )
-    lora_on_cpu: bool | None = None
-    gptq: bool | None = None
-    bnb_config_kwargs: dict[str, Any] | None = None
+    lora_on_cpu: bool | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Do the LoRA/PEFT loading on CPU -- this is required if the base model is so large it takes up most or all of the available GPU VRAM, e.g. during a model and LoRA merge"
+        },
+    )
+    gptq: bool | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Whether you are training a 4-bit GPTQ quantized model"
+        },
+    )
+    bnb_config_kwargs: dict[str, Any] | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "optional overrides to the bnb 4bit quantization configuration"
+        },
+    )

    loraplus_lr_ratio: float | None = Field(
        default=None,
@@ -62,7 +124,7 @@ class LoraConfig(BaseModel):
    loraplus_lr_embedding: float | None = Field(
        default=1e-6,
        json_schema_extra={
-            "description": "loraplus learning rate for lora embedding layers."
+            "description": "loraplus learning rate for lora embedding layers. Default value is 1e-6."
        },
    )

@@ -125,8 +187,29 @@ class LoraConfig(BaseModel):
 class ReLoRAConfig(BaseModel):
    """ReLoRA configuration subset"""

-    relora_steps: int | None = None
-    relora_warmup_steps: int | None = None
-    relora_anneal_steps: int | None = None
-    relora_prune_ratio: float | None = None
-    relora_cpu_offload: bool | None = None
+    relora_steps: int | None = Field(
+        default=None,
+        json_schema_extra={"description": "Number of steps per ReLoRA restart"},
+    )
+    relora_warmup_steps: int | None = Field(
+        default=None,
+        json_schema_extra={"description": "Number of per-restart warmup steps"},
+    )
+    relora_anneal_steps: int | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Number of anneal steps for each relora cycle"
+        },
+    )
+    relora_prune_ratio: float | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "threshold for optimizer magnitude when pruning"
+        },
+    )
+    relora_cpu_offload: bool | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "True to perform lora weight merges on cpu during restarts, for modest gpu memory savings"
+        },
+    )
--- a/src/axolotl/utils/schemas/quantization.py
+++ b/src/axolotl/utils/schemas/quantization.py
@@ -15,17 +15,22 @@ class QATConfig(BaseModel):
    """

    activation_dtype: TorchIntDType | None = Field(
-        default=None, description="Activation dtype"
+        default=None,
+        description='Fake quantization layout to use for activation quantization. Valid options are "int4" and "int8"',
    )
    weight_dtype: TorchIntDType = Field(
-        default=TorchIntDType.int8, description="Weight dtype"
+        default=TorchIntDType.int8,
+        description='Fake quantization layout to use for weight quantization. Valid options are "int4" and "int8"',
    )
    quantize_embedding: bool | None = Field(
        default=False, description="Quantize embedding"
    )
-    group_size: int | None = Field(default=32, description="Group size")
+    group_size: int | None = Field(
+        default=32,
+        description="The number of elements in each group for per-group fake quantization",
+    )
    fake_quant_after_n_steps: int | None = Field(
-        default=None, description="Fake quant after n steps"
+        default=None, description="The number of steps to apply fake quantization after"
    )

    @field_validator("activation_dtype", "weight_dtype", mode="before")
@@ -44,15 +49,20 @@ class PTQConfig(BaseModel):
    """

    weight_dtype: TorchIntDType = Field(
-        default=TorchIntDType.int8, description="Weight dtype"
+        default=TorchIntDType.int8,
+        description="Fake quantization layout to use for weight quantization. Valid options are uintX for X in [1, 2, 3, 4, 5, 6, 7], or int4, or int8",
    )
    activation_dtype: TorchIntDType | None = Field(
-        default=None, description="Activation dtype"
+        default=None,
+        description='Fake quantization layout to use for activation quantization. Valid options are "int4" and "int8"',
    )
    quantize_embedding: bool | None = Field(
-        default=None, description="Quantize embedding"
+        default=None, description="Whether to quantize the embedding layer."
+    )
+    group_size: int | None = Field(
+        default=32,
+        description="The number of elements in each group for per-group fake quantization",
    )
-    group_size: int | None = Field(default=32, description="Group size")

    @field_validator("activation_dtype", "weight_dtype", mode="before")
    @classmethod
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
NanoCode012	159f0531f9	chore: fix docstring comment from distributed pr	2025-06-20 05:48:34 +07:00
Wing Lian	0494359c6c	update trl to 0.18.2 (#2814 )	2025-06-19 11:27:59 -04:00
NanoCode012	26c39e1ca7	fix(doc): address exitcode formatting to help search (#2809 ) [skip ci]	2025-06-19 11:19:52 -04:00
Dan Saunders	45adf1bfb9	get_logger use_environ fix (#2808 ) * get_logger use_environ fix * rethinking * replacing old logger imports * simplify * fix boolean cond	2025-06-19 11:16:52 -04:00
Carsten Kragelund Jørgensen	eb3a57eb17	Ignore generation/endgeneration tags when analyzing Jinja chat template (#2787 ) * ignore generation/endgeneration tags Axolotl handles calculating the mask for assistant turns on its own, and as such these tags are not needed, however currently the analyzer does not recognize them at all and throws an error. * feat: add phi4 tokenizer test and unblock gemma2 * fix: improve template * chore: refactor * chore: lint --------- Co-authored-by: NanoCode012 <nano@axolotl.ai> Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-06-18 15:59:07 -04:00
Wing Lian	34da391391	Set dev version (#2807 ) [skip ci]	2025-06-18 15:49:05 -04:00
NanoCode012	0bb9077553	Fix: logging on py310 (#2802 ) * feat: encourage py311 * fix: logging import on py310 * fix: do upper and simplify handling	2025-06-18 15:46:27 -04:00
Wing Lian	a85efffbef	bump transformers==4.52.4 (#2800 ) [skip ci] * bump transformers==4.52.4 * don't use hf offline for qwen tokenizer * increase timeout * don't use methodtype * increase timeout * better assertion logging * upgrade deepspeed version too	2025-06-18 15:46:14 -04:00
Dan Saunders	06a648263b	Config doc autogen: follow-up fix docs build (#2806 ) * config reference doc autogen * improvements * cleanup; still ugly but working * reformat * remove autogen config ref from git * factor out validations * rewrite * rewrite * cleanup * progress * progress * progress * lint and minifying somewhat * remove unneeded * coderabbit * coderabbit * update preview-docs workflow triggers * installing with deps * coderabbit * update refs * overwrote file accidentally * docs install deps	2025-06-18 15:42:54 -04:00
Dan Saunders	9d5bfc127e	Config doc autogen (#2718 ) * config reference doc autogen * improvements * cleanup; still ugly but working * reformat * remove autogen config ref from git * factor out validations * rewrite * rewrite * cleanup * progress * progress * progress * lint and minifying somewhat * remove unneeded * coderabbit * coderabbit * update preview-docs workflow triggers * installing with deps * coderabbit * update refs * overwrote file accidentally	2025-06-18 15:36:53 -04:00
Wing Lian	da8f6c32b9	update favicon (#2801 ) * update favicon * correct size favicon	2025-06-17 18:09:24 -04:00
Wing Lian	88c0e8d048	release tag (#2799 ) Some checks failed ci-cd / build-axolotl (<nil>, 124, 12.4.1, 3.11, 2.5.1) (push) Has been cancelled Details ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.7.1) (push) Has been cancelled Details ci-cd / build-axolotl (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled Details ci-cd / build-axolotl (vllm, 124, 12.4.1, true, 3.11, 2.6.0) (push) Has been cancelled Details publish pypi / Create Release (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 124, 12.4.1, 3.11, 2.5.1) (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 124, 12.4.1, true, 3.11, 2.6.0) (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.7.1) (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled Details ci-cd / build-axolotl-cloud-no-tmux (<nil>, 124, 12.4.1, 3.11, 2.6.0) (push) Has been cancelled Details publish pypi / Upload release to PyPI (push) Has been cancelled Details	2025-06-17 12:13:27 -04:00
NanoCode012	d8e8cd8558	feat: remove evalfirst callback with built-in trainer arg (#2797 )	2025-06-17 12:09:33 -04:00
Wing Lian	ccc94da8ad	KD fix w/ online distillation (#2700 ) [skip ci] * kd fixes * fix collator setup * fix input args * better handling to drop string fields for kd with raw dataset * kd trainer has kd temp as part of the init * drop top_k before softmax * simplfy and remove zscore * WIP chunked KD loss with autograd wrapper * more fixes and liger-type chunked loss * collator cls for plugins * remove debugging * additional plugin collator kwargs, don't scale up kd loss by t^2 * don't need temp arg to distill method * online kd wip * add close to comment block * suport sampling params/max new tokens * handle when no custom collator is used in plugins * logsumexp trick: * fix check * shift off the first empty token * fix length of padding * use max not min * temp scale kd loss at end * support for dynamic plugin training args mixins and symmetric kl * chore: lint * fix trainer callback base class * Fix decay * accept compressed responses for smaller wire payload * post-rebase lint * more KD updates * increase hyperparams_count for gradients for added normalize_topk * fix to remove attention_mask * rename vars for consistency * fix rebase issues * default to dropping last batch in multipack batch sampler * improve handling of train len * init collator_cls_and_kwargs * explicit drop_last=False when checking for multipack completeness * use separate v2 loader for kd * fix kd tests to use subprocess so it picks up kd training args * default value for kd_beta arg * use updated dataset for ci * longer timeout for e2e	2025-06-17 12:09:13 -04:00
Matt Cummins	ba62aa65ee	fixed the lora_target_modules syntax (#2793 )	2025-06-15 16:47:02 -04:00
NanoCode012	21388cf615	Fix: lora kernel pre-patch applied despite post-patch not applied (#2772 ) * fix: do not pre-patch self attention if lora dropout non-zero * fix: add test to check patch not applied * fix: test * fix: test config check * fix where we check so that tests don't break * fix: test --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-06-14 11:54:06 -07:00
NanoCode012	80d5b066ec	Fix: adding magistral fsdp config, fixing not eval with test_datasets, handle mllama attention (#2789 ) [skip ci] * feat: add fsdp config for magistral * fix: add mllama self attention handling for lora kernels * fix: no eval if val_set_size 0 despite having test_datasets * fix: add note for cce for vlm in newer model	2025-06-14 11:53:43 -07:00
NanoCode012	a3c82e8cbb	fix: grpo doc link (#2788 ) [skip ci]	2025-06-13 12:03:47 -07:00
Wing Lian	b2274d430b	support for QAT w RL (DPO) (#2776 )	2025-06-13 10:00:35 -04:00
NanoCode012	eac4a61f55	Feat: Add Magistral and mistral-common tokenizer support (#2780 )	2025-06-12 19:18:33 -04:00
Wing Lian	ace9287c96	update loss value for flakey e2e test (#2786 ) [skip ci] * update loss value for flakey e2e test * use pytest skip * parametrize combinations	2025-06-12 18:06:14 -04:00
JZacaroli	f5fbc82f2b	Fix logging import in evaluate.py (#2782 ) (#2783 ) * Fix logging import in evaluate.py (#2782) * chore: lint --------- Co-authored-by: Joe Zacaroli <jaz@cyberscience.com> Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-06-12 13:23:31 -04:00
NanoCode012	706c677cad	feat(doc): update readme to include changelog and remove matrix (#2775 ) [skip ci] * feat(doc): update readme to include changelog and remove matrix * chore: improve wording * chore: wording * Update README.md Co-authored-by: salman <salman.mohammadi@outlook.com> * Update README.md Co-authored-by: salman <salman.mohammadi@outlook.com> * Update README.md Co-authored-by: salman <salman.mohammadi@outlook.com> * Update README.md Co-authored-by: salman <salman.mohammadi@outlook.com> * chore: address comment remove muon * chore: address comments * fix: address final comments --------- Co-authored-by: salman <salman.mohammadi@outlook.com>	2025-06-12 13:23:18 -04:00
Wing Lian	468580d18e	limit multipack sampler processes (#2771 ) [skip ci] * limit to 16 packing processes * make num_processes properly reflect configured dataset_processes	2025-06-12 13:22:58 -04:00
salman	3634d8ff9d	QAT docfix (#2778 ) [skip ci] * nits * Update docs/qat.qmd Co-authored-by: NanoCode012 <nano@axolotl.ai> --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2025-06-12 13:22:40 -04:00
Wing Lian	bcc108efc1	build 2.7.1 images too (#2784 ) [skip ci]	2025-06-12 13:22:20 -04:00
Wing Lian	581dd324cc	build base images for torch 2.7.1 (#2764 ) * build base images for torch 2.7.1 * fix: update base docker to use torch 2.7.1 * fix: update doc for main base to use 2.7.1 * make sure to install fa2 in base uv too * use no build isolation for uv+flashattn * install psutil also for fa2 * longer timeout for flash attn build --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2025-06-11 17:11:06 -04:00
Dan Saunders	00cda8cc70	Data loader refactor (#2707 ) * data loading refactor (wip) * updates * progress * pytest * pytest fix * lint * zero_first -> filelock, more simplifications * small simplification * import change * nit * lint * simplify dedup * couldnt resist * review comments WIP * continued wip * minor changes * fix; remove contrived test * further refactor * set default seed in pydantic config * lint * continued simplication * lint * renaming and nits * filelock tests * fix * fix * lint * remove nullable arg * remove unnecessary code * moving dataset save fn to shared module * remove debug print * matching var naming * fn name change * coderabbit comments * naming nit * fix test	2025-06-10 19:53:07 -04:00
Dan Saunders	52a0452acb	magistral small placeholder (#2777 )	2025-06-10 13:03:41 -04:00
NanoCode012	83632f71d8	Feat: add tool calling support via tools column (#2774 ) * feat: add tool_calling field support * fix: add tests	2025-06-09 21:42:05 -07:00
Qingyang Wu	92afa4fa27	Fix the bug of position ids padding (#2739 ) [skip ci] * Update batching.py: fix the bug of position ids padding if position ids is padded with a long sequence of zeros, it will cause flash attention to crash * use alternate calculation for padding position_ids with a range --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-06-09 21:26:36 -07:00
Wing Lian	dd660c2ed0	handle when unable to save optimizer state when using ao optimizer with FSDP (#2773 ) [skip ci] * handle when unable to save optimizer state when using ao optimizer with FSDP1 * improve messaging Co-authored-by: salman <salman.mohammadi@outlook.com> --------- Co-authored-by: salman <salman.mohammadi@outlook.com>	2025-06-09 21:26:14 -07:00
Wing Lian	09c685fd2c	fix worker_init_fn signature handling (#2769 )	2025-06-08 23:14:10 -07:00