hmmm

Merge branch 'main' into lora_bf16
working?
2025-09-12 18:51:10 +01:00 · 2025-09-12 18:35:03 +01:00 · 2025-09-12 17:34:41 +00:00 · 2025-09-12 15:26:12 +00:00 · 2025-09-12 10:55:50 +01:00 · 2025-09-12 10:55:11 +01:00
112 changed files with 5166 additions and 997 deletions
--- a/.coderabbit.yaml
+++ b/.coderabbit.yaml
@@ -12,6 +12,6 @@ reviews:
  auto_review:
    enabled: true
    drafts: false
-    auto_incremental_review: true
+    auto_incremental_review: false
 chat:
  auto_reply: true
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -36,6 +36,11 @@ jobs:
            python_version: "3.11"
            pytorch: 2.7.1
            axolotl_extras:
          - cuda: 128
            cuda_version: 12.8.1
            python_version: "3.11"
            pytorch: 2.8.0
            axolotl_extras:
    runs-on: axolotl-gpu-runner
    steps:
      - name: Checkout
@@ -110,6 +115,11 @@ jobs:
            python_version: "3.11"
            pytorch: 2.7.1
            axolotl_extras:
          - cuda: 128
            cuda_version: 12.8.1
            python_version: "3.11"
            pytorch: 2.8.0
            axolotl_extras:
    runs-on: axolotl-gpu-runner
    steps:
      - name: Checkout
@@ -169,6 +179,12 @@ jobs:
            pytorch: 2.7.1
            axolotl_extras: vllm
            is_latest: true
          - cuda: 128
            cuda_version: 12.8.1
            python_version: "3.11"
            pytorch: 2.8.0
            axolotl_extras:
            is_latest:
    runs-on: axolotl-gpu-runner
    steps:
      - name: Checkout
--- a/.github/workflows/multi-gpu-e2e.yml
+++ b/.github/workflows/multi-gpu-e2e.yml
@@ -33,13 +33,6 @@ jobs:
            axolotl_extras:
            num_gpus: 2
            nightly_build: "true"
          - cuda: 126
            cuda_version: 12.6.3
            python_version: "3.11"
            pytorch: 2.7.0
            axolotl_extras:
            num_gpus: 2
            nightly_build: "true"
          - cuda: 126
            cuda_version: 12.6.3
            python_version: "3.11"
@@ -47,6 +40,13 @@ jobs:
            axolotl_extras: vllm
            num_gpus: 2
            nightly_build: "true"
          - cuda: 128
            cuda_version: 12.8.1
            python_version: "3.11"
            pytorch: 2.8.0
            axolotl_extras: fbgemm-gpu
            num_gpus: 2
            nightly_build: "true"
    runs-on: [self-hosted, modal]
    timeout-minutes: 120
    steps:
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -55,7 +55,7 @@ jobs:
      fail-fast: false
      matrix:
        python_version: ["3.11"]
-        pytorch_version: ["2.6.0", "2.7.0", "2.7.1"]
+        pytorch_version: ["2.6.0", "2.7.1", "2.8.0"]
    timeout-minutes: 20
    steps:
@@ -130,7 +130,7 @@ jobs:
      fail-fast: false
      matrix:
        python_version: ["3.11"]
-        pytorch_version: ["2.6.0", "2.7.0", "2.7.1"]
+        pytorch_version: ["2.6.0", "2.7.1", "2.8.0"]
    timeout-minutes: 20
    steps:
@@ -240,7 +240,7 @@ jobs:
          - cuda: 126
            cuda_version: 12.6.3
            python_version: "3.11"
-            pytorch: 2.6.0
+            pytorch: 2.7.1
            num_gpus: 1
            axolotl_extras:
            dockerfile: "Dockerfile-uv.jinja"
@@ -298,6 +298,13 @@ jobs:
            pytorch: 2.7.1
            num_gpus: 1
            axolotl_extras:
          - cuda: 128
            cuda_version: 12.8.1
            python_version: "3.11"
            pytorch: 2.8.0
            num_gpus: 1
            gpu_type: "B200"
            axolotl_extras: fbgemm-gpu
    steps:
      - name: Checkout
        uses: actions/checkout@v4
@@ -318,6 +325,7 @@ jobs:
          echo "CUDA=${{ matrix.cuda }}" >> $GITHUB_ENV
          echo "MODAL_IMAGE_BUILDER_VERSION=2024.10" >> $GITHUB_ENV
          echo "N_GPUS=${{ matrix.num_gpus }}" >> $GITHUB_ENV
          echo "GPU_TYPE=${{ matrix.gpu_type || 'L40S'}}" >> $GITHUB_ENV
          echo "CODECOV_TOKEN=${{ secrets.CODECOV_TOKEN }}" >> $GITHUB_ENV
          echo "E2E_DOCKERFILE=${{ matrix.dockerfile || 'Dockerfile.jinja'}}" >> $GITHUB_ENV
      - name: Run tests job on Modal
@@ -334,10 +342,10 @@ jobs:
      fail-fast: false
      matrix:
        include:
-          - cuda: 124
+          - cuda: 126
-            cuda_version: 12.4.1
+            cuda_version: 12.6.3
            python_version: "3.11"
-            pytorch: 2.6.0
+            pytorch: 2.7.1
            num_gpus: 1
            axolotl_extras:
    steps:
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -11,10 +11,10 @@ repos:
    -   id: no-commit-to-branch
        args: ['--branch', 'main']
 -   repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.12.9
+    rev: v0.12.12
    hooks:
    -   id: ruff
-        args: [--fix]
+        args: [--fix, --select, I]
    -   id: ruff-format
 -   repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.17.1
--- a/CITATION.cff
+++ b/CITATION.cff
@@ -1,6 +1,6 @@
 cff-version: 1.2.0
 type: software
-title: "Axolotl: Post-Training for AI Models"
+title: "Axolotl: Open Source LLM Post-Training"
 message: "If you use this software, please cite it as below."
 authors:
  - name: "Axolotl maintainers and contributors"
--- a/README.md
+++ b/README.md
@@ -5,6 +5,9 @@
        <img alt="Axolotl" src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/887513285d98132142bf5db2a74eb5e0928787f1/image/axolotl_logo_digital_black.svg" width="400" height="104" style="max-width: 100%;">
    </picture>
 </p>
  <p align="center">
      <strong>A Free and Open Source LLM Fine-tuning Framework</strong><br>
  </p>
 <p align="center">
    <img src="https://img.shields.io/github/license/axolotl-ai-cloud/axolotl.svg?color=blue" alt="GitHub License">
@@ -17,6 +20,7 @@
    <br/>
    <a href="https://discord.com/invite/HhrNrHJPRb"><img src="https://img.shields.io/badge/discord-7289da.svg?style=flat-square&logo=discord" alt="discord" style="height: 20px;"></a>
    <a href="https://twitter.com/axolotl_ai"><img src="https://img.shields.io/twitter/follow/axolotl_ai?style=social" alt="twitter" style="height: 20px;"></a>
    <a href="https://colab.research.google.com/github/axolotl-ai-cloud/axolotl/blob/main/examples/colab-notebooks/colab-axolotl-example.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="google-colab" style="height: 20px;"></a>
    <br/>
    <img src="https://github.com/axolotl-ai-cloud/axolotl/actions/workflows/tests-nightly.yml/badge.svg" alt="tests-nightly">
    <img src="https://github.com/axolotl-ai-cloud/axolotl/actions/workflows/multi-gpu-e2e.yml/badge.svg" alt="multigpu-semi-weekly tests">
@@ -49,20 +53,21 @@
 ## ✨ Overview
-Axolotl is a tool designed to streamline post-training for various AI models.
+Axolotl is a free and open-source tool designed to streamline post-training and fine-tuning for the latest large language models (LLMs).
 Features:
- **Multiple Model Support**: Train various models like LLaMA, Mistral, Mixtral, Pythia, and more. We are compatible with HuggingFace transformers causal language models.
+- **Multiple Model Support**: Train various models like GPT-OSS, LLaMA, Mistral, Mixtral, Pythia, and many more models available on the Hugging Face Hub.
- **Training Methods**: Full fine-tuning, LoRA, QLoRA, GPTQ, QAT, Preference Tuning (DPO, IPO, KTO, ORPO), RL (GRPO), Multimodal, and Reward Modelling (RM) / Process Reward Modelling (PRM).
+- **Multimodal Training**: Fine-tune vision-language models (VLMs) including LLaMA-Vision, Qwen2-VL, Pixtral, LLaVA, SmolVLM2, and audio models like Voxtral with image, video, and audio support.
- **Easy Configuration**: Re-use a single YAML file between dataset preprocess, training, evaluation, quantization, and inference.
+- **Training Methods**: Full fine-tuning, LoRA, QLoRA, GPTQ, QAT, Preference Tuning (DPO, IPO, KTO, ORPO), RL (GRPO), and Reward Modelling (RM) / Process Reward Modelling (PRM).
 - **Easy Configuration**: Re-use a single YAML configuration file across the full fine-tuning pipeline: dataset preprocessing, training, evaluation, quantization, and inference.
 - **Performance Optimizations**: [Multipacking](https://docs.axolotl.ai/docs/multipack.html), [Flash Attention](https://github.com/Dao-AILab/flash-attention), [Xformers](https://github.com/facebookresearch/xformers), [Flex Attention](https://pytorch.org/blog/flexattention/), [Liger Kernel](https://github.com/linkedin/Liger-Kernel), [Cut Cross Entropy](https://github.com/apple/ml-cross-entropy/tree/main), [Sequence Parallelism (SP)](https://docs.axolotl.ai/docs/sequence_parallelism.html), [LoRA optimizations](https://docs.axolotl.ai/docs/lora_optims.html), [Multi-GPU training (FSDP1, FSDP2, DeepSpeed)](https://docs.axolotl.ai/docs/multi-gpu.html), [Multi-node training (Torchrun, Ray)](https://docs.axolotl.ai/docs/multi-node.html), and many more!
 - **Flexible Dataset Handling**: Load from local, HuggingFace, and cloud (S3, Azure, GCP, OCI) datasets.
 - **Cloud Ready**: We ship [Docker images](https://hub.docker.com/u/axolotlai) and also [PyPI packages](https://pypi.org/project/axolotl/) for use on cloud platforms and local hardware.
-## 🚀 Quick Start
+## 🚀 Quick Start - LLM Fine-tuning in Minutes
 **Requirements**:
@@ -70,6 +75,10 @@ Features:
 - Python 3.11
 - PyTorch ≥2.6.0
 ### Google Colab
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/axolotl-ai-cloud/axolotl/blob/main/examples/colab-notebooks/colab-axolotl-example.ipynb#scrollTo=msOCO4NRmRLa)
 ### Installation
 #### Using pip
@@ -155,7 +164,7 @@ If you use Axolotl in your research or projects, please cite it as follows:
 ```bibtex
@software{axolotl,
-  title = {Axolotl: Post-Training for AI Models},
+  title = {Axolotl: Open Source LLM Post-Training},
  author = {{Axolotl maintainers and contributors}},
  url = {https://github.com/axolotl-ai-cloud/axolotl},
  license = {Apache-2.0},
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -153,7 +153,7 @@ quartodoc:
        - utils.distributed
        - utils.dict
        - utils.optimizers.adopt
-        - utils.data.pretraining
+        - utils.data.streaming
        - utils.data.sft
        - utils.quantization
    - title: Schemas
@@ -272,6 +272,7 @@ website:
          contents:
            - docs/batch_vs_grad.qmd
            - docs/dataset_preprocessing.qmd
            - docs/streaming.qmd
            - docs/multipack.qmd
            - docs/mixed_precision.qmd
            - docs/optimizers.qmd
--- a/cicd/single_gpu.py
+++ b/cicd/single_gpu.py
@@ -57,7 +57,8 @@ VOLUME_CONFIG = {
 }
 N_GPUS = int(os.environ.get("N_GPUS", 1))
-GPU_CONFIG = f"L40S:{N_GPUS}"
+GPU_TYPE = os.environ.get("GPU_TYPE", "L40S")
 GPU_CONFIG = f"{GPU_TYPE}:{N_GPUS}"
 def run_cmd(cmd: str, run_folder: str):
--- a/codecov.yml
+++ b/codecov.yml
@@ -12,7 +12,7 @@ coverage:
      default:
        # basic
        target: auto
-        threshold: 0%
+        threshold: 1%
        base: auto
        # advanced
        branches: null
@@ -27,7 +27,7 @@ coverage:
      default:
        # basic
        target: auto
-        threshold: 0%
+        threshold: 1%
        base: auto
        # advanced
        branches: null
--- a/docs/installation.qmd
+++ b/docs/installation.qmd
@@ -134,7 +134,7 @@ For providers supporting Docker:
 ### Google Colab {#sec-colab}
-Use our [example notebook](../examples/colab-notebooks/colab-axolotl-example.ipynb).
+[![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/axolotl-ai-cloud/axolotl/blob/main/examples/colab-notebooks/colab-axolotl-example.ipynb#scrollTo=msOCO4NRmRLa)
 ## Platform-Specific Instructions {#sec-platform-specific}
--- a/docs/multi-gpu.qmd
+++ b/docs/multi-gpu.qmd
@@ -63,15 +63,6 @@ Start from Stage 1 -> Stage 2 -> Stage 3.
 :::
 ::: {.callout-tip}
 Using ZeRO Stage 3 with Single-GPU training
 ZeRO Stage 3 can be used for training on a single GPU by manually setting the environment variables:
 `WORLD_SIZE=1 LOCAL_RANK=0 MASTER_ADDR=0.0.0.0 MASTER_PORT=29500`
 :::
 ## Fully Sharded Data Parallel (FSDP) {#sec-fsdp}
 ::: {.callout-note}
--- a/docs/quantize.qmd
+++ b/docs/quantize.qmd
@@ -51,3 +51,11 @@ axolotl quantize qat.yml
 ```
 This ensures that an identical quantization configuration is used to quantize the model as was used to train it.
 ::: {.callout-note}
 If you have configured pushing to hub with `hub_model_id`, your model hub name will have the quantization schema appended to it,
 e.g. `axolotl-ai-cloud/qat-nvfp4-llama3B` will become `axolotl-ai-cloud/qat-nvfp4-llama3B-nvfp4w`
 :::
--- a/docs/reward_modelling.qmd
+++ b/docs/reward_modelling.qmd
@@ -11,6 +11,7 @@ We support the reward modelling techniques supported by `trl`.
 ### (Outcome) Reward Models
 Outcome reward models are trained using data which contains preference annotations for an entire interaction between the user and model (e.g. rather than per-turn or per-step).
 For improved training stability, you can use the `center_rewards_coefficient` parameter to encourage mean-zero reward outputs ([see TRL docs](https://huggingface.co/docs/trl/v0.10.1/en/reward_trainer#centering-rewards)).
 ```yaml
 base_model: google/gemma-2-2b
--- a/docs/streaming.qmd
+++ b/docs/streaming.qmd
@@ -0,0 +1,120 @@
 ---
 title: Streaming Datasets
 description: How to use streaming mode for large-scale datasets and memory-efficient training
 order: 10
 ---
 Streaming enables memory-efficient training with large datasets by loading data
 incrementally rather than loading the entire dataset into memory at once.
 Use streaming when:
 - Your dataset is too large to fit in memory (e.g. when you're doing pretraining with massive text corpora)
 - You want to start training immediately without preprocessing the entire dataset
 Streaming works with both remote and locally stored datasets!
 ::: {.callout-note}
 Streaming currently only supports a single dataset. Multi-dataset support will be added soon.
 :::
 ## Configuration
 ### Basic Streaming
 Enable streaming mode by setting the `streaming` flag:
 ```yaml
 streaming: true
 ```
 ### Pretraining with Streaming
 For pretraining tasks, streaming is automatically enabled when using `pretraining_dataset`:
 ```yaml
 pretraining_dataset:
  - path: HuggingFaceFW/fineweb-edu
    type: pretrain
    text_column: text
    split: train
 # Optionally, enable sample packing
 streaming_multipack_buffer_size: 10000
 sample_packing: true
 ```
 ### SFT with Streaming
 For supervised fine-tuning with streaming:
 ```yaml
 streaming: true
 datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
    split: train
 # Optionally, enable sample packing
 streaming_multipack_buffer_size: 10000
 sample_packing: true
 ```
 ## Configuration Options
 ### `streaming_multipack_buffer_size`
 Controls the buffer size for multipack streaming (default: 10,000). This determines how
 many samples are buffered before packing. Larger buffers can improve packing efficiency
 but use more memory.
 ### `shuffle_merged_datasets`
 When enabled, shuffles the streaming dataset using the buffer. This requires additional
 memory for the shuffle buffer.
 ## Sample Packing with Streaming
 Sample packing is supported for streaming datasets. When enabled, multiple samples are
 packed into a single sequence to maximize GPU utilization:
 ```yaml
 sample_packing: true
 streaming_multipack_buffer_size: 10000
 # For SFT: attention is automatically isolated between packed samples
 # For pretraining: control with pretrain_multipack_attn
 pretrain_multipack_attn: true  # prevent cross-attention between packed samples
 ```
 For more information, see our [documentation](multipack.qmd) on multipacking.
 ## Important Considerations
 ### Memory Usage
 While streaming reduces memory usage compared to loading entire datasets, you still need
 to consider:
 - You can control the memory usage by adjusting `streaming_multipack_buffer_size`
 - Sample packing requires buffering multiple samples
 - Shuffling requires additional memory for the shuffle buffer
 ### Performance
 - Streaming may have slightly higher latency compared to preprocessed datasets, as samples are processed on-the-fly
 - Network speed and disk read speed are important when streaming from remote sources or a local dataset, respectively
 - Consider using `axolotl preprocess` for smaller or more frequently used datasets
 ### Evaluation Datasets
 Evaluation datasets are not streamed to ensure consistent evaluation metrics. They're
 loaded normally even when training uses streaming.
 ## Examples
 See the `examples/streaming/` directory for complete configuration examples:
 - `pretrain.yaml`: Pretraining with streaming dataset
 - `sft.yaml`: Supervised fine-tuning with streaming
--- a/examples/cloud/baseten.yaml
+++ b/examples/cloud/baseten.yaml
@@ -0,0 +1,10 @@
 provider: baseten
 project_name:
 secrets:
  - HF_TOKEN
  - WANDB_API_KEY
 gpu: h100
 gpu_count: 8
 node_count: 1
--- a/examples/colab-notebooks/colab-axolotl-example.ipynb
+++ b/examples/colab-notebooks/colab-axolotl-example.ipynb
@@ -40,7 +40,7 @@
    "%%capture\n",
    "# This step can take ~5-10 minutes to install dependencies\n",
    "!pip install --no-build-isolation axolotl[flash-attn]>=0.9.1\n",
-    "!pip install \"cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@0ee9ee8\""
+    "!pip install \"cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@c6a32c5\""
   ]
  },
  {
@@ -176,8 +176,8 @@
    }
   ],
   "source": [
    "from axolotl.utils.dict import DictDefault\n",
    "from axolotl.cli.config import load_cfg\n",
    "from axolotl.utils.dict import DictDefault\n",
    "\n",
    "# Axolotl provides full control and transparency over model and training configuration\n",
    "config = DictDefault(\n",
--- a/examples/devstral/README.md
+++ b/examples/devstral/README.md
@@ -20,7 +20,13 @@ pip3 install packaging==23.2 setuptools==75.8.0 wheel ninja
 pip3 install --no-build-isolation 'axolotl[flash-attn]>=0.12.0'
 ```
-2. Run the finetuning example:
+2. Install [Cut Cross Entropy](https://docs.axolotl.ai/docs/custom_integrations.html#cut-cross-entropy) to reduce training VRAM usage
 ```bash
 python scripts/cutcrossentropy_install.py | sh
 ```
 3. Run the finetuning example:
 ```bash
 axolotl train examples/devstral/devstral-small-qlora.yml
--- a/examples/gemma3/270m-qlora.yml
+++ b/examples/gemma3/270m-qlora.yml
@@ -0,0 +1,68 @@
 base_model: google/gemma-3-270m-it
 # optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
 # gemma3 doesn't seem to play nice with ddp
 ddp_find_unused_parameters: true
 load_in_8bit: false
 load_in_4bit: true
 # huggingface repo
 chat_template: gemma3
 eot_tokens:
  - <end_of_turn>
 datasets:
  - path: cgato/SlimOrcaDedupCleaned
    type: chat_template
    field_messages: conversations
    message_property_mappings:
      role: from
      content: value
 val_set_size: 0.0
 output_dir: ./outputs/out
 adapter: qlora
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
 sequence_len: 2048
 sample_packing: true
 eval_sample_packing: false
 wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_name:
 wandb_log_model:
 gradient_accumulation_steps: 4
 micro_batch_size: 1
 num_epochs: 1
 optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002
 bf16: auto
 tf32: true
 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
 resume_from_checkpoint:
 logging_steps: 1
 flash_attention: true
 warmup_ratio: 0.1
 evals_per_epoch:
 saves_per_epoch: 1
 weight_decay: 0.0
 special_tokens:
--- a/examples/gpt-oss/README.md
+++ b/examples/gpt-oss/README.md
@@ -106,6 +106,16 @@ See [Nanobit/text-tools-2k-test](https://huggingface.co/datasets/Nanobit/text-to
 Refer to [our docs](https://docs.axolotl.ai/docs/dataset-formats/conversation.html#using-tool-use) for more info.
 ### Thinking and chat_template masking conflict
 OpenAI’s Harmony template hides `thinking` in all non-final turns, which conflicts with Axolotl’s `chat_template` masking.
 If your dataset has `thinking` content mid-turn, there are two paths we recommend:
 - Train only on the last turn. This can be accomplished via chat_template's [train on last doc](https://docs.axolotl.ai/docs/dataset-formats/conversation.html#training-on-last-message).
 - Adjust your dataset to only have `thinking` content in the last turn.
 ### TIPS
 - Read more on how to load your own dataset at [docs](https://docs.axolotl.ai/docs/dataset_loading.html).
--- a/examples/hunyuan/README.md
+++ b/examples/hunyuan/README.md
@@ -0,0 +1,85 @@
 # Finetune HunYuan with Axolotl
 Tencent released a family of opensource models called HunYuan with varying parameter scales of 0.5B, 1.8B, 4B, and 7B scale for both Pre-trained and Instruct variants. The models can be found at [HuggingFace](https://huggingface.co/collections/tencent/hunyuan-dense-model-6890632cda26b19119c9c5e7). This guide shows how to fine-tune it with Axolotl with multi-turn conversations and proper masking.
 ## Getting started
 1. Install Axolotl following the [installation guide](https://docs.axolotl.ai/docs/installation.html). You need to install from main as HunYuan is only on nightly or use our latest [Docker images](https://docs.axolotl.ai/docs/docker.html).
    Here is an example of how to install from main for pip:
 ```bash
 # Ensure you have Pytorch installed (Pytorch 2.6.0 min)
 git clone https://github.com/axolotl-ai-cloud/axolotl.git
 cd axolotl
 pip3 install packaging==23.2 setuptools==75.8.0 wheel ninja
 pip3 install --no-build-isolation -e '.[flash-attn]'
 # Install CCE https://docs.axolotl.ai/docs/custom_integrations.html#cut-cross-entropy
 python scripts/cutcrossentropy_install.py | sh
 ```
 2. Run the finetuning example:
 ```bash
 axolotl train examples/hunyuan/hunyuan-v1-dense-qlora.yaml
 ```
 This config uses about 4.7 GB VRAM.
 Let us know how it goes. Happy finetuning! 🚀
 ### Dataset
 HunYuan Instruct models can choose to enter a slow think or fast think pattern. For best performance on fine-tuning their Instruct models, your dataset should be adjusted to match their pattern.
 ```python
 # fast think pattern
 messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "/no_think What color is the sun?" },
    {"role": "assistant", "content": "<think>\n\n</think>\n<answer>\nThe sun is yellow.\n</answer>"}
 ]
 # slow think pattern
 messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "/no_think What color is the sun?" },
    {"role": "assistant", "content": "<think>\nThe user is asking about the color of the sun. I need to ...\n</think>\n<answer>\nThe sun is yellow.\n</answer>"}
 ]
 ```
 ### TIPS
 - For inference, the official Tencent team recommends
 ```json
 {
  "do_sample": true,
  "top_k": 20,
  "top_p": 0.8,
  "repetition_penalty": 1.05,
  "temperature": 0.7
 }
 ```
 - You can run a full finetuning by removing the `adapter: qlora` and `load_in_4bit: true` from the config.
 - Read more on how to load your own dataset at [docs](https://docs.axolotl.ai/docs/dataset_loading.html).
 - The dataset format follows the OpenAI Messages format as seen [here](https://docs.axolotl.ai/docs/dataset-formats/conversation.html#chat_template).
 ## Optimization Guides
 - [Multi-GPU Training](https://docs.axolotl.ai/docs/multi-gpu.html)
 - [Multi-Node Training](https://docs.axolotl.ai/docs/multi-node.html)
 - [LoRA Optimizations](https://docs.axolotl.ai/docs/lora_optims.html)
 ## Related Resources
 - [Tencent HunYuan Blog](https://hunyuan.tencent.com/)
 - [Axolotl Docs](https://docs.axolotl.ai)
 - [Axolotl Website](https://axolotl.ai)
 - [Axolotl GitHub](https://github.com/axolotl-ai-cloud/axolotl)
 - [Axolotl Discord](https://discord.gg/7m9sfhzaf3)
--- a/examples/hunyuan/hunyuan-v1-dense-qlora.yaml
+++ b/examples/hunyuan/hunyuan-v1-dense-qlora.yaml
@@ -0,0 +1,64 @@
 base_model: tencent/Hunyuan-0.5B-Instruct
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
 plugins:
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
 load_in_8bit: false
 load_in_4bit: true
 datasets:
  - path: fozziethebeat/alpaca_messages_2k_test
    type: chat_template
 dataset_prepared_path: last_run_prepared
 val_set_size: 0.1
 output_dir: ./outputs/lora-out
 adapter: qlora
 lora_model_dir:
 sequence_len: 2048
 sample_packing: true
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
 lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj
 wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_name:
 wandb_log_model:
 gradient_accumulation_steps: 4
 micro_batch_size: 2
 num_epochs: 1
 optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002
 bf16: auto
 tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
 flash_attention: true
 warmup_ratio: 0.1
 evals_per_epoch: 1
 saves_per_epoch: 1
 # save_first_step: true  # uncomment this to validate checkpoint saving works with your config
--- a/examples/llama-3/3b-qat-fsdp2-nvfp4.yaml
+++ b/examples/llama-3/3b-qat-fsdp2-nvfp4.yaml
@@ -0,0 +1,64 @@
 base_model: meta-llama/Llama-3.2-3B
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
 load_in_8bit: false
 load_in_4bit: false
 strict: false
 plugins:
  - axolotl.integrations.liger.LigerPlugin
 liger_rope: true
 liger_rms_norm: true
 liger_glu_activation: true
 liger_layer_norm: true
 liger_fused_linear_cross_entropy: true
 datasets:
  - path: yahma/alpaca-cleaned
    type: alpaca
    split: train[:95%]
 output_dir: ./outputs/qat_out/
 dataset_prepared_path: ./outputs/dataset_prepared
 sequence_len: 8192
 flash_attention: true
 qat:
  activation_dtype: nvfp4
  weight_dtype: nvfp4
  group_size: 16 # only group_size of 16 is supported with nvfp4
 wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_name:
 wandb_log_model:
 gradient_checkpointing: true
 gradient_accumulation_steps: 1
 micro_batch_size: 64
 num_epochs: 1
 optimizer: adamw_torch_fused
 cosine_constant_lr_ratio: 0
 cosine_min_lr_ratio: 1.0
 learning_rate: 2e-5
 save_only_model: true
 bf16: true
 resume_from_checkpoint:
 logging_steps: 1
 evals_per_epoch: 1
 saves_per_epoch: 1
 warmup_ratio: 0.1
 weight_decay: 0.0
 special_tokens:
  pad_token: <|finetune_right_pad_id|>
 # save_first_step: true  # uncomment this to validate checkpoint saving works with your config
--- a/examples/llama-3/3b-qat-fsdp2.yaml
+++ b/examples/llama-3/3b-qat-fsdp2.yaml
@@ -15,20 +15,18 @@ liger_glu_activation: true
 liger_layer_norm: true
 liger_fused_linear_cross_entropy: true
 datasets:
  - path: yahma/alpaca-cleaned
    type: alpaca
    split: train[:95%]
 output_dir: ./outputs/qat_out/
 dataset_prepared_path: ./outputs/qat_out/dataset_prepared
-sample_packing: true
+sample_packing: false
-
+sequence_len: 8192
-sequence_len: 512
+flash_attention: true
 flex_attention: true
 flex_attn_compile_kwargs:
  dynamic: false
  mode: max-autotune-no-cudagraphs
 qat:
  activation_dtype: int8
@@ -67,7 +65,7 @@ fsdp:
 fsdp_config:
  fsdp_version: 2
  fsdp_offload_params: false
-  fsdp_cpu_ram_efficient_loading: true
+  fsdp_cpu_ram_efficient_loading: false
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
@@ -76,6 +74,6 @@ fsdp_config:
  fsdp_activation_checkpointing: true
 special_tokens:
-  pad_token: <|end_of_text|>
+  pad_token: <|finetune_right_pad_id|>
 # save_first_step: true  # uncomment this to validate checkpoint saving works with your config
--- a/examples/llama-3/diffusion/pretrain-1b.yaml
+++ b/examples/llama-3/diffusion/pretrain-1b.yaml
@@ -0,0 +1,56 @@
 base_model: meta-llama/Llama-3.2-1B
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
 pretraining_dataset:
  - path: wikitext
    name: wikitext-103-raw-v1
    type: completion
    field: text
 plugins:
  - axolotl.integrations.diffusion.DiffusionPlugin
 diffusion:
  noise_schedule: cosine
  min_mask_ratio: 0.15
  max_mask_ratio: 0.85
  num_diffusion_steps: 128
  eps: 5e-4
  importance_weighting: true
  mask_token_id: 128002
  generate_samples: true
  generation_interval: 250
 output_dir: ./outputs/model-out
 sequence_len: 512
 sample_packing: true
 gradient_accumulation_steps: 8
 micro_batch_size: 4
 max_steps: 10000
 warmup_ratio: 0.1
 optimizer: adamw_8bit
 lr_scheduler: cosine
 learning_rate: 3e-4
 sdp_attention: true
 bf16: auto
 tf32: true
 logging_steps: 1
 save_strategy: steps
 save_steps: 1000
 special_tokens:
  pad_token: "<|end_of_text|>"
 wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_name:
 wandb_log_model:
 # save_first_step: true  # uncomment this to validate checkpoint saving works with your config
--- a/examples/llama-3/diffusion/sft-1b.yaml
+++ b/examples/llama-3/diffusion/sft-1b.yaml
@@ -0,0 +1,59 @@
 base_model: meta-llama/Llama-3.2-1B
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
 datasets:
  - path: teknium/GPT4-LLM-Cleaned
    type: alpaca
 val_set_size: 0.05
 plugins:
  - axolotl.integrations.diffusion.DiffusionPlugin
 diffusion:
  noise_schedule: cosine
  min_mask_ratio: 0.1
  max_mask_ratio: 0.9
  num_diffusion_steps: 128
  eps: 1e-3
  importance_weighting: true
  mask_token_id: 128002
  generate_samples: true
  generation_interval: 250
 output_dir: ./outputs/model-out
 sequence_len: 512
 sample_packing: true
 eval_sample_packing: true
 gradient_accumulation_steps: 4
 micro_batch_size: 4
 num_epochs: 1
 warmup_steps: 0.1
 optimizer: adamw_8bit
 lr_scheduler: cosine
 learning_rate: 1e-5
 bf16: auto
 tf32: true
 gradient_checkpointing: true
 resume_from_checkpoint:
 sdp_attention: true
 logging_steps: 1
 save_strategy: best
 eval_strategy: epoch
 special_tokens:
  pad_token: "<|end_of_text|>"
 wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_name:
 wandb_log_model:
 # save_first_step: true  # uncomment this to validate checkpoint saving works with your config
--- a/examples/magistral/README.md
+++ b/examples/magistral/README.md
@@ -18,7 +18,13 @@ pip3 install packaging==23.2 setuptools==75.8.0 wheel ninja
 pip3 install --no-build-isolation 'axolotl[flash-attn]>=0.12.0'
 ```
-2. Run the finetuning example:
+2. Install [Cut Cross Entropy](https://docs.axolotl.ai/docs/custom_integrations.html#cut-cross-entropy) to reduce training VRAM usage
 ```bash
 python scripts/cutcrossentropy_install.py | sh
 ```
 3. Run the finetuning example:
 ```bash
 axolotl train examples/magistral/magistral-small-qlora.yaml
--- a/examples/qwen3/reward-model.yaml
+++ b/examples/qwen3/reward-model.yaml
@@ -0,0 +1,44 @@
 base_model: Skywork/Skywork-Reward-V2-Qwen3-8B
 model_type: AutoModelForSequenceClassification
 num_labels: 1
 reward_model: true
 center_rewards_coefficient: 0.01  # Incentivize mean-zero rewards for improved stability
 chat_template: qwen3
 datasets:
  - path: argilla/distilabel-intel-orca-dpo-pairs
    type: bradley_terry.chat_template
 val_set_size: 0.0
 output_dir: ./outputs/out
 sequence_len: 8192
 sample_packing: false
 eval_sample_packing: false
 pad_to_sequence_len: true
 deepspeed: deepspeed_configs/zero1.json
 wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_name:
 wandb_log_model:
 gradient_accumulation_steps: 4
 micro_batch_size: 1
 eval_batch_size: 1
 num_epochs: 3
 optimizer: adamw_bnb_8bit
 lr_scheduler: linear
 learning_rate: 0.00002
 bf16: true
 tf32: true
 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
 warmup_ratio: 0.1
 logging_steps: 1
 weight_decay: 0.01
--- a/examples/seed-oss/README.md
+++ b/examples/seed-oss/README.md
@@ -0,0 +1,54 @@
 # Finetune ByteDance's Seed-OSS with Axolotl
 [Seed-OSS](https://huggingface.co/collections/ByteDance-Seed/seed-oss-68a609f4201e788db05b5dcd) are a series of 36B parameter open source models trained by ByteDance's Seed Team.
 This guide shows how to fine-tune it with Axolotl with multi-turn conversations and proper masking.
 ## Getting started
 1. Install Axolotl following the [installation guide](https://docs.axolotl.ai/docs/installation.html). You need to install from main as Seed-OSS is only on nightly or use our latest [Docker images](https://docs.axolotl.ai/docs/docker.html).
    Here is an example of how to install from main for pip:
 ```bash
 # Ensure you have Pytorch installed (Pytorch 2.6.0 min)
 git clone https://github.com/axolotl-ai-cloud/axolotl.git
 cd axolotl
 pip3 install packaging==23.2 setuptools==75.8.0 wheel ninja
 pip3 install --no-build-isolation -e '.[flash-attn]'
 # Install Cut Cross Entropy
 python scripts/cutcrossentropy_install.py | sh
 ```
 2. Run the finetuning example:
 ```bash
 axolotl train examples/seed-oss/seed-oss-36b-qlora.yaml
 ```
 This config uses about 27.7 GiB VRAM.
 Let us know how it goes. Happy finetuning! 🚀
 ### TIPS
 - For inference, the official Seed Team recommends `top_p=0.95` and `temperature=1.1`.
 - You can run a full finetuning by removing the `adapter: qlora` and `load_in_4bit: true` from the config.
 - Read more on how to load your own dataset at [docs](https://docs.axolotl.ai/docs/dataset_loading.html).
 - The dataset format follows the OpenAI Messages format as seen [here](https://docs.axolotl.ai/docs/dataset-formats/conversation.html#chat_template).
 ## Optimization Guides
 - [Multi-GPU Training](https://docs.axolotl.ai/docs/multi-gpu.html)
 - [Multi-Node Training](https://docs.axolotl.ai/docs/multi-node.html)
 - [LoRA Optimizations](https://docs.axolotl.ai/docs/lora_optims.html)
 ## Related Resources
 - [ByteDance Seed Website](https://seed.bytedance.com/)
 - [Axolotl Docs](https://docs.axolotl.ai)
 - [Axolotl Website](https://axolotl.ai)
 - [Axolotl GitHub](https://github.com/axolotl-ai-cloud/axolotl)
 - [Axolotl Discord](https://discord.gg/7m9sfhzaf3)
--- a/examples/seed-oss/seed-oss-36b-qlora.yaml
+++ b/examples/seed-oss/seed-oss-36b-qlora.yaml
@@ -0,0 +1,56 @@
 base_model: ByteDance-Seed/Seed-OSS-36B-Instruct
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
 plugins:
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
 load_in_8bit: false
 load_in_4bit: true
 datasets:
  - path: fozziethebeat/alpaca_messages_2k_test
    type: chat_template
 dataset_prepared_path: last_run_prepared
 val_set_size: 0.1
 output_dir: ./outputs/lora-out
 adapter: qlora
 lora_model_dir:
 sequence_len: 2048
 sample_packing: true
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
 wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_name:
 wandb_log_model:
 gradient_accumulation_steps: 4
 micro_batch_size: 2
 num_epochs: 1
 optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002
 bf16: auto
 tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
 flash_attention: true
 warmup_ratio: 0.1
 evals_per_epoch: 1
 saves_per_epoch: 1
 # save_first_step: true  # uncomment this to validate checkpoint saving works with your config
--- a/examples/streaming/README.md
+++ b/examples/streaming/README.md
@@ -0,0 +1,50 @@
 # Streaming Dataset Examples
 This directory contains example configurations for using Axolotl's streaming dataset
 functionality, which enables memory-efficient training with large datasets.
 ## Examples
 Run the following examples with e.g. `axolotl train examples/streaming/sft.yaml`; no
 `axolotl preprocess` required!
 ### Pretraining (`pretrain.yaml`)
 Demonstrates streaming configuration for pretraining tasks using the fineweb-edu dataset
 with SmolLM2-135M.
 - Uses `pretraining_dataset` configuration for automatic streaming
 - Multipack attention control to prevent cross-attention between packed sequences
 - Buffer size configuration for memory management
 ### SFT (`sft.yaml`)
 Shows how to use streaming for supervised fine-tuning with the Alpaca dataset.
 - Explicit `streaming: true` flag for SFT datasets
 - Memory-efficient training on instruction datasets
 - Evaluation datasets are currently not streamed
 ## Key Configuration Options
 ### `streaming`
 - Enables streaming mode for standard datasets
 - Automatically enabled for `pretraining_dataset`
 ### `streaming_multipack_buffer_size`
 - Controls buffer size for sample packing (default: 10,000)
 - Larger values improve packing efficiency but use more memory
 - Adjust based on available memory
 ### `shuffle_merged_datasets`
 - Enables shuffling of streaming datasets
 - Requires additional memory for shuffle buffer
 ### `sample_packing`
 - Packs multiple samples into single sequences
 - Minimize per-step padding tokens
 ## Performance Tips
 - Download small / frequently-used datasets locally for better performance
 - Larger buffer sizes improve packing efficiency
--- a/examples/streaming/pretrain.yaml
+++ b/examples/streaming/pretrain.yaml
@@ -0,0 +1,57 @@
 base_model: HuggingFaceTB/SmolLM2-135M
 # Streaming pretraining configuration
 pretraining_dataset:
  - path: HuggingFaceFW/fineweb-edu
    name: sample-10BT
    type: pretrain
    text_column: text
    split: train
 # Streaming-specific settings
 streaming_multipack_buffer_size: 10000
 shuffle_merged_datasets: true
 # Training configuration
 max_steps: 1000
 output_dir: ./outputs/smollm2-135m-pretrain-streaming
 # Sequence and packing settings
 sequence_len: 1024
 sample_packing: true
 pretrain_multipack_attn: true  # Prevent cross-attention between packed sequences
 flash_attention: true
 # Batch size settings
 gradient_accumulation_steps: 8
 micro_batch_size: 1
 # Optimizer and scheduler
 optimizer: adamw_torch
 lr_scheduler: cosine
 learning_rate: 5e-4
 warmup_ratio: 0.1
 weight_decay: 0.01
 # Precision and performance
 bf16: auto
 tf32: true
 # Logging and checkpointing
 logging_steps: 10
 save_strategy: steps
 save_steps: 250
 save_total_limit: 3
 # Weights & Biases (optional)
 wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_name:
 wandb_log_model:
 # Special tokens
 special_tokens:
  pad_token: "<|endoftext|>"
 # save_first_step: true  # uncomment this to validate checkpoint saving works with your config
--- a/examples/streaming/sft.yaml
+++ b/examples/streaming/sft.yaml
@@ -0,0 +1,55 @@
 base_model: HuggingFaceTB/SmolLM2-135M
 # Dataset configuration
 datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
    split: train
 # Streaming-specific settings
 streaming: true
 streaming_multipack_buffer_size: 10000
 shuffle_merged_datasets: true
 # Training configuration
 max_steps: 1000
 output_dir: ./outputs/smollm2-135m-sft-streaming
 # Sequence and packing settings
 sequence_len: 1024
 sample_packing: true
 flash_attention: true
 # Batch size settings
 gradient_accumulation_steps: 4
 micro_batch_size: 1
 # Optimizer and scheduler
 optimizer: adamw_torch
 lr_scheduler: cosine
 learning_rate: 2e-4
 warmup_ratio: 0.1
 weight_decay: 0.0
 # Precision and performance
 bf16: auto
 tf32: true
 # Logging and checkpointing
 logging_steps: 10
 save_strategy: steps
 save_steps: 100
 save_total_limit: 3
 # Weights & Biases (optional)
 wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_name:
 wandb_log_model:
 # Special tokens
 special_tokens:
  pad_token: "<|endoftext|>"
 # save_first_step: true  # uncomment this to validate checkpoint saving works with your config
--- a/examples/voxtral/README.md
+++ b/examples/voxtral/README.md
@@ -22,6 +22,9 @@ pip3 install --no-build-isolation 'axolotl[flash-attn]>=0.12.0'
 # audio
 pip3 install librosa==0.11.0
 pip3 install 'mistral_common[audio]==1.8.3'
 # Install CCE https://docs.axolotl.ai/docs/custom_integrations.html#cut-cross-entropy
 python scripts/cutcrossentropy_install.py | sh
 ```
 3. Run the finetuning example:
--- a/requirements.txt
+++ b/requirements.txt
@@ -2,8 +2,7 @@
 # START section of dependencies that don't install on Darwin/MacOS
 bitsandbytes==0.47.0
-# triton 3.4.0 is not compatible with CCE
+triton>=3.0.0
 triton>=3.0.0,<3.4.0
 mamba-ssm==1.2.0.post1
 xformers>=0.0.23.post1
 autoawq==0.2.7.post3
@@ -14,7 +13,7 @@ packaging==23.2
 huggingface_hub>=0.33.0
 peft>=0.17.0
-transformers==4.55.3
+transformers==4.56.1
 tokenizers>=0.21.1
 accelerate==1.10.0
 datasets==4.0.0
@@ -65,7 +64,7 @@ langdetect==1.0.9
 immutabledict==4.2.0
 antlr4-python3-runtime==4.13.2
-torchao==0.12.0
+torchao==0.13.0
 schedulefree==1.4.1
 axolotl-contribs-lgpl==0.0.6
--- a/scripts/cutcrossentropy_install.py
+++ b/scripts/cutcrossentropy_install.py
@@ -29,5 +29,5 @@ UV_PREFIX = "uv " if USE_UV else ""
 print(
    UNINSTALL_PREFIX
-    + f'{UV_PREFIX}pip install "cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@0ee9ee8"'
+    + f'{UV_PREFIX}pip install "cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@c6a32c5"'
 )
--- a/setup.py
+++ b/setup.py
@@ -64,7 +64,9 @@ def parse_requirements(extras_require_map):
            else:
                raise ValueError("Invalid version format")
-            if (major, minor) >= (2, 7):
+            if (major, minor) >= (2, 8):
                pass
            elif (major, minor) >= (2, 7):
                _install_requires.pop(_install_requires.index(xformers_version))
                if patch == 0:
                    _install_requires.append("xformers==0.0.30")
@@ -125,7 +127,7 @@ extras_require = {
        "yunchang==0.6.0",
    ],
    "deepspeed": [
-        "deepspeed==0.17.2",
+        "deepspeed==0.17.5",
        "deepspeed-kernels",
    ],
    "mamba-ssm": [
@@ -160,6 +162,7 @@ extras_require = {
    "llmcompressor": [
        "llmcompressor==0.5.1",
    ],
    "fbgemm-gpu": ["fbgemm-gpu-genai>=1.2.0"],
 }
 install_requires, dependency_links, extras_require_build = parse_requirements(
    extras_require
--- a/src/axolotl/cli/args.py
+++ b/src/axolotl/cli/args.py
@@ -14,9 +14,13 @@ class PreprocessCliArgs:
    prompter: Optional[str] = field(default=None)
    download: Optional[bool] = field(default=True)
    iterable: Optional[bool] = field(
-        default=None,
+        default=False,
        metadata={
-            "help": "Use IterableDataset for streaming processing of large datasets"
+            "help": (
                "Deprecated in v0.13.0, will be removed in v0.14.0. For streaming "
                "datasets, use 'axolotl train' and set 'streaming: true' in your YAML "
                "config, or pass --streaming instead in the CLI."
            )
        },
    )
@@ -111,6 +115,7 @@ class QuantizeCliArgs:
    quantize_embedding: Optional[bool] = field(default=None)
    group_size: Optional[int] = field(default=None)
    output_dir: Optional[str] = field(default=None)
    hub_model_id: Optional[str] = field(default=None)
@dataclass
--- a/src/axolotl/cli/cloud/init.py
+++ b/src/axolotl/cli/cloud/init.py
@@ -7,6 +7,8 @@ from typing import Literal
 import yaml
 from axolotl.cli.cloud.base import Cloud
 from axolotl.cli.cloud.baseten import BasetenCloud
 from axolotl.cli.cloud.modal_ import ModalCloud
 from axolotl.utils.dict import DictDefault
@@ -38,8 +40,15 @@ def do_cli_train(
    cwd=None,
    **kwargs,
 ) -> None:
-    cloud_cfg = load_cloud_cfg(cloud_config)
+    cloud_cfg: DictDefault = load_cloud_cfg(cloud_config)
-    cloud = ModalCloud(cloud_cfg)
+    provider = cloud_cfg.provider or "modal"
    cloud: Cloud | None
    if provider == "modal":
        cloud = ModalCloud(cloud_cfg)
    elif provider == "baseten":
        cloud = BasetenCloud(cloud_cfg.to_dict())
    else:
        raise ValueError(f"Unsupported cloud provider: {provider}")
    with open(config, "r", encoding="utf-8") as file:
        config_yaml = file.read()
    local_dirs = {}
--- a/src/axolotl/cli/cloud/baseten/init.py
+++ b/src/axolotl/cli/cloud/baseten/init.py
@@ -0,0 +1,48 @@
 """Baseten Cloud CLI"""
 import shutil
 import subprocess  # nosec B404
 import tempfile
 from os.path import dirname
 from typing import Literal
 import yaml
 from axolotl.cli.cloud.base import Cloud
 class BasetenCloud(Cloud):
    """Baseten Cloud Axolotl CLI"""
    def __init__(self, config: dict):
        self.config = config
    def preprocess(self, config_yaml: str, *args, **kwargs) -> None:
        raise NotImplementedError(
            "Separate preprocess function for Baseten is not "
            "implemented and will happen during hte train step."
        )
    def train(
        self,
        config_yaml: str,
        launcher: Literal["accelerate", "torchrun", "python"] = "accelerate",
        launcher_args: list[str] | None = None,
        local_dirs: dict[str, str] | None = None,  # pylint: disable=unused-argument
        **kwargs,
    ):
        with tempfile.TemporaryDirectory() as tmp_dir:
            config = self.config.copy()
            config["launcher"] = launcher
            config["launcher_args"] = launcher_args
            with open(tmp_dir + "/cloud.yaml", "w", encoding="utf-8") as cloud_fout:
                yaml.dump(config, cloud_fout)
            with open(tmp_dir + "/train.yaml", "w", encoding="utf-8") as config_fout:
                config_fout.write(config_yaml)
            shutil.copyfile(dirname(__file__) + "/template/run.sh", tmp_dir + "/run.sh")
            shutil.copyfile(
                dirname(__file__) + "/template/train_sft.py", tmp_dir + "/train_sft.py"
            )
            subprocess.run(  # nosec B603 B607
                ["truss", "train", "push", "train_sft.py"], cwd=tmp_dir, check=False
            )
--- a/src/axolotl/cli/cloud/baseten/template/run.sh
+++ b/src/axolotl/cli/cloud/baseten/template/run.sh
@@ -0,0 +1,9 @@
 #!/bin/bash
 set -eux
 export NCCL_SOCKET_IFNAME="^docker0,lo"
 export NCCL_IB_DISABLE=0
 export NCCL_TIMEOUT=1800000
 axolotl preprocess train.yaml
 axolotl train train.yaml --launcher ${AXOLOTL_LAUNCHER} ${AXOLOTL_LAUNCHER_ARGS}
--- a/src/axolotl/cli/cloud/baseten/template/train_sft.py
+++ b/src/axolotl/cli/cloud/baseten/template/train_sft.py
@@ -0,0 +1,71 @@
 """
 Baseten Training Script for Axolotl
 """
 # pylint: skip-file
 import yaml
 from truss.base import truss_config
 # Import necessary classes from the Baseten Training SDK
 from truss_train import definitions
 cloud_config = yaml.safe_load(open("cloud.yaml", "r"))
 gpu = cloud_config.get("gpu", "h100")
 gpu_count = int(cloud_config.get("gpu_count", 1))
 node_count = int(cloud_config.get("node_count", 1))
 project_name = cloud_config.get("project_name", "axolotl-project") or "axolotl-project"
 secrets = cloud_config.get("secrets", [])
 launcher = cloud_config.get("launcher", "accelerate")
 launcher_args = cloud_config.get("launcher_args", [])
 script_name = "run.sh"
 launcher_args_str = ""
 if launcher_args:
    launcher_args_str = "-- " + " ".join(launcher_args)
 # 1. Define a base image for your training job
 # must use torch 2.7.0 for vllm
 BASE_IMAGE = "axolotlai/axolotl:main-py3.11-cu126-2.7.1"
 # 2. Define the Runtime Environment for the Training Job
 # This includes start commands and environment variables.a
 # Secrets from the baseten workspace like API keys are referenced using
 # `SecretReference`.
 env_vars = {
    "AXOLOTL_LAUNCHER": launcher,
    "AXOLOTL_LAUNCHER_ARGS": launcher_args_str,
 }
 for secret_name in secrets:
    env_vars[secret_name] = definitions.SecretReference(name=secret_name)
 training_runtime = definitions.Runtime(
    start_commands=[  # Example: list of commands to run your training script
        f"/bin/sh -c 'chmod +x ./{script_name} && ./{script_name}'"
    ],
    environment_variables=env_vars,
 )
 # 3. Define the Compute Resources for the Training Job
 training_compute = definitions.Compute(
    node_count=node_count,
    accelerator=truss_config.AcceleratorSpec(
        accelerator=truss_config.Accelerator.H100,
        count=gpu_count,
    ),
 )
 # 4. Define the Training Job
 # This brings together the image, compute, and runtime configurations.
 my_training_job = definitions.TrainingJob(
    image=definitions.Image(base_image=BASE_IMAGE),
    compute=training_compute,
    runtime=training_runtime,
 )
 # This config will be pushed using the Truss CLI.
 # The association of the job to the project happens at the time of push.
 first_project_with_job = definitions.TrainingProject(
    name=project_name, job=my_training_job
 )
--- a/src/axolotl/cli/inference.py
+++ b/src/axolotl/cli/inference.py
@@ -14,10 +14,14 @@ from transformers import GenerationConfig, TextIteratorStreamer, TextStreamer
 from axolotl.cli.args import InferenceCliArgs
 from axolotl.cli.config import load_cfg
 from axolotl.cli.utils import load_model_and_tokenizer
-from axolotl.utils.chat_templates import (
+from axolotl.cli.utils.diffusion import (
-    get_chat_template,
+    diffusion_inference,
-    get_chat_template_from_config,
+    launch_diffusion_gradio_ui,
    render_html,
    run_diffusion,
 )
 from axolotl.integrations.base import PluginManager
 from axolotl.utils.chat_templates import get_chat_template_from_config
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.logging import get_logger
@@ -32,6 +36,7 @@ def get_multi_line_input() -> str:
        Possibly multi-line, possibly empty stdin input as a string.
    """
    print("Give me an instruction (Ctrl + D to submit): ")
    print("=" * 80)
    instruction = ""
    for line in sys.stdin:
@@ -46,9 +51,9 @@ def do_inference(
    cli_args: InferenceCliArgs,
 ):
    """
-    Runs inference on the command line in a loop. User input is accepted, a chat template
+    Runs inference on the command line in a loop. User input is accepted, a chat
-    is (optionally) applied, and the model specified in the `axolotl` config is used to
+    template is (optionally) applied, and the model specified in the `axolotl` config is
-    generate completions according to a default generation config.
+    used to generate completions according to a default generation config.
    Args:
        cfg: Dictionary mapping `axolotl` config keys to values.
@@ -64,17 +69,31 @@ def do_inference(
            importlib.import_module("axolotl.prompters"), prompter
        )
    elif cfg.chat_template:
-        chat_template_str = get_chat_template(cfg.chat_template, tokenizer=tokenizer)
+        chat_template_str = get_chat_template_from_config(
-    elif cfg.datasets[0].type == "chat_template":
+            cfg, ds_cfg=None, tokenizer=tokenizer
        )
    elif cfg.datasets and cfg.datasets[0].type == "chat_template":
        chat_template_str = get_chat_template_from_config(
            cfg=cfg, ds_cfg=cfg.datasets[0], tokenizer=tokenizer
        )
    model = model.to(cfg.device, dtype=cfg.torch_dtype)
    # Detect diffusion mode
    plugin_manager = PluginManager.get_instance()
    is_diffusion = any(
        plugin.__class__.__name__ == "DiffusionPlugin"
        for plugin in plugin_manager.plugins.values()
    )
    if is_diffusion:
        print("=" * 80)
        print("Commands:")
        print(":complete N -> completion mode with N tokens (default 64)")
        print(":mask R     -> random masking with ratio R (0.0–1.0)")
    while True:
        print("=" * 80)
        # support for multiline inputs
        instruction = get_multi_line_input()
        if not instruction:
            return
@@ -104,9 +123,19 @@ def do_inference(
        else:
            batch = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)
-        print("=" * 40)
+        print("=" * 80)
        model.eval()
        with torch.no_grad():
            if is_diffusion:
                diffusion_inference(
                    model=model,
                    tokenizer=tokenizer,
                    cfg=cfg,
                    prompt=prompt,
                    chat_template_str=chat_template_str,
                )
                continue
            generation_config = GenerationConfig(
                repetition_penalty=1.1,
                max_new_tokens=1024,
@@ -129,7 +158,7 @@ def do_inference(
                generation_config=generation_config,
                streamer=streamer,
            )
-        print("=" * 40)
+        print("=" * 80)
        print(tokenizer.decode(generated["sequences"].cpu().tolist()[0]))
@@ -159,10 +188,33 @@ def do_inference_gradio(
            importlib.import_module("axolotl.prompters"), prompter
        )
    elif cfg.chat_template:
-        chat_template_str = get_chat_template(cfg.chat_template, tokenizer=tokenizer)
+        chat_template_str = get_chat_template_from_config(
            cfg, ds_cfg=None, tokenizer=tokenizer
        )
    elif cfg.datasets and cfg.datasets[0].type == "chat_template":
        chat_template_str = get_chat_template_from_config(
            cfg=cfg, ds_cfg=cfg.datasets[0], tokenizer=tokenizer
        )
    model = model.to(cfg.device, dtype=cfg.torch_dtype)
    # Detect diffusion mode
    plugin_manager = PluginManager.get_instance()
    is_diffusion = any(
        plugin.__class__.__name__ == "DiffusionPlugin"
        for plugin in plugin_manager.plugins.values()
    )
    if is_diffusion:
        launch_diffusion_gradio_ui(
            model=model,
            tokenizer=tokenizer,
            cfg=cfg,
            prompter_module=prompter_module,
            chat_template_str=chat_template_str,
        )
        return
    def generate(instruction):
        if not instruction:
            return
--- a/src/axolotl/cli/merge_lora.py
+++ b/src/axolotl/cli/merge_lora.py
@@ -43,7 +43,10 @@ def do_merge_lora(*, cfg: DictDefault) -> None:
            safe_serialization=safe_serialization,
            progressbar=True,
        )
-        tokenizer.save_pretrained(str(Path(cfg.output_dir) / "merged"))
+        tokenizer.save_pretrained(
            str(Path(cfg.output_dir) / "merged"),
            save_jinja_files=cfg.tokenizer_save_jinja_files,
        )
        if processor:
            processor.save_pretrained(str(Path(cfg.output_dir) / "merged"))
--- a/src/axolotl/cli/preprocess.py
+++ b/src/axolotl/cli/preprocess.py
@@ -35,10 +35,20 @@ def do_preprocess(cfg: DictDefault, cli_args: PreprocessCliArgs) -> None:
    check_accelerate_default_config()
    check_user_token()
    if cli_args.iterable:
        LOG.error(
            "The --iterable CLI argument for 'axolotl preprocess' is no longer "
            "supported. For training, set 'streaming: true' in your YAML config or "
            "pass '--streaming' in your 'axolotl train' command for on-the-fly "
            "preprocessing."
        )
        return
    for key in ["skip_prepare_dataset", "pretraining_dataset"]:
        if cfg.get(key):
            LOG.error(
-                f"You have set `{key}:`. `preprocess` is not needed. Run the `axolotl train` CLI directly instead."
+                f"You have set `{key}:`. `preprocess` is not needed. Run the 'axolotl "
                "train' CLI directly instead."
            )
            return
--- a/src/axolotl/cli/quantize.py
+++ b/src/axolotl/cli/quantize.py
@@ -5,12 +5,17 @@ CLI to post-training quantize a model using torchao
 from pathlib import Path
 from typing import Union
-from transformers import AutoModelForCausalLM
+from transformers import AutoConfig, AutoModelForCausalLM, TorchAoConfig
 from axolotl.cli.config import load_cfg
 from axolotl.loaders import load_tokenizer
 from axolotl.utils.logging import get_logger
-from axolotl.utils.quantization import TorchIntDType, quantize_model_for_ptq
+from axolotl.utils.quantization import (
    TorchAOQuantDType,
    get_quantization_config,
    quantization_config_to_str,
    quantize_model,
 )
 LOG = get_logger(__name__)
@@ -43,13 +48,13 @@ def do_quantize(
            "No quantization configuration found. Please specify either qat or quantization in your config file."
        )
-    model_path = cli_args.get("model_path") or cfg.output_dir
+    model_path = cli_args.get("base_model") or cfg.output_dir
    if weight_dtype := cli_args.get("weight_dtype"):
-        weight_dtype = TorchIntDType[weight_dtype]
+        weight_dtype = TorchAOQuantDType.from_string(weight_dtype)
    else:
        weight_dtype = quantize_cfg.weight_dtype
    if activation_dtype := cli_args.get("activation_dtype"):
-        activation_dtype = TorchIntDType[activation_dtype]
+        activation_dtype = TorchAOQuantDType.from_string(activation_dtype)
    else:
        activation_dtype = quantize_cfg.activation_dtype
    group_size = cli_args.get("group_size") or quantize_cfg.group_size
@@ -57,10 +62,15 @@ def do_quantize(
        cli_args.get("quantize_embedding") or quantize_cfg.quantize_embedding
    )
    output_dir = cli_args.get("output_dir") or cfg.output_dir
    hub_model_id = cli_args.get("hub_model_id") or cfg.hub_model_id
-    LOG.info(f"Loading model from {model_path}...")
+    LOG.info(f"Loading model from {model_path}.")
    tokenizer = load_tokenizer(cfg)
-    model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")
+    config = AutoConfig.from_pretrained(model_path)
    torch_dtype = config.torch_dtype if hasattr(config, "torch_dtype") else None
    model = AutoModelForCausalLM.from_pretrained(
        model_path, device_map="auto", torch_dtype=torch_dtype
    )
    LOG.info(
        f"Quantizing model with configuration: \n"
@@ -70,11 +80,21 @@ def do_quantize(
        f"\tquantize_embedding: {quantize_embedding}"
    )
-    quantize_model_for_ptq(
+    quantize_model(
        model, weight_dtype, group_size, activation_dtype, quantize_embedding
    )
-    LOG.info(f"Saving quantized model to: {str(Path(output_dir) / 'quantized')}...")
+    quantization_config = get_quantization_config(
        weight_dtype, activation_dtype, group_size
    )
    ao_config = TorchAoConfig(
        quant_type=quantization_config,
        include_input_output_embeddings=quantize_embedding,
    )
    model.config.quantization_config = ao_config
    LOG.info(f"Saving quantized model to: {str(Path(output_dir) / 'quantized')}.")
    model.save_pretrained(
        str(Path(output_dir) / "quantized"),
        safe_serialization=False,
@@ -84,5 +104,16 @@ def do_quantize(
        str(Path(output_dir) / "quantized"),
        safe_serialization=False,
        progressbar=True,
        save_jinja_files=cfg.tokenizer_save_jinja_files,
    )
-    LOG.info(f"Quantized model saved to: {str(Path(output_dir) / 'quantized')}...")
+
    if hub_model_id:
        hub_model_id = (
            hub_model_id.rstrip("-")
            + f"-{quantization_config_to_str[type(quantization_config)]}"
        )
        model.push_to_hub(hub_model_id, safe_serialization=False)
        tokenizer.push_to_hub(hub_model_id)
        LOG.info(f"Quantized model pushed to: {hub_model_id}.")
    LOG.info(f"Quantized model saved to: {str(Path(output_dir) / 'quantized')}.")
--- a/src/axolotl/cli/utils/diffusion.py
+++ b/src/axolotl/cli/utils/diffusion.py
@@ -0,0 +1,375 @@
 """Helpers for diffusion-mode inference in CLI and Gradio."""
 from __future__ import annotations
 import gradio as gr
 import torch
 from colorama import Fore, Style
 from axolotl.integrations.diffusion import generate, resolve_mask_token_id
 from axolotl.utils.dict import DictDefault
 def diffusion_inference(
    model,
    tokenizer,
    cfg,
    prompt: str,
    chat_template_str: str | None = None,
 ):
    """Diffusion inference helper method."""
    mode = "random"
    completion_tokens = 0
    target_mask_ratio = None
    mode, completion_tokens, target_mask_ratio, cleaned = _parse_commands(prompt)
    if cleaned:
        prompt = cleaned
    info = run_diffusion(
        model=model,
        tokenizer=tokenizer,
        cfg=cfg,
        prompt=prompt,
        chat_template_str=chat_template_str,
        mode=mode,
        target_mask_ratio=target_mask_ratio,
        completion_tokens=completion_tokens,
    )
    masked_text = info["masked_text"]
    mask_ratio = info["mask_ratio"]
    generated_ids = info["generated_ids"]
    masked_positions = info["masked_positions"]
    orig_ids = info["orig_ids"]
    # Display with masked preview and colored diff
    if masked_text is not None and mask_ratio is not None:
        print(f"Masked ({mask_ratio:.1%}):\n{masked_text}\n")
    if generated_ids is not None:
        # Compute per-token style
        styles: list[str] = []
        for i, tid in enumerate(generated_ids):
            if i in masked_positions:
                if i < len(orig_ids) and tid == orig_ids[i]:
                    styles.append("green")  # correct fill
                elif i < len(orig_ids):
                    styles.append("red")  # incorrect fill
                else:
                    styles.append("normal")  # appended
            else:
                same = i < len(orig_ids) and tid == orig_ids[i]
                styles.append("dim" if same else "normal")
        # Group contiguous spans by style
        styled_spans: list[tuple[str, int, int]] = []
        if generated_ids:
            current_style = styles[0]
            start = 0
            for i in range(1, len(generated_ids)):
                s = styles[i]
                if s != current_style:
                    styled_spans.append((current_style, start, i))
                    current_style, start = s, i
            styled_spans.append((current_style, start, len(generated_ids)))
        out_parts = []
        for style_name, a, b in styled_spans:
            chunk_text = tokenizer.decode(generated_ids[a:b], skip_special_tokens=False)
            if style_name == "green":
                out_parts.append(Fore.GREEN + chunk_text + Style.RESET_ALL)
            elif style_name == "red":
                out_parts.append(Fore.RED + chunk_text + Style.RESET_ALL)
            else:
                if style_name == "dim":
                    out_parts.append(Style.DIM + chunk_text + Style.RESET_ALL)
                else:
                    out_parts.append(chunk_text)
        print("Generated:\n" + "".join(out_parts))
    else:
        print("Generated:\n(no output)")
 def _parse_commands(text: str):
    """
    Parse leading diffusion commands.
    Supported at start of input (can be chained):
      :complete N  -> completion mode with N tokens (default 64)
      :mask R      -> random masking with ratio R in [0, 1]
    """
    tokens = text.strip().split()
    i = 0
    mode = "random"
    completion_tokens = 0
    target_mask_ratio = None
    consumed = 0
    while i < len(tokens) and tokens[i].startswith(":"):
        cmd = tokens[i]
        i += 1
        consumed = i
        if cmd == ":complete":
            mode = "completion"
            if i < len(tokens):
                try:
                    completion_tokens = int(tokens[i])
                    i += 1
                    consumed = i
                except Exception:
                    completion_tokens = 64
            else:
                completion_tokens = 64
        elif cmd == ":mask":
            mode = "random"
            if i < len(tokens):
                try:
                    target_mask_ratio = float(tokens[i])
                    i += 1
                    consumed = i
                except Exception:
                    target_mask_ratio = None
        else:
            i -= 1
            consumed = i
            break
    cleaned = " ".join(tokens[consumed:])
    return mode, completion_tokens, target_mask_ratio, cleaned
 def run_diffusion(
    *,
    model,
    tokenizer,
    cfg: DictDefault,
    prompt: str,
    chat_template_str: str | None,
    mode: str = "random",
    target_mask_ratio: float | None = None,
    completion_tokens: int = 0,
 ):
    """Run a single diffusion generation and return a structured result dict."""
    if chat_template_str:
        batch = tokenizer.apply_chat_template(
            [{"role": "user", "content": prompt}],
            return_tensors="pt",
            add_special_tokens=True,
            add_generation_prompt=True,
            chat_template=chat_template_str,
            tokenize=True,
            return_dict=True,
        )
    else:
        batch = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)
    mask_token_id = resolve_mask_token_id(tokenizer, cfg, allow_add=False)
    seq = batch["input_ids"].to(cfg.device)
    gen_mode = "completion" if mode == "completion" else "random"
    comp_tokens = int(completion_tokens) if gen_mode == "completion" else 0
    result = generate(
        model,
        tokenizer,
        original_sequence=seq[:1],
        num_diffusion_steps=cfg.diffusion.num_diffusion_steps,
        temperature=cfg.diffusion.generation_temperature,
        mask_token_id=int(mask_token_id),
        mode=gen_mode,  # type: ignore[arg-type]
        completion_tokens=comp_tokens,
        target_mask_ratio=target_mask_ratio,
    )
    masked_text = result.get("masked") if isinstance(result, dict) else None
    mask_ratio = result.get("mask_ratio") if isinstance(result, dict) else None
    generated_ids = result.get("generated_ids") if isinstance(result, dict) else None
    masked_positions = (
        set(result.get("masked_positions") or []) if isinstance(result, dict) else set()
    )
    orig_ids = seq[0].detach().cpu().tolist()
    return {
        "masked_text": masked_text,
        "mask_ratio": mask_ratio,
        "generated_ids": generated_ids,
        "masked_positions": masked_positions,
        "orig_ids": orig_ids,
    }
 def render_html(
    *,
    generated_ids: list[int] | None,
    orig_ids: list[int],
    masked_positions: set[int],
    tokenizer,
 ) -> str:
    """Render HTML visualizing diffusion outputs."""
    if not generated_ids:
        return "<pre>Generated:\n(no output)</pre>"
    def _style_for(i: int, tid: int) -> str:
        if i in masked_positions:
            if i < len(orig_ids) and tid == orig_ids[i]:
                return "green"
            if i < len(orig_ids):
                return "red"
            return "normal"
        same = i < len(orig_ids) and tid == orig_ids[i]
        return "dim" if same else "normal"
    # Group contiguous spans by style to reduce HTML size
    spans: list[tuple[str, int, int]] = []
    if generated_ids:
        cur = _style_for(0, generated_ids[0])
        start = 0
        for i in range(1, len(generated_ids)):
            s = _style_for(i, generated_ids[i])
            if s != cur:
                spans.append((cur, start, i))
                cur, start = s, i
        spans.append((cur, start, len(generated_ids)))
    html_parts = []
    for style_name, a, b in spans:
        txt = tokenizer.decode(generated_ids[a:b], skip_special_tokens=False)
        if style_name == "green":
            html_parts.append(f'<span style="color:#2e7d32">{txt}</span>')
        elif style_name == "red":
            html_parts.append(f'<span style="color:#c62828">{txt}</span>')
        elif style_name == "dim":
            html_parts.append(f'<span style="opacity:0.6">{txt}</span>')
        else:
            html_parts.append(txt)
    legend = (
        '<div style="font-size:0.9em;margin-bottom:4px">'
        '<span style="color:#2e7d32">correct</span>, '
        '<span style="color:#c62828">incorrect</span>, '
        '<span style="opacity:0.6">unchanged</span>'
        "</div>"
    )
    return (
        legend
        + '<pre style="white-space:pre-wrap">Generated:\n'
        + "".join(html_parts)
        + "</pre>"
    )
 def launch_diffusion_gradio_ui(
    *,
    model,
    tokenizer,
    cfg: DictDefault,
    prompter_module=None,
    chat_template_str: str | None = None,
 ):
    """Build and launch a simple Gradio UI for diffusion inference."""
    with gr.Blocks(
        title=cfg.get("gradio_title", "Axolotl Diffusion Interface")
    ) as demo:
        gr.Markdown(
            """
            ## Axolotl Diffusion Inference
            - Mode "Random" masks tokens at a target ratio and fills them.
            - Mode "Completion" appends N masked tokens at the end and fills them.
            """
        )
        with gr.Row():
            mode = gr.Radio(
                choices=["random", "completion"],
                value="random",
                label="Mode",
            )
            mask_ratio = gr.Slider(
                minimum=0.0,
                maximum=1.0,
                step=0.05,
                value=0.4,
                label="Mask ratio (random mode)",
                interactive=True,
            )
            completion_tokens = gr.Number(
                value=64,
                precision=0,
                label="Completion tokens (completion mode)",
                interactive=True,
                visible=False,
            )
        instruction = gr.Textbox(label="Instruction", lines=6)
        run_btn = gr.Button("Generate")
        masked_preview = gr.Textbox(label="Masked preview", lines=6)
        html_out = gr.HTML(label="Generated")
        def _toggle_controls(selected_mode: str):
            return (
                gr.update(visible=(selected_mode == "random")),
                gr.update(visible=(selected_mode == "completion")),
            )
        mode.change(
            _toggle_controls,
            inputs=[mode],
            outputs=[mask_ratio, completion_tokens],
        )
        def _gen(instruction_text: str, selected_mode: str, mratio: float, ctoks: int):
            if not instruction_text:
                return "", "<pre>Generated:\n(no output)</pre>"
            if prompter_module:
                prompt: str = next(
                    prompter_module().build_prompt(
                        instruction=instruction_text.strip("\n")
                    )
                )
            else:
                prompt = instruction_text.strip()
            info = run_diffusion(
                model=model,
                tokenizer=tokenizer,
                cfg=cfg,
                prompt=prompt,
                chat_template_str=chat_template_str,
                mode=selected_mode,
                target_mask_ratio=mratio if selected_mode == "random" else None,
                completion_tokens=int(ctoks) if selected_mode == "completion" else 0,
            )
            masked_text = info.get("masked_text")
            mask_ratio_val = info.get("mask_ratio")
            generated_ids = info.get("generated_ids")
            masked_positions = info.get("masked_positions") or set()
            orig_ids = info.get("orig_ids") or []
            preview = (
                f"Masked ({mask_ratio_val:.1%}):\n{masked_text}"
                if masked_text is not None and mask_ratio_val is not None
                else ""
            )
            html = render_html(
                generated_ids=generated_ids,
                orig_ids=orig_ids,
                masked_positions=masked_positions,
                tokenizer=tokenizer,
            )
            return preview, html
        run_btn.click(
            _gen,
            inputs=[instruction, mode, mask_ratio, completion_tokens],
            outputs=[masked_preview, html_out],
        )
        demo.queue().launch(
            show_api=False,
            share=cfg.get("gradio_share", True),
            server_name=cfg.get("gradio_server_name", "127.0.0.1"),
            server_port=cfg.get("gradio_server_port", None),
        )
--- a/src/axolotl/common/datasets.py
+++ b/src/axolotl/common/datasets.py
@@ -55,13 +55,11 @@ def load_datasets(
    """
    tokenizer = load_tokenizer(cfg)
    processor = load_processor(cfg, tokenizer=tokenizer) if cfg.processor_type else None
    preprocess_iterable = getattr(cli_args, "iterable", False)
    train_dataset, eval_dataset, total_num_steps, prompters = prepare_datasets(
        cfg,
        tokenizer,
        processor=processor,
        preprocess_iterable=preprocess_iterable,
    )
    if (
--- a/src/axolotl/core/builders/base.py
+++ b/src/axolotl/core/builders/base.py
@@ -24,9 +24,7 @@ from pathlib import Path
 from typing import Any
 import torch
-from transformers import (
+from transformers import TrainerCallback
    TrainerCallback,
 )
 from transformers.trainer_pt_utils import AcceleratorConfig
 from axolotl.integrations.base import PluginManager
@@ -512,6 +510,7 @@ class TrainerBuilderBase(abc.ABC):
                self.cfg.eval_batch_size
            )
        training_args_kwargs["include_tkps"] = self.cfg.include_tkps
        training_args_kwargs["max_steps"] = self.cfg.max_steps or total_num_steps or -1
        training_args_kwargs["num_train_epochs"] = self.cfg.num_epochs
--- a/src/axolotl/core/builders/causal.py
+++ b/src/axolotl/core/builders/causal.py
@@ -10,6 +10,7 @@ import transformers
 from transformers import (
    DataCollatorWithFlattening,
    EarlyStoppingCallback,
    Trainer,
 )
 from trl.trainer.utils import RewardDataCollatorWithPadding
@@ -35,6 +36,7 @@ from axolotl.utils.callbacks import (
 )
 from axolotl.utils.callbacks.lisa import lisa_callback_factory
 from axolotl.utils.callbacks.qat import QATCallback
 from axolotl.utils.callbacks.tokens_per_second import TokensPerSecondCallback
 from axolotl.utils.chat_templates import get_chat_template_from_config
 from axolotl.utils.collators import (
    BatchSamplerDataCollatorForSeq2Seq,
@@ -74,6 +76,12 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        if self.cfg.qat:
            callbacks.append(QATCallback(self.cfg.qat))
        if self.cfg.include_tkps:
            callbacks.append(
                TokensPerSecondCallback(
                    self.cfg.tensor_parallel_size, self.cfg.context_parallel_size
                )
            )
        return callbacks
    def get_post_trainer_create_callbacks(self, trainer):
@@ -340,6 +348,10 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        if self.cfg.reward_model:
            training_args_cls = AxolotlRewardConfig
            if self.cfg.center_rewards_coefficient is not None:
                training_arguments_kwargs["center_rewards_coefficient"] = (
                    self.cfg.center_rewards_coefficient
                )
        elif self.cfg.process_reward_model:
            training_args_cls = AxolotlPRMConfig
        else:
@@ -383,10 +395,11 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
                **data_collator_kwargs,
            )
        sig = inspect.signature(trainer_cls)
-        if "processing_class" in sig.parameters:
+        if "processing_class" in sig.parameters or issubclass(trainer_cls, Trainer):
            trainer_kwargs["processing_class"] = self.tokenizer
        elif "tokenizer" in sig.parameters:
            trainer_kwargs["tokenizer"] = self.tokenizer
        if (
            trainer_cls not in [AxolotlRewardTrainer, AxolotlPRMTrainer]
            and self.cfg.datasets is not None
@@ -404,6 +417,9 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
            **trainer_kwargs,
        )
        trainer = self.hook_post_create_trainer(trainer)
        # if the trainer has the `axolotl_cfg` property, set it
        if hasattr(trainer, "axolotl_cfg"):
            trainer.axolotl_cfg = self.cfg
        for callback in self.get_post_trainer_create_callbacks(trainer):
            trainer.add_callback(callback)
--- a/src/axolotl/core/trainers/base.py
+++ b/src/axolotl/core/trainers/base.py
@@ -42,12 +42,20 @@ from axolotl.core.trainers.utils import (
 )
 from axolotl.utils import get_not_null
 from axolotl.utils.bench import get_gpu_memory_usage
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.distributed import is_main_process
 from axolotl.utils.logging import get_logger
 from axolotl.utils.samplers import MultipackBatchSampler, get_dataset_lengths
 LOG = get_logger(__name__)
 REDUCTION_FNS = {
    "mean": torch.mean,
    "min": torch.min,
    "max": torch.max,
    "sum": torch.sum,
 }
 class AxolotlTrainer(
    PackingMixin,
@@ -63,6 +71,15 @@ class AxolotlTrainer(
    args = None  # type: "AxolotlTrainingArguments"  # type: ignore[name-defined]
    tag_names = ["axolotl"]
    _axolotl_cfg: DictDefault | None = None
    @property
    def axolotl_cfg(self):
        return self._axolotl_cfg
    @axolotl_cfg.setter
    def axolotl_cfg(self, cfg):
        self._axolotl_cfg = cfg
    def __init__(
        self,
@@ -78,9 +95,10 @@ class AxolotlTrainer(
        self._signature_columns = None  # workaround for pylint
        super().__init__(*_args, **kwargs)
        self.train_data_collator = self.data_collator
-        self._stored_metrics = defaultdict(lambda: defaultdict(list))
+        self._stored_metrics = defaultdict(
            lambda: defaultdict(lambda: {"values": [], "reduction": "mean"})
        )
        if self.args.orpo_alpha:
            self.loss_fct = torch.nn.CrossEntropyLoss(reduction="none")
@@ -327,6 +345,17 @@ class AxolotlTrainer(
        #     outputs = model(**inputs)
        #     loss = trainer_weighted_loss(outputs, labels, shift_labels=True)
        #     return (loss, outputs) if return_outputs else loss
        # track number of tokens for tokens per second calculation
        if self.args.include_tkps:
            inputs_key = "labels" if "labels" in inputs else "input_ids"
            if hasattr(self.state, "num_tokens"):
                self.state.num_tokens = (
                    self.state.num_tokens + (inputs[inputs_key] != -100).sum().cpu()
                )
            else:
                self.state.num_tokens = (inputs[inputs_key] != -100).sum().cpu()
        if self.args.orpo_alpha:
            return self.orpo_compute_loss(
                model,
@@ -342,6 +371,11 @@ class AxolotlTrainer(
            num_items_in_batch=num_items_in_batch,
        )
    @override
    def evaluate(self, *args, **kwargs):
        LOG.info("Running evaluation step...")
        return super().evaluate(*args, **kwargs)
    @staticmethod
    def orpo_concatenate_inputs(inputs, label_pad_token=-100, pad_token=0, device=None):
        concatenated_batch = {}
@@ -526,9 +560,6 @@ class AxolotlTrainer(
        super().create_accelerator_and_postprocess()
        # now we need to put parallelism_config back on the PartialState since we rely on that info in other places
        # PartialState().parallelism_config = self.accelerator.state.parallelism_config
        if self.is_fsdp_enabled:
            if (
                "limit_all_gathers" in self.args.fsdp_config
@@ -568,29 +599,61 @@ class AxolotlTrainer(
        """
        # logs either has 'loss' or 'eval_loss'
        train_eval = "train" if "loss" in logs else "eval"
-        # Add averaged stored metrics to logs
+
-        for key, metrics in self._stored_metrics[train_eval].items():
+        for key, metric_data in self._stored_metrics[train_eval].items():
-            logs[key] = torch.tensor(metrics).mean().item()
+            values = torch.tensor(metric_data["values"])  # type: ignore[arg-type]
            reduction_type = metric_data["reduction"]
            fn = REDUCTION_FNS.get(reduction_type)
            if fn is None:
                raise NotImplementedError(
                    "Metric reduction must be one of [mean, min, max, sum]"
                )
            logs[key] = round(fn(values).item(), 4)
        if is_main_process():
            # Add memory usage
            try:
                active, allocated, reserved = get_gpu_memory_usage()
-                logs["memory/max_mem_active(gib)"] = round(active, 2)
+                logs["memory/max_active (GiB)"] = round(active, 2)
-                logs["memory/max_mem_allocated(gib)"] = round(allocated, 2)
+                logs["memory/max_allocated (GiB)"] = round(allocated, 2)
-                logs["memory/device_mem_reserved(gib)"] = round(reserved, 2)
+                logs["memory/device_reserved (GiB)"] = round(reserved, 2)
            except (ValueError, TypeError, FileNotFoundError):
                pass
        if self.args.include_tkps and train_eval == "train":
            # each rank will log its own tokens per second
            # for logging_steps > 1 we obtain a moving average of this metric
            logs["tokens_per_second_per_gpu"] = round(
                self.state.last_tokens_per_second.item() / self.args.logging_steps, 2
            )
        del self._stored_metrics[train_eval]
        return super().log(logs, start_time)
    def store_metrics(
-        self, metrics: dict[str, float], train_eval: Literal["train", "eval"] = "train"
+        self,
        metrics: dict[str, float] | dict[str, tuple[int | float, str]],
        train_eval: Literal["train", "eval"] = "train",
        reduction: Literal["mean", "min", "max", "sum"] = "mean",
    ) -> None:
        """
        Store metrics with specified reduction type.
        Args:
            metrics: Dictionary of metric names to values, or metric names to (value,
                reduction_type) tuples.
            train_eval: Whether this is for training or evaluation.
        """
        for key, value in metrics.items():
-            self._stored_metrics[train_eval][key].append(value)
+            if isinstance(value, tuple):
                value, _reduction = value  # type: ignore[assignment]
            else:
                value, _reduction = value, reduction
            self._stored_metrics[train_eval][key]["values"].append(value)
            self._stored_metrics[train_eval][key]["reduction"] = _reduction
    def _save_checkpoint(self, model, trial, **kwargs):
        # make sure the checkpoint dir exists, since trainer is flakey
@@ -657,6 +720,11 @@ class AxolotlTrainer(
                LOG.info(
                    "Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`"
                )
-                self.data_collator.tokenizer.save_pretrained(output_dir)
+                save_jinja_files = True
                if self.axolotl_cfg:
                    save_jinja_files = self.axolotl_cfg.tokenizer_save_jinja_files
                self.data_collator.tokenizer.save_pretrained(
                    output_dir, save_jinja_files=save_jinja_files
                )
            # Good practice: save your training arguments together with the trained model
            torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))
--- a/src/axolotl/core/training_args_base.py
+++ b/src/axolotl/core/training_args_base.py
@@ -49,6 +49,12 @@ class AxolotlTrainingMixins:
        default=False,
        metadata={"help": "Use real batches for efficient training."},
    )
    include_tkps: bool = field(
        default=True,
        metadata={
            "help": "Whether to include tokens per second in the training metrics."
        },
    )
    eval_sample_packing: Optional[bool] = field(
        default=None,
        metadata={"help": "Use sample packing for efficient evals."},
--- a/src/axolotl/datasets.py
+++ b/src/axolotl/datasets.py
@@ -1,18 +1,17 @@
-"""Module containing Dataset functionality"""
+"""
 Module containing dataset functionality.
 We want this to be a wrapper for an existing dataset that we have loaded. Lets use the
 concept of middlewares to wrap each dataset. We'll use the collators later on to pad the
 datasets.
 """
 import torch
 from datasets import Dataset, IterableDataset
 from axolotl.utils.logging import get_logger
 from .prompt_tokenizers import PromptTokenizingStrategy
 # We want this to be a wrapper for an existing dataset that we have loaded
 # lets use the concept of middlewares to wrap each dataset, for example
 # ConstantLengthDataset(ShuffledDataset([TokenizedPromptDataset(alpaca_dataset)]))
 # let's check to ensure we don't truncate an item in the middle, we'll use
 # the collators later on to pad the datasets
 LOG = get_logger(__name__)
@@ -86,133 +85,3 @@ def wrap_dataset_for_tokenized_prompt(
            **map_kwargs,
        )
    return TokenizedPromptDataset(prompt_tokenizer, dataset, **kwargs)
 # TODO this isn't the best since it can't interleave datasets
 class ConstantLengthDataset(IterableDataset):
    """Iterable dataset that returns constant length chunks of tokens from stream of
    text files.
    Args:
        tokenizer: The processor used for processing the data.
        dataset: Dataset with text files.
        seq_length: Length of token sequences to return.
    """
    def __init__(
        self,
        tokenizer,
        datasets,
        seq_length=2048,
    ):
        self.tokenizer = tokenizer
        self.concat_token_id = tokenizer.eos_token_id
        self.datasets: list[IterableDataset] = datasets
        self.seq_length = seq_length
        vocab_size = len(tokenizer.get_vocab())
        if vocab_size <= torch.iinfo(torch.int16).max:
            self.tokens_dtype = torch.int16
        elif vocab_size <= torch.iinfo(torch.int32).max:
            self.tokens_dtype = torch.int32
        else:
            self.tokens_dtype = torch.int64
    def __iter__(self):
        buffer = {
            "input_ids": [],
            "attention_mask": [],
            "labels": [],
            "position_ids": [],
        }
        buffer_len = 0
        for dataset in self.datasets:
            idx = 0
            iterator = iter(dataset)
            more_examples = True
            while more_examples:
                try:
                    example = next(iterator)
                    idx += 1
                except StopIteration:
                    more_examples = False
                    example = None
                add_concat_token = False
                if example:
                    example_len = len(example["input_ids"])
                    add_concat_token = example["input_ids"][-1] != self.concat_token_id
                else:
                    example_len = 0
                if not example_len or (
                    buffer_len + int(add_concat_token) + example_len > self.seq_length
                ):
                    if buffer["input_ids"]:
                        input_ids = torch.cat(buffer["input_ids"], dim=-1)[
                            : self.seq_length
                        ]
                        attention_mask = torch.cat(buffer["attention_mask"], dim=-1)[
                            : self.seq_length
                        ]
                        position_ids = torch.cat(buffer["position_ids"], dim=-1)[
                            : self.seq_length
                        ]
                        labels = torch.cat(buffer["labels"], dim=-1)[: self.seq_length]
                        if labels.size() == input_ids.size() and (
                            attention_mask.size() == input_ids.size()
                        ):
                            yield {
                                "input_ids": input_ids,
                                "labels": labels,
                                "attention_mask": attention_mask,
                                "position_ids": position_ids,
                            }
                        else:
                            LOG.warning(
                                "Dropping batch due to tensor size mismatch "
                                f"input_ids: {input_ids.size()}, "
                                f"labels: {labels.size()}, "
                                f"attention_mask: {attention_mask.size()}"
                            )
                    buffer = {
                        "input_ids": [],
                        "attention_mask": [],
                        "labels": [],
                        "position_ids": [],
                    }
                    buffer_len = 0
                    idx = 1
                if example:
                    # FIXME
                    # just going to drop data points that are too long
                    if len(example["input_ids"]) <= self.seq_length:
                        input_ids = example["input_ids"]
                        attention_mask = example["attention_mask"]
                        labels = example["labels"]
                        if add_concat_token:
                            input_ids.append(self.concat_token_id)
                            attention_mask.append(1)
                            labels.append(self.concat_token_id)
                        input_ids_with_concat = torch.tensor(
                            input_ids, dtype=self.tokens_dtype
                        )
                        attention_mask_with_concat = torch.tensor(
                            [idx * m for m in attention_mask], dtype=torch.int16
                        )
                        labels_with_concat = torch.tensor(
                            labels, dtype=self.tokens_dtype
                        )
                        position_ids = torch.arange(
                            len(input_ids), dtype=self.tokens_dtype
                        )
                        buffer["input_ids"].append(input_ids_with_concat)
                        buffer["attention_mask"].append(attention_mask_with_concat)
                        buffer["labels"].append(labels_with_concat)
                        buffer["position_ids"].append(position_ids)
                        buffer_len += len(input_ids)
--- a/src/axolotl/integrations/base.py
+++ b/src/axolotl/integrations/base.py
@@ -142,7 +142,7 @@ class BasePlugin:
            model: The loaded model.
        """
-    def get_trainer_cls(self, cfg: DictDefault) -> Trainer | None:
+    def get_trainer_cls(self, cfg: DictDefault) -> type[Trainer] | None:
        """Returns a custom class for the trainer.
        Args:
--- a/src/axolotl/integrations/config.py
+++ b/src/axolotl/integrations/config.py
@@ -20,8 +20,8 @@ from typing import Any, Dict, List, Type
 from axolotl.utils.schemas.config import (
    AxolotlConfigWCapabilities as AxolotlConfigWCapabilitiesBase,
    AxolotlInputConfig as AxolotlInputConfigBase,
 )
 from axolotl.utils.schemas.config import AxolotlInputConfig as AxolotlInputConfigBase
 def merge_input_args():
--- a/src/axolotl/integrations/cut_cross_entropy/README.md
+++ b/src/axolotl/integrations/cut_cross_entropy/README.md
@@ -19,7 +19,7 @@ python scripts/cutcrossentropy_install.py | sh
 - If you are installing from pip
 ```bash
-pip3 uninstall -y cut-cross-entropy && pip3 install "cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@0ee9ee8"
+pip3 uninstall -y cut-cross-entropy && pip3 install "cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@c6a32c5"
 ```
 ## Usage
@@ -34,6 +34,7 @@ plugins:
 - arcee
 - cohere
 - cohere2
 - deepseek_v3
 - gemma
 - gemma2
 - gemma3
@@ -42,6 +43,7 @@ plugins:
 - gemma3n_text
 - glm
 - glm4
 - glm4_moe
 - gpt_oss
 - granite
 - granitemoe
@@ -64,6 +66,7 @@ plugins:
 - qwen3
 - qwen3_moe
 - smollm3
 - seed_oss
 - voxtral
 ## Citation
--- a/src/axolotl/integrations/cut_cross_entropy/init.py
+++ b/src/axolotl/integrations/cut_cross_entropy/init.py
@@ -35,7 +35,7 @@ LOG = get_logger(__name__)
 _CCE_INSTALL_MESSAGE = (
    "Please install Axolotl's fork of cut_cross_entropy with transformers support using "
-    '`pip install "cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@0ee9ee8"`'
+    '`pip install "cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@c6a32c5"`'
 )
--- a/src/axolotl/integrations/diffusion/README.md
+++ b/src/axolotl/integrations/diffusion/README.md
@@ -0,0 +1,154 @@
 # Diffusion LM Training Plugin for Axolotl
 This plugin enables diffusion language model training using an approach inspired by
 LLaDA (Large Language Diffusion Models) within Axolotl.
 ## Overview
 LLaDA is a diffusion-based approach to language model training that uses:
 - **Random token masking** during training instead of next-token prediction
 - **Bidirectional attention** to allow the model to attend to the full context
 - **Importance weighting** based on masking probabilities for stable training
 This approach can lead to more robust language models with better understanding of
 bidirectional context.
 ## Installation
 The plugin is included with Axolotl. See our
 [installation docs](https://docs.axolotl.ai/docs/installation.html).
 ## Quickstart
 Train with an example config (Llama‑3.2 1B):
   - Pretrain: `axolotl train examples/llama-3/diffusion-3.2-1b-pretrain.yaml`
   - SFT: `axolotl train examples/llama-3/diffusion-3.2-1b-sft.yaml`
 ### Basic Configuration
 You can also modify your existing configs to enable / customize diffusion training.
 Add the following to your Axolotl config:
 ```yaml
 # Enable diffusion LM training plugin
 plugins:
  - axolotl.integrations.diffusion.DiffusionPlugin
 ```
 And, configure the nested `diffusion` block (defaults shown):
 ```yaml
 diffusion:
  noise_schedule: linear  # or "cosine"
  min_mask_ratio: 0.1
  max_mask_ratio: 0.9
  num_diffusion_steps: 128
  eps: 1e-3
  importance_weighting: true
  # Mask token (training auto-adds if missing, avoid pad/eos)
  mask_token_str: "<|diffusion_mask|>"
  # Or use an existing special token id (e.g., 128002 for Llama-3.x)
  # mask_token_id: 128002
  # Sample generation during training (optional)
  generate_samples: true
  generation_interval: 100
  num_generation_samples: 3
  generation_steps: 128
  generation_temperature: 0.0
  generation_max_length: 100
 ```
 ## Supported Models
 Any models that support 4D attention masks should work out of the box. If not, please
 create an [issue](https://github.com/axolotl-ai-cloud/axolotl/issues) or open a
 [PR](https://github.com/axolotl-ai-cloud/axolotl/compare)!
 ## How It Works
 ### Random Masking
 During training, tokens are randomly masked:
 - Sample timestep `t` uniformly from [0, 1]
 - Calculate masking probability: `p = (1 - eps) * t + eps`
 - Randomly mask tokens with probability `p`
 ### Diffusion Loss
 Loss is computed only on masked tokens with (optional) importance weighting:
 ```python
 loss = sum(cross_entropy(pred, target) / p_mask) / total_tokens
 ```
 ## Sample Generation
 When `diffusion.generate_samples: true`, the plugin generates samples during training:
 ```
 Sample 1:
   Original (45 tokens): The quick brown fox jumps over the lazy dog...
   Masked (18/45 tokens, 40.0%): The [MASK] [MASK] fox [MASK] over [MASK] lazy [MASK]...
   Generated: The quick brown fox jumps over the lazy dog...
 ```
 Samples are logged to console and wandb (if enabled).
 ## Inference
 Diffusion inference is integrated into the standard Axolotl CLI. Use the same config
 you trained with and run:
 ```
 axolotl inference path/to/your-config.yaml
 ```
 Optionally, pass `--gradio` to use a simple web interface.
 Interactive controls (prefix the prompt with commands):
 - `:complete N` → completion mode with N new masked tokens appended (default 64)
 - `:mask R` → random masking mode with target mask ratio R in [0.0, 1.0]
 Example session:
 ```
 ================================================================================
 Commands:
 :complete N -> completion mode with N tokens (default 64)
 :mask R     -> random masking with ratio R (0.0–1.0)
 ================================================================================
 Give me an instruction (Ctrl + D to submit):
 :mask 0.4 The quick brown fox jumps over the lazy dog
 Masked (40.0%):
 The [MASK] brown [MASK] jumps over the [MASK] dog
 Generated:
 The quick brown fox jumps over the loud dog
 ```
 ## Metrics and Monitoring
 The plugin adds (or modifies) several metrics to track diffusion training:
 - `train/loss`: Weighted diffusion loss
 - `train/accuracy`: Accuracy on masked tokens
 - `train/mask_ratio`: Average fraction of tokens masked
 - `train/num_masked_tokens`: Number of tokens masked
 - `train/avg_p_mask`: Average masking probability
 - `train/ce_loss`: Unweighted cross-entropy loss
 - `train/importance_weight_avg`: Average importance weight
 ## Limitations
 - No flash attention support
 - No RL training support
 ## References
 - [LLaDA Paper](https://arxiv.org/abs/2404.10406)
 - [Axolotl Documentation](https://docs.axolotl.ai/)
 - [API reference for plugin](https://docs.axolotl.ai/docs/api/integrations.diffusion.args.html#axolotl.integrations.diffusion.args)
--- a/src/axolotl/integrations/diffusion/init.py
+++ b/src/axolotl/integrations/diffusion/init.py
@@ -0,0 +1,19 @@
 """Diffusion LM training plugin init."""
 from .args import DiffusionArgs, DiffusionConfig
 from .callbacks import DiffusionGenerationCallback
 from .generation import generate
 from .plugin import DiffusionPlugin
 from .trainer import DiffusionTrainer
 from .utils import create_bidirectional_attention_mask, resolve_mask_token_id
 __all__ = [
    "DiffusionArgs",
    "DiffusionPlugin",
    "DiffusionTrainer",
    "generate",
    "resolve_mask_token_id",
    "create_bidirectional_attention_mask",
    "DiffusionGenerationCallback",
    "DiffusionConfig",
 ]
--- a/src/axolotl/integrations/diffusion/args.py
+++ b/src/axolotl/integrations/diffusion/args.py
@@ -0,0 +1,95 @@
 """Config args for diffusion LM training (nested under `diffusion:`)."""
 from __future__ import annotations
 from typing import Literal
 from pydantic import BaseModel, Field, model_validator
 class DiffusionConfig(BaseModel):
    """Nested diffusion configuration available under the `diffusion` key."""
    # Noise schedule config
    noise_schedule: Literal["linear", "cosine"] = Field(
        default="linear", description="Type of noise schedule for diffusion training"
    )
    min_mask_ratio: float = Field(
        default=0.1,
        ge=0.0,
        le=1.0,
        description="Minimum masking ratio for diffusion noise schedule",
    )
    max_mask_ratio: float = Field(
        default=0.9,
        ge=0.0,
        le=1.0,
        description="Maximum masking ratio for diffusion noise schedule",
    )
    num_diffusion_steps: int = Field(
        default=128, ge=1, description="Number of diffusion timesteps"
    )
    eps: float = Field(
        default=1e-3,
        ge=0.0,
        le=1.0,
        description="Epsilon value for minimum masking probability in forward process",
    )
    # Training config
    importance_weighting: bool = Field(
        default=True,
        description="Apply importance weighting to loss based on masking probability",
    )
    mask_token_id: int | None = Field(
        default=None,
        description=(
            "Token ID to use for masking. Unset by default; can use one of the "
            "tokenizer's special tokens here."
        ),
    )
    mask_token_str: str | None = Field(
        default=None,
        description=(
            "Token string to use as a mask. If `mask_token_id` is invalid or unset, "
            "this token will be ensured to exist as an additional special token and "
            "used. If absent, a default '<|diffusion_mask|>' will be added."
        ),
    )
    # Sample generation config
    generate_samples: bool = Field(
        default=True, description="Enable sample generation during training"
    )
    generation_interval: int = Field(
        default=100, ge=1, description="Generate samples every N steps"
    )
    num_generation_samples: int = Field(
        default=3, ge=1, description="Number of samples to generate each time"
    )
    generation_steps: int = Field(
        default=128, ge=1, description="Number of diffusion steps for generation"
    )
    generation_temperature: float = Field(
        default=0.0,
        ge=0.0,
        description="Temperature for generation sampling (0.0 = deterministic)",
    )
    generation_max_length: int = Field(
        default=100, ge=1, description="Maximum sequence length for generation"
    )
    @model_validator(mode="after")
    def _validate_mask_ratios(self) -> "DiffusionConfig":
        if self.min_mask_ratio > self.max_mask_ratio:
            raise ValueError("min_mask_ratio must be ≤ max_mask_ratio")
        return self
 class DiffusionArgs(BaseModel):
    """Plugin entry that exposes the nested `diffusion` block to the core config."""
    diffusion: DiffusionConfig = Field(
        default_factory=DiffusionConfig,
        description="Diffusion training configuration. Only nested block is supported.",
    )
--- a/src/axolotl/integrations/diffusion/callbacks.py
+++ b/src/axolotl/integrations/diffusion/callbacks.py
@@ -0,0 +1,174 @@
 """Callbacks for diffusion training."""
 import logging
 import sys
 import wandb
 from colorama import Fore, Style
 from transformers.trainer_callback import TrainerCallback, TrainerControl, TrainerState
 from transformers.training_args import TrainingArguments
 from .generation import generate_samples
 # Simpler logger for more readable sample generation
 logger = logging.getLogger(__name__)
 if not logger.handlers:
    handler = logging.StreamHandler(sys.stdout)
    handler.setFormatter(logging.Formatter("%(message)s"))
    logger.addHandler(handler)
    logger.propagate = False
 logger.setLevel(logging.INFO)
 class DiffusionGenerationCallback(TrainerCallback):
    """Callback for generating samples during diffusion training."""
    def __init__(self, trainer):
        self.trainer = trainer
    def on_step_end(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        """Generate samples at specified intervals."""
        if (
            state.global_step > 0
            and state.global_step % self.trainer.cfg.diffusion.generation_interval == 0
        ):
            if not self.trainer.state.is_world_process_zero:
                return
            # Use eval dataloader if available, otherwise use train dataloader
            dataloader = None
            try:
                if getattr(self.trainer, "eval_dataset", None) is not None:
                    dataloader = self.trainer.get_eval_dataloader()
            except Exception:
                dataloader = None
            if dataloader is None:
                dataloader = self.trainer.get_train_dataloader()
            # Generate samples
            diffusion_cfg = self.trainer.cfg.diffusion
            samples = generate_samples(
                model=self.trainer.model,
                tokenizer=self.trainer.processing_class,
                dataloader=dataloader,
                num_generation_samples=diffusion_cfg.num_generation_samples,
                max_length=diffusion_cfg.generation_max_length,
                num_diffusion_steps=diffusion_cfg.generation_steps,
                temperature=diffusion_cfg.generation_temperature,
                mask_token_id=diffusion_cfg.mask_token_id,
            )
            # Log samples
            self._log_samples(samples, state.global_step)
    def _log_samples(self, samples: list, step: int):
        """Log generated samples."""
        if not samples:
            return
        logger.info("=" * 60)
        logger.info("GENERATED SAMPLES")
        logger.info("=" * 60)
        for i, sample_data in enumerate(samples, 1):
            original = sample_data["original"]
            masked = sample_data["masked"]
            generated = sample_data["generated"]
            mask_ratio = sample_data["mask_ratio"]
            masked_tokens = sample_data["masked_tokens"]
            total_tokens = sample_data["total_tokens"]
            logger.info(f"\nSample {i}:")
            logger.info(f"\tOriginal ({total_tokens} tokens): {original}")
            logger.info(
                f"\tMasked ({masked_tokens}/{total_tokens} tokens, "
                f"{mask_ratio:.1%}): {masked}"
            )
            try:
                gen_ids = sample_data.get("generated_ids")
                orig_ids = sample_data.get("orig_ids")
                masked_positions = set(sample_data.get("masked_positions") or [])
                if isinstance(gen_ids, list) and isinstance(orig_ids, list):
                    styles: list[str] = []
                    for i, tid in enumerate(gen_ids):
                        if i in masked_positions:
                            if i < len(orig_ids) and tid == orig_ids[i]:
                                styles.append("green")
                            elif i < len(orig_ids):
                                styles.append("red")
                            else:
                                styles.append("normal")
                        else:
                            same = i < len(orig_ids) and tid == orig_ids[i]
                            styles.append("dim" if same else "normal")
                    spans: list[tuple[str, int, int]] = []
                    if gen_ids:
                        cur = styles[0]
                        start = 0
                        for i in range(1, len(gen_ids)):
                            s = styles[i]
                            if s != cur:
                                spans.append((cur, start, i))
                                cur, start = s, i
                        spans.append((cur, start, len(gen_ids)))
                    parts = []
                    for style_name, a, b in spans:
                        chunk_text = self.trainer.processing_class.decode(
                            gen_ids[a:b], skip_special_tokens=False
                        )
                        if style_name == "green":
                            parts.append(Fore.GREEN + chunk_text + Style.RESET_ALL)
                        elif style_name == "red":
                            parts.append(Fore.RED + chunk_text + Style.RESET_ALL)
                        else:
                            if style_name == "dim":
                                parts.append(Style.DIM + chunk_text + Style.RESET_ALL)
                            else:
                                parts.append(chunk_text)
                    logger.info("\tGenerated:\n%s", "".join(parts))
                else:
                    logger.info(f"\tGenerated: {generated}")
            except Exception:
                logger.info(f"\tGenerated: {generated}")
        logger.info("=" * 60)
        if self.trainer.cfg.use_wandb:
            if wandb.run is not None:
                wandb.log(
                    {
                        "generated_samples": wandb.Table(
                            columns=[
                                "step",
                                "original",
                                "masked",
                                "generated",
                                "mask_ratio",
                                "masked_tokens",
                                "total_tokens",
                            ],
                            data=[
                                [
                                    step,
                                    sample["original"],
                                    sample["masked"],
                                    sample["generated"],
                                    f"{sample['mask_ratio']:.1%}",
                                    sample["masked_tokens"],
                                    sample["total_tokens"],
                                ]
                                for sample in samples
                            ],
                        )
                    },
                    step=step,
                )
--- a/src/axolotl/integrations/diffusion/generation.py
+++ b/src/axolotl/integrations/diffusion/generation.py
@@ -0,0 +1,409 @@
 """Sample generation utilities for diffusion training."""
 import re
 from typing import Any, List, Literal, Optional
 import torch
 from axolotl.utils.logging import get_logger
 from .utils import create_bidirectional_attention_mask
 LOG = get_logger(__name__)
 def generate_samples(
    model: torch.nn.Module,
    tokenizer: Any,
    dataloader: Optional[Any] = None,
    num_generation_samples: int = 3,
    max_length: int = 100,
    num_diffusion_steps: int = 128,
    temperature: float = 0.0,
    mask_token_id: int = 32000,
    mode: Literal["random", "completion"] = "random",
    completion_tokens: int = 0,
    target_mask_ratio: Optional[float] = None,
 ) -> List[dict]:
    """
    Generate text samples using the diffusion model by randomly masking sequences from
    the given dataset and running the reverse diffusion process.
    Args:
        model: The wrapped or unwrapped model
        tokenizer: Tokenizer for encoding/decoding
        dataloader: Validation dataloader (for sampling sequences)
        num_generation_samples: Number of samples to generate
        max_length: Maximum length of sequences to use
        num_diffusion_steps: Number of diffusion steps for generation
        temperature: Temperature for sampling (0.0 = deterministic)
        mask_token_id: Token ID used for masking
    Returns:
        List of dictionaries with original text, masked text, and generated text
    """
    if dataloader is None:
        LOG.warning("No validation dataloader provided, cannot generate samples")
        return []
    unwrapped_model = model.module if hasattr(model, "module") else model
    training = unwrapped_model.training
    unwrapped_model.eval()
    # Resolve device robustly (some modules don't expose `.device`)
    device = getattr(unwrapped_model, "device", None)
    if device is None:
        try:
            device = next(unwrapped_model.parameters()).device
        except StopIteration:
            device = torch.device("cpu")
    generations = []
    # Sample sequences from validation dataset
    sampled_sequences = _sample_sequences_from_dataloader(
        dataloader, num_generation_samples, max_length, device
    )
    LOG.info(f"Sampled {len(sampled_sequences)} sequences from validation dataset")
    # Generate samples using reverse diffusion process
    with torch.no_grad():
        for sample in sampled_sequences:
            if isinstance(sample, dict):
                original_sequence = sample.get("input_ids")
                labels_seq = sample.get("labels")
                attn_seq = sample.get("attention_mask")
            else:
                original_sequence = sample
                labels_seq = None
                attn_seq = None
            generation_result = generate(
                unwrapped_model,
                tokenizer,
                original_sequence,
                num_diffusion_steps,
                temperature,
                mask_token_id,
                mode=mode,
                completion_tokens=completion_tokens,
                target_mask_ratio=target_mask_ratio,
                labels=labels_seq,
                attention_mask=attn_seq,
            )
            generations.append(generation_result)
    # Restore prior training state
    if training:
        unwrapped_model.train()
    else:
        unwrapped_model.eval()
    return generations
 def _sample_sequences_from_dataloader(
    dataloader: Any, num_samples: int, max_length: int, device: torch.device
 ) -> List[Any]:
    """Sample sequences from validation dataloader."""
    sampled_sequences: list[dict[str, torch.Tensor] | torch.Tensor] = []
    sample_count = 0
    # Skip a random number of batches (we could be more clever about this)
    skip_batches = torch.randint(0, 10, (1,)).item()
    batch_count = 0
    for batch in dataloader:
        # Skip some batches for variety
        if batch_count < skip_batches:
            batch_count += 1
            continue
        if sample_count >= num_samples:
            break
        batch_count += 1
        input_ids = batch["input_ids"]
        attention_mask = batch.get("attention_mask")
        labels = batch.get("labels")
        # Randomly sample from sequences in this batch
        batch_indices = torch.randperm(input_ids.size(0)).tolist()
        for i in batch_indices:
            if sample_count >= num_samples:
                break
            # Get actual sequence length (non-padded)
            if attention_mask is not None:
                seq_len = attention_mask[i].sum().item()
            else:
                seq_len = input_ids.size(1)
            if seq_len < 10:
                continue
            # Determine truncation length
            max_total = min(seq_len, max_length)
            if labels is not None:
                labels_i = labels[i][:seq_len]
                answer_mask = labels_i != -100
                if not answer_mask.any():
                    # No answer tokens; skip for SFT masking
                    continue
                first_ans_idx = int(
                    torch.nonzero(answer_mask, as_tuple=False)[0].item()
                )
                prompt_len = first_ans_idx
                if prompt_len >= max_total:
                    # Prompt alone reaches cap; cannot include any answer
                    continue
                remaining_answer = int(answer_mask[prompt_len:].sum().item())
                allowed_answer = max_total - prompt_len
                take_answer = min(remaining_answer, allowed_answer)
                if take_answer <= 0:
                    continue
                actual_length = prompt_len + take_answer
            else:
                actual_length = max_total
            # Extract the (possibly truncated) sequence
            sequence = input_ids[i][:actual_length].unsqueeze(0).to(device)
            attn_seq = (
                attention_mask[i][:actual_length].unsqueeze(0).to(device)
                if attention_mask is not None
                else None
            )
            if labels is not None:
                labels_seq = labels[i][:actual_length].unsqueeze(0).to(device)
                sampled_sequences.append(
                    {
                        "input_ids": sequence,
                        "labels": labels_seq,
                        "attention_mask": attn_seq,
                    }
                )
            else:
                if attn_seq is not None:
                    sampled_sequences.append(
                        {"input_ids": sequence, "attention_mask": attn_seq}
                    )
                else:
                    sampled_sequences.append(sequence)
            sample_count += 1
    return sampled_sequences
 def generate(
    model: torch.nn.Module,
    tokenizer: Any,
    original_sequence: torch.Tensor,
    num_diffusion_steps: int,
    temperature: float,
    mask_token_id: int,
    *,
    mode: Literal["random", "completion"] = "random",
    completion_tokens: int = 0,
    target_mask_ratio: Optional[float] = None,
    labels: Optional[torch.Tensor] = None,
    attention_mask: Optional[torch.Tensor] = None,
 ) -> dict:
    """Generate a single sample using reverse diffusion."""
    # Get original text for comparison
    original_text = tokenizer.decode(
        original_sequence[0].cpu(), skip_special_tokens=True
    )
    # Build masked sequence
    if (
        labels is not None
        and labels.numel() > 0
        and (labels == -100).any()
        and (labels != -100).any()
    ):
        # SFT case: completely mask all answer tokens (labels != -100)
        total_tokens = original_sequence.size(1)
        masked_indices = (labels != -100).to(dtype=torch.bool)
        masked_sequence = original_sequence.clone()
        masked_sequence[masked_indices] = mask_token_id
        masked_tokens = int(masked_indices.sum().item())
        mask_ratio = masked_tokens / max(int(total_tokens), 1)
    elif mode == "completion" and completion_tokens > 0:
        # Append mask tokens to the right for completion
        total_tokens = original_sequence.size(1) + int(completion_tokens)
        masked_indices = torch.zeros(
            1, total_tokens, dtype=torch.bool, device=original_sequence.device
        )
        masked_indices[0, -int(completion_tokens) :] = True
        append = torch.full(
            (1, int(completion_tokens)), mask_token_id, device=original_sequence.device
        )
        masked_sequence = torch.cat([original_sequence, append], dim=1)
        masked_tokens = int(completion_tokens)
        mask_ratio = masked_tokens / total_tokens
    else:
        # Apply random masking with optional fixed ratio
        total_tokens = original_sequence.size(1)
        if target_mask_ratio is None:
            min_ratio, max_ratio = 0.1, 0.7
            target_mask_ratio = (
                torch.rand(1).item() * (max_ratio - min_ratio) + min_ratio
            )
        target_masked_tokens = max(1, int(total_tokens * float(target_mask_ratio)))
        # Create random mask indices
        mask_positions = torch.randperm(total_tokens)[:target_masked_tokens]
        masked_indices = torch.zeros(
            1, total_tokens, dtype=torch.bool, device=original_sequence.device
        )
        masked_indices[0, mask_positions] = True
        # Create masked sequence
        masked_sequence = original_sequence.clone()
        masked_sequence[masked_indices] = mask_token_id
        # Calculate actual mask ratio
        masked_tokens = masked_indices.sum().item()
        mask_ratio = masked_tokens / total_tokens
    # Get masked text for comparison
    masked_text = tokenizer.decode(masked_sequence[0].cpu(), skip_special_tokens=False)
    masked_text = _clean_masked_text(masked_text, tokenizer, mask_token_id)
    # Run reverse diffusion process
    sequence = masked_sequence.clone()
    attention_mask = create_bidirectional_attention_mask(
        sequence, attention_mask, sample_packing=attention_mask is not None
    )
    for step in range(num_diffusion_steps):
        sequence = _diffusion_step(
            model,
            sequence,
            step,
            num_diffusion_steps,
            temperature,
            mask_token_id,
            attention_mask,
        )
    generated_text = tokenizer.decode(sequence[0].cpu(), skip_special_tokens=True)
    # Collect diagnostic info
    final_ids = sequence[0].detach().cpu().tolist()
    orig_ids_for_render = original_sequence[0].detach().cpu().tolist()
    if masked_indices is not None:
        masked_positions = (
            torch.where(masked_indices[0])[0].detach().cpu().tolist()
            if masked_indices.ndim == 2
            else []
        )
    else:
        masked_positions = []
    result = {
        "original": original_text,
        "masked": masked_text,
        "generated": generated_text,
        "mask_ratio": mask_ratio,
        "masked_tokens": masked_tokens,
        "total_tokens": total_tokens,
        "generated_ids": final_ids,
        "masked_positions": masked_positions,
        "orig_ids": orig_ids_for_render,
        "formatted": (
            f"Original: '{original_text}' → Masked: '{masked_text}' "
            f"({mask_ratio:.1%}) → Generated: '{generated_text}'"
        ),
    }
    return result
 def _clean_masked_text(masked_text: str, tokenizer: Any, mask_token_id: int) -> str:
    """Clean up masked text for display."""
    mask_token_repr = tokenizer.decode([mask_token_id], skip_special_tokens=False)
    cleaned = masked_text.replace(mask_token_repr, "[MASK]")
    # Remove literal special token strings
    if hasattr(tokenizer, "special_tokens_map"):
        for token_value in tokenizer.special_tokens_map.values():
            if token_value and isinstance(token_value, str):
                cleaned = cleaned.replace(token_value, "")
    # Normalize whitespace but preserve newlines
    cleaned = cleaned.replace("\r\n", "\n").replace("\r", "\n")
    cleaned = re.sub(r"[ \t]+", " ", cleaned)
    cleaned = "\n".join(line.rstrip() for line in cleaned.split("\n")).strip()
    return cleaned
 def _diffusion_step(
    model: torch.nn.Module,
    sequence: torch.Tensor,
    step: int,
    num_diffusion_steps: int,
    temperature: float,
    mask_token_id: int,
    attention_mask: torch.Tensor | None = None,
 ) -> torch.Tensor:
    """Perform a single diffusion step with remasking."""
    # Only process if there are masked tokens remaining
    current_mask = sequence == mask_token_id
    if not current_mask.any():
        return sequence
    # Create or use provided attention mask
    if attention_mask is None:
        batch_size, seq_len = sequence.shape
        attention_mask = torch.ones(
            batch_size, 1, seq_len, seq_len, dtype=torch.bool, device=sequence.device
        )
    # Forward pass
    outputs = model(input_ids=sequence, attention_mask=attention_mask)
    logits = outputs.logits
    # Only sample at currently masked positions
    if current_mask.any():
        masked_logits = logits[current_mask]
        # Apply temperature scaling
        if temperature > 0:
            scaled_logits = masked_logits / temperature
        else:
            scaled_logits = masked_logits
        # Suppress mask token in outputs
        scaled_logits[:, mask_token_id] = -float("inf")
        if temperature > 0:
            # Add Gumbel noise for sampling
            gumbel_noise = -torch.log(
                -torch.log(torch.rand_like(scaled_logits, dtype=torch.float32))
            )
            gumbel_logits = scaled_logits + gumbel_noise
            predicted_tokens = torch.argmax(gumbel_logits, dim=-1)
        else:
            predicted_tokens = torch.argmax(scaled_logits, dim=-1)
        # Calculate probabilities for confidence scoring
        probs = torch.softmax(scaled_logits, dim=-1)
        predicted_token_probs = probs[range(len(predicted_tokens)), predicted_tokens]
        # Determine how many tokens to unmask this step
        remaining_masked = current_mask.sum().item()
        if step == num_diffusion_steps - 1:
            num_to_unmask = remaining_masked
        else:
            unmask_ratio = 1.0 / (num_diffusion_steps - step)
            num_to_unmask = max(1, int(remaining_masked * unmask_ratio))
        # Select highest confidence predictions to unmask
        if num_to_unmask >= remaining_masked:
            sequence[current_mask] = predicted_tokens
        else:
            _, top_indices = predicted_token_probs.topk(num_to_unmask)
            mask_positions = torch.where(current_mask)[1]
            positions_to_unmask = mask_positions[top_indices]
            sequence[0, positions_to_unmask] = predicted_tokens[top_indices]
    return sequence
--- a/src/axolotl/integrations/diffusion/plugin.py
+++ b/src/axolotl/integrations/diffusion/plugin.py
@@ -0,0 +1,41 @@
 """Diffusion LM training plugin for Axolotl."""
 from peft import PeftModel
 from transformers import PreTrainedModel
 from axolotl.integrations.base import BasePlugin
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.logging import get_logger
 from .trainer import DiffusionTrainer
 LOG = get_logger(__name__)
 class DiffusionPlugin(BasePlugin):
    """
    Plugin for diffusion language model training.
    This plugin enables diffusion-based training using the LLaDA approach, which uses
    random masking and bidirectional attention to train language models.
    """
    def __init__(self):
        super().__init__()
        self.cfg = None
    def get_input_args(self) -> str:
        """Returns the pydantic model for LLaDA plugin arguments."""
        return "axolotl.integrations.diffusion.DiffusionArgs"
    def post_model_load(self, cfg: DictDefault, model: PreTrainedModel | PeftModel):
        """Perform actions after model is loaded."""
        self.cfg = cfg
    def get_trainer_cls(self, cfg: DictDefault) -> type[DiffusionTrainer] | None:
        """Return custom trainer class for diffusion training."""
        return DiffusionTrainer
    def post_trainer_create(self, cfg: DictDefault, trainer: DiffusionTrainer):
        """Configure trainer after creation."""
        trainer.set_config(cfg)
--- a/src/axolotl/integrations/diffusion/trainer.py
+++ b/src/axolotl/integrations/diffusion/trainer.py
@@ -0,0 +1,301 @@
 """Custom trainer for diffusion LM training."""
 from typing import Any, Literal
 import torch
 import torch.nn.functional as F
 from torch import nn
 from axolotl.core.trainers.base import AxolotlTrainer
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.logging import get_logger
 from .callbacks import DiffusionGenerationCallback
 from .utils import create_bidirectional_attention_mask
 LOG = get_logger(__name__)
 class DiffusionTrainer(AxolotlTrainer):
    """Custom trainer for diffusion LM training that overrides loss computation."""
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.cfg = None
        self._special_token_ids = None
    def set_config(self, config: DictDefault):
        """Set config for diffusion training."""
        self.cfg = config
        self._cache_special_token_ids()
        self._resolve_mask_token_id()
        token_id = int(getattr(self.cfg.diffusion, "mask_token_id", 0))
        LOG.info(f"Diffusion: using mask_token_id={token_id}")
        if getattr(config.diffusion, "generate_samples", True):
            generation_callback = DiffusionGenerationCallback(self)
            self.add_callback(generation_callback)
    def _resolve_mask_token_id(self) -> None:
        """Ensure mask_token_id is valid for the current tokenizer."""
        from .utils import resolve_mask_token_id
        tokenizer = getattr(self, "processing_class", None)
        if tokenizer is None:
            return
        mid = resolve_mask_token_id(
            tokenizer,
            self.cfg,
            allow_add=True,
            model=getattr(self, "model", None),
        )
        try:
            self.cfg.diffusion.mask_token_id = int(mid)
        except Exception:
            pass
    def compute_loss(
        self,
        model: nn.Module,
        inputs: dict[str, torch.Tensor],
        return_outputs: bool = False,
        num_items_in_batch: torch.Tensor | None = None,
    ) -> torch.Tensor | tuple[torch.Tensor, dict[str, torch.Tensor]]:
        """Override compute_loss to use diffusion loss."""
        input_ids = inputs.get("input_ids")
        attention_mask = inputs.get("attention_mask")
        labels = inputs.get("labels")
        if input_ids is None:
            raise ValueError("input_ids is required for diffusion training")
        loss, outputs = self._compute_diffusion_loss(
            model, input_ids, attention_mask, labels
        )
        if return_outputs:
            return loss, outputs
        return loss
    def _cache_special_token_ids(self):
        """Cache special token IDs to avoid repeated tokenizer access."""
        if self.processing_class is None:
            self._special_token_ids = set()
            return
        tokenizer = self.processing_class
        special_tokens = set()
        if hasattr(tokenizer, "bos_token_id") and tokenizer.bos_token_id is not None:
            special_tokens.add(tokenizer.bos_token_id)
        if hasattr(tokenizer, "eos_token_id") and tokenizer.eos_token_id is not None:
            special_tokens.add(tokenizer.eos_token_id)
        if hasattr(tokenizer, "pad_token_id") and tokenizer.pad_token_id is not None:
            special_tokens.add(tokenizer.pad_token_id)
        self._special_token_ids = special_tokens
    def _forward_process(
        self,
        input_ids: torch.Tensor,
        attention_mask: torch.Tensor | None = None,
        labels: torch.Tensor | None = None,
        eps: float = 1e-3,
    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """
        Forward noising process. A timestep is sampled along the process, and tokens are
        masked with probability determined by the configured noise schedule.
        Args:
            input_ids: Input token ids [batch_size, seq_len].
            attention_mask: Attention mask [batch_size, seq_len].
            labels: Labels for SFT training [batch_size, seq_len].
            eps: Small epsilon value for minimum masking probability.
        Returns:
            noisy_batch: Input with some tokens masked.
            masked_indices: Boolean mask indicating which tokens were masked.
            p_mask: Masking probabilities for each token [batch_size, seq_len].
        """
        batch_size, seq_len = input_ids.shape
        device = input_ids.device
        # Sample random timesteps for each sample in batch
        t = torch.rand(batch_size, device=device)
        p_mask = (1 - eps) * t + eps  # [batch_size]
        p_mask = p_mask[:, None].repeat(1, seq_len)  # [batch_size, seq_len]
        # Don't mask padding tokens if attention_mask is provided
        if attention_mask is not None:
            valid_mask = attention_mask.bool()
            p_mask = p_mask * valid_mask.float()
        # Create mask to exclude special tokens
        special_token_mask = torch.zeros_like(input_ids, dtype=torch.bool)
        if self._special_token_ids:
            for token_id in self._special_token_ids:
                special_token_mask |= input_ids == token_id
        # Create random mask based on p_mask
        masked_indices = torch.rand((batch_size, seq_len), device=device) < p_mask
        masked_indices = masked_indices & ~special_token_mask
        if attention_mask is not None:
            masked_indices = masked_indices & attention_mask.bool()
        # For SFT data, only mask answer tokens
        if labels is not None:
            answer_mask = labels != -100
            masked_indices = masked_indices & answer_mask
        # Create masked input
        mask_token_id = int(self.cfg.diffusion.mask_token_id)
        mask_value = torch.full_like(input_ids, mask_token_id)
        noisy_batch = torch.where(masked_indices, mask_value, input_ids)
        return noisy_batch, masked_indices, p_mask
    def _compute_diffusion_loss(
        self,
        model: nn.Module,
        input_ids: torch.Tensor,
        attention_mask: torch.Tensor | None = None,
        labels: torch.Tensor | None = None,
    ) -> tuple[torch.Tensor, torch.Tensor | Any]:
        """
        Compute diffusion loss.
        Args:
            model: The model to compute loss for.
            input_ids: Ground truth token ids [batch_size, seq_len].
            attention_mask: Attention mask [batch_size, seq_len].
            labels: Labels for SFT training [batch_size, seq_len].
        Returns:
            loss: Cross-entropy loss.
            metrics: Dictionary of metrics.
        """
        # Short-circuit empty sequences
        if input_ids is None or input_ids.numel() == 0 or input_ids.shape[1] == 0:
            zero = torch.tensor(
                0.0,
                device=(input_ids.device if input_ids is not None else None),
                requires_grad=True,
            )
            return zero, {}
        # If an attention_mask is provided and all positions are padding for every
        # sample in this batch, skip the step.
        if attention_mask is not None:
            if attention_mask.dim() == 2 and (attention_mask.sum(dim=1) == 0).all():
                zero = torch.tensor(0.0, device=input_ids.device, requires_grad=True)
                return zero, {}
        # Apply forward process
        noisy_batch, masked_indices, p_mask = self._forward_process(
            input_ids, attention_mask, labels, self.cfg.diffusion.eps
        )
        # Create bidirectional attention mask
        bidirectional_mask = create_bidirectional_attention_mask(
            input_ids, attention_mask, sample_packing=self.cfg.sample_packing
        )
        # Forward pass
        outputs = model(
            input_ids=noisy_batch.long(),
            attention_mask=bidirectional_mask,
        )
        logits = outputs.logits
        if masked_indices.sum() > 0:
            valid_indices = torch.where(masked_indices)
            batch_indices, seq_indices = valid_indices
            masked_logits = logits[batch_indices, seq_indices]
            masked_targets = input_ids[batch_indices, seq_indices]
            masked_p_mask = p_mask[batch_indices, seq_indices]
            # Compute cross-entropy loss without reduction
            token_loss = F.cross_entropy(
                masked_logits.float(), masked_targets, reduction="none"
            )
            if self.cfg.diffusion.importance_weighting:
                masked_p_mask = masked_p_mask.float()
                weighted_loss = token_loss / masked_p_mask
            else:
                weighted_loss = token_loss
            if labels is not None:
                # For SFT data: normalize by answer token count per sample
                answer_mask = labels != -100
                answer_lengths = answer_mask.sum(dim=1).float()  # [batch_size]
                # Get batch indices for masked tokens
                masked_batch_indices = batch_indices
                # Sum losses per sample and divide by answer length
                batch_size = input_ids.shape[0]
                loss_per_sample = torch.zeros(batch_size, device=input_ids.device)
                for i in range(batch_size):
                    sample_mask = masked_batch_indices == i
                    if sample_mask.sum() > 0:
                        sample_loss = weighted_loss[sample_mask].sum()
                        denom = answer_lengths[i].clamp(min=1.0)
                        loss_per_sample[i] = sample_loss / denom
                loss = loss_per_sample.mean()
            else:
                # Non-SFT: when importance weighting is enabled, use unbiased estimator
                # (sum(loss/p) / total_tokens). Otherwise, average over masked tokens
                # for stable scaling across varying mask ratios.
                if self.cfg.diffusion.importance_weighting:
                    loss = weighted_loss.sum() / (
                        input_ids.shape[0] * input_ids.shape[1]
                    )
                else:
                    loss = weighted_loss.mean()
            ce_loss = token_loss.mean()
            # Compute accuracy on masked tokens
            with torch.no_grad():
                pred_tokens = masked_logits.argmax(dim=-1)
                accuracy = (pred_tokens == masked_targets).float().mean()
        else:
            loss = torch.tensor(0.0, device=input_ids.device, requires_grad=True)
            accuracy = torch.tensor(0.0, device=input_ids.device)
            ce_loss = torch.tensor(0.0, device=input_ids.device)
            masked_p_mask = torch.tensor(1.0, device=input_ids.device)
        avg_p_mask = (
            p_mask[masked_indices].mean().item() if masked_indices.any() else 0.0
        )
        metrics = {
            "loss": loss.item(),
            "accuracy": accuracy.item(),
            "mask_ratio": masked_indices.float().mean().item(),
            "num_masked_tokens": (masked_indices.sum().item(), "sum"),
            "avg_p_mask": avg_p_mask,
            "ce_loss": ce_loss.item(),
        }
        # If doing SFT training, log answer-specific metrics
        if self.cfg.datasets is not None:
            with torch.no_grad():
                answer_mask = labels != -100
                answer_lengths = answer_mask.sum(dim=1).float()  # type: ignore
                total_answer_tokens = answer_mask.sum().item()  # type: ignore
                total_tokens = labels.numel()  # type: ignore
                metrics["answer_ratio"] = total_answer_tokens / max(total_tokens, 1)
                metrics["avg_answer_length"] = answer_lengths.mean().item()
        if self.cfg.diffusion.importance_weighting:
            metrics["importance_weight_avg"] = (1.0 / masked_p_mask).mean().item()
        train_eval: Literal["train", "eval"] = "train" if model.training else "eval"
        self.store_metrics(metrics, train_eval=train_eval)
        return loss, outputs
--- a/src/axolotl/integrations/diffusion/utils.py
+++ b/src/axolotl/integrations/diffusion/utils.py
@@ -0,0 +1,159 @@
 """Shared utilities for diffusion integration."""
 from __future__ import annotations
 from typing import Any, Optional
 import torch
 from axolotl.utils.dict import DictDefault
 def resolve_mask_token_id(
    tokenizer: Any,
    cfg: DictDefault,
    *,
    allow_add: bool,
    model: Any | None = None,
    default_token: str = "<|diffusion_mask|>",
 ) -> int:
    """Resolve mask token id. Training may add a new special token; inference won't."""
    # Determine vocab size if available
    vocab_size = None
    if tokenizer is not None:
        if hasattr(tokenizer, "vocab_size") and tokenizer.vocab_size is not None:
            try:
                vocab_size = int(tokenizer.vocab_size)  # type: ignore[arg-type]
            except Exception:
                vocab_size = None
        elif hasattr(tokenizer, "__len__"):
            try:
                vocab_size = int(len(tokenizer))
            except Exception:
                vocab_size = None
    # Use explicit id from config if provided
    diffusion_cfg = getattr(cfg, "diffusion", None)
    # Fallback to top-level attr names only if nested missing (shouldn't happen)
    cfg_id = (
        getattr(diffusion_cfg, "mask_token_id", None)
        if diffusion_cfg is not None
        else getattr(cfg, "diffusion_mask_token_id", None)
    )
    if isinstance(cfg_id, int) and cfg_id >= 0:
        if vocab_size is None or cfg_id < vocab_size:
            return int(cfg_id)
    def _existing_special_token_id(token_str: str | None) -> int | None:
        """Attempt to resolve an existing special token string to a real ID."""
        if not token_str or not hasattr(tokenizer, "convert_tokens_to_ids"):
            return None
        try:
            token_id = tokenizer.convert_tokens_to_ids(token_str)
        except Exception:
            return None
        if not isinstance(token_id, int) or token_id < 0:
            return None
        # Ensure it's registered as special and not UNK, and within vocab
        unk_id = getattr(tokenizer, "unk_token_id", None)
        specials = set(getattr(tokenizer, "all_special_tokens", []) or [])
        addl = set(getattr(tokenizer, "additional_special_tokens", []) or [])
        is_special = token_str in specials or token_str in addl
        in_vocab = vocab_size is None or token_id < vocab_size
        if (
            (unk_id is not None and token_id == unk_id)
            or not is_special
            or not in_vocab
        ):
            return None
        return token_id
    # Try mask token string if provided
    token_str = (
        getattr(diffusion_cfg, "mask_token_str", None)
        if diffusion_cfg is not None
        else getattr(cfg, "diffusion_mask_token_str", None)
    )
    for candidate in (token_str, default_token):
        token_id = _existing_special_token_id(candidate)
        if isinstance(token_id, int):
            try:
                if diffusion_cfg is None:
                    cfg.diffusion_mask_token_id = int(token_id)  # legacy fallback
                else:
                    diffusion_cfg.mask_token_id = int(token_id)
            except Exception:
                pass
            return int(token_id)
    # Optionally add and return a dedicated special token during training
    if allow_add and hasattr(tokenizer, "add_special_tokens"):
        token_to_add = token_str or default_token
        try:
            tokenizer.add_special_tokens({"additional_special_tokens": [token_to_add]})
            # Resize embeddings if possible
            if (
                model is not None
                and hasattr(tokenizer, "__len__")
                and hasattr(model, "resize_token_embeddings")
            ):
                try:
                    model.resize_token_embeddings(len(tokenizer))
                except Exception:
                    pass
            new_id = tokenizer.convert_tokens_to_ids(token_to_add)
            if isinstance(new_id, int) and new_id >= 0:
                try:
                    if diffusion_cfg is None:
                        cfg.diffusion_mask_token_id = int(new_id)  # legacy fallback
                    else:
                        diffusion_cfg.mask_token_id = int(new_id)
                except Exception:
                    pass
                return int(new_id)
        except Exception:
            pass
    # Fallback to unk or 0 (do not update cfg)
    fallback = getattr(tokenizer, "unk_token_id", 0) or 0
    return int(fallback)
 def create_bidirectional_attention_mask(
    input_ids: torch.Tensor,
    attention_mask: Optional[torch.Tensor] = None,
    sample_packing: bool = False,
 ) -> torch.Tensor:
    """
    Create bidirectional attention mask to override default causal masking.
    Handles sample-packed sequences where different samples are identified
    by different attention mask values.
    Args:
        input_ids: Input token ids [batch_size, seq_len]
        attention_mask: Attention mask [batch_size, seq_len]
        sample_packing: Whether sample packing is enabled
    Returns:
        bidirectional_mask: 4D attention mask [batch_size, 1, seq_len, seq_len]
    """
    batch_size, seq_len = input_ids.shape
    device = input_ids.device
    if attention_mask is None or not sample_packing:
        return torch.ones(
            batch_size, 1, seq_len, seq_len, dtype=torch.bool, device=device
        )
    # Handle sample packing: tokens can only attend within their sample
    mask_i = attention_mask.unsqueeze(2)  # [batch_size, seq_len, 1]
    mask_j = attention_mask.unsqueeze(1)  # [batch_size, 1, seq_len]
    # Tokens can attend to each other if they have the same non-zero sample ID
    bidirectional_mask = (mask_i == mask_j) & (mask_i > 0)
    # Add head dimension: [batch_size, 1, seq_len, seq_len]
    return bidirectional_mask.unsqueeze(1)
--- a/src/axolotl/loaders/adapter.py
+++ b/src/axolotl/loaders/adapter.py
@@ -14,6 +14,7 @@ from peft import (
    PeftConfig,
    PeftMixedModel,
    PeftModel,
    TaskType,
    get_peft_model,
 )
 from transformers import PreTrainedModel
@@ -98,6 +99,17 @@ def load_lora(
        lora_config_kwargs["use_rslora"] = cfg.peft_use_rslora
    if cfg.peft_layer_replication:
        lora_config_kwargs["layer_replication"] = cfg.peft_layer_replication
    if cfg.peft_trainable_token_indices:
        lora_config_kwargs["trainable_token_indices"] = cfg.peft_trainable_token_indices
    # Determine the correct PEFT task type
    model_cls = type(model).__name__
    if "SequenceClassification" in model_cls:
        task_type = TaskType.SEQ_CLS
    elif "TokenClassification" in model_cls:
        task_type = TaskType.TOKEN_CLS
    else:
        task_type = TaskType.CAUSAL_LM
    lora_config = LoraConfig(
        r=cfg.lora_r,
@@ -110,7 +122,7 @@ def load_lora(
        fan_in_fan_out=cfg.lora_fan_in_fan_out,
        modules_to_save=cfg.lora_modules_to_save if cfg.lora_modules_to_save else None,
        bias="none",
-        task_type="CAUSAL_LM",
+        task_type=task_type,
        **lora_config_kwargs,
    )
--- a/src/axolotl/loaders/model.py
+++ b/src/axolotl/loaders/model.py
@@ -673,6 +673,33 @@ class ModelLoader:
        return hf_ds_cfg
    def _load_model_from_config(self, model_loader_class=None) -> PreTrainedModel:
        """
        Load model with random initialization using from_config.
        Uses the selected loader when provided; otherwise falls back to the auto loader.
        """
        loader = model_loader_class or self.auto_model_loader
        if loader in [AutoModelForCausalLM, AutoModelForVision2Seq]:
            model = loader.from_config(
                config=self.model_config,
                trust_remote_code=self.cfg.trust_remote_code or False,
            )
        else:
            model = loader(config=self.model_config)
        return model
    def _load_model_from_pretrained(self, model_loader_class=None) -> PreTrainedModel:
        """Load model from pretrained weights."""
        loader = model_loader_class or self.auto_model_loader
        kwargs = {
            "config": self.model_config,
            "trust_remote_code": self.cfg.trust_remote_code or False,
            **self.model_kwargs,
        }
        return loader.from_pretrained(self.base_model, **kwargs)
    def _build_model(self) -> bool:
        """Load model, with load strategy depending on config."""
        skip_move_to_device = False
@@ -687,7 +714,8 @@ class ModelLoader:
        if self.is_fsdp_enabled:
            if self.cfg.fsdp_config.cpu_ram_efficient_loading:
                skip_move_to_device = True
-                # Don't delete device_map for QLoRA + FSDP - it was set correctly in _set_device_map
+                # Don't delete device_map for QLoRA + FSDP - it was set correctly in
                # _set_device_map
                if (
                    "device_map" in self.model_kwargs
                    and not self.is_qlora_and_fsdp_enabled
@@ -716,6 +744,11 @@ class ModelLoader:
                or self.cfg.qlora_sharded_model_loading
            )
        ):
            if self.cfg.reinit_weights:
                LOG.warning(
                    "reinit_weights is not supported with sharded quantized loading. "
                    "Loading from pretrained weights instead."
                )
            quant_storage = self.cfg.torch_dtype
            quantization_config = getattr(
                self.model_config, "quantization_config", None
@@ -731,33 +764,12 @@ class ModelLoader:
                quantization_config=quantization_config,
            )
            skip_move_to_device = True
        elif (
            self.model_config.model_type in ["llama", "llama4"]
            and not self.cfg.trust_remote_code
            and not self.cfg.gptq
        ):
            # Please don't remove underscore binding without reading the fn docstring.
            _ = self._configure_zero3_memory_efficient_loading()
            # Load model with random initialization if specified
            if self.cfg.random_init_weights:
                # AutoModel classes support the from_config method
                if self.auto_model_loader in [
                    AutoModelForCausalLM,
                    AutoModelForVision2Seq,
                ]:
                    self.model = self.auto_model_loader.from_config(
                        config=self.model_config,
                    )
                else:
                    self.model = self.auto_model_loader(config=self.model_config)
            else:
                self.model = self.auto_model_loader.from_pretrained(
                    self.base_model,
                    config=self.model_config,
                    **self.model_kwargs,
                )
        elif self.model_type == "MambaLMHeadModel":
            if self.cfg.reinit_weights:
                LOG.warning(
                    "reinit_weights is not supported with MambaLMHeadModel. "
                    "Loading from pretrained weights instead."
                )
            # FIXME this is janky at best and hacked together to make it work
            MambaLMHeadModel = fix_mamba_attn_for_loss()
@@ -770,41 +782,27 @@ class ModelLoader:
                self.base_model,
                **self.model_kwargs,
            )
        elif (
            self.model_type
            and self.model_type != "AutoModelForCausalLM"
            and not self.cfg.trust_remote_code
        ):
            if self.cfg.gptq:
                self.model = self.auto_model_loader.from_pretrained(
                    self.base_model,
                    config=self.model_config,
                    trust_remote_code=self.cfg.trust_remote_code or False,
                    **self.model_kwargs,
                )
            else:
                self.model = getattr(transformers, self.model_type).from_pretrained(
                    self.base_model,
                    config=self.model_config,
                    trust_remote_code=self.cfg.trust_remote_code or False,
                    **self.model_kwargs,
                )
        elif self.cfg.gptq:
            self.model = self.auto_model_loader.from_pretrained(
                self.base_model,
                config=self.model_config,
                trust_remote_code=self.cfg.trust_remote_code or False,
                **self.model_kwargs,
            )
        else:
-            # Please don't remove underscore binding without reading the fn docstring.
+            # Please don't remove underscore binding without reading the fn docstring
            _ = self._configure_zero3_memory_efficient_loading()
-            self.model = self.auto_model_loader.from_pretrained(
+
-                self.base_model,
+            if (
-                config=self.model_config,
+                self.model_type
-                trust_remote_code=self.cfg.trust_remote_code or False,
+                and self.model_type != "AutoModelForCausalLM"
-                **self.model_kwargs,
+                and not self.cfg.trust_remote_code
-            )
+                and not self.cfg.gptq
            ):
                # Use model type from transformers
                model_loader_class = getattr(transformers, self.model_type)
            else:
                # Use auto model loader (handles gptq and default cases)
                model_loader_class = self.auto_model_loader
            if self.cfg.reinit_weights:
                self.model = self._load_model_from_config(model_loader_class)
            else:
                self.model = self._load_model_from_pretrained(model_loader_class)
        if is_deepspeed_zero3_enabled():
            skip_move_to_device = True
--- a/src/axolotl/loaders/patch_manager.py
+++ b/src/axolotl/loaders/patch_manager.py
@@ -4,6 +4,7 @@ Applies pre- and post-model load patches for various fixes and optimizations.
 """
 import importlib.util
 import os
 from functools import cached_property
 import addict
@@ -66,6 +67,7 @@ class PatchManager:
        self._apply_mistral_cross_entropy_patch()
        self._apply_self_attention_lora_patch()
        self._apply_fsdp2_bnb_patches()
        self._apply_patch_deepspeed_zero3()
    def apply_post_plugin_pre_model_load_patches(self):
        """Apply post plugin-pre_model_load load patches based on config."""
@@ -78,13 +80,7 @@ class PatchManager:
            patch_maybe_log_save_evaluate,
        )
-        patch_fsdp2 = (
+        patch_evaluation_loop()
            self.cfg.torch_compile
            and self.cfg.fsdp_config
            and self.cfg.fsdp_version == 2
        )
        patch_evaluation_loop(patch_fsdp2)
        patch_maybe_log_save_evaluate()
    def apply_post_model_load_patches(self, model: PreTrainedModel):
@@ -147,14 +143,12 @@ class PatchManager:
    def _apply_flex_attention_patches(self):
        """Apply patches for flexible attention."""
        if self.cfg.flex_attention:
-            # from axolotl.monkeypatch.attention.flex_attn import (
+            from axolotl.monkeypatch.attention.flex_attn import (
-            #     patch_flex_make_mask,
+                patch_flex_wrapper,
-            #     patch_flex_wrapper,
+            )
-            # )
+
-            #
+            flex_attn_compile_kwargs = self.cfg.flex_attn_compile_kwargs or {}
-            # flex_attn_compile_kwargs = self.cfg.flex_attn_compile_kwargs or {}
+            patch_flex_wrapper(**flex_attn_compile_kwargs)
            # patch_flex_wrapper(**flex_attn_compile_kwargs)
            # patch_flex_make_mask()
            if self.cfg.sample_packing:
                from axolotl.core.attention.flex_block_mask import (
                    patch_create_causal_mask,
@@ -471,3 +465,17 @@ class PatchManager:
            from axolotl.monkeypatch.lora_kernels import apply_lora_kernel_patches
            apply_lora_kernel_patches(model=model, cfg=self.cfg)
    def _apply_patch_deepspeed_zero3(self):
        try:
            from transformers.integrations.deepspeed import is_deepspeed_zero3_enabled
            from axolotl.monkeypatch.deepspeed_utils import apply_deepspeed_patches
            if self.cfg.activation_offloading is True and (
                is_deepspeed_zero3_enabled()
                or os.getenv("ACCELERATE_DEEPSPEED_ZERO_STAGE") == "3"
            ):
                apply_deepspeed_patches()
        except ImportError as e:
            LOG.warning(f"DeepSpeed patches not applied: {e}")
--- a/src/axolotl/loaders/tokenizer.py
+++ b/src/axolotl/loaders/tokenizer.py
@@ -296,7 +296,7 @@ def load_tokenizer(cfg: DictDefault) -> PreTrainedTokenizer:
            )
        tokenizer.chat_template = chat_template_string
-    else:
+    elif getattr(tokenizer, "chat_template", None) is None:
        LOG.info(
            "No Chat template selected. Consider adding a chat template for easier inference."
        )
--- a/src/axolotl/monkeypatch/accelerate/fsdp2.py
+++ b/src/axolotl/monkeypatch/accelerate/fsdp2.py
@@ -160,9 +160,11 @@ def get_state_dict(self, model, unwrap=True):
                state_dict[param_name] = param.cpu()
            torch.distributed.barrier()
    elif self.distributed_type == DistributedType.FSDP:
-        from torch.distributed.fsdp import FullStateDictConfig
+        from torch.distributed.fsdp import (
-        from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+            FullStateDictConfig,
-        from torch.distributed.fsdp import StateDictType
+            FullyShardedDataParallel as FSDP,
            StateDictType,
        )
        full_state_dict_config = FullStateDictConfig(
            offload_to_cpu=True, rank0_only=True
@@ -178,6 +180,38 @@ def get_state_dict(self, model, unwrap=True):
    return state_dict
 def cast_lora_module(module):
    base_layer_dtype = module.base_layer.weight.dtype
    # Linear4Bit will keep it's bias term in fp32. If the weight dtype is in bf16 we are not able to
    # wrap this. Therefore we must ensure the bias has the same dtype as the weight
    if hasattr(module.base_layer, "bias") and module.base_layer.bias is not None:
        if module.base_layer.weight.dtype != module.base_layer.bias.dtype:
            log_bias_dtype_mismatch = True
            module.base_layer.bias.data = module.base_layer.bias.data.to(
                module.base_layer.weight.dtype
            )
    for active_adapter in module.active_adapters:
        if module.lora_A:
            module.lora_A[active_adapter] = module.lora_A[active_adapter].to(base_layer_dtype)
            if hasattr(module.lora_A[active_adapter], 'bias') and module.lora_A[active_adapter].bias is not None:
                module.lora_A[active_adapter].bias.data = module.lora_A[active_adapter].bias.data.to(base_layer_dtype)
        if module.lora_B:
           module.lora_B[active_adapter] = module.lora_B[active_adapter].to(base_layer_dtype)
           if hasattr(module.lora_B[active_adapter], 'bias') and module.lora_B[active_adapter].bias is not None:
               module.lora_B[active_adapter].bias.data = module.lora_B[active_adapter].bias.data.to(base_layer_dtype)
        if module.lora_embedding_A:
            module.lora_embedding_A[active_adapter] = module.lora_embedding_A[active_adapter].to(base_layer_dtype)
            if hasattr(module.lora_embedding_A[active_adapter], 'bias') and module.lora_embedding_A[active_adapter].bias is not None:
                module.lora_embedding_A[active_adapter].bias.data = module.lora_embedding_A[active_adapter].bias.data.to(base_layer_dtype)
        if module.lora_embedding_B:
            module.lora_embedding_B[active_adapter] = module.lora_embedding_B[active_adapter].to(base_layer_dtype)
            if hasattr(module.lora_embedding_B[active_adapter], 'bias') and module.lora_embedding_B[active_adapter].bias is not None:
                module.lora_embedding_B[active_adapter].bias.data = module.lora_embedding_B[active_adapter].bias.data.to(base_layer_dtype)
        if module.lora_magnitude_vector:
            module.lora_magnitude_vector[active_adapter] = module.lora_magnitude_vector[active_adapter].to(base_layer_dtype)
            if hasattr(module.lora_magnitude_vector[active_adapter], 'bias') and module.lora_magnitude_vector[active_adapter].bias is not None:
                module.lora_magnitude_vector[active_adapter].bias.data = module.lora_magnitude_vector[active_adapter].bias.data.to(base_layer_dtype)
 def _process_lora_module_for_fsdp(module, fsdp2_kwargs):
    """Helper function to process LoRA modules for FSDP2."""
@@ -193,18 +227,37 @@ def _process_lora_module_for_fsdp(module, fsdp2_kwargs):
            module.base_layer.bias.data = module.base_layer.bias.data.to(
                module.base_layer.weight.dtype
            )
-
+    fully_shard(module, **fsdp2_kwargs)
-    for active_adapter in module.active_adapters:
+    module.set_reshard_after_forward(False)
-        if module.lora_A:
+    module.set_reshard_after_backward(False)
-            fully_shard(module.lora_A[active_adapter], **fsdp2_kwargs)
+    # for active_adapter in module.active_adapters:
-        if module.lora_B:
+    #     for adapter_name in [
-            fully_shard(module.lora_B[active_adapter], **fsdp2_kwargs)
+    #         "lora_A",
-        if module.lora_embedding_A:
+    #         "lora_B",
-            fully_shard(module.lora_embedding_A[active_adapter], **fsdp2_kwargs)
+    #         "lora_embedding_A",
-        if module.lora_embedding_B:
+    #         "lora_embedding_B",
-            fully_shard(module.lora_embedding_B[active_adapter], **fsdp2_kwargs)
+    #         "lora_magnitude_vector",
-        if module.lora_magnitude_vector:
+    #     ]:
-            fully_shard(module.lora_magnitude_vector[active_adapter], **fsdp2_kwargs)
+    #         adapter_module = getattr(module, adapter_name, None)
    #         # print(adapter_module, adapter_name)
    #         # torch.distributed.breakpoint()
    #         if not adapter_module:
    #             continue
    #         fsdp_adapter_module = fully_shard(adapter_module[active_adapter], **fsdp2_kwargs)
    #         # fsdp_adapter_module.unshard()
    #         fsdp_adapter_module.set_reshard_after_backward(False)
    #         fsdp_adapter_module.set_reshard_after_forward(False)
            # torch.distributed.breakpoint()
        # if module.lora_A:
        #     fully_shard(module.lora_A[active_adapter], **fsdp2_kwargs)
        # if module.lora_B:
        #     fully_shard(module.lora_B[active_adapter], **fsdp2_kwargs)
        # if module.lora_embedding_A:
        #     fully_shard(module.lora_embedding_A[active_adapter], **fsdp2_kwargs)
        # if module.lora_embedding_B:
        #     fully_shard(module.lora_embedding_B[active_adapter], **fsdp2_kwargs)
        # if module.lora_magnitude_vector:
            # fully_shard(module.lora_magnitude_vector[active_adapter], **fsdp2_kwargs)
    return log_bias_dtype_mismatch
@@ -318,16 +371,26 @@ def fsdp2_prepare_model(accelerator, model: torch.nn.Module) -> torch.nn.Module:
            model.tie_weights()
    is_peft_model = isinstance(model, PeftModel)
-
+    # TODO - this doesn't actually do anything
    for name, module in model.named_children():
        if name == "experts":
            # torch.distributed.breakpoint()
            for expert in module.children():
                # torch.distributed.breakpoint()
                print(f"expert: {expert}")
                for lora_module in expert.children():
                    print(f"lora {lora_module}")
                    # torch.distributed.breakpoint()
                    cast_lora_module(lora_module)
                    _process_lora_module_for_fsdp(lora_module, fsdp2_kwargs)
    auto_wrap_policy = fsdp2_prepare_auto_wrap_policy(fsdp2_plugin, model)
    log_bias_dtype_mismatch = False
    if auto_wrap_policy is not None:
        for module in get_module_children_bottom_up(model)[:-1]:
-            if is_peft_model and isinstance(module, LoraLayer):
+            if is_peft_model and isinstance(module, LoraLayer) and not isinstance(module, FSDPModule):
-                module_log_bias_mismatch = _process_lora_module_for_fsdp(
+                # torch.distributed.breakpoint()
-                    module, fsdp2_kwargs
+                cast_lora_module(module)
-                )
+                # torch.distributed.breakpoint()
                log_bias_dtype_mismatch |= module_log_bias_mismatch
            if auto_wrap_policy(module) and not isinstance(module, FSDPModule):
                fully_shard(module, **fsdp2_kwargs)
@@ -344,6 +407,9 @@ def fsdp2_prepare_model(accelerator, model: torch.nn.Module) -> torch.nn.Module:
            accelerator, model, original_sd, offload_to_cpu=offload_to_cpu
        )
    # for module in model.named_modules():
    #     if "Lora" in 
    if fsdp2_plugin.cpu_ram_efficient_loading and not model_has_params4bit:
        # We re-register the buffers, as they may not be in the state_dict
        for fqn, buffer_tensor in original_non_persistent_buffers.items():
--- a/src/axolotl/monkeypatch/attention/flex_attn.py
+++ b/src/axolotl/monkeypatch/attention/flex_attn.py
@@ -1,10 +1,11 @@
 """Flex attention monkey patch"""
 import sys
 from typing import Optional, Tuple, Union
 import torch
 import transformers
 from packaging import version
 from transformers.utils.import_utils import _torch_version, is_torch_less_or_equal
 from axolotl.utils.logging import get_logger
@@ -46,19 +47,33 @@ def patch_flex_wrapper(**flex_attn_compile_kwargs):
            """
            self.training = None
            if not self._is_flex_compiled or training != self.training:
                self.training = training
                if is_torch_less_or_equal("2.5.1"):
                    self._compiled_flex_attention = torch.compile(
                        flex_attention, dynamic=False
                    )
                # In PyTorch 2.6.0, there's a known issue with flex attention compilation which may
                # cause errors. The suggested fix is to compile with "max-autotune-no-cudagraphs"
                # see https://github.com/pytorch/pytorch/issues/146260 for training
-                self.training = training
+                elif version.parse(_torch_version).base_version == "2.6.0" and training:
-                LOG.info(
+                    self._compiled_flex_attention = torch.compile(
-                    "Compiling flex attention with kwargs: %s. This may take a while...",
+                        flex_attention, dynamic=False, mode="max-autotune-no-cudagraphs"
-                    flex_attn_compile_kwargs,
+                    )
-                )
+                # Fallback, usually the most recent torch 2.7.x+ versions
-                self._compiled_flex_attention = torch.compile(
+                else:
-                    flex_attention,
+                    LOG.info(
-                    **flex_attn_compile_kwargs,
+                        "Compiling flex attention with kwargs: %s. This may take a while...",
-                )
+                        flex_attn_compile_kwargs,
-                LOG.info("Flex attention compiled successfully.")
+                        main_process_only=True,
                    )
                    self._compiled_flex_attention = torch.compile(
                        flex_attention,
                        **flex_attn_compile_kwargs,
                    )
                    LOG.info(
                        "Flex attention compiled successfully.", main_process_only=True
                    )
                self._is_flex_compiled = True
        def __call__(self):
@@ -68,139 +83,3 @@ def patch_flex_wrapper(**flex_attn_compile_kwargs):
    sys.modules[
        "transformers.integrations.flex_attention"
    ].WrappedFlexAttention = WrappedFlexAttention
 def patch_flex_make_mask():
    is_torch_2_6 = torch.__version__.startswith("2.6")
    if not is_torch_2_6:
        return
    from torch.nn.attention.flex_attention import (
        _DEFAULT_SPARSE_BLOCK_SIZE as flex_default_block_size,
    )
    from torch.nn.attention.flex_attention import (
        BlockMask,
    )
    from torch.nn.attention.flex_attention import (
        create_block_mask as create_block_causal_mask_flex,
    )
    Offset = Union[torch.Tensor, int]
    def patched_make_flex_block_causal_mask(
        attention_mask_2d: torch.Tensor,
        attention_chunk_size: Optional[int] = None,
        query_length=None,
        key_length=None,
        offsets: Optional[Tuple[Offset, Offset]] = None,
    ) -> "BlockMask":
        """
        Create a block causal document mask for a batch of sequences, both packed and unpacked.
        Create Block causal logic and passing it into :func:`torch.nn.attention.flex_attention.create_block_mask`.
        The resultant BlockMask is a compressed representation of the full block causal
        mask. BlockMask is essential for performant computation of flex attention.
        See: https://pytorch.org/blog/flexattention/
        Args:
            attention_mask_2d (torch.Tensor): Attention mask for packed and padded sequences
            of shape (batch_size, total_seq_len). e.g.
            For unpacked sequence:
            [[1, 1, 1, 1, 0, 0, 0],
             [1, 1, 1, 1, 1, 0, 0]]
            For packed sequence:
            [[1, 1, 1, 2, 2, 2, 0],
             [1, 1, 2, 2, 2, 3, 3]]
        Returns:
            BlockMask
        """
        batch_size, total_seq_len = attention_mask_2d.shape
        if not key_length:
            key_length = total_seq_len
        if not query_length:
            query_length = total_seq_len
        attention_mask_2d = torch.nn.functional.pad(
            attention_mask_2d,
            value=0,
            pad=(0, abs(total_seq_len - max(key_length, flex_default_block_size))),
        )
        device = attention_mask_2d.device
        document_ids = attention_mask_2d.clone()
        if attention_chunk_size is not None:
            # we create an arange, then we just // by chunk size to get [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]
            chunk_idxs = (document_ids.clone().fill_(1).cumsum(-1) - 1) // (
                attention_chunk_size
            )
        # Instead of passing a tensor mask, flex attention requires a mask_mod function
        # that determines which elements of QK^T should be included in the attention
        # computation prior to the softmax. For sample packing, we need both the
        # logic for both causal mask and document mask. See PyTorch's official
        # blog post for more details: https://pytorch.org/blog/flexattention/#mask-mods
        def causal_mask_mod(batch_idx, head_idx, q_idx, kv_idx):
            """
            Defines the logic of a block causal mask by combining both a standard causal mask
            and a block diagonal document mask.
            See :func:`~torchtune.modules.attention_utils.create_block_causal_mask`
            for an illustration.
            """
            causal_mask = q_idx >= kv_idx  # not valid when decoding
            document_mask = (
                document_ids[batch_idx, q_idx] == document_ids[batch_idx, kv_idx]
            )
            padding_mask = attention_mask_2d[batch_idx, q_idx] > 0
            final_mask = causal_mask & padding_mask & document_mask
            return final_mask
        def chunk_causal_mask_mod(batch_idx, head_idx, q_idx, kv_idx):
            """
            Combines the chunk mask with the causal mask for chunked attention.
            """
            chunk_mask = chunk_idxs[batch_idx, q_idx] == chunk_idxs[batch_idx, kv_idx]
            causal_doc_mask = causal_mask_mod(batch_idx, head_idx, q_idx, kv_idx)
            return chunk_mask & causal_doc_mask
        mask_mod_maybe_combined = (
            causal_mask_mod if attention_chunk_size is None else chunk_causal_mask_mod
        )
        if offsets is not None:
            q_offset = offsets[0]
            kv_offset = offsets[1]
            def mask_mod(batch_idx, head_idx, q_idx, kv_idx):
                offset_q = q_idx + q_offset
                offset_kv = kv_idx + kv_offset
                return mask_mod_maybe_combined(batch_idx, head_idx, offset_q, offset_kv)
        else:
            mask_mod = mask_mod_maybe_combined
        return create_block_causal_mask_flex(
            mask_mod=mask_mod,
            B=batch_size,
            H=None,  # attention head
            Q_LEN=query_length,
            KV_LEN=key_length,
            device=device,
            _compile=True,
        )
    for n in tuple(sys.modules):
        if ".modeling_" in n:
            if hasattr(sys.modules[n], "make_flex_block_causal_mask"):
                sys.modules[
                    n
                ].make_flex_block_causal_mask = patched_make_flex_block_causal_mask
                sys.modules[
                    n
                ].make_flex_block_causal_mask = patched_make_flex_block_causal_mask
    transformers.integrations.flex_attention.make_flex_block_causal_mask = (
        patched_make_flex_block_causal_mask
    )
--- a/src/axolotl/monkeypatch/deepspeed_utils.py
+++ b/src/axolotl/monkeypatch/deepspeed_utils.py
@@ -0,0 +1,67 @@
 import importlib
 import importlib.util
 from axolotl.utils.logging import get_logger
 LOG = get_logger(__name__)
 def patch_checkpoint_wrapper_setattr():
    """
    Patch CheckpointWrapper to properly forward DeepSpeed attributes to wrapped modules.
    This fixes the issue where CheckpointWrapper doesn't forward ds_* attributes
    (like ds_grads_remaining) to the actual wrapped module, causing DeepSpeed
    ZeRO-3 to fail when gradient checkpointing is enabled.
    This issue occurs specifically with:
    - QLoRA + DeepSpeed ZeRO-3
    - gradient_checkpointing: true
    - activation_offloading: true
    References:
    - https://github.com/deepspeedai/DeepSpeed/issues/7203
    - https://github.com/deepspeedai/DeepSpeed/blob/38d1a9eb64c9e01e32eccc50b25ba18925287441/deepspeed/runtime/zero/parameter_offload.py#L424-L458
    - https://github.com/axolotl-ai-cloud/axolotl/pull/3102
    """
    try:
        from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
            CheckpointWrapper,
        )
        # Check if already patched
        if hasattr(CheckpointWrapper, "_axolotl_setattr_patched"):
            LOG.debug("CheckpointWrapper already patched")
            return
        original_setattr = CheckpointWrapper.__setattr__
        def new_setattr(self, name: str, value) -> None:
            if name.startswith("ds_") and hasattr(self, "_checkpoint_wrapped_module"):
                setattr(self._checkpoint_wrapped_module, name, value)
                LOG.debug(
                    f"Forwarded {name} to wrapped module {type(self._checkpoint_wrapped_module).__name__}"
                )
            else:
                original_setattr(self, name, value)
        CheckpointWrapper.__setattr__ = new_setattr
        CheckpointWrapper._axolotl_setattr_patched = True
        LOG.info("CheckpointWrapper patched to forward DeepSpeed attributes")
    except ImportError as e:
        LOG.debug(f"CheckpointWrapper not available: {e}")
    except Exception as e:
        LOG.warning(f"Failed to patch CheckpointWrapper: {e}")
 def apply_deepspeed_patches():
    """
    Apply DeepSpeed-related patches
    """
    if importlib.util.find_spec("deepspeed") is not None:
        patch_checkpoint_wrapper_setattr()
    else:
        LOG.debug("DeepSpeed not available, skipping patches")
--- a/src/axolotl/monkeypatch/lora_kernels.py
+++ b/src/axolotl/monkeypatch/lora_kernels.py
@@ -149,6 +149,11 @@ def get_attention_cls_from_config(cfg: DictDefault) -> Type[nn.Module]:
        return MistralAttention
    if model_type == "gemma3_text":
        from transformers.models.gemma3.modeling_gemma3 import Gemma3Attention
        return Gemma3Attention
    try:
        # Dynamically import the module and attention class
        module_path = f"transformers.models.{model_type}.modeling_{model_type}"
--- a/src/axolotl/monkeypatch/multipack.py
+++ b/src/axolotl/monkeypatch/multipack.py
@@ -36,8 +36,13 @@ SUPPORTED_MULTIPACK_MODEL_TYPES = [
    "glm",
    "glm4",
    "smollm3",
    "granite",
    "granitemoe",
    "hunyuan_v1_dense",
    "hunyuan_v1_moe",
    "gpt_oss",
    "arcee",
    "seed_oss",
 ]
--- a/src/axolotl/monkeypatch/tiled_mlp/base.py
+++ b/src/axolotl/monkeypatch/tiled_mlp/base.py
@@ -8,6 +8,94 @@ from typing import List
 import torch
 class DeepSpeedTiledMLPMoE(torch.autograd.Function):
    @staticmethod
    def forward(
        ctx,
        fn,
        self,
        x,
        shards,
        compute_params,
    ) -> torch.Tensor:
        ctx.fn = fn
        ctx.self = self
        ctx.shards = shards
        ctx.compute_params = [p for p in compute_params if p.requires_grad]
        ctx.save_for_backward(x)
        x_shards = list(torch.chunk(x, chunks=shards, dim=1))
        with torch.no_grad():
            output_shards = [fn(self, x_shard) for x_shard in x_shards]
        ctx.is_tuple_output = isinstance(output_shards[0], tuple)
        if isinstance(output_shards[0], tuple):
            tuple_dim_idx = [1, 0]
            output_unsharded = tuple(
                torch.cat(
                    [output_shard[i] for output_shard in output_shards],
                    dim=tuple_dim_idx[i],
                )
                for i in range(len(output_shards[0]))
            )
        else:
            output_unsharded = torch.cat(output_shards, dim=1)
        return output_unsharded
    @staticmethod
    def backward(ctx, *grads) -> torch.Tensor:
        fn = ctx.fn
        (x,) = ctx.saved_tensors
        self = ctx.self
        shards = ctx.shards
        compute_params = ctx.compute_params
        is_tuple_output = ctx.is_tuple_output
        x_requires_grad = x.requires_grad
        x = x.detach()
        # detach() unsets `x.requires_grad`, so restore it
        x.requires_grad_(x_requires_grad)
        incoming_grad = grads[0]
        x_grad = torch.zeros_like(x)
        x_shards = list(torch.chunk(x, chunks=shards, dim=1))
        shard_step = x_shards[0].numel()
        for i, x_shard in enumerate(x_shards):
            # Tell deepspeed not to add a new grad to its ipg bucket until the last shard is run
            if compute_params is not None:
                if i + 1 < shards:
                    for param in compute_params:
                        param.ds_grad_is_ready = False
                else:
                    # last shard, can add the grad
                    for param in compute_params:
                        param.ds_grad_is_ready = True
            x_shard.requires_grad_(x_requires_grad)
            shard_offset = i * shard_step
            x_shard.grad = (
                x_grad.view(-1)
                .narrow(0, shard_offset, x_shard.numel())
                .view_as(x_shard)
            )
            incoming_grad_shard = (
                incoming_grad.view(-1)
                .narrow(0, shard_offset, x_shard.numel())
                .view_as(x_shard)
            )
            with torch.enable_grad():
                output = fn(self, x_shard)
            if is_tuple_output:
                torch.autograd.backward(output[0], incoming_grad_shard)
            else:
                torch.autograd.backward(output, incoming_grad_shard)
        return (None, None, x_grad, None, None)
 class TiledMLP(torch.autograd.Function):
    """
    TiledMLP implementation using gradient hooks
@@ -31,7 +119,18 @@ class TiledMLP(torch.autograd.Function):
        x_shards = list(torch.chunk(x, chunks=shards, dim=1))
        with torch.no_grad():
            output_shards = [fn(self, x_shard) for x_shard in x_shards]
-        output_unsharded = torch.cat(output_shards, dim=1)
+        ctx.is_tuple_output = isinstance(output_shards[0], tuple)
        if isinstance(output_shards[0], tuple):
            tuple_dim_idx = [1, 0]
            output_unsharded = tuple(
                torch.cat(
                    [output_shard[i] for output_shard in output_shards],
                    dim=tuple_dim_idx[i],
                )
                for i in range(len(output_shards[0]))
            )
        else:
            output_unsharded = torch.cat(output_shards, dim=1)
        return output_unsharded
@@ -42,6 +141,7 @@ class TiledMLP(torch.autograd.Function):
        self = ctx.self
        shards = ctx.shards
        compute_params = ctx.compute_params
        is_tuple_output = ctx.is_tuple_output
        x_requires_grad = x.requires_grad
        x = x.detach()
@@ -76,7 +176,10 @@ class TiledMLP(torch.autograd.Function):
            with torch.enable_grad():
                output = fn(self, x_shard)
-            torch.autograd.backward(output, incoming_grad_shard)
+            if is_tuple_output:
                torch.autograd.backward(output[0], incoming_grad_shard)
            else:
                torch.autograd.backward(output, incoming_grad_shard)
        # Clean up hooks
        grad_accumulator.cleanup()
--- a/src/axolotl/monkeypatch/tiled_mlp/patch.py
+++ b/src/axolotl/monkeypatch/tiled_mlp/patch.py
@@ -17,7 +17,7 @@ def patch_tiled_mlp(model_type, use_original_mlp=True, cfg_num_shards=None):
        TiledMLP as DeepSpeedTiledMLP,
    )
-    from axolotl.monkeypatch.tiled_mlp.base import TiledMLP
+    from axolotl.monkeypatch.tiled_mlp.base import DeepSpeedTiledMLPMoE, TiledMLP
    try:
        # Dynamically import the module and MLP class
@@ -64,7 +64,10 @@ def patch_tiled_mlp(model_type, use_original_mlp=True, cfg_num_shards=None):
                        for p in self._compute_params
                    )
                ) or os.environ.get("ACCELERATE_USE_DEEPSPEED", "false") == "true":
-                    self._tiled_mlp_dist_impl = DeepSpeedTiledMLP
+                    if model_type == "gpt_oss":
                        self._tiled_mlp_dist_impl = DeepSpeedTiledMLPMoE
                    else:
                        self._tiled_mlp_dist_impl = DeepSpeedTiledMLP
                else:
                    self._tiled_mlp_dist_impl = TiledMLP
--- a/src/axolotl/monkeypatch/transformers/trainer_loss_calc.py
+++ b/src/axolotl/monkeypatch/transformers/trainer_loss_calc.py
@@ -28,15 +28,6 @@ PATCHED_EVAL_CODE = {
    "array": 'metrics[f"{metric_key_prefix}_loss"] = np.nanmean(all_losses).item()',
 }
 ORIGINAL_FSDP2_CODE = """
    model.eval()
 """
 PATCHED_FSDP2_CODE = """
    if hasattr(model, "eval") and callable(model.eval):
        self.model.eval()
 """
 ORIGINAL_MAYBE_CODE = "tr_loss_scalar = self._nested_gather(tr_loss).mean().item()"
 PATCHED_MAYBE_CODE = "tr_loss_scalar = self._nested_gather(tr_loss).nanmean().item()"
@@ -46,13 +37,7 @@ def check_evaluation_loop_is_patchable() -> bool:
    return all(value in evaluation_loop_source for value in ORIGINAL_EVAL_CODE.values())
-def check_evaluation_loop_is_fsdp2_patchable() -> bool:
+def patch_evaluation_loop():
    evaluation_loop_source = inspect.getsource(Trainer.evaluation_loop)
    evaluation_loop_source, _ = detab_code(evaluation_loop_source)
    return ORIGINAL_FSDP2_CODE in evaluation_loop_source
 def patch_evaluation_loop(patch_fsdp2: bool):
    """Patch the evaluation_loop method."""
    # Check if already patched
    if hasattr(Trainer, "_original_evaluation_loop"):
@@ -75,13 +60,6 @@ def patch_evaluation_loop(patch_fsdp2: bool):
        ORIGINAL_EVAL_CODE["array"], PATCHED_EVAL_CODE["array"]
    )
    # Apply FSDP2 eval guard patch if needed
    if patch_fsdp2 and ORIGINAL_FSDP2_CODE in evaluation_loop_source:
        evaluation_loop_source = evaluation_loop_source.replace(
            ORIGINAL_FSDP2_CODE, PATCHED_FSDP2_CODE
        )
        LOG.info("Applied FSDP2 eval guard patch to evaluation_loop")
    # Rename the function to avoid conflicts
    evaluation_loop_source = evaluation_loop_source.replace(
        "def evaluation_loop(",
--- a/src/axolotl/prompt_tokenizers.py
+++ b/src/axolotl/prompt_tokenizers.py
@@ -75,7 +75,7 @@ class PromptTokenizingStrategy(abc.ABC):
    ) -> BatchEncoding:
        empty = BatchEncoding(data={"input_ids": [], "attention_mask": []})
        if not prompt:
-            LOG.warning("Empty text requested for tokenization.")
+            LOG.warning_once("Empty text requested for tokenization.")
            return empty
        result = self.tokenizer(
--- a/src/axolotl/train.py
+++ b/src/axolotl/train.py
@@ -30,11 +30,7 @@ from axolotl.contribs.lgpl import (  # pylint: disable = no-name-in-module
    fix_untrained_tokens,
 )
 from axolotl.integrations.base import PluginManager
-from axolotl.loaders import (
+from axolotl.loaders import ModelLoader, load_processor, load_tokenizer
    ModelLoader,
    load_processor,
    load_tokenizer,
 )
 from axolotl.utils.ctx_managers.sequence_parallel import SequenceParallelContextManager
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.distributed import cleanup_distributed
@@ -234,16 +230,15 @@ def save_trained_model(
    # handle QAT
    if cfg.qat:
-        from axolotl.utils.quantization import convert_qat_model_for_ptq
+        from axolotl.utils.quantization import convert_qat_model
-        LOG.info("Processing QAT model for saving...")
+        convert_qat_model(
        convert_qat_model_for_ptq(
            model,
            quantize_embedding=cfg.qat.quantize_embedding,
        )
        LOG.info(
-            "QAT modules have been converted for PTQ. Please ensure you quantize "
+            "QAT usage note: please ensure you quantize your model fine-tuned using QAT by running `axolotl quantize`"
-            "your model weights with `axolotl quantize`."
+            " with the same config which you used for training."
        )
    # Handle ReLoRA early return case
    if cfg.relora:
@@ -337,9 +332,7 @@ def save_trained_model(
    if hasattr(cfg, "llmcompressor") and cfg.llmcompressor:
        # TODO: add integration support so this can be implemented completely within the plugin
-        from axolotl.integrations.llm_compressor.utils import (
+        from axolotl.integrations.llm_compressor.utils import save_compressed_model
            save_compressed_model,
        )
        save_compressed_model(
            model=model,
@@ -416,7 +409,9 @@ def save_initial_configs(
    # Pre-save the tokenizer and model configs
    LOG.info(f"Pre-saving tokenizer to {cfg.output_dir}...")
-    tokenizer.save_pretrained(str(output_dir))
+    tokenizer.save_pretrained(
        str(Path(cfg.output_dir)), save_jinja_files=cfg.tokenizer_save_jinja_files
    )
    if hasattr(model, "config"):
        LOG.info(f"Pre-saving model config to {cfg.output_dir}...")
        model.config.save_pretrained(str(output_dir))
@@ -592,6 +587,9 @@ def train(
    # Save the trained model and cleanup
    save_trained_model(cfg, trainer, model, safe_serialization)
    tokenizer.save_pretrained(
        str(Path(cfg.output_dir)), save_jinja_files=cfg.tokenizer_save_jinja_files
    )
    create_model_card(cfg, trainer)
    if not cfg.use_ray:
        cleanup_distributed()
--- a/src/axolotl/utils/bench.py
+++ b/src/axolotl/utils/bench.py
@@ -60,13 +60,14 @@ def gpu_memory_usage_all(device=0):
    active = torch.cuda.memory_stats().get("active_bytes.all.peak", 0) / 1024.0**3
    allocated = torch.cuda.max_memory_allocated(device) / 1024.0**3
    reserved = torch.cuda.max_memory_reserved(device) / 1024.0**3
    torch.cuda.reset_peak_memory_stats(device)
    return active, allocated, reserved
 def mps_memory_usage_all():
-    usage = torch.mps.current_allocated_memory() / 1024.0**3
+    active = torch.mps.current_allocated_memory() / 1024.0**3
-    reserved = torch.mps.driver_allocated_memory() / 1024.0**3
+    allocated = torch.mps.driver_allocated_memory() / 1024.0**3
-    return usage, reserved - usage, 0
+    return active, allocated, 0
 def npu_memory_usage_all(device=0):
--- a/src/axolotl/utils/callbacks/tokens_per_second.py
+++ b/src/axolotl/utils/callbacks/tokens_per_second.py
@@ -0,0 +1,64 @@
 """A callback for calculating tokens per second during training."""
 import time
 import torch
 from transformers import (
    TrainerCallback,
    TrainerControl,
    TrainerState,
    TrainingArguments,
 )
 class TokensPerSecondCallback(TrainerCallback):
    """
    A callback to measure and log tokens per second during training.
    """
    def __init__(self, tensor_parallel_size, context_parallel_size):
        super().__init__()
        self.step_time = 0.0
        self.start_time = 0.0
        self.non_data_parallel_size = 1
        if tensor_parallel_size is not None:
            self.non_data_parallel_size *= tensor_parallel_size
        if context_parallel_size is not None:
            self.non_data_parallel_size *= context_parallel_size
    def on_step_begin(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):  # pylint: disable=unused-argument
        self.start_time = time.perf_counter()
        state.last_tokens_per_second = torch.zeros(1)
    def on_step_end(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):  # pylint: disable=unused-argument
        if hasattr(state, "num_tokens"):
            step_time = time.perf_counter() - self.start_time
            num_tokens_per_device = state.num_tokens.clone()
            # non data parallel groups have duplicated tokens, so we avoid double-counting
            num_tokens_per_device = num_tokens_per_device / self.non_data_parallel_size
            state.last_tokens_per_second = num_tokens_per_device / step_time
    def on_log(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        logs=None,
        **kwargs,
    ):  # pylint: disable=unused-argument
        # after logging, clear the running metrics
        if hasattr(state, "last_tokens_per_second"):
            state.last_tokens_per_second.zero_()
            state.num_tokens = torch.zeros(1)
--- a/src/axolotl/utils/collators/init.py
+++ b/src/axolotl/utils/collators/init.py
@@ -1,11 +1,17 @@
-"""
+"""Shared axolotl collators for multipacking, mamba, multimodal."""
 shared axolotl collators for multipack, mamba, multimodal
 """
-from .batching import (  # noqa: F401
+from .batching import (
    BatchSamplerDataCollatorForSeq2Seq,
    DataCollatorForSeq2Seq,
    PretrainingBatchSamplerDataCollatorForSeq2Seq,
    V2BatchSamplerDataCollatorForSeq2Seq,
 )
-from .mamba import MambaDataCollator  # noqa: F401
+from .mamba import MambaDataCollator
 __all__ = [
    "DataCollatorForSeq2Seq",
    "BatchSamplerDataCollatorForSeq2Seq",
    "V2BatchSamplerDataCollatorForSeq2Seq",
    "PretrainingBatchSamplerDataCollatorForSeq2Seq",
    "MambaDataCollator",
 ]
--- a/src/axolotl/utils/config/init.py
+++ b/src/axolotl/utils/config/init.py
@@ -17,8 +17,8 @@ from axolotl.utils.dict import DictDefault
 from axolotl.utils.logging import get_logger
 from axolotl.utils.schemas.config import (
    AxolotlConfigWCapabilities as AxolotlConfigWCapabilitiesBase,
    AxolotlInputConfig as AxolotlInputConfigBase,
 )
 from axolotl.utils.schemas.config import AxolotlInputConfig as AxolotlInputConfigBase
 from axolotl.utils.schemas.datasets import DPODataset, KTODataset, SFTDataset
 LOG = get_logger(__name__)
@@ -77,7 +77,7 @@ def resolve_dtype(cfg):
    if cfg.device == "mps":
        cfg.load_in_8bit = False
        cfg.tf32 = False
-        if cfg.bf16:
+        if cfg.bf16 and cfg.fp16 is not False:
            cfg.fp16 = True
        cfg.bf16 = False
    else:
@@ -273,7 +273,9 @@ def validate_config(
    # Convert datasets to proper format if needed
    if cfg.get("datasets"):
        for idx, ds_cfg in enumerate(cfg["datasets"]):
-            if cfg.get("rl") in ["dpo", "simpo"] and not isinstance(ds_cfg, DPODataset):
+            if cfg.get("rl") in ["dpo", "ipo", "simpo"] and not isinstance(
                ds_cfg, DPODataset
            ):
                cfg["datasets"][idx] = DPODataset(**ds_cfg)
            elif cfg.get("rl") == "kto" and not isinstance(ds_cfg, KTODataset):
                cfg["datasets"][idx] = KTODataset(**dict(ds_cfg))
--- a/src/axolotl/utils/ctx_managers/sequence_parallel.py
+++ b/src/axolotl/utils/ctx_managers/sequence_parallel.py
@@ -48,10 +48,10 @@ def apply_sequence_parallelism(
            - The original sequence length before padding.
            - The number of padding tokens added.
    """
-    original_seq_len = batch["input_ids"].size(1)
+    batch_size, original_seq_len = batch["input_ids"].shape
    # Update ring attention params if needed
-    if batch.get("position_ids") is not None:
+    if batch.get("position_ids") is not None and batch_size == 1:
        update_ring_attn_params(position_ids=batch["position_ids"])
    else:
        # If position_ids aren't already in the batch, create them
--- a/src/axolotl/utils/data/init.py
+++ b/src/axolotl/utils/data/init.py
@@ -1,19 +1,19 @@
 """Init for `axolotl.utils.data` module."""
 from axolotl.utils.data.pretraining import (
    encode_pretraining,
    wrap_pretraining_dataset,
 )
 from axolotl.utils.data.rl import prepare_preference_datasets
 from axolotl.utils.data.sft import (
    get_dataset_wrapper,
    prepare_datasets,
 )
 from axolotl.utils.data.streaming import (
    encode_streaming,
    wrap_streaming_dataset,
 )
 from axolotl.utils.data.utils import md5
 __all__ = [
-    "encode_pretraining",
+    "encode_streaming",
-    "wrap_pretraining_dataset",
+    "wrap_streaming_dataset",
    "prepare_preference_datasets",
    "get_dataset_wrapper",
    "prepare_datasets",
--- a/src/axolotl/utils/data/sft.py
+++ b/src/axolotl/utils/data/sft.py
@@ -9,13 +9,13 @@ from datasets import (
    Dataset,
    DatasetDict,
    IterableDataset,
    IterableDatasetDict,
    load_dataset,
 )
 from transformers import PreTrainedTokenizer, ProcessorMixin
 from axolotl.prompters import Prompter
 from axolotl.utils.data.lock import FileLockLoader
 from axolotl.utils.data.pretraining import wrap_pretraining_dataset
 from axolotl.utils.data.shared import (
    create_train_validation_split,
    datasets_with_name_generator,
@@ -26,6 +26,7 @@ from axolotl.utils.data.shared import (
    save_preprocessed_dataset,
    try_load_from_hub,
 )
 from axolotl.utils.data.streaming import wrap_streaming_dataset
 from axolotl.utils.data.utils import (
    deduplicate_and_log_datasets,
    handle_long_seq_in_dataset,
@@ -48,7 +49,6 @@ def prepare_datasets(
    cfg: DictDefault,
    tokenizer: PreTrainedTokenizer,
    processor: ProcessorMixin | None = None,
    preprocess_iterable: bool = False,
 ) -> tuple[IterableDataset | Dataset, Dataset | None, int, list[Prompter | None]]:
    """Prepare training and evaluation datasets based on configuration.
@@ -56,23 +56,19 @@ def prepare_datasets(
        cfg: Dictionary mapping `axolotl` config keys to values.
        tokenizer: Tokenizer to use for processing text.
        processor: Optional processor for multimodal datasets.
        preprocess_iterable: Whether to use iterable preprocessing.
    Returns:
        Tuple of (train_dataset, eval_dataset, total_steps, prompters).
    """
-    if cfg.pretraining_dataset:
+    if cfg.streaming or cfg.pretraining_dataset:
-        return _prepare_pretraining_dataset(
+        return _prepare_streaming_dataset(cfg, tokenizer, processor)
-            cfg, tokenizer, processor, preprocess_iterable
+    return _prepare_standard_dataset(cfg, tokenizer, processor)
        )
    return _prepare_standard_dataset(cfg, tokenizer, processor, preprocess_iterable)
 def _prepare_standard_dataset(
    cfg: DictDefault,
    tokenizer: PreTrainedTokenizer,
    processor: ProcessorMixin | None,
    preprocess_iterable: bool,
 ) -> tuple[Dataset, Dataset | None, int, list[Prompter | None]]:
    """Prepare standard (non-pretraining) datasets."""
@@ -83,7 +79,6 @@ def _prepare_standard_dataset(
            cfg,
            split="train",
            processor=processor,
            preprocess_iterable=preprocess_iterable,
        )
        # Overwrite eval_dataset if test data exists
@@ -93,7 +88,6 @@ def _prepare_standard_dataset(
                cfg,
                split="test",
                processor=processor,
                preprocess_iterable=preprocess_iterable,
            )
        return train_dataset, eval_dataset, prompters
@@ -128,22 +122,40 @@ def _prepare_standard_dataset(
    return train_dataset, eval_dataset, total_num_steps, prompters
-def _prepare_pretraining_dataset(
+def _prepare_streaming_dataset(
    cfg: DictDefault,
    tokenizer: PreTrainedTokenizer,
    processor: ProcessorMixin | None,
    preprocess_iterable: bool,
 ) -> tuple[IterableDataset, Dataset | None, int, list[Prompter | None]]:
    """
-    Prepare dataset for pretraining mode.
+    Prepare dataset for streaming mode.
-    Note: Pre-training datasets are streamed from the HuggingFace Hub.
+    Note: Streaming datasets are loaded incrementally from the source.
    """
-    # Extract pretraining dataset configuration
+    if cfg.pretraining_dataset:
-    pretraining_config = _extract_pretraining_config(cfg)
+        dataset_config = _extract_pretraining_config(cfg)
        train_dataset = _load_streaming_dataset(dataset_config, cfg, tokenizer)
    elif cfg.sample_packing:
        # TODO(djsaunde): Implement for multiple datasets
        dataset_config = DictDefault(cfg.datasets[0])
-    # Load streaming dataset for training
+        # Ensure we have a split set - default to 'train' if not specified
-    train_dataset = _load_pretraining_dataset(pretraining_config, cfg, tokenizer)
+        if not hasattr(dataset_config, "split") or not dataset_config.split:
            dataset_config.split = "train"
        train_dataset = _load_streaming_dataset(dataset_config, cfg, tokenizer)
    else:
        # Use legacy loading function for non-packed streaming datasets
        train_dataset, eval_dataset, prompters = _load_and_prepare_datasets(
            tokenizer,
            cfg,
            split="train",
            processor=processor,
            streaming=True,
        )
        # Return early for non-packed streaming datasets
        total_num_steps = cfg.max_steps if cfg.max_steps else -1
        return train_dataset, eval_dataset, total_num_steps, prompters
    # Load evaluation dataset if specified
    eval_dataset = None
@@ -153,14 +165,12 @@ def _prepare_pretraining_dataset(
            cfg,
            split="test",
            processor=processor,
-            preprocess_iterable=preprocess_iterable,
+            streaming=False,
        )
-    if cfg.dataset_exact_deduplication:
+    # For streaming, we return max_steps directly from config or -1 if not set
-        LOG.info("Deduplication not available for pretrained datasets")
+    total_num_steps = cfg.max_steps if cfg.max_steps else -1
-
+    return train_dataset, eval_dataset, total_num_steps, []
    # For pretraining, we return max_steps directly from config
    return train_dataset, eval_dataset, cfg.max_steps, []
 def _extract_pretraining_config(cfg: DictDefault) -> DictDefault:
@@ -192,7 +202,7 @@ def _extract_pretraining_config(cfg: DictDefault) -> DictDefault:
    )
-def _load_pretraining_dataset(
+def _load_streaming_dataset(
    pretraining_config: DictDefault, cfg: DictDefault, tokenizer: PreTrainedTokenizer
 ) -> IterableDataset:
    """Load and prepare a streaming dataset for pretraining."""
@@ -227,15 +237,11 @@ def _load_pretraining_dataset(
        iter_dataset = iter_dataset.skip(pretraining_config["skip"])
    # Wrap the dataset for pretraining
-    train_dataset = wrap_pretraining_dataset(
+    train_dataset = wrap_streaming_dataset(
        iter_dataset,
        tokenizer,
        cfg,
        dataset_wrapper_partial,
        max_tokens=cfg.sequence_len,
        batch_size=cfg.micro_batch_size,
        seed=cfg.seed,
        buffer_size=cfg.pretrain_multipack_buffer_size or 10_000,
    )
    # Format for PyTorch
@@ -256,7 +262,7 @@ def _load_tokenized_prepared_datasets(
    cfg: DictDefault,
    split: Literal["train", "test"] = "train",
    processor: ProcessorMixin | None = None,
-    preprocess_iterable: bool = False,
+    streaming: bool = False,
 ) -> tuple[Dataset | DatasetDict, list[Prompter | None]]:
    """Load or create tokenized and prepared datasets for training or testing.
@@ -265,7 +271,7 @@ def _load_tokenized_prepared_datasets(
        cfg: Configuration object.
        split: Dataset split to load ('train' or 'test').
        processor: Optional processor for multimodal datasets.
-        preprocess_iterable: Whether to use iterable preprocessing.
+        streaming: Whether to use iterable preprocessing.
    Returns:
        Tuple of (dataset, prompters list).
@@ -296,7 +302,7 @@ def _load_tokenized_prepared_datasets(
            tokenizer,
            split,
            processor,
-            preprocess_iterable,
+            streaming,
        )
    return dataset, prompters
@@ -308,7 +314,7 @@ def _load_raw_datasets(
    tokenizer: PreTrainedTokenizer,
    split: str,
    processor: ProcessorMixin | None = None,
-    preprocess_iterable: bool = False,
+    streaming: bool = False,
 ) -> tuple[Dataset, list[Prompter | None]]:
    """Load, process, merge, and save raw datasets."""
    LOG.info("Loading raw datasets...", main_process_only=False)
@@ -329,7 +335,7 @@ def _load_raw_datasets(
            split=split,
            seed=cfg.seed,
            processor=processor,
-            preprocess_iterable=preprocess_iterable,
+            streaming=streaming,
        )
        datasets.append(dataset_wrapper)
        prompters.append(dataset_prompter)
@@ -337,7 +343,7 @@ def _load_raw_datasets(
    # Merge datasets
    dataset = merge_datasets(datasets, cfg)
-    if not cfg.skip_prepare_dataset:
+    if not cfg.skip_prepare_dataset and not streaming:
        if split == "test" and cfg.eval_sequence_len:
            dataset = handle_long_seq_in_dataset(dataset, cfg.eval_sequence_len, cfg)
        else:
@@ -361,19 +367,19 @@ def _load_and_process_single_dataset(
    split: str,
    seed: int,
    processor: ProcessorMixin | None = None,
-    preprocess_iterable: bool = False,
+    streaming: bool = False,
 ) -> tuple[Dataset | IterableDataset, Prompter | None]:
    """Load and process a single dataset based on the passed config."""
    # Load the dataset
    dataset = load_dataset_with_config(
-        dataset_config, cfg.hf_use_auth_token, streaming=preprocess_iterable
+        dataset_config, cfg.hf_use_auth_token, streaming=streaming
    )
    # Parse dataset type
    d_base_type, d_prompt_style = _parse_dataset_type(dataset_config.type)
    # Select the appropriate split
-    if isinstance(dataset, DatasetDict):
+    if isinstance(dataset, (DatasetDict, IterableDatasetDict)):
        if dataset_config.split and dataset_config.split in dataset:
            dataset = dataset[dataset_config.split]
        elif split in dataset:
@@ -479,7 +485,7 @@ def _load_and_prepare_datasets(
    cfg: DictDefault,
    split: Literal["train", "test"] = "train",
    processor: ProcessorMixin | None = None,
-    preprocess_iterable: bool = False,
+    streaming: bool = False,
 ) -> tuple[Dataset | None, Dataset | None, list[Prompter | None]]:
    """Load and prepare datasets with optional validation split and sharding.
@@ -488,7 +494,7 @@ def _load_and_prepare_datasets(
        cfg: Configuration object.
        split: Dataset split to load ('train' or 'test').
        processor: Optional processor for multimodal datasets.
-        preprocess_iterable: Whether to use iterable preprocessing.
+        streaming: Whether to use iterable preprocessing.
    Returns:
        Tuple of (train_dataset, eval_dataset, prompters).
@@ -499,7 +505,7 @@ def _load_and_prepare_datasets(
        cfg,
        split=split,
        processor=processor,
-        preprocess_iterable=preprocess_iterable,
+        streaming=streaming,
    )
    # Apply dataset sharding if configured using shared function
--- a/src/axolotl/utils/data/shared.py
+++ b/src/axolotl/utils/data/shared.py
@@ -236,11 +236,9 @@ def _load_from_local_path(
        try:
            return load_from_disk(dataset_config.path)
        except FileNotFoundError:
            load_dataset_kwargs["streaming"] = False
            return load_dataset(dataset_config.path, **load_dataset_kwargs)
    elif local_path.is_file():
        dataset_type = get_dataset_type(dataset_config)
        load_dataset_kwargs["streaming"] = False
        return load_dataset(
            dataset_type,
            data_files=dataset_config.path,
--- a/src/axolotl/utils/data/pretraining.py
+++ b/src/axolotl/utils/data/pretraining.py
@@ -1,4 +1,4 @@
-"""data handling specific to pretraining"""
+"""Data handling specific to streaming datasets."""
 import functools
 from collections import defaultdict
@@ -17,10 +17,10 @@ from axolotl.utils.trainer import process_pretraining_datasets_for_packing
 LOG = get_logger(__name__)
-def encode_pretraining(
+def encode_streaming(
    examples: Dict[str, List],
    tokenizer: PreTrainedTokenizerBase,
    max_tokens: int,
    examples: Dict[str, List],
    text_column: str = "text",
    concatenate: bool = True,
 ) -> Dict[str, List]:
@@ -176,45 +176,57 @@ def encode_pretraining(
    return ret
-def wrap_pretraining_dataset(
+def wrap_streaming_dataset(
    dataset,
    tokenizer,
    cfg,
    ds_wrapper_fn,
    max_tokens=2048,
    batch_size=1,
    seed=42,
    buffer_size=10_000,
 ):
    if cfg.sample_packing:
        # For SFT (non-pretraining) datasets, always use multipack_attn=True to ensure
        # attention isolation between packed sequences
        multipack_attn = (
            True if not cfg.pretraining_dataset else cfg.pretrain_multipack_attn
        )
        collate_fn = PretrainingBatchSamplerDataCollatorForSeq2Seq(
            tokenizer,
            return_tensors="pt",
            padding=True,
-            pad_to_multiple_of=max_tokens,
+            pad_to_multiple_of=cfg.sequence_len,
-            multipack_attn=cfg.pretrain_multipack_attn,
+            multipack_attn=multipack_attn,
        )
        encode = functools.partial(
-            encode_packed_pretraining,
+            encode_packed_streaming,
            collate_fn,
            ds_wrapper_fn,
-            max_seq_length=max_tokens,
+            max_seq_length=cfg.sequence_len,
-            batch_size=batch_size,
+            batch_size=cfg.micro_batch_size,
-            multipack_attn=cfg.pretrain_multipack_attn,
+            multipack_attn=multipack_attn,
        )
-        # set this to 1 so downstream data_loader doesn't try to increase the batch again
+
        # Set this to 1 so downstream data_loader doesn't try to increase the batch size
        # again
        cfg.micro_batch_size = 1
    else:
        # NOTE: This is not reachable for SFT datasets since we use the pre-existing
        # loading function for non-packed streaming datasets. Refer to
        # _prepare_streaming_datasets in sft.py for that code path.
        text_column = (
            getattr(cfg.pretraining_dataset[0], "text_column", "text") or "text"
        )
        encode = functools.partial(
-            encode_pretraining,
+            encode_streaming,
-            tokenizer,
+            tokenizer=tokenizer,
-            max_tokens,
+            max_tokens=cfg.sequence_len,
-            text_column=cfg.pretraining_dataset[0].text_column or "text",
+            text_column=text_column,
            concatenate=cfg.pretraining_sample_concatenation is True,
        )
    if cfg.shuffle_merged_datasets:
-        dataset = dataset.shuffle(seed=seed, buffer_size=buffer_size)
+        dataset = dataset.shuffle(
            seed=cfg.seed, buffer_size=cfg.streaming_multipack_buffer_size
        )
    else:
        LOG.debug("NOT shuffling merged pretraining datasets")
@@ -232,14 +244,13 @@ def wrap_pretraining_dataset(
    dataset = dataset.map(
        encode,
        batched=True,
-        batch_size=buffer_size,
+        batch_size=cfg.streaming_multipack_buffer_size,
        # input_columns="text",
        remove_columns=remove_columns,
    )
    return dataset
-def encode_packed_pretraining(
+def encode_packed_streaming(
    collate_fn,
    ds_wrapper: Callable,
    examples: Dict[str, List],
@@ -274,8 +285,6 @@ def encode_packed_pretraining(
    for batch in sampler:
        for data in batch:
            features = train_dataset[data]
            if "num_truncated_tokens" in features:
                del features["num_truncated_tokens"]
            if "num_truncated_tokens" in features:
                del features["num_truncated_tokens"]
            if "overflow_to_sample_mapping" in features:
--- a/src/axolotl/utils/data/utils.py
+++ b/src/axolotl/utils/data/utils.py
@@ -190,12 +190,21 @@ def handle_long_seq_in_dataset(
    Returns:
        Filtered dataset with long sequences removed.
    """
-    if "input_ids" not in dataset.column_names:
+    if (
        hasattr(dataset, "column_names")
        and dataset.column_names
        and "input_ids" not in dataset.column_names
    ):
        LOG.warning(
            "Dataset does not contain 'input_ids' column. Skip drop long seq. This is "
            "expected for reward modeling."
        )
        return dataset
    elif not hasattr(dataset, "column_names") or dataset.column_names is None:
        LOG.info(
            "Dataset is streaming (IterableDataset), skipping long sequence handling"
        )
        return dataset
    drop_long = functools.partial(
        drop_long_seq,
--- a/src/axolotl/utils/environment.py
+++ b/src/axolotl/utils/environment.py
@@ -6,8 +6,6 @@ from importlib.metadata import version
 from accelerate.utils.environment import (
    check_cuda_p2p_ib_support as accelerate_check_cuda_p2p_ib_support,
 )
 from accelerate.utils.environment import (
    get_gpu_info,
 )
 from packaging.version import Version, parse
--- a/src/axolotl/utils/quantization.py
+++ b/src/axolotl/utils/quantization.py
@@ -3,30 +3,47 @@ Utilities for quantization including QAT and PTQ using torchao.
 """
 import torch
-from torch import nn
+from packaging import version
 from torchao.core.config import AOBaseConfig
 from torchao.quantization import quantize_
 from torchao.quantization.qat import (
-    FakeQuantizeConfig,
+    QATConfig,
    FromIntXQuantizationAwareTrainingConfig,
    IntXQuantizationAwareTrainingConfig,
 )
 from torchao.quantization.quant_api import (
-    Int4DynamicActivationInt4WeightConfig,
+    Float8DynamicActivationFloat8WeightConfig,
-    Int4WeightOnlyConfig,
+    Float8DynamicActivationInt4WeightConfig,
    Int8DynamicActivationInt4WeightConfig,
    Int8DynamicActivationInt8WeightConfig,
    Int8WeightOnlyConfig,
    UIntXWeightOnlyConfig,
    _is_linear,
 )
-from axolotl.utils.schemas.enums import TorchIntDType
+from axolotl.utils.schemas.enums import TorchAOQuantDType
 quantization_config_to_str = {
    Int8DynamicActivationInt4WeightConfig: "int8int4",
    Float8DynamicActivationFloat8WeightConfig: "fp8fp8",
    Float8DynamicActivationInt4WeightConfig: "fp8int4",
 }
 if version.parse(torch.__version__) >= version.parse("2.8.0"):
    try:
        from torchao.prototype.mx_formats import NVFP4InferenceConfig
        quantization_config_to_str[NVFP4InferenceConfig] = "nvfp4"
    except:
        pass
    # int4 weight config imports will fail on machines with fbgemm-gpu installed
    # without a CUDA runtime available so we do this safely
    try:
        from torchao.quantization.quant_api import Int4WeightOnlyConfig
        quantization_config_to_str[Int4WeightOnlyConfig] = "int4"
    except:
        pass
-def get_ptq_config(
+def get_quantization_config(
-    weight_dtype: TorchIntDType,
+    weight_dtype: TorchAOQuantDType,
-    activation_dtype: TorchIntDType | None = None,
+    activation_dtype: TorchAOQuantDType | None = None,
    group_size: int | None = None,
 ) -> AOBaseConfig:
    """
@@ -45,44 +62,101 @@ def get_ptq_config(
            or if the group size is not specified for int8 or int4 weight only quantization.
    """
    if activation_dtype is None:
-        if not weight_dtype.value.is_signed:  # type: ignore[attr-defined,union-attr]
+        if weight_dtype == TorchAOQuantDType.int8:
-            return UIntXWeightOnlyConfig(
+            raise ValueError("Int8WeightOnlyConfig is not supported by torchao QAT.")
-                dtype=weight_dtype.value,
+        if weight_dtype == TorchAOQuantDType.int4:
-                group_size=group_size,
+            from torchao.quantization.quant_api import Int4WeightOnlyConfig
-                set_inductor_config=False,
+
-            )
+            if group_size is not None:
-        if weight_dtype == TorchIntDType.int8:
+                return Int4WeightOnlyConfig(group_size=group_size, version=2)
-            if group_size is None:
+            else:
-                raise ValueError(
+                return Int4WeightOnlyConfig(version=2)
-                    "group_size must be specified for int8 weight only quantization"
+    if (
-                )
+        activation_dtype == TorchAOQuantDType.int4
-            return Int8WeightOnlyConfig(
+        and weight_dtype == TorchAOQuantDType.int4
-                group_size=group_size,
+    ):
-            )
+        raise ValueError(
-        if weight_dtype == TorchIntDType.int4:
+            "Int4DynamicActivationInt4WeightConfig is not supported by torchao QAT."
-            if group_size is None:
+        )
-                raise ValueError(
+    if (
-                    "group_size must be specified for int4 weight only quantization"
+        activation_dtype == TorchAOQuantDType.int8
-                )
+        and weight_dtype == TorchAOQuantDType.int8
-            return Int4WeightOnlyConfig(
+    ):
-                group_size=group_size,
+        raise ValueError(
-            )
+            "Int8DynamicActivationInt8WeightConfig is not supported by torchao QAT."
-    if activation_dtype == TorchIntDType.int4 and weight_dtype == TorchIntDType.int4:
+        )
-        return Int4DynamicActivationInt4WeightConfig()
+    if (
-    if activation_dtype == TorchIntDType.int8 and weight_dtype == TorchIntDType.int8:
+        activation_dtype == TorchAOQuantDType.int8
-        return Int8DynamicActivationInt8WeightConfig()
+        and weight_dtype == TorchAOQuantDType.int4
-    if activation_dtype == TorchIntDType.int8 and weight_dtype == TorchIntDType.int4:
+    ):
-        return Int8DynamicActivationInt4WeightConfig()
+        if group_size is not None:
            return Int8DynamicActivationInt4WeightConfig(group_size=group_size)
        else:
            return Int8DynamicActivationInt4WeightConfig()
    if (
        activation_dtype == TorchAOQuantDType.float8_e4m3fn
        and weight_dtype == TorchAOQuantDType.float8_e4m3fn
    ):
        return Float8DynamicActivationFloat8WeightConfig()
    if (
        activation_dtype == TorchAOQuantDType.float8_e4m3fn
        and weight_dtype == TorchAOQuantDType.int4
    ):
        return Float8DynamicActivationInt4WeightConfig()
    if weight_dtype == TorchAOQuantDType.nvfp4:
        from torchao.prototype.mx_formats import NVFP4InferenceConfig
        if group_size is not None and group_size != 16:
            raise ValueError("NVFP4 quantization must use a group_size of 16")
        return NVFP4InferenceConfig()
    raise ValueError(
        f"Invalid activation/weight dtype combination: {activation_dtype}/{weight_dtype}"
    )
 def quantize_model(
    model,
    weight_dtype: TorchAOQuantDType,
    group_size: int | None = None,
    activation_dtype: TorchAOQuantDType | None = None,
    quantize_embedding: bool | None = None,
 ):
    """
    This function is used to quantize a model.
    Args:
        model: The model to quantize.
        weight_dtype: The dtype to use for weight quantization.
        group_size: The group size to use for weight quantization.
        activation_dtype: The dtype to use for activation quantization.
        quantize_embedding: Whether to quantize the model's embedding weights.
    """
    linear_ptq_config = get_quantization_config(
        weight_dtype=weight_dtype,
        activation_dtype=activation_dtype,
        group_size=group_size,
    )
    quantize_(model, linear_ptq_config)
    if quantize_embedding:
        # activation fake quantization is not supported for embedding layers
        embedding_quantize_config = get_quantization_config(
            weight_dtype=weight_dtype,
            activation_dtype=None,
            group_size=group_size,
        )
        quantize_(
            model,
            embedding_quantize_config,
            filter_fn=lambda m, _: isinstance(m, torch.nn.Embedding),
        )
 def prepare_model_for_qat(
    model,
-    weight_dtype: TorchIntDType,
+    weight_dtype: TorchAOQuantDType,
-    group_size: int,
+    group_size: int | None = None,
-    activation_dtype: TorchIntDType | None = None,
+    activation_dtype: TorchAOQuantDType | None = None,
    quantize_embedding: bool = False,
 ):
    """
@@ -100,86 +174,40 @@ def prepare_model_for_qat(
    Raises:
        ValueError: If the activation/weight dtype combination is invalid.
    """
-    if activation_dtype:
+    base_config = get_quantization_config(
        activation_config = FakeQuantizeConfig(
            dtype=activation_dtype.value, granularity="per_token", is_symmetric=False
        )
    weight_config = FakeQuantizeConfig(dtype=weight_dtype.value, group_size=group_size)
    linear_quantize_config = IntXQuantizationAwareTrainingConfig(
        activation_config=None if activation_dtype is None else activation_config,
        weight_config=weight_config,
    )
    quantize_(model, linear_quantize_config)
    if quantize_embedding:
        # activation fake quantization is not supported for embedding layers
        embedding_quantize_config = IntXQuantizationAwareTrainingConfig(
            activation_config=None,
            weight_config=weight_config,
        )
        quantize_(
            model,
            embedding_quantize_config,
            filter_fn=lambda m, _: isinstance(m, torch.nn.Embedding),
        )
 def quantize_model_for_ptq(
    model,
    weight_dtype: TorchIntDType,
    group_size: int | None = None,
    activation_dtype: TorchIntDType | None = None,
    quantize_embedding: bool | None = None,
 ):
    """
    This function is used to quantize a model for post-training quantization.
    It swaps the model's linear layers with fake quantized linear layers.
    If `quantize_embedding` is True, it will also swap the model's embedding weights with fake quantized embedding weights.
    Args:
        model: The model to quantize.
        weight_dtype: The dtype to use for weight quantization.
        group_size: The group size to use for weight quantization.
        activation_dtype: The dtype to use for activation quantization.
        quantize_embedding: Whether to quantize the model's embedding weights.
    """
    linear_ptq_config = get_ptq_config(
        weight_dtype=weight_dtype,
        activation_dtype=activation_dtype,
        group_size=group_size,
    )
-    quantize_(model, linear_ptq_config)
+    qat_config = QATConfig(base_config)
    quantize_(model, qat_config)
    if quantize_embedding:
-        embedding_quantize_config = get_ptq_config(
+        # activation fake quantization is not supported for embedding layers
        embedding_base_config = get_quantization_config(
            weight_dtype=weight_dtype,
            activation_dtype=None,
            group_size=group_size,
        )
        embedding_qat_config = QATConfig(embedding_base_config)
        quantize_(
            model,
-            embedding_quantize_config,
+            embedding_qat_config,
            filter_fn=lambda m, _: isinstance(m, torch.nn.Embedding),
        )
-def convert_qat_model_for_ptq(
+def convert_qat_model(
    model,
-    *,
+    quantize_embedding: bool = False,
    quantize_embedding: bool | None = None,
 ):
    """
-    This function is used to convert a swap fake-quantized modules in a model
+    This function converts a QAT model which has fake quantized layers back to the original model.
    which has been trained with QAT back to the original modules, ready for PTQ.
    Args:
        model: The model to convert.
        quantize_embedding: Whether to quantize the model's embedding weights.
    """
    config = QATConfig(step="convert")
    quantize_(model, config)
    if quantize_embedding:
-
+        quantize_(
-        def filter_fn(m, _):
+            model,
-            return isinstance(m, nn.Embedding) or _is_linear(m)
+            config,
-
+            filter_fn=lambda m, _: isinstance(m, torch.nn.Embedding),
-    else:
+        )
        filter_fn = _is_linear
    quantize_(model, FromIntXQuantizationAwareTrainingConfig(), filter_fn=filter_fn)
--- a/src/axolotl/utils/schemas/config.py
+++ b/src/axolotl/utils/schemas/config.py
@@ -106,6 +106,12 @@ class AxolotlInputConfig(
            "description": "Don't upcast the embeddings to float32 when using PEFT. Useful for low-VRAM GPUs"
        },
    )
    reinit_weights: bool | None = Field(
        default=None,
        json_schema_extra={
            "description": "Reinitialize model weights randomly instead of loading pretrained weights"
        },
    )
    trainer_cls: str | None = Field(
        default=None,
@@ -138,6 +144,12 @@ class AxolotlInputConfig(
            "description": "Process reward modelling: `True` or `False`"
        },
    )
    center_rewards_coefficient: float | None = Field(
        default=None,
        json_schema_extra={
            "description": "Coefficient to incentivize the reward model to output mean-zero rewards (proposed by https://huggingface.co/papers/2312.09244, Eq. 2). Recommended value: `0.01`."
        },
    )
    num_labels: int | None = None
    # Whether to use weighting in DPO trainer.
    # If `None`, default is `False` in the trainer.
@@ -475,12 +487,6 @@ class AxolotlInputConfig(
        },
    )
    multipack_real_batches: bool | None = None
    pretraining_sample_concatenation: bool | None = Field(
        default=None,
        json_schema_extra={
            "description": "whether to concatenate samples during pretraining",
        },
    )
    batch_flattening: Literal["auto"] | bool | None = Field(
        default=None,
@@ -495,13 +501,34 @@ class AxolotlInputConfig(
    pose_max_context_len: int | None = None
    pose_num_chunks: int | None = None
-    pretrain_multipack_buffer_size: int | None = 10_000
+    # Deprecated: Use streaming_multipack_buffer_size instead
    pretrain_multipack_buffer_size: int | None = Field(
        default=None,
        deprecated="Deprecated in v0.13.0, will be removed in v0.14.0. Use streaming_multipack_buffer_size instead",
    )
    pretrain_multipack_attn: bool | None = Field(
        default=True,
        json_schema_extra={
            "description": "whether to prevent cross attention for packed sequences during pretraining",
        },
    )
    pretraining_sample_concatenation: bool | None = Field(
        default=None,
        json_schema_extra={
            "description": "whether to concatenate samples during pretraining",
        },
    )
    streaming: bool | None = Field(
        default=None,
        json_schema_extra={"description": "Use streaming mode for loading datasets"},
    )
    streaming_multipack_buffer_size: int | None = Field(
        default=10_000,
        json_schema_extra={
            "description": "Buffer size for multipack streaming datasets"
        },
    )
    xformers_attention: bool | None = Field(
        default=None,
@@ -830,10 +857,15 @@ class AxolotlInputConfig(
    include_tokens_per_second: bool | None = Field(
        default=None,
        json_schema_extra={
-            "description": "bool of whether to include tokens trainer per second in the training metrics. This iterates over the entire dataset once, so it takes some time."
+            "description": "bool of whether to report tokens per second at the end of training. This is not supported with pre-training datasets."
        },
    )
    include_tkps: bool | None = Field(
        default=True,
        json_schema_extra={
            "description": "bool of whether to report tokens per second per-gpu during training by measuring throughput of non-padding tokens."
        },
    )
    neftune_noise_alpha: float | None = Field(
        default=None,
        json_schema_extra={
@@ -927,7 +959,15 @@ class AxolotlInputConfig(
        },
    )
-    fix_untrained_tokens: int | list[int] | None = None
+    fix_untrained_tokens: int | list[int] | None = Field(
        default=None,
        json_schema_extra={
            "description": (
                "Token index or indices to adjust embedding weights to the mean of the other tokens. "
                "This is useful when the model has untrained embeddings."
            )
        },
    )
    # INTERNALS - document for now, generally not set externally
    is_preprocess: bool | None = None
@@ -986,6 +1026,26 @@ class AxolotlInputConfig(
            return [ds_config.model_dump(exclude_none=True) for ds_config in ds_configs]
        return None
    @model_validator(mode="before")
    @classmethod
    def warn_peft_trainable_token_to_fix_untrained(cls, data):
        if (
            peft_trainable_token_indices := data.get("peft_trainable_token_indices")
        ) and (fix_untrained_tokens := data.get("fix_untrained_tokens")):
            if isinstance(fix_untrained_tokens, int):
                fix_untrained_tokens = (fix_untrained_tokens,)
            if isinstance(peft_trainable_token_indices, int):
                peft_trainable_token_indices = (peft_trainable_token_indices,)
            for untrained_token_id in fix_untrained_tokens:
                if untrained_token_id not in peft_trainable_token_indices:
                    LOG.warning_once(
                        f"Token {untrained_token_id} is fixed via `fix_untrained_tokens`, yet not in `peft_trainable_token_indices: ` list. "
                        "Please add it, otherwise the token won't be trained on."
                    )
        return data
 class AxolotlConfigWCapabilities(AxolotlInputConfig):
    """wrapper to valdiate GPU capabilities with the configured options"""
@@ -1259,3 +1319,14 @@ class AxolotlConfigWCapabilities(AxolotlInputConfig):
            data["dataset_processes"] = get_default_process_count()
        return data
    @model_validator(mode="before")
    @classmethod
    def check_deduplication_with_streaming(cls, data):
        if data.get("dataset_exact_deduplication") and (
            data.get("streaming") or data.get("pretraining_dataset")
        ):
            raise NotImplementedError(
                "dataset_exact_deduplication is not available for streaming datasets. "
            )
        return data
--- a/src/axolotl/utils/schemas/enums.py
+++ b/src/axolotl/utils/schemas/enums.py
@@ -5,18 +5,21 @@ from enum import Enum
 import torch
-class TorchIntDType(Enum):
+class TorchAOQuantDType(Enum):
-    """Torch integer data types - `getattr` guards against torch < 2.6 which does not support int4"""
+    int4 = torch.int4
    int8 = torch.int8
    float8_e4m3fn = torch.float8_e4m3fn
    nvfp4 = "nvfp4"
-    uint1 = getattr(torch, "uint1", None)
+    def from_string(str):
-    uint2 = getattr(torch, "uint2", None)
+        if str == "int4":
-    uint3 = getattr(torch, "uint3", None)
+            return TorchAOQuantDType.int4
-    uint4 = getattr(torch, "uint4", None)
+        if str == "int8":
-    uint5 = getattr(torch, "uint5", None)
+            return TorchAOQuantDType.int8
-    uint6 = getattr(torch, "uint6", None)
+        if str in ["float8_e4m3fn", "fp8", "float8"]:
-    uint7 = getattr(torch, "uint7", None)
+            return TorchAOQuantDType.float8_e4m3fn
-    int4 = getattr(torch, "int4", None)
+        if str == "nvfp4":
-    int8 = getattr(torch, "int8", None)
+            return TorchAOQuantDType.nvfp4
 class RLType(str, Enum):
--- a/src/axolotl/utils/schemas/model.py
+++ b/src/axolotl/utils/schemas/model.py
@@ -59,16 +59,21 @@ class ModelInputConfig(BaseModel):
    processor_type: str | None = Field(
        default=None, json_schema_extra={"description": "transformers processor class"}
    )
    tokenizer_save_jinja_files: bool | None = Field(
        default=True,  # match the default behavior from transformers
        json_schema_extra={
            "description": "Whether to save jinja files for tokenizer, transformers default is True"
        },
    )
    trust_remote_code: bool | None = Field(
        default=None,
        json_schema_extra={"description": "Trust remote code for untrusted source"},
    )
    experimental_skip_move_to_device: bool | None = Field(
-        default=None,
+        default=True,
        json_schema_extra={
-            "description": "Don't move the model to the device before sharding. "
+            "description": "Don't move the model to the device before sharding. Set to `false` to revert to legacy behavior."
            "This is an experimental feature that may be included in the future as the default."
        },
    )
--- a/src/axolotl/utils/schemas/peft.py
+++ b/src/axolotl/utils/schemas/peft.py
@@ -90,6 +90,16 @@ class LoraConfig(BaseModel):
            "description": "How to initialize LoRA weights. Default to True which is MS original implementation."
        },
    )
    peft_trainable_token_indices: list[int] | dict[str, list[int]] | None = Field(
        default=None,
        json_schema_extra={
            "description": (
                "A list of token indices to fine-tune on the `embed_tokens` layer.\n"
                "Otherwise, a dict mapping an embedding layer name to its trainable token indices.\n"
                "See https://huggingface.co/docs/peft/v0.17.0/en/developer_guides/lora#efficiently-train-tokens-alongside-lora"
            )
        },
    )
    qlora_sharded_model_loading: bool | None = Field(
        default=False,
--- a/src/axolotl/utils/schemas/quantization.py
+++ b/src/axolotl/utils/schemas/quantization.py
@@ -6,7 +6,23 @@ from typing import Any
 from pydantic import BaseModel, Field, field_validator
-from axolotl.utils.schemas.enums import TorchIntDType
+from axolotl.utils.schemas.enums import TorchAOQuantDType
 def validate_ao_dtype(v: Any) -> TorchAOQuantDType | None:
    if v is None:
        return None
    if v == "int4":
        return TorchAOQuantDType.int4
    if v == "int8":
        return TorchAOQuantDType.int8
    if v in ["float8_e4m3fn", "fp8", "float8"]:
        return TorchAOQuantDType.float8_e4m3fn
    if v == "nvfp4":
        return TorchAOQuantDType.nvfp4
    raise ValueError(
        f"Invalid dtype: '{v}'. Must be one of: {[e.name for e in TorchAOQuantDType] + ['fp8', 'float8']}"
    )
 class QATConfig(BaseModel):
@@ -14,13 +30,13 @@ class QATConfig(BaseModel):
    QAT Config Schema
    """
-    activation_dtype: TorchIntDType | None = Field(
+    activation_dtype: TorchAOQuantDType | None = Field(
        default=None,
-        description='Fake quantization layout to use for activation quantization. Valid options are "int4" and "int8"',
+        description="Fake quantization layout to use for activation quantization.",
    )
-    weight_dtype: TorchIntDType = Field(
+    weight_dtype: TorchAOQuantDType = Field(
-        default=TorchIntDType.int8,
+        default=TorchAOQuantDType.int8,
-        description='Fake quantization layout to use for weight quantization. Valid options are "int4" and "int8"',
+        description="Fake quantization layout to use for weight quantization.",
    )
    quantize_embedding: bool | None = Field(
        default=False, description="Quantize embedding"
@@ -35,12 +51,8 @@ class QATConfig(BaseModel):
    @field_validator("activation_dtype", "weight_dtype", mode="before")
    @classmethod
-    def validate_dtype(cls, v: Any) -> TorchIntDType | None:
+    def validate_dtype(cls, v: Any) -> TorchAOQuantDType | None:
-        if v == "int4":
+        return validate_ao_dtype(v)
            return TorchIntDType.int4
        if v == "int8":
            return TorchIntDType.int8
        raise ValueError(f"Invalid dtype: '{v}'. Must be one of: ['int4', 'int8']")
 class PTQConfig(BaseModel):
@@ -48,13 +60,13 @@ class PTQConfig(BaseModel):
    PTQ Config Schema
    """
-    weight_dtype: TorchIntDType = Field(
+    weight_dtype: TorchAOQuantDType = Field(
-        default=TorchIntDType.int8,
+        default=TorchAOQuantDType.int8,
-        description="Fake quantization layout to use for weight quantization. Valid options are uintX for X in [1, 2, 3, 4, 5, 6, 7], or int4, or int8",
+        description="Fake quantization layout to use for weight quantization.",
    )
-    activation_dtype: TorchIntDType | None = Field(
+    activation_dtype: TorchAOQuantDType | None = Field(
        default=None,
-        description='Fake quantization layout to use for activation quantization. Valid options are "int4" and "int8"',
+        description="Fake quantization layout to use for activation quantization.",
    )
    quantize_embedding: bool | None = Field(
        default=None, description="Whether to quantize the embedding layer."
@@ -66,9 +78,5 @@ class PTQConfig(BaseModel):
    @field_validator("activation_dtype", "weight_dtype", mode="before")
    @classmethod
-    def validate_dtype(cls, v: Any) -> TorchIntDType | None:
+    def validate_dtype(cls, v: Any) -> TorchAOQuantDType | None:
-        if v == "int4":
+        return validate_ao_dtype(v)
            return TorchIntDType.int4
        if v == "int8":
            return TorchIntDType.int8
        raise ValueError(f"Invalid dtype: '{v}'. Must be one of: ['int4', 'int8']")
--- a/src/axolotl/utils/schemas/validation.py
+++ b/src/axolotl/utils/schemas/validation.py
@@ -14,7 +14,6 @@ from transformers.utils.import_utils import is_torch_npu_available
 from axolotl.utils.logging import get_logger
 from axolotl.utils.schemas.enums import ChatTemplate, RingAttnFunc, RLType
 LOG = get_logger(__name__)
 SUPPORTED_METRICS = {"sacrebleu", "comet", "ter", "chrf", "perplexity"}
@@ -60,6 +59,20 @@ class DatasetValidationMixin:
            raise ValueError("either datasets or pretraining_dataset is required")
        return data
    @model_validator(mode="before")
    @classmethod
    def check_pretraining_streaming_deprecation(cls, data):
        # TODO(djsaunde): remove this check + implement change for 0.13.0 release
        if data.get("pretraining_dataset") and not data.get("streaming"):
            LOG.warning(
                "Setting `pretraining_dataset` without explicitly setting `streaming: "
                "true` is deprecated. In a future release, streaming will not be "
                "automatically enabled when using pretraining_dataset. Please "
                "explicitly set `streaming: true` in your configuration to maintain "
                "current behavior."
            )
        return data
    @model_validator(mode="before")
    @classmethod
    def check_push_ds_auth(cls, data):
@@ -340,6 +353,30 @@ class TrainingValidationMixin:
            )
        return data
    @model_validator(mode="before")
    @classmethod
    def check_multipack_buffer_size(cls, data):
        if data.get("pretrain_multipack_buffer_size") and not data.get(
            "streaming_multipack_buffer_size"
        ):
            LOG.warning(
                "`pretrain_multipack_buffer_size` is deprecated in v0.13.0, will be "
                "removed in v0.14.0. Use `streaming_multipack_buffer_size` instead."
            )
            data["streaming_multipack_buffer_size"] = data[
                "pretrain_multipack_buffer_size"
            ]
            del data["pretrain_multipack_buffer_size"]
        elif data.get("pretrain_multipack_buffer_size") and data.get(
            "streaming_multipack_buffer_size"
        ):
            raise ValueError(
                "pretrain_multipack_buffer_size is deprecated, use "
                "streaming_multipack_buffer_size; both are set, please remove the "
                "deprecated pretrain_multipack_buffer_size setting"
            )
        return data
    @model_validator(mode="after")
    def check_fft_possible_bad_config(self):
        if (
@@ -1074,6 +1111,50 @@ class PretrainingValidationMixin:
                    data["accelerator_config"]["dispatch_batches"] = False
        return data
    @model_validator(mode="before")
    @classmethod
    def check_pretraining_w_val_set_size(cls, data):
        if data.get("pretraining_dataset") and data.get("val_set_size"):
            raise ValueError(
                "val_set_size is not supported with pretraining_dataset. "
                "Use test_datasets to specify evaluation datasets for pretraining."
            )
        return data
    @model_validator(mode="before")
    @classmethod
    def check_streaming_w_val_set_size(cls, data):
        if data.get("streaming") and data.get("val_set_size"):
            raise ValueError(
                "val_set_size is not supported with streaming datasets. "
                "Use test_datasets to specify evaluation datasets when streaming is enabled."
            )
        return data
    @model_validator(mode="before")
    @classmethod
    def check_streaming_w_max_steps(cls, data):
        if data.get("streaming") and not data.get("max_steps"):
            raise ValueError(
                "max_steps must be set when using streaming datasets. "
                "Trainer cannot infer dataset length for iterable datasets."
            )
        return data
    @model_validator(mode="before")
    @classmethod
    def check_streaming_w_multiple_datasets(cls, data):
        if (
            data.get("streaming")
            and data.get("sample_packing")
            and data.get("datasets")
            and len(data.get("datasets")) > 1
        ):
            raise NotImplementedError(
                "Sample packing with multiple streaming datasets is not yet supported"
            )
        return data
 class ModelCompatibilityValidationMixin:
    """Validation methods for specific model compatibility."""
--- a/src/axolotl/utils/trainer.py
+++ b/src/axolotl/utils/trainer.py
@@ -475,7 +475,9 @@ def calculate_total_num_steps(cfg, train_dataset, update=True):
                train_dataset.remove_columns(["length"]),
                batch_sampler=sampler,
            )
-            data_loader_len = len(data_loader) * cfg.micro_batch_size // cfg.batch_size
+            data_loader_len = max(
                1, len(data_loader) * cfg.micro_batch_size // cfg.batch_size
            )
            LOG.debug(f"data_loader_len: {data_loader_len}")
            # FIXME: is there a bug here somewhere? the total num steps depends
            # on the agreed on value for sample_packing_eff_est
@@ -547,6 +549,13 @@ def setup_deepspeed_env(cfg, stage=None):
        if stage == 3:
            os.environ["ACCELERATE_DEEPSPEED_ZERO3_INIT"] = "true"
    device_count = torch.cuda.device_count()
    if device_count == 1:
        os.environ.setdefault("WORLD_SIZE", "1")
        os.environ.setdefault("LOCAL_RANK", "0")
        os.environ.setdefault("MASTER_ADDR", "0.0.0.0")  # nosec B104
        os.environ.setdefault("MASTER_PORT", "29500")
    # NOTE(djsaunde): The distribued state cannot be initialized prior to the
    # ACCELERATE_USE_DEEPSPEED assignment, but it must be initialized some time prior
    # to model load.
--- a/tests/e2e/integrations/test_kd.py
+++ b/tests/e2e/integrations/test_kd.py
@@ -25,7 +25,7 @@ def min_cfg(temp_dir):
        "liger_rms_norm": True,
        "liger_glu_activation": True,
        "torch_compile": True,
-        "chat_template": "llama3",
+        "chat_template": "qwen3",
        "kd_trainer": True,
        "kd_ce_alpha": 0.1,
        "kd_alpha": 0.9,
--- a/tests/e2e/test_diffusion.py
+++ b/tests/e2e/test_diffusion.py
@@ -0,0 +1,139 @@
 """E2E smoke test for diffusion training plugin."""
 from axolotl.common.datasets import load_datasets
 from axolotl.train import train
 from axolotl.utils.config import normalize_config, validate_config
 from axolotl.utils.dict import DictDefault
 from tests.e2e.utils import check_model_output_exists
 class TestDiffusion:
    """Test case for diffusion training plugin."""
    def test_diffusion_smoke_test(self, temp_dir):
        """
        Smoke test for diffusion training to ensure the plugin loads and trains without
        error.
        """
        cfg = DictDefault(
            {
                "base_model": "HuggingFaceTB/SmolLM2-135M",
                "tokenizer_type": "AutoTokenizer",
                "trust_remote_code": True,
                "sequence_len": 256,
                "val_set_size": 0.1,
                "special_tokens": {
                    "pad_token": "<|endoftext|>",
                },
                "datasets": [
                    {
                        "path": "mhenrichsen/alpaca_2k_test",
                        "type": "alpaca",
                    },
                ],
                "num_epochs": 1,
                "max_steps": 3,
                "micro_batch_size": 1,
                "gradient_accumulation_steps": 1,
                "output_dir": temp_dir,
                "learning_rate": 0.0001,
                "optimizer": "adamw_torch",
                "lr_scheduler": "cosine",
                "bf16": True,
                "save_safetensors": True,
                "save_first_step": False,
                "logging_steps": 1,
                "eval_steps": 3,
                # Diffusion-specific config
                "plugins": ["axolotl.integrations.diffusion.DiffusionPlugin"],
                "diffusion": {
                    # sample generation
                    "generate_samples": True,
                    "generation_interval": 1,
                    "num_generation_samples": 1,
                    "generation_steps": 2,
                    "generation_max_length": 32,
                    "generation_temperature": 0.0,
                    # training-specific
                    "mask_token_id": 16,
                    "eps": 1e-3,
                    "importance_weighting": False,
                },
            }
        )
        cfg = validate_config(cfg)
        normalize_config(cfg)
        dataset_meta = load_datasets(cfg=cfg)
        train(cfg=cfg, dataset_meta=dataset_meta)
        check_model_output_exists(temp_dir, cfg)
    def test_diffusion_sft_labels(self, temp_dir):
        """Test that diffusion training properly handles SFT data with labels."""
        cfg = DictDefault(
            {
                "base_model": "HuggingFaceTB/SmolLM2-135M",
                "tokenizer_type": "AutoTokenizer",
                "trust_remote_code": True,
                "sequence_len": 256,
                "val_set_size": 0.1,
                "special_tokens": {
                    "pad_token": "<|endoftext|>",
                },
                "datasets": [
                    {
                        "path": "mhenrichsen/alpaca_2k_test",
                        "type": "alpaca",
                    },
                ],
                "num_epochs": 1,
                "max_steps": 3,
                "micro_batch_size": 1,
                "gradient_accumulation_steps": 1,
                "output_dir": temp_dir,
                "learning_rate": 0.0001,
                "optimizer": "adamw_torch",
                "lr_scheduler": "cosine",
                "bf16": True,
                "save_safetensors": True,
                "save_first_step": False,
                "logging_steps": 1,
                "eval_steps": 2,
                # Diffusion-specific config
                "plugins": ["axolotl.integrations.diffusion.DiffusionPlugin"],
                "diffusion": {
                    # sample generation
                    "generate_samples": True,
                    "generation_interval": 1,
                    "num_generation_samples": 1,
                    "generation_steps": 2,
                    "generation_max_length": 32,
                    "generation_temperature": 0.0,
                    # training-specific
                    "mask_token_id": 16,
                    "eps": 1e-3,
                    "importance_weighting": True,
                },
                # Ensure we have proper SFT labels
                "train_on_inputs": False,
            }
        )
        cfg = validate_config(cfg)
        normalize_config(cfg)
        dataset_meta = load_datasets(cfg=cfg)
        # Verify that the dataset has labels
        sample = dataset_meta.train_dataset[0]
        assert "labels" in sample, "SFT dataset should have labels"
        # Check that some labels are -100 (prompt tokens)
        labels = sample["labels"]
        if hasattr(labels, "tolist"):
            labels = labels.tolist()
        assert -100 in labels, "SFT dataset should have -100 labels for prompt tokens"
        train(cfg=cfg, dataset_meta=dataset_meta)
        check_model_output_exists(temp_dir, cfg)
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Salman Mohammadi	a7676af44d	hmmm	2025-09-12 18:51:10 +01:00
Salman Mohammadi	52e37077fc	Merge branch 'main' into lora_bf16	2025-09-12 18:35:03 +01:00
Salman Mohammadi	850489405b	working?	2025-09-12 17:34:41 +00:00
Salman Mohammadi	6874d32e0c	more lora handling	2025-09-12 15:26:12 +00:00
salman	58d67bf98d	Migrate QAT API; fix `axolotl quantize` for QAT-ed models; add NVFP4 (#3107 )	2025-09-12 10:55:50 +01:00
salman	0401a15888	SEO go brrr (#3153 ) [skip-ci]	2025-09-12 10:55:11 +01:00
NanoCode012	fcfc13d710	feat(doc): update thinking and chat_template notes (#3114 ) [skip ci] * feat: update thinking and chat_template notes * fix: grammar	2025-09-12 14:45:18 +07:00
salman	9406c0c488	log before eval step (#3148 ) [skip-ci]	2025-09-11 11:19:30 +01:00
Dan Saunders	1b53c49e1a	text diffusion training plugin (#3067 ) * diffusion training plugin * cleanup * nits * fixes + improvements * add back in reinit_weights (clobbered?); masking / pretrain fixes * nits * cleanup; tests draft * sample generation, tests fixes * fixes * nits * add inference support; add auto-mask token support * nits * nits * progress * simplify logging * lint * prefix args with diffusion_ * coderabbito * tests fix * nit * nits * cleanup + nits * nits * fix SFT sample gen * fixes * fix * comments * comments * lint * reward model lora fix * cleanup; fix pretraining_dataset case * gradio inference * update cfgs * update cfgs * train, generation parity, cleanup * fix * simplify * test * test fix	2025-09-10 20:27:00 -04:00
NanoCode012	b71482cec5	Feat: add hunyuan v1 (#3016 ) * feat: add hunyuan cce support * feat: update cce docs * feat: add multipack support for granite and hunyuan * feat: add hunyuan docs and example config * feat: update readme instructions to include CCE installation * fix: chat template log appearing despite tokenizer already having template * feat: add vram usage * fix: remove duplicate cce install * fix: use latest commit of PR in case rebased/pushed * Revert "fix: use latest commit of PR in case rebased/pushed" This reverts commit `8b60aa00de`. * feat: update doc as upstream merged	2025-09-10 09:03:30 +07:00
NanoCode012	79103b01ca	Feat: add seedoss (#3104 ) [skip ci] * feat: add seedoss cce * feat: add seedoss config and docs * fix: shouldn't have target modules with target linear * feat: add vram numbers * fix: hf link * fix: name * fix: support multipack seedoss * fix: merge error * feat: update seedoss instructions for transformers release	2025-09-10 09:01:02 +07:00
Salman Mohammadi	6daed7d060	dont keep adpater weights in fp32	2025-09-09 17:11:13 +01:00
salman	9640338d37	Default `include_tkps` to true (#3134 ) * default true * force e2e * causal trainer only * fix eval loggin [skip-ci] * revert setup.py * force tests * guarding * guarding * fix test case * use evaluate [skip-e2e] * use evaluate [skip-e2e] * kick off ci * fixing * reverting	2025-09-09 10:50:21 -04:00
Wing Lian	b5d4c7ff54	allow 1% deviation for codecov (#3138 ) [skip ci]	2025-09-07 11:01:03 -04:00
Seungduk Kim	8fd9221f13	Add `ipo` as an `rl` type that shares DPODataset config (#3128 ) * Add `ipo` as an `rl` type that shares DPODataset config * chore: lint --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-09-07 10:49:10 -04:00
github-actions[bot]	bf00f29f3a	chore: update pre-commit hooks (#3137 ) [skip ci] Co-authored-by: djsaunde <1245942+djsaunde@users.noreply.github.com>	2025-09-07 10:33:20 -04:00
NanoCode012	1d32278755	feat: upgrade transformers to v4.56.1 (#3127 ) * feat: upgrade transformers to v4.56 * fix handling of CP/SP now that position_ids are default even for unpacked sequences * feat: monkeypatch list_repo_templates * fix: apply patch for tests only * see if updated main works at least * fix: update to patch release and remove monkeypatch * remove fsdp2 eval patch --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-09-05 11:00:54 -04:00
NanoCode012	c6ae5c43cb	fix: chat template jinja file not being loaded during inference (#3112 ) * fix: chat template jinja file not being loaded during inference * fix: bot comment	2025-09-03 16:25:09 -04:00
yardenhoch	efa1da52d5	Center rewards coefficient (#3124 ) * feat: add center_rewards_coefficient for reward modeling - Add center_rewards_coefficient parameter to Pydantic schema with paper reference - Pass parameter through base builder and causal builder to training args - Add documentation section with usage examples and theoretical background - Enable parameter in reward modeling example configs with recommended value - Enables reward centering for improved training stability in RLHF workflows Implements auxiliary loss from Eisenstein et al. 2023 (https://huggingface.co/papers/2312.09244) to incentivize mean-zero reward outputs without post-training normalization. * Update description * test: add unit tests for center_rewards_coefficient integration * Update src/axolotl/core/builders/base.py Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * Update docs/reward_modelling.qmd Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * Update docs/reward_modelling.qmd Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * reference to TRL documentation. * add new reward model configuration for qwen3 with comprehensive parameters * Verified center_rewards_coefficient is correctly passed through the trainer builder to training arguments. * Refactor reward modeling documentation to consolidate information on center_rewards_coefficient * Remove unit tests for center_rewards_coefficient integration as part of codebase cleanup. * linting * nit * Apply suggestions from code review Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * lint --------- Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> Co-authored-by: Salman Mohammadi <salman.mohammadi@outlook.com>	2025-09-03 16:22:37 -04:00
mhenrichsen	48db520d92	Create 270m-qlora.yml (#3075 ) [skip ci] Adds 270m gemma3 qlora	2025-09-03 16:20:32 -04:00
NanoCode012	53a0c1f39c	feat: add peft_trainable_token_indices (#3062 ) * feat: add peft_trainable_token_indices * feat: add warning compat with fix_untrained_tokens	2025-09-03 01:48:01 -04:00
github-actions[bot]	4cc6038d52	chore: update pre-commit hooks (#3122 ) [skip ci] Co-authored-by: djsaunde <1245942+djsaunde@users.noreply.github.com>	2025-09-03 01:41:34 -04:00
NanoCode012	e48aa8a5b1	feat(doc): improve visibility for colab notebooks (#3110 ) [skip ci] * feat: improve visibility for colab notebooks * fix: link to GH colab * feat: change to badge and move higher	2025-09-03 01:40:53 -04:00
xuyifann	24aba5caca	Clamping the len of dataloader to minimum of 1 (#3100 ) [skip ci] * Clamping the len of dataloader to minimum of 1 * linter reformat	2025-09-03 01:40:27 -04:00
Wing Lian	06bebcb65f	run cu128-2.8.0 e2e tests on B200 (#3126 ) * run cu128-2.8.0 e2e tests on B200 * not an int 🤦 * fix yaml	2025-09-02 13:13:23 -04:00
Dan Saunders	231a67e70b	Streaming SFT support (#3101 ) * working * fixes * deprecate --iterable; cleanup * pretrain_multipack_buffer_size -> streaming_multipack_buffer_size * improvements * tests * remove unused * docs, examples * nit * nit * add val_set_size validation * val * nit * min * coderabbito * cleanup * nit * add depr warning, cleanup * nit * fix test, fix quarto * fix * review comments * review comments * fix	2025-09-02 12:08:44 -04:00
Wing Lian	0094a2d744	support for tiledmlp for GPT-OSS (#3116 ) * fix use of flex attn kwargs and add support for tiledmlp for GPT-OSS * add logging back * update deps	2025-08-29 13:52:49 -04:00
Wing Lian	7ed40f1d70	automatically set env vars for single gpu deepspeed zero3 (#3118 ) [skip ci] * automatically set env vars for single gpu deepspeed zero3 * use setdefault	2025-08-29 13:36:47 -04:00
VED	5b6ec2820f	patch for ds_grads_remaining in deepspeed (#3102 ) [skip ci] * patch deepspeed * deepspeed patch for ds_grads_remaining * patch in Patchmanager * chore: lint * deepseed utils * chore2 * patch ds_grads_remaining chore * chore lint * chore lint * remove torch.nn patch * lint * Update src/axolotl/monkeypatch/utils.py Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * patched with checkpointwarapper * lint * only apply deepspeed patch when using activation offloading --------- Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-08-29 12:12:09 -04:00
Wing Lian	6afba3871d	Add support for PyTorch 2.8.0 (#3106 ) * Add support for PyTorch 2.8.0 * loosen triton requirements * handle torch 2.8.0 in setup.py * fix versions * no vllm for torch 2.8.0 * remove comment Co-authored-by: NanoCode012 <nano@axolotl.ai> --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2025-08-28 09:10:40 -04:00
Dan Saunders	dc338c3b0e	Update .coderabbit.yaml (#3109 ) [skip ci] Oops, should be false.	2025-08-27 09:50:52 -04:00
salman	d0d2fc5606	Tokens per second logging [skip-e2e] (#3072 )	2025-08-27 09:10:14 +01:00
Wing Lian	e1131e9619	make always skip_move_to_device default as true (#3084 )	2025-08-26 09:30:22 -04:00
Wing Lian	c4c4b90638	add tokenizer_save_jinja_files to keep legacy behavior of including chat template in tokenizer_config.json (#3093 ) * add tokenizer_save_jinja_files to keep legacy behavior of including chat template in tokenizer_config.json * fix test import	2025-08-26 09:30:04 -04:00
Wing Lian	0e9945e3b9	deploy training jobs to baseten w truss in axolotl cli (#3086 ) [skip ci] * deploy training jobs to baseten w truss in axolotl cli * cleanup	2025-08-26 09:29:50 -04:00
NanoCode012	0de254a0d0	feat: add gemma3_text attention handling for lora kernels (#3103 )	2025-08-26 16:47:26 +07:00