fix reentrant when using offloading

Default include_tkps to true (#3134 )
* default true * force e2e * causal trainer only * fix eval loggin [skip-ci] * revert setup.py * force tests * guarding * guarding * fix test case * use evaluate [skip-e2e] * use evaluate [skip-e2e] * kick off ci * fixing * reverting
2025-09-14 10:42:15 -04:00 · 2025-09-09 10:50:21 -04:00 · 2025-09-07 11:01:03 -04:00 · 2025-09-07 10:49:10 -04:00 · 2025-09-07 10:33:20 -04:00 · 2025-09-05 11:00:54 -04:00
76 changed files with 1975 additions and 1094 deletions
--- a/.coderabbit.yaml
+++ b/.coderabbit.yaml
@@ -12,6 +12,6 @@ reviews:
  auto_review:
    enabled: true
    drafts: false
-    auto_incremental_review: true
+    auto_incremental_review: false
 chat:
  auto_reply: true
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -36,6 +36,11 @@ jobs:
            python_version: "3.11"
            pytorch: 2.7.1
            axolotl_extras:
          - cuda: 128
            cuda_version: 12.8.1
            python_version: "3.11"
            pytorch: 2.8.0
            axolotl_extras:
    runs-on: axolotl-gpu-runner
    steps:
      - name: Checkout
@@ -110,6 +115,11 @@ jobs:
            python_version: "3.11"
            pytorch: 2.7.1
            axolotl_extras:
          - cuda: 128
            cuda_version: 12.8.1
            python_version: "3.11"
            pytorch: 2.8.0
            axolotl_extras:
    runs-on: axolotl-gpu-runner
    steps:
      - name: Checkout
@@ -169,6 +179,12 @@ jobs:
            pytorch: 2.7.1
            axolotl_extras: vllm
            is_latest: true
          - cuda: 128
            cuda_version: 12.8.1
            python_version: "3.11"
            pytorch: 2.8.0
            axolotl_extras:
            is_latest:
    runs-on: axolotl-gpu-runner
    steps:
      - name: Checkout
--- a/.github/workflows/multi-gpu-e2e.yml
+++ b/.github/workflows/multi-gpu-e2e.yml
@@ -33,13 +33,6 @@ jobs:
            axolotl_extras:
            num_gpus: 2
            nightly_build: "true"
          - cuda: 126
            cuda_version: 12.6.3
            python_version: "3.11"
            pytorch: 2.7.0
            axolotl_extras:
            num_gpus: 2
            nightly_build: "true"
          - cuda: 126
            cuda_version: 12.6.3
            python_version: "3.11"
@@ -47,6 +40,13 @@ jobs:
            axolotl_extras: vllm
            num_gpus: 2
            nightly_build: "true"
          - cuda: 128
            cuda_version: 12.8.1
            python_version: "3.11"
            pytorch: 2.8.0
            axolotl_extras:
            num_gpus: 2
            nightly_build: "true"
    runs-on: [self-hosted, modal]
    timeout-minutes: 120
    steps:
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -55,7 +55,7 @@ jobs:
      fail-fast: false
      matrix:
        python_version: ["3.11"]
-        pytorch_version: ["2.6.0", "2.7.0", "2.7.1"]
+        pytorch_version: ["2.6.0", "2.7.1", "2.8.0"]
    timeout-minutes: 20
    steps:
@@ -130,7 +130,7 @@ jobs:
      fail-fast: false
      matrix:
        python_version: ["3.11"]
-        pytorch_version: ["2.6.0", "2.7.0", "2.7.1"]
+        pytorch_version: ["2.6.0", "2.7.1", "2.8.0"]
    timeout-minutes: 20
    steps:
@@ -240,7 +240,7 @@ jobs:
          - cuda: 126
            cuda_version: 12.6.3
            python_version: "3.11"
-            pytorch: 2.6.0
+            pytorch: 2.7.1
            num_gpus: 1
            axolotl_extras:
            dockerfile: "Dockerfile-uv.jinja"
@@ -298,6 +298,13 @@ jobs:
            pytorch: 2.7.1
            num_gpus: 1
            axolotl_extras:
          - cuda: 128
            cuda_version: 12.8.1
            python_version: "3.11"
            pytorch: 2.8.0
            num_gpus: 1
            gpu_type: "B200"
            axolotl_extras:
    steps:
      - name: Checkout
        uses: actions/checkout@v4
@@ -318,6 +325,7 @@ jobs:
          echo "CUDA=${{ matrix.cuda }}" >> $GITHUB_ENV
          echo "MODAL_IMAGE_BUILDER_VERSION=2024.10" >> $GITHUB_ENV
          echo "N_GPUS=${{ matrix.num_gpus }}" >> $GITHUB_ENV
          echo "GPU_TYPE=${{ matrix.gpu_type || 'L40S'}}" >> $GITHUB_ENV
          echo "CODECOV_TOKEN=${{ secrets.CODECOV_TOKEN }}" >> $GITHUB_ENV
          echo "E2E_DOCKERFILE=${{ matrix.dockerfile || 'Dockerfile.jinja'}}" >> $GITHUB_ENV
      - name: Run tests job on Modal
@@ -334,10 +342,10 @@ jobs:
      fail-fast: false
      matrix:
        include:
-          - cuda: 124
+          - cuda: 126
-            cuda_version: 12.4.1
+            cuda_version: 12.6.3
            python_version: "3.11"
-            pytorch: 2.6.0
+            pytorch: 2.7.1
            num_gpus: 1
            axolotl_extras:
    steps:
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -11,7 +11,7 @@ repos:
    -   id: no-commit-to-branch
        args: ['--branch', 'main']
 -   repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.12.9
+    rev: v0.12.12
    hooks:
    -   id: ruff
        args: [--fix]
--- a/README.md
+++ b/README.md
@@ -17,6 +17,7 @@
    <br/>
    <a href="https://discord.com/invite/HhrNrHJPRb"><img src="https://img.shields.io/badge/discord-7289da.svg?style=flat-square&logo=discord" alt="discord" style="height: 20px;"></a>
    <a href="https://twitter.com/axolotl_ai"><img src="https://img.shields.io/twitter/follow/axolotl_ai?style=social" alt="twitter" style="height: 20px;"></a>
    <a href="https://colab.research.google.com/github/axolotl-ai-cloud/axolotl/blob/main/examples/colab-notebooks/colab-axolotl-example.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="google-colab" style="height: 20px;"></a>
    <br/>
    <img src="https://github.com/axolotl-ai-cloud/axolotl/actions/workflows/tests-nightly.yml/badge.svg" alt="tests-nightly">
    <img src="https://github.com/axolotl-ai-cloud/axolotl/actions/workflows/multi-gpu-e2e.yml/badge.svg" alt="multigpu-semi-weekly tests">
@@ -70,6 +71,10 @@ Features:
 - Python 3.11
 - PyTorch ≥2.6.0
 ### Google Colab
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/axolotl-ai-cloud/axolotl/blob/main/examples/colab-notebooks/colab-axolotl-example.ipynb#scrollTo=msOCO4NRmRLa)
 ### Installation
 #### Using pip
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -153,7 +153,7 @@ quartodoc:
        - utils.distributed
        - utils.dict
        - utils.optimizers.adopt
-        - utils.data.pretraining
+        - utils.data.streaming
        - utils.data.sft
        - utils.quantization
    - title: Schemas
@@ -272,6 +272,7 @@ website:
          contents:
            - docs/batch_vs_grad.qmd
            - docs/dataset_preprocessing.qmd
            - docs/streaming.qmd
            - docs/multipack.qmd
            - docs/mixed_precision.qmd
            - docs/optimizers.qmd
--- a/cicd/single_gpu.py
+++ b/cicd/single_gpu.py
@@ -57,7 +57,8 @@ VOLUME_CONFIG = {
 }
 N_GPUS = int(os.environ.get("N_GPUS", 1))
-GPU_CONFIG = f"L40S:{N_GPUS}"
+GPU_TYPE = os.environ.get("GPU_TYPE", "L40S")
 GPU_CONFIG = f"{GPU_TYPE}:{N_GPUS}"
 def run_cmd(cmd: str, run_folder: str):
--- a/codecov.yml
+++ b/codecov.yml
@@ -12,7 +12,7 @@ coverage:
      default:
        # basic
        target: auto
-        threshold: 0%
+        threshold: 1%
        base: auto
        # advanced
        branches: null
@@ -27,7 +27,7 @@ coverage:
      default:
        # basic
        target: auto
-        threshold: 0%
+        threshold: 1%
        base: auto
        # advanced
        branches: null
--- a/docs/installation.qmd
+++ b/docs/installation.qmd
@@ -134,7 +134,7 @@ For providers supporting Docker:
 ### Google Colab {#sec-colab}
-Use our [example notebook](../examples/colab-notebooks/colab-axolotl-example.ipynb).
+[![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/axolotl-ai-cloud/axolotl/blob/main/examples/colab-notebooks/colab-axolotl-example.ipynb#scrollTo=msOCO4NRmRLa)
 ## Platform-Specific Instructions {#sec-platform-specific}
--- a/docs/multi-gpu.qmd
+++ b/docs/multi-gpu.qmd
@@ -63,15 +63,6 @@ Start from Stage 1 -> Stage 2 -> Stage 3.
 :::
 ::: {.callout-tip}
 Using ZeRO Stage 3 with Single-GPU training
 ZeRO Stage 3 can be used for training on a single GPU by manually setting the environment variables:
 `WORLD_SIZE=1 LOCAL_RANK=0 MASTER_ADDR=0.0.0.0 MASTER_PORT=29500`
 :::
 ## Fully Sharded Data Parallel (FSDP) {#sec-fsdp}
 ::: {.callout-note}
--- a/docs/reward_modelling.qmd
+++ b/docs/reward_modelling.qmd
@@ -11,6 +11,7 @@ We support the reward modelling techniques supported by `trl`.
 ### (Outcome) Reward Models
 Outcome reward models are trained using data which contains preference annotations for an entire interaction between the user and model (e.g. rather than per-turn or per-step).
 For improved training stability, you can use the `center_rewards_coefficient` parameter to encourage mean-zero reward outputs ([see TRL docs](https://huggingface.co/docs/trl/v0.10.1/en/reward_trainer#centering-rewards)).
 ```yaml
 base_model: google/gemma-2-2b
--- a/docs/streaming.qmd
+++ b/docs/streaming.qmd
@@ -0,0 +1,120 @@
 ---
 title: Streaming Datasets
 description: How to use streaming mode for large-scale datasets and memory-efficient training
 order: 10
 ---
 Streaming enables memory-efficient training with large datasets by loading data
 incrementally rather than loading the entire dataset into memory at once.
 Use streaming when:
 - Your dataset is too large to fit in memory (e.g. when you're doing pretraining with massive text corpora)
 - You want to start training immediately without preprocessing the entire dataset
 Streaming works with both remote and locally stored datasets!
 ::: {.callout-note}
 Streaming currently only supports a single dataset. Multi-dataset support will be added soon.
 :::
 ## Configuration
 ### Basic Streaming
 Enable streaming mode by setting the `streaming` flag:
 ```yaml
 streaming: true
 ```
 ### Pretraining with Streaming
 For pretraining tasks, streaming is automatically enabled when using `pretraining_dataset`:
 ```yaml
 pretraining_dataset:
  - path: HuggingFaceFW/fineweb-edu
    type: pretrain
    text_column: text
    split: train
 # Optionally, enable sample packing
 streaming_multipack_buffer_size: 10000
 sample_packing: true
 ```
 ### SFT with Streaming
 For supervised fine-tuning with streaming:
 ```yaml
 streaming: true
 datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
    split: train
 # Optionally, enable sample packing
 streaming_multipack_buffer_size: 10000
 sample_packing: true
 ```
 ## Configuration Options
 ### `streaming_multipack_buffer_size`
 Controls the buffer size for multipack streaming (default: 10,000). This determines how
 many samples are buffered before packing. Larger buffers can improve packing efficiency
 but use more memory.
 ### `shuffle_merged_datasets`
 When enabled, shuffles the streaming dataset using the buffer. This requires additional
 memory for the shuffle buffer.
 ## Sample Packing with Streaming
 Sample packing is supported for streaming datasets. When enabled, multiple samples are
 packed into a single sequence to maximize GPU utilization:
 ```yaml
 sample_packing: true
 streaming_multipack_buffer_size: 10000
 # For SFT: attention is automatically isolated between packed samples
 # For pretraining: control with pretrain_multipack_attn
 pretrain_multipack_attn: true  # prevent cross-attention between packed samples
 ```
 For more information, see our [documentation](multipack.qmd) on multipacking.
 ## Important Considerations
 ### Memory Usage
 While streaming reduces memory usage compared to loading entire datasets, you still need
 to consider:
 - You can control the memory usage by adjusting `streaming_multipack_buffer_size`
 - Sample packing requires buffering multiple samples
 - Shuffling requires additional memory for the shuffle buffer
 ### Performance
 - Streaming may have slightly higher latency compared to preprocessed datasets, as samples are processed on-the-fly
 - Network speed and disk read speed are important when streaming from remote sources or a local dataset, respectively
 - Consider using `axolotl preprocess` for smaller or more frequently used datasets
 ### Evaluation Datasets
 Evaluation datasets are not streamed to ensure consistent evaluation metrics. They're
 loaded normally even when training uses streaming.
 ## Examples
 See the `examples/streaming/` directory for complete configuration examples:
 - `pretrain.yaml`: Pretraining with streaming dataset
 - `sft.yaml`: Supervised fine-tuning with streaming
--- a/examples/cloud/baseten.yaml
+++ b/examples/cloud/baseten.yaml
@@ -0,0 +1,10 @@
 provider: baseten
 project_name:
 secrets:
  - HF_TOKEN
  - WANDB_API_KEY
 gpu: h100
 gpu_count: 8
 node_count: 1
--- a/examples/colab-notebooks/colab-axolotl-example.ipynb
+++ b/examples/colab-notebooks/colab-axolotl-example.ipynb
@@ -40,7 +40,7 @@
    "%%capture\n",
    "# This step can take ~5-10 minutes to install dependencies\n",
    "!pip install --no-build-isolation axolotl[flash-attn]>=0.9.1\n",
-    "!pip install \"cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@0ee9ee8\""
+    "!pip install \"cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@c6a32c5\""
   ]
  },
  {
--- a/examples/gemma3/270m-qlora.yml
+++ b/examples/gemma3/270m-qlora.yml
@@ -0,0 +1,68 @@
 base_model: google/gemma-3-270m-it
 # optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
 # gemma3 doesn't seem to play nice with ddp
 ddp_find_unused_parameters: true
 load_in_8bit: false
 load_in_4bit: true
 # huggingface repo
 chat_template: gemma3
 eot_tokens:
  - <end_of_turn>
 datasets:
  - path: cgato/SlimOrcaDedupCleaned
    type: chat_template
    field_messages: conversations
    message_property_mappings:
      role: from
      content: value
 val_set_size: 0.0
 output_dir: ./outputs/out
 adapter: qlora
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
 sequence_len: 2048
 sample_packing: true
 eval_sample_packing: false
 wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_name:
 wandb_log_model:
 gradient_accumulation_steps: 4
 micro_batch_size: 1
 num_epochs: 1
 optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002
 bf16: auto
 tf32: true
 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
 resume_from_checkpoint:
 logging_steps: 1
 flash_attention: true
 warmup_ratio: 0.1
 evals_per_epoch:
 saves_per_epoch: 1
 weight_decay: 0.0
 special_tokens:
--- a/examples/qwen3/reward-model.yaml
+++ b/examples/qwen3/reward-model.yaml
@@ -0,0 +1,44 @@
 base_model: Skywork/Skywork-Reward-V2-Qwen3-8B
 model_type: AutoModelForSequenceClassification
 num_labels: 1
 reward_model: true
 center_rewards_coefficient: 0.01  # Incentivize mean-zero rewards for improved stability
 chat_template: qwen3
 datasets:
  - path: argilla/distilabel-intel-orca-dpo-pairs
    type: bradley_terry.chat_template
 val_set_size: 0.0
 output_dir: ./outputs/out
 sequence_len: 8192
 sample_packing: false
 eval_sample_packing: false
 pad_to_sequence_len: true
 deepspeed: deepspeed_configs/zero1.json
 wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_name:
 wandb_log_model:
 gradient_accumulation_steps: 4
 micro_batch_size: 1
 eval_batch_size: 1
 num_epochs: 3
 optimizer: adamw_bnb_8bit
 lr_scheduler: linear
 learning_rate: 0.00002
 bf16: true
 tf32: true
 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
 warmup_ratio: 0.1
 logging_steps: 1
 weight_decay: 0.01
--- a/examples/streaming/README.md
+++ b/examples/streaming/README.md
@@ -0,0 +1,50 @@
 # Streaming Dataset Examples
 This directory contains example configurations for using Axolotl's streaming dataset
 functionality, which enables memory-efficient training with large datasets.
 ## Examples
 Run the following examples with e.g. `axolotl train examples/streaming/sft.yaml`; no
 `axolotl preprocess` required!
 ### Pretraining (`pretrain.yaml`)
 Demonstrates streaming configuration for pretraining tasks using the fineweb-edu dataset
 with SmolLM2-135M.
 - Uses `pretraining_dataset` configuration for automatic streaming
 - Multipack attention control to prevent cross-attention between packed sequences
 - Buffer size configuration for memory management
 ### SFT (`sft.yaml`)
 Shows how to use streaming for supervised fine-tuning with the Alpaca dataset.
 - Explicit `streaming: true` flag for SFT datasets
 - Memory-efficient training on instruction datasets
 - Evaluation datasets are currently not streamed
 ## Key Configuration Options
 ### `streaming`
 - Enables streaming mode for standard datasets
 - Automatically enabled for `pretraining_dataset`
 ### `streaming_multipack_buffer_size`
 - Controls buffer size for sample packing (default: 10,000)
 - Larger values improve packing efficiency but use more memory
 - Adjust based on available memory
 ### `shuffle_merged_datasets`
 - Enables shuffling of streaming datasets
 - Requires additional memory for shuffle buffer
 ### `sample_packing`
 - Packs multiple samples into single sequences
 - Minimize per-step padding tokens
 ## Performance Tips
 - Download small / frequently-used datasets locally for better performance
 - Larger buffer sizes improve packing efficiency
--- a/examples/streaming/pretrain.yaml
+++ b/examples/streaming/pretrain.yaml
@@ -0,0 +1,57 @@
 base_model: HuggingFaceTB/SmolLM2-135M
 # Streaming pretraining configuration
 pretraining_dataset:
  - path: HuggingFaceFW/fineweb-edu
    name: sample-10BT
    type: pretrain
    text_column: text
    split: train
 # Streaming-specific settings
 streaming_multipack_buffer_size: 10000
 shuffle_merged_datasets: true
 # Training configuration
 max_steps: 1000
 output_dir: ./outputs/smollm2-135m-pretrain-streaming
 # Sequence and packing settings
 sequence_len: 1024
 sample_packing: true
 pretrain_multipack_attn: true  # Prevent cross-attention between packed sequences
 flash_attention: true
 # Batch size settings
 gradient_accumulation_steps: 8
 micro_batch_size: 1
 # Optimizer and scheduler
 optimizer: adamw_torch
 lr_scheduler: cosine
 learning_rate: 5e-4
 warmup_ratio: 0.1
 weight_decay: 0.01
 # Precision and performance
 bf16: auto
 tf32: true
 # Logging and checkpointing
 logging_steps: 10
 save_strategy: steps
 save_steps: 250
 save_total_limit: 3
 # Weights & Biases (optional)
 wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_name:
 wandb_log_model:
 # Special tokens
 special_tokens:
  pad_token: "<|endoftext|>"
 # save_first_step: true  # uncomment this to validate checkpoint saving works with your config
--- a/examples/streaming/sft.yaml
+++ b/examples/streaming/sft.yaml
@@ -0,0 +1,55 @@
 base_model: HuggingFaceTB/SmolLM2-135M
 # Dataset configuration
 datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
    split: train
 # Streaming-specific settings
 streaming: true
 streaming_multipack_buffer_size: 10000
 shuffle_merged_datasets: true
 # Training configuration
 max_steps: 1000
 output_dir: ./outputs/smollm2-135m-sft-streaming
 # Sequence and packing settings
 sequence_len: 1024
 sample_packing: true
 flash_attention: true
 # Batch size settings
 gradient_accumulation_steps: 4
 micro_batch_size: 1
 # Optimizer and scheduler
 optimizer: adamw_torch
 lr_scheduler: cosine
 learning_rate: 2e-4
 warmup_ratio: 0.1
 weight_decay: 0.0
 # Precision and performance
 bf16: auto
 tf32: true
 # Logging and checkpointing
 logging_steps: 10
 save_strategy: steps
 save_steps: 100
 save_total_limit: 3
 # Weights & Biases (optional)
 wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_name:
 wandb_log_model:
 # Special tokens
 special_tokens:
  pad_token: "<|endoftext|>"
 # save_first_step: true  # uncomment this to validate checkpoint saving works with your config
--- a/requirements.txt
+++ b/requirements.txt
@@ -2,8 +2,7 @@
 # START section of dependencies that don't install on Darwin/MacOS
 bitsandbytes==0.47.0
-# triton 3.4.0 is not compatible with CCE
+triton>=3.0.0
 triton>=3.0.0,<3.4.0
 mamba-ssm==1.2.0.post1
 xformers>=0.0.23.post1
 autoawq==0.2.7.post3
@@ -14,7 +13,7 @@ packaging==23.2
 huggingface_hub>=0.33.0
 peft>=0.17.0
-transformers==4.55.3
+transformers==4.56.1
 tokenizers>=0.21.1
 accelerate==1.10.0
 datasets==4.0.0
--- a/scripts/cutcrossentropy_install.py
+++ b/scripts/cutcrossentropy_install.py
@@ -29,5 +29,5 @@ UV_PREFIX = "uv " if USE_UV else ""
 print(
    UNINSTALL_PREFIX
-    + f'{UV_PREFIX}pip install "cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@0ee9ee8"'
+    + f'{UV_PREFIX}pip install "cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@c6a32c5"'
 )
--- a/setup.py
+++ b/setup.py
@@ -64,7 +64,9 @@ def parse_requirements(extras_require_map):
            else:
                raise ValueError("Invalid version format")
-            if (major, minor) >= (2, 7):
+            if (major, minor) >= (2, 8):
                pass
            elif (major, minor) >= (2, 7):
                _install_requires.pop(_install_requires.index(xformers_version))
                if patch == 0:
                    _install_requires.append("xformers==0.0.30")
@@ -125,7 +127,7 @@ extras_require = {
        "yunchang==0.6.0",
    ],
    "deepspeed": [
-        "deepspeed==0.17.2",
+        "deepspeed==0.17.5",
        "deepspeed-kernels",
    ],
    "mamba-ssm": [
--- a/src/axolotl/cli/args.py
+++ b/src/axolotl/cli/args.py
@@ -14,9 +14,13 @@ class PreprocessCliArgs:
    prompter: Optional[str] = field(default=None)
    download: Optional[bool] = field(default=True)
    iterable: Optional[bool] = field(
-        default=None,
+        default=False,
        metadata={
-            "help": "Use IterableDataset for streaming processing of large datasets"
+            "help": (
                "Deprecated in v0.13.0, will be removed in v0.14.0. For streaming "
                "datasets, use 'axolotl train' and set 'streaming: true' in your YAML "
                "config, or pass --streaming instead in the CLI."
            )
        },
    )
--- a/src/axolotl/cli/cloud/init.py
+++ b/src/axolotl/cli/cloud/init.py
@@ -7,6 +7,8 @@ from typing import Literal
 import yaml
 from axolotl.cli.cloud.base import Cloud
 from axolotl.cli.cloud.baseten import BasetenCloud
 from axolotl.cli.cloud.modal_ import ModalCloud
 from axolotl.utils.dict import DictDefault
@@ -38,8 +40,15 @@ def do_cli_train(
    cwd=None,
    **kwargs,
 ) -> None:
-    cloud_cfg = load_cloud_cfg(cloud_config)
+    cloud_cfg: DictDefault = load_cloud_cfg(cloud_config)
-    cloud = ModalCloud(cloud_cfg)
+    provider = cloud_cfg.provider or "modal"
    cloud: Cloud | None
    if provider == "modal":
        cloud = ModalCloud(cloud_cfg)
    elif provider == "baseten":
        cloud = BasetenCloud(cloud_cfg.to_dict())
    else:
        raise ValueError(f"Unsupported cloud provider: {provider}")
    with open(config, "r", encoding="utf-8") as file:
        config_yaml = file.read()
    local_dirs = {}
--- a/src/axolotl/cli/cloud/baseten/init.py
+++ b/src/axolotl/cli/cloud/baseten/init.py
@@ -0,0 +1,48 @@
 """Baseten Cloud CLI"""
 import shutil
 import subprocess  # nosec B404
 import tempfile
 from os.path import dirname
 from typing import Literal
 import yaml
 from axolotl.cli.cloud.base import Cloud
 class BasetenCloud(Cloud):
    """Baseten Cloud Axolotl CLI"""
    def __init__(self, config: dict):
        self.config = config
    def preprocess(self, config_yaml: str, *args, **kwargs) -> None:
        raise NotImplementedError(
            "Separate preprocess function for Baseten is not "
            "implemented and will happen during hte train step."
        )
    def train(
        self,
        config_yaml: str,
        launcher: Literal["accelerate", "torchrun", "python"] = "accelerate",
        launcher_args: list[str] | None = None,
        local_dirs: dict[str, str] | None = None,  # pylint: disable=unused-argument
        **kwargs,
    ):
        with tempfile.TemporaryDirectory() as tmp_dir:
            config = self.config.copy()
            config["launcher"] = launcher
            config["launcher_args"] = launcher_args
            with open(tmp_dir + "/cloud.yaml", "w", encoding="utf-8") as cloud_fout:
                yaml.dump(config, cloud_fout)
            with open(tmp_dir + "/train.yaml", "w", encoding="utf-8") as config_fout:
                config_fout.write(config_yaml)
            shutil.copyfile(dirname(__file__) + "/template/run.sh", tmp_dir + "/run.sh")
            shutil.copyfile(
                dirname(__file__) + "/template/train_sft.py", tmp_dir + "/train_sft.py"
            )
            subprocess.run(  # nosec B603 B607
                ["truss", "train", "push", "train_sft.py"], cwd=tmp_dir, check=False
            )
--- a/src/axolotl/cli/cloud/baseten/template/run.sh
+++ b/src/axolotl/cli/cloud/baseten/template/run.sh
@@ -0,0 +1,9 @@
 #!/bin/bash
 set -eux
 export NCCL_SOCKET_IFNAME="^docker0,lo"
 export NCCL_IB_DISABLE=0
 export NCCL_TIMEOUT=1800000
 axolotl preprocess train.yaml
 axolotl train train.yaml --launcher ${AXOLOTL_LAUNCHER} ${AXOLOTL_LAUNCHER_ARGS}
--- a/src/axolotl/cli/cloud/baseten/template/train_sft.py
+++ b/src/axolotl/cli/cloud/baseten/template/train_sft.py
@@ -0,0 +1,71 @@
 """
 Baseten Training Script for Axolotl
 """
 # pylint: skip-file
 import yaml
 from truss.base import truss_config
 # Import necessary classes from the Baseten Training SDK
 from truss_train import definitions
 cloud_config = yaml.safe_load(open("cloud.yaml", "r"))
 gpu = cloud_config.get("gpu", "h100")
 gpu_count = int(cloud_config.get("gpu_count", 1))
 node_count = int(cloud_config.get("node_count", 1))
 project_name = cloud_config.get("project_name", "axolotl-project") or "axolotl-project"
 secrets = cloud_config.get("secrets", [])
 launcher = cloud_config.get("launcher", "accelerate")
 launcher_args = cloud_config.get("launcher_args", [])
 script_name = "run.sh"
 launcher_args_str = ""
 if launcher_args:
    launcher_args_str = "-- " + " ".join(launcher_args)
 # 1. Define a base image for your training job
 # must use torch 2.7.0 for vllm
 BASE_IMAGE = "axolotlai/axolotl:main-py3.11-cu126-2.7.1"
 # 2. Define the Runtime Environment for the Training Job
 # This includes start commands and environment variables.a
 # Secrets from the baseten workspace like API keys are referenced using
 # `SecretReference`.
 env_vars = {
    "AXOLOTL_LAUNCHER": launcher,
    "AXOLOTL_LAUNCHER_ARGS": launcher_args_str,
 }
 for secret_name in secrets:
    env_vars[secret_name] = definitions.SecretReference(name=secret_name)
 training_runtime = definitions.Runtime(
    start_commands=[  # Example: list of commands to run your training script
        f"/bin/sh -c 'chmod +x ./{script_name} && ./{script_name}'"
    ],
    environment_variables=env_vars,
 )
 # 3. Define the Compute Resources for the Training Job
 training_compute = definitions.Compute(
    node_count=node_count,
    accelerator=truss_config.AcceleratorSpec(
        accelerator=truss_config.Accelerator.H100,
        count=gpu_count,
    ),
 )
 # 4. Define the Training Job
 # This brings together the image, compute, and runtime configurations.
 my_training_job = definitions.TrainingJob(
    image=definitions.Image(base_image=BASE_IMAGE),
    compute=training_compute,
    runtime=training_runtime,
 )
 # This config will be pushed using the Truss CLI.
 # The association of the job to the project happens at the time of push.
 first_project_with_job = definitions.TrainingProject(
    name=project_name, job=my_training_job
 )
--- a/src/axolotl/cli/inference.py
+++ b/src/axolotl/cli/inference.py
@@ -14,10 +14,7 @@ from transformers import GenerationConfig, TextIteratorStreamer, TextStreamer
 from axolotl.cli.args import InferenceCliArgs
 from axolotl.cli.config import load_cfg
 from axolotl.cli.utils import load_model_and_tokenizer
-from axolotl.utils.chat_templates import (
+from axolotl.utils.chat_templates import get_chat_template_from_config
    get_chat_template,
    get_chat_template_from_config,
 )
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.logging import get_logger
@@ -64,7 +61,9 @@ def do_inference(
            importlib.import_module("axolotl.prompters"), prompter
        )
    elif cfg.chat_template:
-        chat_template_str = get_chat_template(cfg.chat_template, tokenizer=tokenizer)
+        chat_template_str = get_chat_template_from_config(
            cfg, ds_cfg=None, tokenizer=tokenizer
        )
    elif cfg.datasets[0].type == "chat_template":
        chat_template_str = get_chat_template_from_config(
            cfg=cfg, ds_cfg=cfg.datasets[0], tokenizer=tokenizer
@@ -159,7 +158,13 @@ def do_inference_gradio(
            importlib.import_module("axolotl.prompters"), prompter
        )
    elif cfg.chat_template:
-        chat_template_str = get_chat_template(cfg.chat_template, tokenizer=tokenizer)
+        chat_template_str = get_chat_template_from_config(
            cfg, ds_cfg=None, tokenizer=tokenizer
        )
    elif cfg.datasets[0].type == "chat_template":
        chat_template_str = get_chat_template_from_config(
            cfg=cfg, ds_cfg=cfg.datasets[0], tokenizer=tokenizer
        )
    model = model.to(cfg.device, dtype=cfg.torch_dtype)
--- a/src/axolotl/cli/merge_lora.py
+++ b/src/axolotl/cli/merge_lora.py
@@ -43,7 +43,10 @@ def do_merge_lora(*, cfg: DictDefault) -> None:
            safe_serialization=safe_serialization,
            progressbar=True,
        )
-        tokenizer.save_pretrained(str(Path(cfg.output_dir) / "merged"))
+        tokenizer.save_pretrained(
            str(Path(cfg.output_dir) / "merged"),
            save_jinja_files=cfg.tokenizer_save_jinja_files,
        )
        if processor:
            processor.save_pretrained(str(Path(cfg.output_dir) / "merged"))
--- a/src/axolotl/cli/preprocess.py
+++ b/src/axolotl/cli/preprocess.py
@@ -35,10 +35,20 @@ def do_preprocess(cfg: DictDefault, cli_args: PreprocessCliArgs) -> None:
    check_accelerate_default_config()
    check_user_token()
    if cli_args.iterable:
        LOG.error(
            "The --iterable CLI argument for 'axolotl preprocess' is no longer "
            "supported. For training, set 'streaming: true' in your YAML config or "
            "pass '--streaming' in your 'axolotl train' command for on-the-fly "
            "preprocessing."
        )
        return
    for key in ["skip_prepare_dataset", "pretraining_dataset"]:
        if cfg.get(key):
            LOG.error(
-                f"You have set `{key}:`. `preprocess` is not needed. Run the `axolotl train` CLI directly instead."
+                f"You have set `{key}:`. `preprocess` is not needed. Run the 'axolotl "
                "train' CLI directly instead."
            )
            return
--- a/src/axolotl/cli/quantize.py
+++ b/src/axolotl/cli/quantize.py
@@ -84,5 +84,6 @@ def do_quantize(
        str(Path(output_dir) / "quantized"),
        safe_serialization=False,
        progressbar=True,
        save_jinja_files=cfg.tokenizer_save_jinja_files,
    )
    LOG.info(f"Quantized model saved to: {str(Path(output_dir) / 'quantized')}...")
--- a/src/axolotl/common/datasets.py
+++ b/src/axolotl/common/datasets.py
@@ -55,13 +55,11 @@ def load_datasets(
    """
    tokenizer = load_tokenizer(cfg)
    processor = load_processor(cfg, tokenizer=tokenizer) if cfg.processor_type else None
    preprocess_iterable = getattr(cli_args, "iterable", False)
    train_dataset, eval_dataset, total_num_steps, prompters = prepare_datasets(
        cfg,
        tokenizer,
        processor=processor,
        preprocess_iterable=preprocess_iterable,
    )
    if (
--- a/src/axolotl/core/builders/base.py
+++ b/src/axolotl/core/builders/base.py
@@ -24,9 +24,7 @@ from pathlib import Path
 from typing import Any
 import torch
-from transformers import (
+from transformers import TrainerCallback
    TrainerCallback,
 )
 from transformers.trainer_pt_utils import AcceleratorConfig
 from axolotl.integrations.base import PluginManager
@@ -512,6 +510,7 @@ class TrainerBuilderBase(abc.ABC):
                self.cfg.eval_batch_size
            )
        training_args_kwargs["include_tkps"] = self.cfg.include_tkps
        training_args_kwargs["max_steps"] = self.cfg.max_steps or total_num_steps or -1
        training_args_kwargs["num_train_epochs"] = self.cfg.num_epochs
--- a/src/axolotl/core/builders/causal.py
+++ b/src/axolotl/core/builders/causal.py
@@ -7,10 +7,7 @@ from pathlib import Path
 from typing import Type, Union
 import transformers
-from transformers import (
+from transformers import DataCollatorWithFlattening, EarlyStoppingCallback
    DataCollatorWithFlattening,
    EarlyStoppingCallback,
 )
 from trl.trainer.utils import RewardDataCollatorWithPadding
 from axolotl.core.builders.base import TrainerBuilderBase
@@ -26,12 +23,12 @@ from axolotl.monkeypatch.relora import ReLoRACallback
 from axolotl.processing_strategies import get_processing_strategy
 from axolotl.utils import is_comet_available, is_mlflow_available
 from axolotl.utils.callbacks import (
    LossWatchDogCallback,
    SaveBetterTransformerModelCallback,
    bench_eval_callback_factory,
    causal_lm_bench_eval_callback_factory,
    colab_inference_post_train_callback,
    log_prediction_callback_factory,
    LossWatchDogCallback,
    SaveBetterTransformerModelCallback,
 )
 from axolotl.utils.callbacks.lisa import lisa_callback_factory
 from axolotl.utils.callbacks.qat import QATCallback
@@ -42,6 +39,7 @@ from axolotl.utils.collators import (
    MambaDataCollator,
    V2BatchSamplerDataCollatorForSeq2Seq,
 )
 from axolotl.utils.callbacks.tokens_per_second import TokensPerSecondCallback
 from axolotl.utils.collators.mm_chat import MultiModalChatDataCollator
 from axolotl.utils.import_helper import get_cls_from_module_str
 from axolotl.utils.logging import get_logger
@@ -74,6 +72,12 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        if self.cfg.qat:
            callbacks.append(QATCallback(self.cfg.qat))
        if self.cfg.include_tkps:
            callbacks.append(
                TokensPerSecondCallback(
                    self.cfg.tensor_parallel_size, self.cfg.context_parallel_size
                )
            )
        return callbacks
    def get_post_trainer_create_callbacks(self, trainer):
@@ -340,6 +344,10 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        if self.cfg.reward_model:
            training_args_cls = AxolotlRewardConfig
            if self.cfg.center_rewards_coefficient is not None:
                training_arguments_kwargs["center_rewards_coefficient"] = (
                    self.cfg.center_rewards_coefficient
                )
        elif self.cfg.process_reward_model:
            training_args_cls = AxolotlPRMConfig
        else:
@@ -404,6 +412,9 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
            **trainer_kwargs,
        )
        trainer = self.hook_post_create_trainer(trainer)
        # if the trainer has the `axolotl_cfg` property, set it
        if hasattr(trainer, "axolotl_cfg"):
            trainer.axolotl_cfg = self.cfg
        for callback in self.get_post_trainer_create_callbacks(trainer):
            trainer.add_callback(callback)
--- a/src/axolotl/core/trainers/base.py
+++ b/src/axolotl/core/trainers/base.py
@@ -42,6 +42,7 @@ from axolotl.core.trainers.utils import (
 )
 from axolotl.utils import get_not_null
 from axolotl.utils.bench import get_gpu_memory_usage
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.distributed import is_main_process
 from axolotl.utils.logging import get_logger
 from axolotl.utils.samplers import MultipackBatchSampler, get_dataset_lengths
@@ -63,6 +64,15 @@ class AxolotlTrainer(
    args = None  # type: "AxolotlTrainingArguments"  # type: ignore[name-defined]
    tag_names = ["axolotl"]
    _axolotl_cfg: DictDefault | None = None
    @property
    def axolotl_cfg(self):
        return self._axolotl_cfg
    @axolotl_cfg.setter
    def axolotl_cfg(self, cfg):
        self._axolotl_cfg = cfg
    def __init__(
        self,
@@ -78,7 +88,6 @@ class AxolotlTrainer(
        self._signature_columns = None  # workaround for pylint
        super().__init__(*_args, **kwargs)
        self.train_data_collator = self.data_collator
        self._stored_metrics = defaultdict(lambda: defaultdict(list))
        if self.args.orpo_alpha:
@@ -327,6 +336,17 @@ class AxolotlTrainer(
        #     outputs = model(**inputs)
        #     loss = trainer_weighted_loss(outputs, labels, shift_labels=True)
        #     return (loss, outputs) if return_outputs else loss
        # track number of tokens for tokens per second calculation
        if self.args.include_tkps:
            inputs_key = "labels" if "labels" in inputs else "input_ids"
            if hasattr(self.state, "num_tokens"):
                self.state.num_tokens = (
                    self.state.num_tokens + (inputs[inputs_key] != -100).sum().cpu()
                )
            else:
                self.state.num_tokens = (inputs[inputs_key] != -100).sum().cpu()
        if self.args.orpo_alpha:
            return self.orpo_compute_loss(
                model,
@@ -526,9 +546,6 @@ class AxolotlTrainer(
        super().create_accelerator_and_postprocess()
        # now we need to put parallelism_config back on the PartialState since we rely on that info in other places
        # PartialState().parallelism_config = self.accelerator.state.parallelism_config
        if self.is_fsdp_enabled:
            if (
                "limit_all_gathers" in self.args.fsdp_config
@@ -576,12 +593,19 @@ class AxolotlTrainer(
            # Add memory usage
            try:
                active, allocated, reserved = get_gpu_memory_usage()
-                logs["memory/max_mem_active(gib)"] = round(active, 2)
+                logs["memory/max_active (GiB)"] = round(active, 2)
-                logs["memory/max_mem_allocated(gib)"] = round(allocated, 2)
+                logs["memory/max_allocated (GiB)"] = round(allocated, 2)
-                logs["memory/device_mem_reserved(gib)"] = round(reserved, 2)
+                logs["memory/device_reserved (GiB)"] = round(reserved, 2)
            except (ValueError, TypeError, FileNotFoundError):
                pass
        if self.args.include_tkps and train_eval == "train":
            # each rank will log its own tokens per second
            # for logging_steps > 1 we obtain a moving average of this metric
            logs["tokens_per_second_per_gpu"] = round(
                self.state.last_tokens_per_second.item() / self.args.logging_steps, 2
            )
        del self._stored_metrics[train_eval]
        return super().log(logs, start_time)
@@ -657,6 +681,11 @@ class AxolotlTrainer(
                LOG.info(
                    "Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`"
                )
-                self.data_collator.tokenizer.save_pretrained(output_dir)
+                save_jinja_files = True
                if self.axolotl_cfg:
                    save_jinja_files = self.axolotl_cfg.tokenizer_save_jinja_files
                self.data_collator.tokenizer.save_pretrained(
                    output_dir, save_jinja_files=save_jinja_files
                )
            # Good practice: save your training arguments together with the trained model
            torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))
--- a/src/axolotl/core/trainers/mixins/activation_checkpointing.py
+++ b/src/axolotl/core/trainers/mixins/activation_checkpointing.py
@@ -3,11 +3,14 @@ Trainer mixin for activation checkpointing w offloading
 """
 import contextlib
 from functools import partial
 from peft import PeftModel
 from torch import nn
 from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
    apply_activation_checkpointing,
    checkpoint_wrapper,
    CheckpointImpl,
 )
 from torch.distributed.fsdp.wrap import ModuleWrapPolicy
 from transformers import GradientCheckpointingLayer, Trainer
@@ -46,9 +49,20 @@ class ActivationOffloadingMixin(Trainer):
            return super().training_step(*args, **kwargs)
-def ac_wrap_hf_model(model: nn.Module, **kwargs):
+def ac_wrap_hf_model(model: nn.Module, use_reentrant=None, **kwargs):
    auto_wrap_policy = ModuleWrapPolicy(set((GradientCheckpointingLayer,)))
-    apply_activation_checkpointing(model, auto_wrap_policy=auto_wrap_policy, **kwargs)
+    if use_reentrant:
        checkpoint_wrapper_fn = partial(
            checkpoint_wrapper, checkpoint_impl=CheckpointImpl.REENTRANT
        )
    else:
        checkpoint_wrapper_fn = checkpoint_wrapper
    apply_activation_checkpointing(
        model,
        checkpoint_wrapper_fn=checkpoint_wrapper_fn,
        auto_wrap_policy=auto_wrap_policy,
        **kwargs,
    )
 def get_lora_act_offloading_ctx_manager(
--- a/src/axolotl/core/training_args_base.py
+++ b/src/axolotl/core/training_args_base.py
@@ -49,6 +49,12 @@ class AxolotlTrainingMixins:
        default=False,
        metadata={"help": "Use real batches for efficient training."},
    )
    include_tkps: bool = field(
        default=True,
        metadata={
            "help": "Whether to include tokens per second in the training metrics."
        },
    )
    eval_sample_packing: Optional[bool] = field(
        default=None,
        metadata={"help": "Use sample packing for efficient evals."},
--- a/src/axolotl/datasets.py
+++ b/src/axolotl/datasets.py
@@ -1,18 +1,17 @@
-"""Module containing Dataset functionality"""
+"""
 Module containing dataset functionality.
 We want this to be a wrapper for an existing dataset that we have loaded. Lets use the
 concept of middlewares to wrap each dataset. We'll use the collators later on to pad the
 datasets.
 """
 import torch
 from datasets import Dataset, IterableDataset
 from axolotl.utils.logging import get_logger
 from .prompt_tokenizers import PromptTokenizingStrategy
 # We want this to be a wrapper for an existing dataset that we have loaded
 # lets use the concept of middlewares to wrap each dataset, for example
 # ConstantLengthDataset(ShuffledDataset([TokenizedPromptDataset(alpaca_dataset)]))
 # let's check to ensure we don't truncate an item in the middle, we'll use
 # the collators later on to pad the datasets
 LOG = get_logger(__name__)
@@ -86,133 +85,3 @@ def wrap_dataset_for_tokenized_prompt(
            **map_kwargs,
        )
    return TokenizedPromptDataset(prompt_tokenizer, dataset, **kwargs)
 # TODO this isn't the best since it can't interleave datasets
 class ConstantLengthDataset(IterableDataset):
    """Iterable dataset that returns constant length chunks of tokens from stream of
    text files.
    Args:
        tokenizer: The processor used for processing the data.
        dataset: Dataset with text files.
        seq_length: Length of token sequences to return.
    """
    def __init__(
        self,
        tokenizer,
        datasets,
        seq_length=2048,
    ):
        self.tokenizer = tokenizer
        self.concat_token_id = tokenizer.eos_token_id
        self.datasets: list[IterableDataset] = datasets
        self.seq_length = seq_length
        vocab_size = len(tokenizer.get_vocab())
        if vocab_size <= torch.iinfo(torch.int16).max:
            self.tokens_dtype = torch.int16
        elif vocab_size <= torch.iinfo(torch.int32).max:
            self.tokens_dtype = torch.int32
        else:
            self.tokens_dtype = torch.int64
    def __iter__(self):
        buffer = {
            "input_ids": [],
            "attention_mask": [],
            "labels": [],
            "position_ids": [],
        }
        buffer_len = 0
        for dataset in self.datasets:
            idx = 0
            iterator = iter(dataset)
            more_examples = True
            while more_examples:
                try:
                    example = next(iterator)
                    idx += 1
                except StopIteration:
                    more_examples = False
                    example = None
                add_concat_token = False
                if example:
                    example_len = len(example["input_ids"])
                    add_concat_token = example["input_ids"][-1] != self.concat_token_id
                else:
                    example_len = 0
                if not example_len or (
                    buffer_len + int(add_concat_token) + example_len > self.seq_length
                ):
                    if buffer["input_ids"]:
                        input_ids = torch.cat(buffer["input_ids"], dim=-1)[
                            : self.seq_length
                        ]
                        attention_mask = torch.cat(buffer["attention_mask"], dim=-1)[
                            : self.seq_length
                        ]
                        position_ids = torch.cat(buffer["position_ids"], dim=-1)[
                            : self.seq_length
                        ]
                        labels = torch.cat(buffer["labels"], dim=-1)[: self.seq_length]
                        if labels.size() == input_ids.size() and (
                            attention_mask.size() == input_ids.size()
                        ):
                            yield {
                                "input_ids": input_ids,
                                "labels": labels,
                                "attention_mask": attention_mask,
                                "position_ids": position_ids,
                            }
                        else:
                            LOG.warning(
                                "Dropping batch due to tensor size mismatch "
                                f"input_ids: {input_ids.size()}, "
                                f"labels: {labels.size()}, "
                                f"attention_mask: {attention_mask.size()}"
                            )
                    buffer = {
                        "input_ids": [],
                        "attention_mask": [],
                        "labels": [],
                        "position_ids": [],
                    }
                    buffer_len = 0
                    idx = 1
                if example:
                    # FIXME
                    # just going to drop data points that are too long
                    if len(example["input_ids"]) <= self.seq_length:
                        input_ids = example["input_ids"]
                        attention_mask = example["attention_mask"]
                        labels = example["labels"]
                        if add_concat_token:
                            input_ids.append(self.concat_token_id)
                            attention_mask.append(1)
                            labels.append(self.concat_token_id)
                        input_ids_with_concat = torch.tensor(
                            input_ids, dtype=self.tokens_dtype
                        )
                        attention_mask_with_concat = torch.tensor(
                            [idx * m for m in attention_mask], dtype=torch.int16
                        )
                        labels_with_concat = torch.tensor(
                            labels, dtype=self.tokens_dtype
                        )
                        position_ids = torch.arange(
                            len(input_ids), dtype=self.tokens_dtype
                        )
                        buffer["input_ids"].append(input_ids_with_concat)
                        buffer["attention_mask"].append(attention_mask_with_concat)
                        buffer["labels"].append(labels_with_concat)
                        buffer["position_ids"].append(position_ids)
                        buffer_len += len(input_ids)
--- a/src/axolotl/integrations/cut_cross_entropy/README.md
+++ b/src/axolotl/integrations/cut_cross_entropy/README.md
@@ -19,7 +19,7 @@ python scripts/cutcrossentropy_install.py | sh
 - If you are installing from pip
 ```bash
-pip3 uninstall -y cut-cross-entropy && pip3 install "cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@0ee9ee8"
+pip3 uninstall -y cut-cross-entropy && pip3 install "cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@c6a32c5"
 ```
 ## Usage
--- a/src/axolotl/integrations/cut_cross_entropy/init.py
+++ b/src/axolotl/integrations/cut_cross_entropy/init.py
@@ -35,7 +35,7 @@ LOG = get_logger(__name__)
 _CCE_INSTALL_MESSAGE = (
    "Please install Axolotl's fork of cut_cross_entropy with transformers support using "
-    '`pip install "cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@0ee9ee8"`'
+    '`pip install "cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@c6a32c5"`'
 )
--- a/src/axolotl/loaders/adapter.py
+++ b/src/axolotl/loaders/adapter.py
@@ -98,6 +98,8 @@ def load_lora(
        lora_config_kwargs["use_rslora"] = cfg.peft_use_rslora
    if cfg.peft_layer_replication:
        lora_config_kwargs["layer_replication"] = cfg.peft_layer_replication
    if cfg.peft_trainable_token_indices:
        lora_config_kwargs["trainable_token_indices"] = cfg.peft_trainable_token_indices
    lora_config = LoraConfig(
        r=cfg.lora_r,
--- a/src/axolotl/loaders/model.py
+++ b/src/axolotl/loaders/model.py
@@ -224,21 +224,27 @@ class ModelLoader:
        ):
            self.model = self.model.merge_and_unload()
-        self._apply_activation_checkpointing()
+        use_reentrant = None
        if (
            self.cfg.gradient_checkpointing_kwargs
            and self.cfg.gradient_checkpointing_kwargs.get("use_reentrant", True)
        ):
            use_reentrant = True
        self._apply_activation_checkpointing(use_reentrant=use_reentrant)
        self._resize_token_embeddings()
        self._adjust_model_config()
        self._configure_embedding_dtypes()
        self._configure_qat()
        log_gpu_memory_usage(LOG, "Memory usage after model load", 0)
-    def _apply_activation_checkpointing(self):
+    def _apply_activation_checkpointing(self, use_reentrant: bool | None = None):
        if self.cfg.activation_offloading is True:
            from axolotl.core.trainers.mixins.activation_checkpointing import (
                ac_wrap_hf_model,
            )
            # ^^ importing this at the module level breaks plugins
-            ac_wrap_hf_model(self.model)
+            ac_wrap_hf_model(self.model, use_reentrant=use_reentrant)
    def _resize_token_embeddings(self):
        """Resize token embeddings if needed."""
--- a/src/axolotl/loaders/patch_manager.py
+++ b/src/axolotl/loaders/patch_manager.py
@@ -3,6 +3,7 @@
 Applies pre- and post-model load patches for various fixes and optimizations.
 """
 import os
 import importlib.util
 from functools import cached_property
@@ -66,6 +67,7 @@ class PatchManager:
        self._apply_mistral_cross_entropy_patch()
        self._apply_self_attention_lora_patch()
        self._apply_fsdp2_bnb_patches()
        self._apply_patch_deepspeed_zero3()
    def apply_post_plugin_pre_model_load_patches(self):
        """Apply post plugin-pre_model_load load patches based on config."""
@@ -78,13 +80,7 @@ class PatchManager:
            patch_maybe_log_save_evaluate,
        )
-        patch_fsdp2 = (
+        patch_evaluation_loop()
            self.cfg.torch_compile
            and self.cfg.fsdp_config
            and self.cfg.fsdp_version == 2
        )
        patch_evaluation_loop(patch_fsdp2)
        patch_maybe_log_save_evaluate()
    def apply_post_model_load_patches(self, model: PreTrainedModel):
@@ -147,14 +143,12 @@ class PatchManager:
    def _apply_flex_attention_patches(self):
        """Apply patches for flexible attention."""
        if self.cfg.flex_attention:
-            # from axolotl.monkeypatch.attention.flex_attn import (
+            from axolotl.monkeypatch.attention.flex_attn import (
-            #     patch_flex_make_mask,
+                patch_flex_wrapper,
-            #     patch_flex_wrapper,
+            )
-            # )
+
-            #
+            flex_attn_compile_kwargs = self.cfg.flex_attn_compile_kwargs or {}
-            # flex_attn_compile_kwargs = self.cfg.flex_attn_compile_kwargs or {}
+            patch_flex_wrapper(**flex_attn_compile_kwargs)
            # patch_flex_wrapper(**flex_attn_compile_kwargs)
            # patch_flex_make_mask()
            if self.cfg.sample_packing:
                from axolotl.core.attention.flex_block_mask import (
                    patch_create_causal_mask,
@@ -471,3 +465,16 @@ class PatchManager:
            from axolotl.monkeypatch.lora_kernels import apply_lora_kernel_patches
            apply_lora_kernel_patches(model=model, cfg=self.cfg)
    def _apply_patch_deepspeed_zero3(self):
        try:
            from axolotl.monkeypatch.deepspeed_utils import apply_deepspeed_patches
            from transformers.integrations.deepspeed import is_deepspeed_zero3_enabled
            if self.cfg.activation_offloading is True and (
                is_deepspeed_zero3_enabled()
                or os.getenv("ACCELERATE_DEEPSPEED_ZERO_STAGE") == "3"
            ):
                apply_deepspeed_patches()
        except ImportError as e:
            LOG.warning(f"DeepSpeed patches not applied: {e}")
--- a/src/axolotl/monkeypatch/attention/flex_attn.py
+++ b/src/axolotl/monkeypatch/attention/flex_attn.py
@@ -1,11 +1,11 @@
 """Flex attention monkey patch"""
 import sys
-from typing import Optional, Tuple, Union
+from packaging import version
 import torch
 import transformers
-
+from transformers.utils.import_utils import _torch_version, is_torch_less_or_equal
 from axolotl.utils.logging import get_logger
 LOG = get_logger(__name__)
@@ -46,19 +46,33 @@ def patch_flex_wrapper(**flex_attn_compile_kwargs):
            """
            self.training = None
            if not self._is_flex_compiled or training != self.training:
                self.training = training
                if is_torch_less_or_equal("2.5.1"):
                    self._compiled_flex_attention = torch.compile(
                        flex_attention, dynamic=False
                    )
                # In PyTorch 2.6.0, there's a known issue with flex attention compilation which may
                # cause errors. The suggested fix is to compile with "max-autotune-no-cudagraphs"
                # see https://github.com/pytorch/pytorch/issues/146260 for training
-                self.training = training
+                elif version.parse(_torch_version).base_version == "2.6.0" and training:
-                LOG.info(
+                    self._compiled_flex_attention = torch.compile(
-                    "Compiling flex attention with kwargs: %s. This may take a while...",
+                        flex_attention, dynamic=False, mode="max-autotune-no-cudagraphs"
-                    flex_attn_compile_kwargs,
+                    )
-                )
+                # Fallback, usually the most recent torch 2.7.x+ versions
-                self._compiled_flex_attention = torch.compile(
+                else:
-                    flex_attention,
+                    LOG.info(
-                    **flex_attn_compile_kwargs,
+                        "Compiling flex attention with kwargs: %s. This may take a while...",
-                )
+                        flex_attn_compile_kwargs,
-                LOG.info("Flex attention compiled successfully.")
+                        main_process_only=True,
                    )
                    self._compiled_flex_attention = torch.compile(
                        flex_attention,
                        **flex_attn_compile_kwargs,
                    )
                    LOG.info(
                        "Flex attention compiled successfully.", main_process_only=True
                    )
                self._is_flex_compiled = True
        def __call__(self):
@@ -68,139 +82,3 @@ def patch_flex_wrapper(**flex_attn_compile_kwargs):
    sys.modules[
        "transformers.integrations.flex_attention"
    ].WrappedFlexAttention = WrappedFlexAttention
 def patch_flex_make_mask():
    is_torch_2_6 = torch.__version__.startswith("2.6")
    if not is_torch_2_6:
        return
    from torch.nn.attention.flex_attention import (
        _DEFAULT_SPARSE_BLOCK_SIZE as flex_default_block_size,
    )
    from torch.nn.attention.flex_attention import (
        BlockMask,
    )
    from torch.nn.attention.flex_attention import (
        create_block_mask as create_block_causal_mask_flex,
    )
    Offset = Union[torch.Tensor, int]
    def patched_make_flex_block_causal_mask(
        attention_mask_2d: torch.Tensor,
        attention_chunk_size: Optional[int] = None,
        query_length=None,
        key_length=None,
        offsets: Optional[Tuple[Offset, Offset]] = None,
    ) -> "BlockMask":
        """
        Create a block causal document mask for a batch of sequences, both packed and unpacked.
        Create Block causal logic and passing it into :func:`torch.nn.attention.flex_attention.create_block_mask`.
        The resultant BlockMask is a compressed representation of the full block causal
        mask. BlockMask is essential for performant computation of flex attention.
        See: https://pytorch.org/blog/flexattention/
        Args:
            attention_mask_2d (torch.Tensor): Attention mask for packed and padded sequences
            of shape (batch_size, total_seq_len). e.g.
            For unpacked sequence:
            [[1, 1, 1, 1, 0, 0, 0],
             [1, 1, 1, 1, 1, 0, 0]]
            For packed sequence:
            [[1, 1, 1, 2, 2, 2, 0],
             [1, 1, 2, 2, 2, 3, 3]]
        Returns:
            BlockMask
        """
        batch_size, total_seq_len = attention_mask_2d.shape
        if not key_length:
            key_length = total_seq_len
        if not query_length:
            query_length = total_seq_len
        attention_mask_2d = torch.nn.functional.pad(
            attention_mask_2d,
            value=0,
            pad=(0, abs(total_seq_len - max(key_length, flex_default_block_size))),
        )
        device = attention_mask_2d.device
        document_ids = attention_mask_2d.clone()
        if attention_chunk_size is not None:
            # we create an arange, then we just // by chunk size to get [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]
            chunk_idxs = (document_ids.clone().fill_(1).cumsum(-1) - 1) // (
                attention_chunk_size
            )
        # Instead of passing a tensor mask, flex attention requires a mask_mod function
        # that determines which elements of QK^T should be included in the attention
        # computation prior to the softmax. For sample packing, we need both the
        # logic for both causal mask and document mask. See PyTorch's official
        # blog post for more details: https://pytorch.org/blog/flexattention/#mask-mods
        def causal_mask_mod(batch_idx, head_idx, q_idx, kv_idx):
            """
            Defines the logic of a block causal mask by combining both a standard causal mask
            and a block diagonal document mask.
            See :func:`~torchtune.modules.attention_utils.create_block_causal_mask`
            for an illustration.
            """
            causal_mask = q_idx >= kv_idx  # not valid when decoding
            document_mask = (
                document_ids[batch_idx, q_idx] == document_ids[batch_idx, kv_idx]
            )
            padding_mask = attention_mask_2d[batch_idx, q_idx] > 0
            final_mask = causal_mask & padding_mask & document_mask
            return final_mask
        def chunk_causal_mask_mod(batch_idx, head_idx, q_idx, kv_idx):
            """
            Combines the chunk mask with the causal mask for chunked attention.
            """
            chunk_mask = chunk_idxs[batch_idx, q_idx] == chunk_idxs[batch_idx, kv_idx]
            causal_doc_mask = causal_mask_mod(batch_idx, head_idx, q_idx, kv_idx)
            return chunk_mask & causal_doc_mask
        mask_mod_maybe_combined = (
            causal_mask_mod if attention_chunk_size is None else chunk_causal_mask_mod
        )
        if offsets is not None:
            q_offset = offsets[0]
            kv_offset = offsets[1]
            def mask_mod(batch_idx, head_idx, q_idx, kv_idx):
                offset_q = q_idx + q_offset
                offset_kv = kv_idx + kv_offset
                return mask_mod_maybe_combined(batch_idx, head_idx, offset_q, offset_kv)
        else:
            mask_mod = mask_mod_maybe_combined
        return create_block_causal_mask_flex(
            mask_mod=mask_mod,
            B=batch_size,
            H=None,  # attention head
            Q_LEN=query_length,
            KV_LEN=key_length,
            device=device,
            _compile=True,
        )
    for n in tuple(sys.modules):
        if ".modeling_" in n:
            if hasattr(sys.modules[n], "make_flex_block_causal_mask"):
                sys.modules[
                    n
                ].make_flex_block_causal_mask = patched_make_flex_block_causal_mask
                sys.modules[
                    n
                ].make_flex_block_causal_mask = patched_make_flex_block_causal_mask
    transformers.integrations.flex_attention.make_flex_block_causal_mask = (
        patched_make_flex_block_causal_mask
    )
--- a/src/axolotl/monkeypatch/deepspeed_utils.py
+++ b/src/axolotl/monkeypatch/deepspeed_utils.py
@@ -0,0 +1,66 @@
 import importlib
 import importlib.util
 from axolotl.utils.logging import get_logger
 LOG = get_logger(__name__)
 def patch_checkpoint_wrapper_setattr():
    """
    Patch CheckpointWrapper to properly forward DeepSpeed attributes to wrapped modules.
    This fixes the issue where CheckpointWrapper doesn't forward ds_* attributes
    (like ds_grads_remaining) to the actual wrapped module, causing DeepSpeed
    ZeRO-3 to fail when gradient checkpointing is enabled.
    This issue occurs specifically with:
    - QLoRA + DeepSpeed ZeRO-3
    - gradient_checkpointing: true
    - activation_offloading: true
    References:
    - https://github.com/deepspeedai/DeepSpeed/issues/7203
    - https://github.com/deepspeedai/DeepSpeed/blob/38d1a9eb64c9e01e32eccc50b25ba18925287441/deepspeed/runtime/zero/parameter_offload.py#L424-L458
    - https://github.com/axolotl-ai-cloud/axolotl/pull/3102
    """
    try:
        from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
            CheckpointWrapper,
        )
        # Check if already patched
        if hasattr(CheckpointWrapper, "_axolotl_setattr_patched"):
            LOG.debug("CheckpointWrapper already patched")
            return
        original_setattr = CheckpointWrapper.__setattr__
        def new_setattr(self, name: str, value) -> None:
            if name.startswith("ds_") and hasattr(self, "_checkpoint_wrapped_module"):
                setattr(self._checkpoint_wrapped_module, name, value)
                LOG.debug(
                    f"Forwarded {name} to wrapped module {type(self._checkpoint_wrapped_module).__name__}"
                )
            else:
                original_setattr(self, name, value)
        CheckpointWrapper.__setattr__ = new_setattr
        CheckpointWrapper._axolotl_setattr_patched = True
        LOG.info("CheckpointWrapper patched to forward DeepSpeed attributes")
    except ImportError as e:
        LOG.debug(f"CheckpointWrapper not available: {e}")
    except Exception as e:
        LOG.warning(f"Failed to patch CheckpointWrapper: {e}")
 def apply_deepspeed_patches():
    """
    Apply DeepSpeed-related patches
    """
    if importlib.util.find_spec("deepspeed") is not None:
        patch_checkpoint_wrapper_setattr()
    else:
        LOG.debug("DeepSpeed not available, skipping patches")
--- a/src/axolotl/monkeypatch/lora_kernels.py
+++ b/src/axolotl/monkeypatch/lora_kernels.py
@@ -149,6 +149,11 @@ def get_attention_cls_from_config(cfg: DictDefault) -> Type[nn.Module]:
        return MistralAttention
    if model_type == "gemma3_text":
        from transformers.models.gemma3.modeling_gemma3 import Gemma3Attention
        return Gemma3Attention
    try:
        # Dynamically import the module and attention class
        module_path = f"transformers.models.{model_type}.modeling_{model_type}"
--- a/src/axolotl/monkeypatch/tiled_mlp/base.py
+++ b/src/axolotl/monkeypatch/tiled_mlp/base.py
@@ -8,6 +8,94 @@ from typing import List
 import torch
 class DeepSpeedTiledMLPMoE(torch.autograd.Function):
    @staticmethod
    def forward(
        ctx,
        fn,
        self,
        x,
        shards,
        compute_params,
    ) -> torch.Tensor:
        ctx.fn = fn
        ctx.self = self
        ctx.shards = shards
        ctx.compute_params = [p for p in compute_params if p.requires_grad]
        ctx.save_for_backward(x)
        x_shards = list(torch.chunk(x, chunks=shards, dim=1))
        with torch.no_grad():
            output_shards = [fn(self, x_shard) for x_shard in x_shards]
        ctx.is_tuple_output = isinstance(output_shards[0], tuple)
        if isinstance(output_shards[0], tuple):
            tuple_dim_idx = [1, 0]
            output_unsharded = tuple(
                torch.cat(
                    [output_shard[i] for output_shard in output_shards],
                    dim=tuple_dim_idx[i],
                )
                for i in range(len(output_shards[0]))
            )
        else:
            output_unsharded = torch.cat(output_shards, dim=1)
        return output_unsharded
    @staticmethod
    def backward(ctx, *grads) -> torch.Tensor:
        fn = ctx.fn
        (x,) = ctx.saved_tensors
        self = ctx.self
        shards = ctx.shards
        compute_params = ctx.compute_params
        is_tuple_output = ctx.is_tuple_output
        x_requires_grad = x.requires_grad
        x = x.detach()
        # detach() unsets `x.requires_grad`, so restore it
        x.requires_grad_(x_requires_grad)
        incoming_grad = grads[0]
        x_grad = torch.zeros_like(x)
        x_shards = list(torch.chunk(x, chunks=shards, dim=1))
        shard_step = x_shards[0].numel()
        for i, x_shard in enumerate(x_shards):
            # Tell deepspeed not to add a new grad to its ipg bucket until the last shard is run
            if compute_params is not None:
                if i + 1 < shards:
                    for param in compute_params:
                        param.ds_grad_is_ready = False
                else:
                    # last shard, can add the grad
                    for param in compute_params:
                        param.ds_grad_is_ready = True
            x_shard.requires_grad_(x_requires_grad)
            shard_offset = i * shard_step
            x_shard.grad = (
                x_grad.view(-1)
                .narrow(0, shard_offset, x_shard.numel())
                .view_as(x_shard)
            )
            incoming_grad_shard = (
                incoming_grad.view(-1)
                .narrow(0, shard_offset, x_shard.numel())
                .view_as(x_shard)
            )
            with torch.enable_grad():
                output = fn(self, x_shard)
            if is_tuple_output:
                torch.autograd.backward(output[0], incoming_grad_shard)
            else:
                torch.autograd.backward(output, incoming_grad_shard)
        return (None, None, x_grad, None, None)
 class TiledMLP(torch.autograd.Function):
    """
    TiledMLP implementation using gradient hooks
@@ -31,7 +119,18 @@ class TiledMLP(torch.autograd.Function):
        x_shards = list(torch.chunk(x, chunks=shards, dim=1))
        with torch.no_grad():
            output_shards = [fn(self, x_shard) for x_shard in x_shards]
-        output_unsharded = torch.cat(output_shards, dim=1)
+        ctx.is_tuple_output = isinstance(output_shards[0], tuple)
        if isinstance(output_shards[0], tuple):
            tuple_dim_idx = [1, 0]
            output_unsharded = tuple(
                torch.cat(
                    [output_shard[i] for output_shard in output_shards],
                    dim=tuple_dim_idx[i],
                )
                for i in range(len(output_shards[0]))
            )
        else:
            output_unsharded = torch.cat(output_shards, dim=1)
        return output_unsharded
@@ -42,6 +141,7 @@ class TiledMLP(torch.autograd.Function):
        self = ctx.self
        shards = ctx.shards
        compute_params = ctx.compute_params
        is_tuple_output = ctx.is_tuple_output
        x_requires_grad = x.requires_grad
        x = x.detach()
@@ -76,7 +176,10 @@ class TiledMLP(torch.autograd.Function):
            with torch.enable_grad():
                output = fn(self, x_shard)
-            torch.autograd.backward(output, incoming_grad_shard)
+            if is_tuple_output:
                torch.autograd.backward(output[0], incoming_grad_shard)
            else:
                torch.autograd.backward(output, incoming_grad_shard)
        # Clean up hooks
        grad_accumulator.cleanup()
--- a/src/axolotl/monkeypatch/tiled_mlp/patch.py
+++ b/src/axolotl/monkeypatch/tiled_mlp/patch.py
@@ -17,7 +17,7 @@ def patch_tiled_mlp(model_type, use_original_mlp=True, cfg_num_shards=None):
        TiledMLP as DeepSpeedTiledMLP,
    )
-    from axolotl.monkeypatch.tiled_mlp.base import TiledMLP
+    from axolotl.monkeypatch.tiled_mlp.base import DeepSpeedTiledMLPMoE, TiledMLP
    try:
        # Dynamically import the module and MLP class
@@ -64,7 +64,10 @@ def patch_tiled_mlp(model_type, use_original_mlp=True, cfg_num_shards=None):
                        for p in self._compute_params
                    )
                ) or os.environ.get("ACCELERATE_USE_DEEPSPEED", "false") == "true":
-                    self._tiled_mlp_dist_impl = DeepSpeedTiledMLP
+                    if model_type == "gpt_oss":
                        self._tiled_mlp_dist_impl = DeepSpeedTiledMLPMoE
                    else:
                        self._tiled_mlp_dist_impl = DeepSpeedTiledMLP
                else:
                    self._tiled_mlp_dist_impl = TiledMLP
--- a/src/axolotl/monkeypatch/transformers/trainer_loss_calc.py
+++ b/src/axolotl/monkeypatch/transformers/trainer_loss_calc.py
@@ -28,15 +28,6 @@ PATCHED_EVAL_CODE = {
    "array": 'metrics[f"{metric_key_prefix}_loss"] = np.nanmean(all_losses).item()',
 }
 ORIGINAL_FSDP2_CODE = """
    model.eval()
 """
 PATCHED_FSDP2_CODE = """
    if hasattr(model, "eval") and callable(model.eval):
        self.model.eval()
 """
 ORIGINAL_MAYBE_CODE = "tr_loss_scalar = self._nested_gather(tr_loss).mean().item()"
 PATCHED_MAYBE_CODE = "tr_loss_scalar = self._nested_gather(tr_loss).nanmean().item()"
@@ -46,13 +37,7 @@ def check_evaluation_loop_is_patchable() -> bool:
    return all(value in evaluation_loop_source for value in ORIGINAL_EVAL_CODE.values())
-def check_evaluation_loop_is_fsdp2_patchable() -> bool:
+def patch_evaluation_loop():
    evaluation_loop_source = inspect.getsource(Trainer.evaluation_loop)
    evaluation_loop_source, _ = detab_code(evaluation_loop_source)
    return ORIGINAL_FSDP2_CODE in evaluation_loop_source
 def patch_evaluation_loop(patch_fsdp2: bool):
    """Patch the evaluation_loop method."""
    # Check if already patched
    if hasattr(Trainer, "_original_evaluation_loop"):
@@ -75,13 +60,6 @@ def patch_evaluation_loop(patch_fsdp2: bool):
        ORIGINAL_EVAL_CODE["array"], PATCHED_EVAL_CODE["array"]
    )
    # Apply FSDP2 eval guard patch if needed
    if patch_fsdp2 and ORIGINAL_FSDP2_CODE in evaluation_loop_source:
        evaluation_loop_source = evaluation_loop_source.replace(
            ORIGINAL_FSDP2_CODE, PATCHED_FSDP2_CODE
        )
        LOG.info("Applied FSDP2 eval guard patch to evaluation_loop")
    # Rename the function to avoid conflicts
    evaluation_loop_source = evaluation_loop_source.replace(
        "def evaluation_loop(",
--- a/src/axolotl/prompt_tokenizers.py
+++ b/src/axolotl/prompt_tokenizers.py
@@ -75,7 +75,7 @@ class PromptTokenizingStrategy(abc.ABC):
    ) -> BatchEncoding:
        empty = BatchEncoding(data={"input_ids": [], "attention_mask": []})
        if not prompt:
-            LOG.warning("Empty text requested for tokenization.")
+            LOG.warning_once("Empty text requested for tokenization.")
            return empty
        result = self.tokenizer(
--- a/src/axolotl/train.py
+++ b/src/axolotl/train.py
@@ -416,7 +416,9 @@ def save_initial_configs(
    # Pre-save the tokenizer and model configs
    LOG.info(f"Pre-saving tokenizer to {cfg.output_dir}...")
-    tokenizer.save_pretrained(str(output_dir))
+    tokenizer.save_pretrained(
        str(Path(cfg.output_dir)), save_jinja_files=cfg.tokenizer_save_jinja_files
    )
    if hasattr(model, "config"):
        LOG.info(f"Pre-saving model config to {cfg.output_dir}...")
        model.config.save_pretrained(str(output_dir))
@@ -592,6 +594,9 @@ def train(
    # Save the trained model and cleanup
    save_trained_model(cfg, trainer, model, safe_serialization)
    tokenizer.save_pretrained(
        str(Path(cfg.output_dir)), save_jinja_files=cfg.tokenizer_save_jinja_files
    )
    create_model_card(cfg, trainer)
    if not cfg.use_ray:
        cleanup_distributed()
--- a/src/axolotl/utils/bench.py
+++ b/src/axolotl/utils/bench.py
@@ -60,13 +60,14 @@ def gpu_memory_usage_all(device=0):
    active = torch.cuda.memory_stats().get("active_bytes.all.peak", 0) / 1024.0**3
    allocated = torch.cuda.max_memory_allocated(device) / 1024.0**3
    reserved = torch.cuda.max_memory_reserved(device) / 1024.0**3
    torch.cuda.reset_peak_memory_stats(device)
    return active, allocated, reserved
 def mps_memory_usage_all():
-    usage = torch.mps.current_allocated_memory() / 1024.0**3
+    active = torch.mps.current_allocated_memory() / 1024.0**3
-    reserved = torch.mps.driver_allocated_memory() / 1024.0**3
+    allocated = torch.mps.driver_allocated_memory() / 1024.0**3
-    return usage, reserved - usage, 0
+    return active, allocated, 0
 def npu_memory_usage_all(device=0):
--- a/src/axolotl/utils/callbacks/tokens_per_second.py
+++ b/src/axolotl/utils/callbacks/tokens_per_second.py
@@ -0,0 +1,64 @@
 """A callback for calculating tokens per second during training."""
 import time
 import torch
 from transformers import (
    TrainerCallback,
    TrainerControl,
    TrainerState,
    TrainingArguments,
 )
 class TokensPerSecondCallback(TrainerCallback):
    """
    A callback to measure and log tokens per second during training.
    """
    def __init__(self, tensor_parallel_size, context_parallel_size):
        super().__init__()
        self.step_time = 0.0
        self.start_time = 0.0
        self.non_data_parallel_size = 1
        if tensor_parallel_size is not None:
            self.non_data_parallel_size *= tensor_parallel_size
        if context_parallel_size is not None:
            self.non_data_parallel_size *= context_parallel_size
    def on_step_begin(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):  # pylint: disable=unused-argument
        self.start_time = time.perf_counter()
        state.last_tokens_per_second = torch.zeros(1)
    def on_step_end(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):  # pylint: disable=unused-argument
        if hasattr(state, "num_tokens"):
            step_time = time.perf_counter() - self.start_time
            num_tokens_per_device = state.num_tokens.clone()
            # non data parallel groups have duplicated tokens, so we avoid double-counting
            num_tokens_per_device = num_tokens_per_device / self.non_data_parallel_size
            state.last_tokens_per_second = num_tokens_per_device / step_time
    def on_log(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        logs=None,
        **kwargs,
    ):  # pylint: disable=unused-argument
        # after logging, clear the running metrics
        if hasattr(state, "last_tokens_per_second"):
            state.last_tokens_per_second.zero_()
            state.num_tokens = torch.zeros(1)
--- a/src/axolotl/utils/collators/init.py
+++ b/src/axolotl/utils/collators/init.py
@@ -1,11 +1,17 @@
-"""
+"""Shared axolotl collators for multipacking, mamba, multimodal."""
 shared axolotl collators for multipack, mamba, multimodal
 """
-from .batching import (  # noqa: F401
+from .batching import (
    BatchSamplerDataCollatorForSeq2Seq,
    DataCollatorForSeq2Seq,
    PretrainingBatchSamplerDataCollatorForSeq2Seq,
    V2BatchSamplerDataCollatorForSeq2Seq,
 )
-from .mamba import MambaDataCollator  # noqa: F401
+from .mamba import MambaDataCollator
 __all__ = [
    "DataCollatorForSeq2Seq",
    "BatchSamplerDataCollatorForSeq2Seq",
    "V2BatchSamplerDataCollatorForSeq2Seq",
    "PretrainingBatchSamplerDataCollatorForSeq2Seq",
    "MambaDataCollator",
 ]
--- a/src/axolotl/utils/config/init.py
+++ b/src/axolotl/utils/config/init.py
@@ -77,7 +77,7 @@ def resolve_dtype(cfg):
    if cfg.device == "mps":
        cfg.load_in_8bit = False
        cfg.tf32 = False
-        if cfg.bf16:
+        if cfg.bf16 and cfg.fp16 is not False:
            cfg.fp16 = True
        cfg.bf16 = False
    else:
@@ -273,7 +273,9 @@ def validate_config(
    # Convert datasets to proper format if needed
    if cfg.get("datasets"):
        for idx, ds_cfg in enumerate(cfg["datasets"]):
-            if cfg.get("rl") in ["dpo", "simpo"] and not isinstance(ds_cfg, DPODataset):
+            if cfg.get("rl") in ["dpo", "ipo", "simpo"] and not isinstance(
                ds_cfg, DPODataset
            ):
                cfg["datasets"][idx] = DPODataset(**ds_cfg)
            elif cfg.get("rl") == "kto" and not isinstance(ds_cfg, KTODataset):
                cfg["datasets"][idx] = KTODataset(**dict(ds_cfg))
--- a/src/axolotl/utils/ctx_managers/sequence_parallel.py
+++ b/src/axolotl/utils/ctx_managers/sequence_parallel.py
@@ -48,10 +48,10 @@ def apply_sequence_parallelism(
            - The original sequence length before padding.
            - The number of padding tokens added.
    """
-    original_seq_len = batch["input_ids"].size(1)
+    batch_size, original_seq_len = batch["input_ids"].shape
    # Update ring attention params if needed
-    if batch.get("position_ids") is not None:
+    if batch.get("position_ids") is not None and batch_size == 1:
        update_ring_attn_params(position_ids=batch["position_ids"])
    else:
        # If position_ids aren't already in the batch, create them
--- a/src/axolotl/utils/data/init.py
+++ b/src/axolotl/utils/data/init.py
@@ -1,8 +1,8 @@
 """Init for `axolotl.utils.data` module."""
-from axolotl.utils.data.pretraining import (
+from axolotl.utils.data.streaming import (
-    encode_pretraining,
+    encode_streaming,
-    wrap_pretraining_dataset,
+    wrap_streaming_dataset,
 )
 from axolotl.utils.data.rl import prepare_preference_datasets
 from axolotl.utils.data.sft import (
@@ -12,8 +12,8 @@ from axolotl.utils.data.sft import (
 from axolotl.utils.data.utils import md5
 __all__ = [
-    "encode_pretraining",
+    "encode_streaming",
-    "wrap_pretraining_dataset",
+    "wrap_streaming_dataset",
    "prepare_preference_datasets",
    "get_dataset_wrapper",
    "prepare_datasets",
--- a/src/axolotl/utils/data/pretraining.py
+++ b/src/axolotl/utils/data/pretraining.py
@@ -1,292 +0,0 @@
 """data handling specific to pretraining"""
 import functools
 from collections import defaultdict
 from typing import Callable, Dict, List, Optional
 import torch
 from datasets import Dataset
 from torch.utils.data import RandomSampler
 from transformers import PreTrainedTokenizerBase
 from axolotl.utils.collators import PretrainingBatchSamplerDataCollatorForSeq2Seq
 from axolotl.utils.logging import get_logger
 from axolotl.utils.samplers import MultipackBatchSampler, get_dataset_lengths
 from axolotl.utils.trainer import process_pretraining_datasets_for_packing
 LOG = get_logger(__name__)
 def encode_pretraining(
    tokenizer: PreTrainedTokenizerBase,
    max_tokens: int,
    examples: Dict[str, List],
    text_column: str = "text",
    concatenate: bool = True,
 ) -> Dict[str, List]:
    res = tokenizer(
        examples[text_column],
        truncation=True,
        max_length=max_tokens - 2,
        add_special_tokens=True,
    )
    # Convert to PyTorch tensors
    input_ids = [torch.tensor(seq) for seq in res["input_ids"]]
    targets = [torch.tensor(seq) for seq in res["input_ids"]]
    attention_mask = [torch.tensor(seq) for seq in res["attention_mask"]]
    if not concatenate:
        return {
            "input_ids": [seq.tolist() for seq in input_ids],
            "labels": [seq.tolist() for seq in targets],
            "attention_mask": [seq.tolist() for seq in attention_mask],
        }
    new_input_ids = []
    new_labels = []
    new_attention_mask = []
    # Append EOS and PAD tokens to input_ids, and correct attention_mask
    for i, _ in enumerate(input_ids):
        input_ids[i] = torch.cat(
            (
                input_ids[i],
                torch.tensor([tokenizer.eos_token_id, tokenizer.pad_token_id]),
            ),
            dim=0,
        )
        targets[i] = torch.cat(
            (
                targets[i],
                torch.tensor([tokenizer.eos_token_id, -100]),
            ),
            dim=0,
        )
        attention_mask[i] = torch.cat((attention_mask[i], torch.tensor([1, 0])), dim=0)
    # Concatenate tokens so that their lengths are less than max_tokens
    buffer_input_ids = torch.tensor([], dtype=torch.long)
    buffer_labels = torch.tensor([], dtype=torch.long)
    buffer_attention_mask = torch.tensor([], dtype=torch.long)
    for ids, labels, mask in zip(input_ids, targets, attention_mask, strict=False):
        if buffer_input_ids.numel() == max_tokens:
            new_input_ids.append(buffer_input_ids)
            new_labels.append(buffer_labels)
            new_attention_mask.append(buffer_attention_mask)
            buffer_input_ids = torch.tensor([], dtype=torch.long)
            buffer_labels = torch.tensor([], dtype=torch.long)
            buffer_attention_mask = torch.tensor([], dtype=torch.long)
            buffer_input_ids = torch.cat((buffer_input_ids, ids), dim=0)
            buffer_labels = torch.cat((buffer_labels, labels), dim=0)
            buffer_attention_mask = torch.cat((buffer_attention_mask, mask), dim=0)
        elif buffer_input_ids.numel() + ids.numel() <= max_tokens:
            buffer_input_ids = torch.cat((buffer_input_ids, ids), dim=0)
            buffer_labels = torch.cat((buffer_labels, labels), dim=0)
            buffer_attention_mask = torch.cat((buffer_attention_mask, mask), dim=0)
        else:
            buffer_input_ids = torch.cat(
                (
                    buffer_input_ids,
                    torch.full(
                        (max_tokens - buffer_input_ids.numel(),),
                        tokenizer.pad_token_id,
                        dtype=torch.long,
                    ),
                ),
                dim=0,
            )
            buffer_labels = torch.cat(
                (
                    buffer_labels,
                    torch.full(
                        (max_tokens - buffer_labels.numel(),),
                        -100,
                        dtype=torch.long,
                    ),
                ),
                dim=0,
            )
            buffer_attention_mask = torch.cat(
                (
                    buffer_attention_mask,
                    torch.full(
                        (max_tokens - buffer_attention_mask.numel(),),
                        0,
                        dtype=torch.long,
                    ),
                ),
                dim=0,
            )
            new_input_ids.append(buffer_input_ids)
            new_labels.append(buffer_labels)
            new_attention_mask.append(buffer_attention_mask)
            buffer_input_ids = torch.tensor([], dtype=torch.long)
            buffer_labels = torch.tensor([], dtype=torch.long)
            buffer_attention_mask = torch.tensor([], dtype=torch.long)
            buffer_input_ids = torch.cat((buffer_input_ids, ids), dim=0)
            buffer_labels = torch.cat((buffer_labels, labels), dim=0)
            buffer_attention_mask = torch.cat((buffer_attention_mask, mask), dim=0)
    if buffer_input_ids.numel() > 0:  # for any leftover tokens
        while buffer_input_ids.numel() < max_tokens:  # make all sequences equal in size
            buffer_input_ids = torch.cat(
                (
                    buffer_input_ids,
                    torch.full(
                        (max_tokens - buffer_input_ids.numel(),),
                        tokenizer.pad_token_id,
                        dtype=torch.long,
                    ),
                ),
                dim=0,
            )
            buffer_labels = torch.cat(
                (
                    buffer_labels,
                    torch.full(
                        (max_tokens - buffer_labels.numel(),),
                        -100,
                        dtype=torch.long,
                    ),
                ),
                dim=0,
            )
            buffer_attention_mask = torch.cat(
                (
                    buffer_attention_mask,
                    torch.full(
                        (max_tokens - buffer_attention_mask.numel(),),
                        0,
                        dtype=torch.long,
                    ),
                ),
                dim=0,
            )
        new_input_ids.append(buffer_input_ids)
        new_labels.append(buffer_labels)
        new_attention_mask.append(buffer_attention_mask)
    ret = {
        "input_ids": [seq.tolist() for seq in new_input_ids],
        "labels": [seq.tolist() for seq in new_labels],
        "attention_mask": [seq.tolist() for seq in new_attention_mask],
    }
    LOG.debug(len(ret["input_ids"]))
    return ret
 def wrap_pretraining_dataset(
    dataset,
    tokenizer,
    cfg,
    ds_wrapper_fn,
    max_tokens=2048,
    batch_size=1,
    seed=42,
    buffer_size=10_000,
 ):
    if cfg.sample_packing:
        collate_fn = PretrainingBatchSamplerDataCollatorForSeq2Seq(
            tokenizer,
            return_tensors="pt",
            padding=True,
            pad_to_multiple_of=max_tokens,
            multipack_attn=cfg.pretrain_multipack_attn,
        )
        encode = functools.partial(
            encode_packed_pretraining,
            collate_fn,
            ds_wrapper_fn,
            max_seq_length=max_tokens,
            batch_size=batch_size,
            multipack_attn=cfg.pretrain_multipack_attn,
        )
        # set this to 1 so downstream data_loader doesn't try to increase the batch again
        cfg.micro_batch_size = 1
    else:
        encode = functools.partial(
            encode_pretraining,
            tokenizer,
            max_tokens,
            text_column=cfg.pretraining_dataset[0].text_column or "text",
            concatenate=cfg.pretraining_sample_concatenation is True,
        )
    if cfg.shuffle_merged_datasets:
        dataset = dataset.shuffle(seed=seed, buffer_size=buffer_size)
    else:
        LOG.debug("NOT shuffling merged pretraining datasets")
    # remove all the existing columns after mapping since they end up having
    # a different length than the encoded/tokenized column
    # this is empty during streaming/pretraining
    remove_columns = []
    if dataset.features is None:
        for first_row in dataset:
            remove_columns = list(first_row.keys())
            break
    else:
        remove_columns = list(dataset.features.keys())
    dataset = dataset.map(
        encode,
        batched=True,
        batch_size=buffer_size,
        # input_columns="text",
        remove_columns=remove_columns,
    )
    return dataset
 def encode_packed_pretraining(
    collate_fn,
    ds_wrapper: Callable,
    examples: Dict[str, List],
    max_seq_length: int = 2048,
    batch_size: int = 4,
    multipack_attn: Optional[bool] = True,
 ) -> Dict[str, List]:
    # tokenize all the examples
    # rows get split with stride (overlap)
    train_dataset = ds_wrapper(dataset=Dataset.from_dict(examples))[0]
    train_dataset = process_pretraining_datasets_for_packing(
        train_dataset,
        max_seq_length,
        skip_position_ids=not multipack_attn,
        # FIXME using attention mask unpad/pad with trainer and packed pretraining is broken atm
        # workaround by using the position id logic for now in trainer
        drop_attention_mask=multipack_attn,
    )
    sampler = MultipackBatchSampler(
        sampler=RandomSampler(train_dataset),
        lengths=get_dataset_lengths(train_dataset),
        batch_size=1,
        batch_max_len=batch_size * max_seq_length,
        drop_last=True,
        num_processes=1,
    )
    chunked_data = defaultdict(list)
    for batch in sampler:
        for data in batch:
            features = train_dataset[data]
            if "num_truncated_tokens" in features:
                del features["num_truncated_tokens"]
            if "num_truncated_tokens" in features:
                del features["num_truncated_tokens"]
            if "overflow_to_sample_mapping" in features:
                del features["overflow_to_sample_mapping"]
            if "labels" not in features:
                features["labels"] = features["input_ids"].copy()
            collated_features = collate_fn(features)
            for feature in features.keys():
                if feature == "length":
                    continue
                chunked_data[feature].append(collated_features[feature].squeeze(0))
    return chunked_data
--- a/src/axolotl/utils/data/sft.py
+++ b/src/axolotl/utils/data/sft.py
@@ -9,13 +9,14 @@ from datasets import (
    Dataset,
    DatasetDict,
    IterableDataset,
    IterableDatasetDict,
    load_dataset,
 )
 from transformers import PreTrainedTokenizer, ProcessorMixin
 from axolotl.prompters import Prompter
 from axolotl.utils.data.lock import FileLockLoader
-from axolotl.utils.data.pretraining import wrap_pretraining_dataset
+from axolotl.utils.data.streaming import wrap_streaming_dataset
 from axolotl.utils.data.shared import (
    create_train_validation_split,
    datasets_with_name_generator,
@@ -26,7 +27,6 @@ from axolotl.utils.data.shared import (
    save_preprocessed_dataset,
    try_load_from_hub,
 )
 from axolotl.utils.data.streaming import wrap_streaming_sft_dataset
 from axolotl.utils.data.utils import (
    deduplicate_and_log_datasets,
    handle_long_seq_in_dataset,
@@ -49,7 +49,6 @@ def prepare_datasets(
    cfg: DictDefault,
    tokenizer: PreTrainedTokenizer,
    processor: ProcessorMixin | None = None,
    preprocess_iterable: bool = False,
 ) -> tuple[IterableDataset | Dataset, Dataset | None, int, list[Prompter | None]]:
    """Prepare training and evaluation datasets based on configuration.
@@ -57,24 +56,20 @@ def prepare_datasets(
        cfg: Dictionary mapping `axolotl` config keys to values.
        tokenizer: Tokenizer to use for processing text.
        processor: Optional processor for multimodal datasets.
        preprocess_iterable: Whether to use iterable preprocessing.
    Returns:
        Tuple of (train_dataset, eval_dataset, total_steps, prompters).
    """
-    if cfg.pretraining_dataset:
+    if cfg.streaming or cfg.pretraining_dataset:
-        return _prepare_pretraining_dataset(
+        return _prepare_streaming_dataset(cfg, tokenizer, processor)
-            cfg, tokenizer, processor, preprocess_iterable
+    return _prepare_standard_dataset(cfg, tokenizer, processor)
        )
    return _prepare_standard_dataset(cfg, tokenizer, processor, preprocess_iterable)
 def _prepare_standard_dataset(
    cfg: DictDefault,
    tokenizer: PreTrainedTokenizer,
    processor: ProcessorMixin | None,
-    preprocess_iterable: bool,
+) -> tuple[Dataset, Dataset | None, int, list[Prompter | None]]:
 ) -> tuple[Dataset | IterableDataset, Dataset | None, int, list[Prompter | None]]:
    """Prepare standard (non-pretraining) datasets."""
    def _load_datasets():
@@ -84,7 +79,6 @@ def _prepare_standard_dataset(
            cfg,
            split="train",
            processor=processor,
            preprocess_iterable=preprocess_iterable,
        )
        # Overwrite eval_dataset if test data exists
@@ -94,7 +88,6 @@ def _prepare_standard_dataset(
                cfg,
                split="test",
                processor=processor,
                preprocess_iterable=preprocess_iterable,
            )
        return train_dataset, eval_dataset, prompters
@@ -119,14 +112,7 @@ def _prepare_standard_dataset(
            )
    # Calculate total number of training steps
-    # For streaming datasets, we must use max_steps
+    if cfg.max_steps:
    if isinstance(train_dataset, IterableDataset):
        if not cfg.max_steps:
            raise ValueError(
                "When using streaming datasets, you must set max_steps in your config"
            )
        total_num_steps = cfg.max_steps
    elif cfg.max_steps:
        total_num_steps = min(
            calculate_total_num_steps(cfg, train_dataset), cfg.max_steps
        )
@@ -136,22 +122,40 @@ def _prepare_standard_dataset(
    return train_dataset, eval_dataset, total_num_steps, prompters
-def _prepare_pretraining_dataset(
+def _prepare_streaming_dataset(
    cfg: DictDefault,
    tokenizer: PreTrainedTokenizer,
    processor: ProcessorMixin | None,
    preprocess_iterable: bool,
 ) -> tuple[IterableDataset, Dataset | None, int, list[Prompter | None]]:
    """
-    Prepare dataset for pretraining mode.
+    Prepare dataset for streaming mode.
-    Note: Pre-training datasets are streamed from the HuggingFace Hub.
+    Note: Streaming datasets are loaded incrementally from the source.
    """
-    # Extract pretraining dataset configuration
+    if cfg.pretraining_dataset:
-    pretraining_config = _extract_pretraining_config(cfg)
+        dataset_config = _extract_pretraining_config(cfg)
        train_dataset = _load_streaming_dataset(dataset_config, cfg, tokenizer)
    elif cfg.sample_packing:
        # TODO(djsaunde): Implement for multiple datasets
        dataset_config = DictDefault(cfg.datasets[0])
-    # Load streaming dataset for training
+        # Ensure we have a split set - default to 'train' if not specified
-    train_dataset = _load_pretraining_dataset(pretraining_config, cfg, tokenizer)
+        if not hasattr(dataset_config, "split") or not dataset_config.split:
            dataset_config.split = "train"
        train_dataset = _load_streaming_dataset(dataset_config, cfg, tokenizer)
    else:
        # Use legacy loading function for non-packed streaming datasets
        train_dataset, eval_dataset, prompters = _load_and_prepare_datasets(
            tokenizer,
            cfg,
            split="train",
            processor=processor,
            streaming=True,
        )
        # Return early for non-packed streaming datasets
        total_num_steps = cfg.max_steps if cfg.max_steps else -1
        return train_dataset, eval_dataset, total_num_steps, prompters
    # Load evaluation dataset if specified
    eval_dataset = None
@@ -161,14 +165,12 @@ def _prepare_pretraining_dataset(
            cfg,
            split="test",
            processor=processor,
-            preprocess_iterable=preprocess_iterable,
+            streaming=False,
        )
-    if cfg.dataset_exact_deduplication:
+    # For streaming, we return max_steps directly from config or -1 if not set
-        LOG.info("Deduplication not available for pretrained datasets")
+    total_num_steps = cfg.max_steps if cfg.max_steps else -1
-
+    return train_dataset, eval_dataset, total_num_steps, []
    # For pretraining, we return max_steps directly from config
    return train_dataset, eval_dataset, cfg.max_steps, []
 def _extract_pretraining_config(cfg: DictDefault) -> DictDefault:
@@ -200,7 +202,7 @@ def _extract_pretraining_config(cfg: DictDefault) -> DictDefault:
    )
-def _load_pretraining_dataset(
+def _load_streaming_dataset(
    pretraining_config: DictDefault, cfg: DictDefault, tokenizer: PreTrainedTokenizer
 ) -> IterableDataset:
    """Load and prepare a streaming dataset for pretraining."""
@@ -235,15 +237,11 @@ def _load_pretraining_dataset(
        iter_dataset = iter_dataset.skip(pretraining_config["skip"])
    # Wrap the dataset for pretraining
-    train_dataset = wrap_pretraining_dataset(
+    train_dataset = wrap_streaming_dataset(
        iter_dataset,
        tokenizer,
        cfg,
        dataset_wrapper_partial,
        max_tokens=cfg.sequence_len,
        batch_size=cfg.micro_batch_size,
        seed=cfg.seed,
        buffer_size=cfg.pretrain_multipack_buffer_size or 10_000,
    )
    # Format for PyTorch
@@ -264,7 +262,7 @@ def _load_tokenized_prepared_datasets(
    cfg: DictDefault,
    split: Literal["train", "test"] = "train",
    processor: ProcessorMixin | None = None,
-    preprocess_iterable: bool = False,
+    streaming: bool = False,
 ) -> tuple[Dataset | DatasetDict, list[Prompter | None]]:
    """Load or create tokenized and prepared datasets for training or testing.
@@ -273,7 +271,7 @@ def _load_tokenized_prepared_datasets(
        cfg: Configuration object.
        split: Dataset split to load ('train' or 'test').
        processor: Optional processor for multimodal datasets.
-        preprocess_iterable: Whether to use iterable preprocessing.
+        streaming: Whether to use iterable preprocessing.
    Returns:
        Tuple of (dataset, prompters list).
@@ -304,7 +302,7 @@ def _load_tokenized_prepared_datasets(
            tokenizer,
            split,
            processor,
-            preprocess_iterable,
+            streaming,
        )
    return dataset, prompters
@@ -316,7 +314,7 @@ def _load_raw_datasets(
    tokenizer: PreTrainedTokenizer,
    split: str,
    processor: ProcessorMixin | None = None,
-    preprocess_iterable: bool = False,
+    streaming: bool = False,
 ) -> tuple[Dataset, list[Prompter | None]]:
    """Load, process, merge, and save raw datasets."""
    LOG.info("Loading raw datasets...", main_process_only=False)
@@ -337,7 +335,7 @@ def _load_raw_datasets(
            split=split,
            seed=cfg.seed,
            processor=processor,
-            preprocess_iterable=preprocess_iterable,
+            streaming=streaming,
        )
        datasets.append(dataset_wrapper)
        prompters.append(dataset_prompter)
@@ -345,23 +343,19 @@ def _load_raw_datasets(
    # Merge datasets
    dataset = merge_datasets(datasets, cfg)
-    if not cfg.skip_prepare_dataset:
+    if not cfg.skip_prepare_dataset and not streaming:
        if split == "test" and cfg.eval_sequence_len:
            dataset = handle_long_seq_in_dataset(dataset, cfg.eval_sequence_len, cfg)
        else:
            dataset = handle_long_seq_in_dataset(dataset, cfg.sequence_len, cfg)
-
+        if cfg.sample_packing:
        # Skip packing processing for streaming datasets - they handle it differently
        if cfg.sample_packing and not isinstance(dataset, IterableDataset):
            dataset, _ = process_datasets_for_packing(cfg, dataset, None)
-        # Skip saving for streaming datasets as they can't be cached
+        # Save the prepared dataset
-        if not isinstance(dataset, IterableDataset):
+        dataset_hash = generate_dataset_hash_from_config(
-            # Save the prepared dataset
+            cfg, datasets_configs, tokenizer.name_or_path
-            dataset_hash = generate_dataset_hash_from_config(
+        )
-                cfg, datasets_configs, tokenizer.name_or_path
+        save_preprocessed_dataset(cfg, dataset, dataset_hash, split)
            )
            save_preprocessed_dataset(cfg, dataset, dataset_hash, split)
    return dataset, prompters
@@ -373,21 +367,19 @@ def _load_and_process_single_dataset(
    split: str,
    seed: int,
    processor: ProcessorMixin | None = None,
-    preprocess_iterable: bool = False,
+    streaming: bool = False,
 ) -> tuple[Dataset | IterableDataset, Prompter | None]:
    """Load and process a single dataset based on the passed config."""
    # Load the dataset
    # Use streaming if enabled in config or if using iterable preprocessing
    use_streaming = cfg.streaming or preprocess_iterable
    dataset = load_dataset_with_config(
-        dataset_config, cfg.hf_use_auth_token, streaming=use_streaming
+        dataset_config, cfg.hf_use_auth_token, streaming=streaming
    )
    # Parse dataset type
    d_base_type, d_prompt_style = _parse_dataset_type(dataset_config.type)
    # Select the appropriate split
-    if isinstance(dataset, DatasetDict):
+    if isinstance(dataset, (DatasetDict, IterableDatasetDict)):
        if dataset_config.split and dataset_config.split in dataset:
            dataset = dataset[dataset_config.split]
        elif split in dataset:
@@ -405,63 +397,16 @@ def _load_and_process_single_dataset(
            num_shards=dataset_config.shards, index=shards_idx
        )
-    # For streaming datasets, we need to handle tokenization differently
+    # Apply dataset wrapper
-    if isinstance(dataset, IterableDataset):
+    dataset_wrapper, dataset_prompter = get_dataset_wrapper(
-        # Use pretraining's approach for multipack streaming
+        dataset_config=dataset_config,
-        if cfg.sample_packing:
+        tokenizer=tokenizer,
-            # Create the dataset wrapper function once
+        cfg=cfg,
-            def ds_wrapper_fn(dataset=None):
+        dataset_base_type=d_base_type,
-                wrapped_dataset, prompter = get_dataset_wrapper(
+        dataset=dataset,
-                    dataset_config=dataset_config,
+        dataset_prompt_style=d_prompt_style,
-                    tokenizer=tokenizer,
+        processor=processor,
-                    cfg=cfg,
+    )
                    dataset_base_type=d_base_type,
                    dataset=dataset,
                    dataset_prompt_style=d_prompt_style,
                    processor=processor,
                )
                return wrapped_dataset, prompter
            # Use pretraining wrapper for efficient streaming SFT with packing
            from axolotl.utils.data.pretraining import wrap_pretraining_dataset
            dataset_wrapper = wrap_pretraining_dataset(
                dataset,
                tokenizer,
                cfg,
                ds_wrapper_fn,
                max_tokens=cfg.sequence_len,
                batch_size=cfg.micro_batch_size,
                seed=cfg.seed,
                buffer_size=cfg.pretrain_multipack_buffer_size,
            )
        else:
            # Use regular streaming wrapper
            dataset_wrapper = wrap_streaming_sft_dataset(
                dataset,
                tokenizer,
                cfg,
                dataset_config,
                d_base_type,
                d_prompt_style,
                processor,
                max_tokens=cfg.sequence_len,
                buffer_size=10_000,
            )
        # For streaming, we don't have a specific prompter
        dataset_prompter = None
    else:
        # Apply dataset wrapper for regular datasets
        dataset_wrapper, dataset_prompter = get_dataset_wrapper(
            dataset_config=dataset_config,
            tokenizer=tokenizer,
            cfg=cfg,
            dataset_base_type=d_base_type,
            dataset=dataset,
            dataset_prompt_style=d_prompt_style,
            processor=processor,
        )
    return dataset_wrapper, dataset_prompter
@@ -540,7 +485,7 @@ def _load_and_prepare_datasets(
    cfg: DictDefault,
    split: Literal["train", "test"] = "train",
    processor: ProcessorMixin | None = None,
-    preprocess_iterable: bool = False,
+    streaming: bool = False,
 ) -> tuple[Dataset | None, Dataset | None, list[Prompter | None]]:
    """Load and prepare datasets with optional validation split and sharding.
@@ -549,7 +494,7 @@ def _load_and_prepare_datasets(
        cfg: Configuration object.
        split: Dataset split to load ('train' or 'test').
        processor: Optional processor for multimodal datasets.
-        preprocess_iterable: Whether to use iterable preprocessing.
+        streaming: Whether to use iterable preprocessing.
    Returns:
        Tuple of (train_dataset, eval_dataset, prompters).
@@ -560,7 +505,7 @@ def _load_and_prepare_datasets(
        cfg,
        split=split,
        processor=processor,
-        preprocess_iterable=preprocess_iterable,
+        streaming=streaming,
    )
    # Apply dataset sharding if configured using shared function
--- a/src/axolotl/utils/data/shared.py
+++ b/src/axolotl/utils/data/shared.py
@@ -236,11 +236,9 @@ def _load_from_local_path(
        try:
            return load_from_disk(dataset_config.path)
        except FileNotFoundError:
            load_dataset_kwargs["streaming"] = False
            return load_dataset(dataset_config.path, **load_dataset_kwargs)
    elif local_path.is_file():
        dataset_type = get_dataset_type(dataset_config)
        load_dataset_kwargs["streaming"] = False
        return load_dataset(
            dataset_type,
            data_files=dataset_config.path,
@@ -524,9 +522,7 @@ def generate_dataset_hash_from_config(
    return str(md5(config_str))
-def merge_datasets(
+def merge_datasets(datasets: list[Dataset], cfg: DictDefault) -> Dataset:
    datasets: list[Dataset | IterableDataset], cfg: DictDefault
 ) -> Dataset | IterableDataset:
    """Merge multiple datasets into one with optional shuffling.
    Args:
@@ -536,41 +532,6 @@ def merge_datasets(
    Returns:
        Merged dataset.
    """
    # Check if we're dealing with streaming datasets
    if any(isinstance(ds, IterableDataset) for ds in datasets):
        # All datasets must be streaming for merging
        if not all(isinstance(ds, IterableDataset) for ds in datasets):
            raise ValueError(
                "Cannot mix streaming and non-streaming datasets. "
                "Either all datasets must be streaming or none."
            )
        if len(datasets) == 1:
            ds = datasets[0]
            # Streaming datasets handle shuffling differently
            if cfg.shuffle_merged_datasets and not cfg.curriculum_sampling:
                return ds.shuffle(seed=cfg.seed, buffer_size=10_000)
            return ds
        # Merge streaming datasets
        LOG.info("Merging streaming datasets...")
        from datasets import interleave_datasets
        # For streaming, we interleave datasets instead of concatenating
        merged_dataset = interleave_datasets(datasets)
        if cfg.shuffle_merged_datasets:
            LOG.debug("Shuffling merged streaming datasets...")
            if cfg.curriculum_sampling:
                LOG.warning(
                    "Shuffling merged datasets with curriculum sampling is not recommended. "
                    "This will randomize the order of samples."
                )
            merged_dataset = merged_dataset.shuffle(seed=cfg.seed, buffer_size=10_000)
        return merged_dataset
    # Original logic for non-streaming datasets
    if len(datasets) == 1:
        ds = datasets[0]
--- a/src/axolotl/utils/data/streaming.py
+++ b/src/axolotl/utils/data/streaming.py
@@ -1,150 +1,301 @@
-"""Utilities for handling streaming datasets."""
+"""Data handling specific to streaming datasets."""
 import functools
 from collections import defaultdict
-from typing import Any, Dict, List
+from typing import Callable, Dict, List, Optional
-import numpy as np
+import torch
-from datasets import Dataset, IterableDataset
+from datasets import Dataset
 from torch.utils.data import RandomSampler
 from transformers import PreTrainedTokenizerBase
-from axolotl.utils.collators import DataCollatorForSeq2Seq
+from axolotl.utils.collators import PretrainingBatchSamplerDataCollatorForSeq2Seq
 from axolotl.utils.logging import get_logger
 from axolotl.utils.samplers import MultipackBatchSampler, get_dataset_lengths
-from axolotl.utils.trainer import add_position_ids
+from axolotl.utils.trainer import process_pretraining_datasets_for_packing
 LOG = get_logger(__name__)
-def wrap_streaming_sft_dataset(
+def encode_streaming(
-    dataset: IterableDataset,
+    examples: Dict[str, List],
    tokenizer: PreTrainedTokenizerBase,
    max_tokens: int,
    text_column: str = "text",
    concatenate: bool = True,
 ) -> Dict[str, List]:
    res = tokenizer(
        examples[text_column],
        truncation=True,
        max_length=max_tokens - 2,
        add_special_tokens=True,
    )
    # Convert to PyTorch tensors
    input_ids = [torch.tensor(seq) for seq in res["input_ids"]]
    targets = [torch.tensor(seq) for seq in res["input_ids"]]
    attention_mask = [torch.tensor(seq) for seq in res["attention_mask"]]
    if not concatenate:
        return {
            "input_ids": [seq.tolist() for seq in input_ids],
            "labels": [seq.tolist() for seq in targets],
            "attention_mask": [seq.tolist() for seq in attention_mask],
        }
    new_input_ids = []
    new_labels = []
    new_attention_mask = []
    # Append EOS and PAD tokens to input_ids, and correct attention_mask
    for i, _ in enumerate(input_ids):
        input_ids[i] = torch.cat(
            (
                input_ids[i],
                torch.tensor([tokenizer.eos_token_id, tokenizer.pad_token_id]),
            ),
            dim=0,
        )
        targets[i] = torch.cat(
            (
                targets[i],
                torch.tensor([tokenizer.eos_token_id, -100]),
            ),
            dim=0,
        )
        attention_mask[i] = torch.cat((attention_mask[i], torch.tensor([1, 0])), dim=0)
    # Concatenate tokens so that their lengths are less than max_tokens
    buffer_input_ids = torch.tensor([], dtype=torch.long)
    buffer_labels = torch.tensor([], dtype=torch.long)
    buffer_attention_mask = torch.tensor([], dtype=torch.long)
    for ids, labels, mask in zip(input_ids, targets, attention_mask, strict=False):
        if buffer_input_ids.numel() == max_tokens:
            new_input_ids.append(buffer_input_ids)
            new_labels.append(buffer_labels)
            new_attention_mask.append(buffer_attention_mask)
            buffer_input_ids = torch.tensor([], dtype=torch.long)
            buffer_labels = torch.tensor([], dtype=torch.long)
            buffer_attention_mask = torch.tensor([], dtype=torch.long)
            buffer_input_ids = torch.cat((buffer_input_ids, ids), dim=0)
            buffer_labels = torch.cat((buffer_labels, labels), dim=0)
            buffer_attention_mask = torch.cat((buffer_attention_mask, mask), dim=0)
        elif buffer_input_ids.numel() + ids.numel() <= max_tokens:
            buffer_input_ids = torch.cat((buffer_input_ids, ids), dim=0)
            buffer_labels = torch.cat((buffer_labels, labels), dim=0)
            buffer_attention_mask = torch.cat((buffer_attention_mask, mask), dim=0)
        else:
            buffer_input_ids = torch.cat(
                (
                    buffer_input_ids,
                    torch.full(
                        (max_tokens - buffer_input_ids.numel(),),
                        tokenizer.pad_token_id,
                        dtype=torch.long,
                    ),
                ),
                dim=0,
            )
            buffer_labels = torch.cat(
                (
                    buffer_labels,
                    torch.full(
                        (max_tokens - buffer_labels.numel(),),
                        -100,
                        dtype=torch.long,
                    ),
                ),
                dim=0,
            )
            buffer_attention_mask = torch.cat(
                (
                    buffer_attention_mask,
                    torch.full(
                        (max_tokens - buffer_attention_mask.numel(),),
                        0,
                        dtype=torch.long,
                    ),
                ),
                dim=0,
            )
            new_input_ids.append(buffer_input_ids)
            new_labels.append(buffer_labels)
            new_attention_mask.append(buffer_attention_mask)
            buffer_input_ids = torch.tensor([], dtype=torch.long)
            buffer_labels = torch.tensor([], dtype=torch.long)
            buffer_attention_mask = torch.tensor([], dtype=torch.long)
            buffer_input_ids = torch.cat((buffer_input_ids, ids), dim=0)
            buffer_labels = torch.cat((buffer_labels, labels), dim=0)
            buffer_attention_mask = torch.cat((buffer_attention_mask, mask), dim=0)
    if buffer_input_ids.numel() > 0:  # for any leftover tokens
        while buffer_input_ids.numel() < max_tokens:  # make all sequences equal in size
            buffer_input_ids = torch.cat(
                (
                    buffer_input_ids,
                    torch.full(
                        (max_tokens - buffer_input_ids.numel(),),
                        tokenizer.pad_token_id,
                        dtype=torch.long,
                    ),
                ),
                dim=0,
            )
            buffer_labels = torch.cat(
                (
                    buffer_labels,
                    torch.full(
                        (max_tokens - buffer_labels.numel(),),
                        -100,
                        dtype=torch.long,
                    ),
                ),
                dim=0,
            )
            buffer_attention_mask = torch.cat(
                (
                    buffer_attention_mask,
                    torch.full(
                        (max_tokens - buffer_attention_mask.numel(),),
                        0,
                        dtype=torch.long,
                    ),
                ),
                dim=0,
            )
        new_input_ids.append(buffer_input_ids)
        new_labels.append(buffer_labels)
        new_attention_mask.append(buffer_attention_mask)
    ret = {
        "input_ids": [seq.tolist() for seq in new_input_ids],
        "labels": [seq.tolist() for seq in new_labels],
        "attention_mask": [seq.tolist() for seq in new_attention_mask],
    }
    LOG.debug(len(ret["input_ids"]))
    return ret
 def wrap_streaming_dataset(
    dataset,
    tokenizer,
    cfg,
-    dataset_config,
+    ds_wrapper_fn,
-    d_base_type: str,
+):
    d_prompt_style: str | None,
    processor: Any | None,
    max_tokens: int = 2048,
    buffer_size: int = 10_000,
 ) -> IterableDataset:
    """
    Wrap a streaming SFT dataset with tokenization and optional packing.
    This is similar to wrap_pretraining_dataset but for SFT datasets.
    Args:
        dataset: The streaming dataset to wrap
        tokenizer: Tokenizer to use
        cfg: Configuration object
        dataset_config: Dataset configuration
        d_base_type: Base dataset type
        d_prompt_style: Prompt style
        processor: Optional processor for multimodal
        max_tokens: Maximum sequence length
        buffer_size: Buffer size for shuffling
    Returns:
        Wrapped streaming dataset ready for training
    """
    # Import here to avoid circular imports
    from axolotl.utils.data.wrappers import get_dataset_wrapper
    # Apply shuffling if configured
    if cfg.shuffle_merged_datasets:
        LOG.info(f"Shuffling streaming dataset with buffer_size={buffer_size}")
        dataset = dataset.shuffle(seed=cfg.seed, buffer_size=buffer_size)
    # For streaming datasets, we need to get column names from the first sample
    remove_columns = []
    for first_row in dataset:
        remove_columns = list(first_row.keys())
        break
    # Reset dataset after peeking
    if cfg.shuffle_merged_datasets:
        dataset = dataset.shuffle(seed=cfg.seed, buffer_size=buffer_size)
    # Define the encoding function - always add position_ids for compatibility
    if cfg.sample_packing:
-        # For sample packing, we need to handle position_ids
+        # For SFT (non-pretraining) datasets, always use multipack_attn=True to ensure
-        def encode_streaming_packed(examples: Dict[str, List]) -> Dict[str, List]:
+        # attention isolation between packed sequences
-            """Encode examples for streaming with sample packing."""
+        multipack_attn = (
-            # Convert the batch dict to a temporary Dataset for processing
+            True if not cfg.pretraining_dataset else cfg.pretrain_multipack_attn
-            temp_dataset = Dataset.from_dict(examples)
+        )
-            # Apply the dataset wrapper to tokenize
+        collate_fn = PretrainingBatchSamplerDataCollatorForSeq2Seq(
-            wrapped_dataset, _ = get_dataset_wrapper(
+            tokenizer,
-                dataset_config=dataset_config,
+            return_tensors="pt",
-                tokenizer=tokenizer,
+            padding=True,
-                cfg=cfg,
+            pad_to_multiple_of=cfg.sequence_len,
-                dataset_base_type=d_base_type,
+            multipack_attn=multipack_attn,
-                dataset=temp_dataset,
+        )
-                dataset_prompt_style=d_prompt_style,
+        encode = functools.partial(
-                processor=processor,
+            encode_packed_streaming,
-            )
+            collate_fn,
            ds_wrapper_fn,
            max_seq_length=cfg.sequence_len,
            batch_size=cfg.micro_batch_size,
            multipack_attn=multipack_attn,
        )
-            # Convert to dict for processing
+        # Set this to 1 so downstream data_loader doesn't try to increase the batch size
-            result = {}
+        # again
-            if hasattr(wrapped_dataset, "to_dict"):
+        cfg.micro_batch_size = 1
                result = wrapped_dataset.to_dict()
            else:
                for key in wrapped_dataset.column_names:
                    result[key] = wrapped_dataset[key]
            # Add position_ids using the existing function
            result = add_position_ids(result)
            # For multipack attention, we may need to drop attention_mask
            if cfg.pretrain_multipack_attn and "attention_mask" in result:
                del result["attention_mask"]
            return result
        encode_fn = encode_streaming_packed
    else:
-        # Regular encoding without packing - still add position_ids for compatibility
+        # NOTE: This is not reachable for SFT datasets since we use the pre-existing
-        def encode_streaming(examples: Dict[str, List]) -> Dict[str, List]:
+        # loading function for non-packed streaming datasets. Refer to
-            """Encode examples for streaming."""
+        # _prepare_streaming_datasets in sft.py for that code path.
-            # Convert the batch dict to a temporary Dataset for processing
+        text_column = (
-            temp_dataset = Dataset.from_dict(examples)
+            getattr(cfg.pretraining_dataset[0], "text_column", "text") or "text"
        )
        encode = functools.partial(
            encode_streaming,
            tokenizer=tokenizer,
            max_tokens=cfg.sequence_len,
            text_column=text_column,
            concatenate=cfg.pretraining_sample_concatenation is True,
        )
-            # Apply the dataset wrapper to tokenize
+    if cfg.shuffle_merged_datasets:
-            wrapped_dataset, _ = get_dataset_wrapper(
+        dataset = dataset.shuffle(
-                dataset_config=dataset_config,
+            seed=cfg.seed, buffer_size=cfg.streaming_multipack_buffer_size
-                tokenizer=tokenizer,
+        )
-                cfg=cfg,
+    else:
-                dataset_base_type=d_base_type,
+        LOG.debug("NOT shuffling merged pretraining datasets")
                dataset=temp_dataset,
                dataset_prompt_style=d_prompt_style,
                processor=processor,
            )
-            # Convert to dict format
+    # remove all the existing columns after mapping since they end up having
-            result = {}
+    # a different length than the encoded/tokenized column
-            if hasattr(wrapped_dataset, "to_dict"):
+    # this is empty during streaming/pretraining
-                result = wrapped_dataset.to_dict()
+    remove_columns = []
-            else:
+    if dataset.features is None:
-                for key in wrapped_dataset.column_names:
+        for first_row in dataset:
-                    result[key] = wrapped_dataset[key]
+            remove_columns = list(first_row.keys())
            break
    else:
        remove_columns = list(dataset.features.keys())
            # Add position_ids even without packing for compatibility
            result = add_position_ids(result)
            return result
        encode_fn = encode_streaming
    # Map the encoding function over the streaming dataset
    dataset = dataset.map(
-        encode_fn,
+        encode,
        batched=True,
-        batch_size=buffer_size,
+        batch_size=cfg.streaming_multipack_buffer_size,
        remove_columns=remove_columns,
    )
    # Set format for PyTorch
    dataset = dataset.with_format("torch")
    return dataset
 def encode_packed_streaming(
    collate_fn,
    ds_wrapper: Callable,
    examples: Dict[str, List],
    max_seq_length: int = 2048,
    batch_size: int = 4,
    multipack_attn: Optional[bool] = True,
 ) -> Dict[str, List]:
    # tokenize all the examples
    # rows get split with stride (overlap)
    train_dataset = ds_wrapper(dataset=Dataset.from_dict(examples))[0]
    train_dataset = process_pretraining_datasets_for_packing(
        train_dataset,
        max_seq_length,
        skip_position_ids=not multipack_attn,
        # FIXME using attention mask unpad/pad with trainer and packed pretraining is broken atm
        # workaround by using the position id logic for now in trainer
        drop_attention_mask=multipack_attn,
    )
    sampler = MultipackBatchSampler(
        sampler=RandomSampler(train_dataset),
        lengths=get_dataset_lengths(train_dataset),
        batch_size=1,
        batch_max_len=batch_size * max_seq_length,
        drop_last=True,
        num_processes=1,
    )
    chunked_data = defaultdict(list)
    for batch in sampler:
        for data in batch:
            features = train_dataset[data]
            if "num_truncated_tokens" in features:
                del features["num_truncated_tokens"]
            if "overflow_to_sample_mapping" in features:
                del features["overflow_to_sample_mapping"]
            if "labels" not in features:
                features["labels"] = features["input_ids"].copy()
            collated_features = collate_fn(features)
            for feature in features.keys():
                if feature == "length":
                    continue
                chunked_data[feature].append(collated_features[feature].squeeze(0))
    return chunked_data
--- a/src/axolotl/utils/data/utils.py
+++ b/src/axolotl/utils/data/utils.py
@@ -178,8 +178,8 @@ def truncate_long_seq(sample, sequence_len=2048, min_sequence_len=2):
 def handle_long_seq_in_dataset(
-    dataset: Dataset | IterableDataset, sequence_len: int, cfg: DictDefault
+    dataset: Dataset, sequence_len: int, cfg: DictDefault
-) -> Dataset | IterableDataset:
+) -> Dataset:
    """Remove sequences longer than configured maximum from dataset.
    Args:
@@ -190,19 +190,21 @@ def handle_long_seq_in_dataset(
    Returns:
        Filtered dataset with long sequences removed.
    """
-    # Streaming datasets don't support filtering the same way
+    if (
-    if isinstance(dataset, IterableDataset):
+        hasattr(dataset, "column_names")
-        LOG.info(
+        and dataset.column_names
-            "Streaming dataset detected - long sequence filtering will be done on-the-fly"
+        and "input_ids" not in dataset.column_names
-        )
+    ):
        return dataset
    if not hasattr(dataset, "column_names") or "input_ids" not in dataset.column_names:
        LOG.warning(
            "Dataset does not contain 'input_ids' column. Skip drop long seq. This is "
            "expected for reward modeling."
        )
        return dataset
    elif not hasattr(dataset, "column_names") or dataset.column_names is None:
        LOG.info(
            "Dataset is streaming (IterableDataset), skipping long sequence handling"
        )
        return dataset
    drop_long = functools.partial(
        drop_long_seq,
--- a/src/axolotl/utils/schemas/config.py
+++ b/src/axolotl/utils/schemas/config.py
@@ -138,6 +138,12 @@ class AxolotlInputConfig(
            "description": "Process reward modelling: `True` or `False`"
        },
    )
    center_rewards_coefficient: float | None = Field(
        default=None,
        json_schema_extra={
            "description": "Coefficient to incentivize the reward model to output mean-zero rewards (proposed by https://huggingface.co/papers/2312.09244, Eq. 2). Recommended value: `0.01`."
        },
    )
    num_labels: int | None = None
    # Whether to use weighting in DPO trainer.
    # If `None`, default is `False` in the trainer.
@@ -244,12 +250,6 @@ class AxolotlInputConfig(
    dataloader_num_workers: int | None = None
    dataloader_prefetch_factor: int | None = None
    dataloader_drop_last: bool | None = None
    streaming: bool | None = Field(
        default=None,
        json_schema_extra={
            "description": "Enable streaming mode for training datasets to reduce memory usage and enable training on datasets larger than memory"
        },
    )
    accelerator_config: dict[str, Any] | None = None
@@ -481,12 +481,6 @@ class AxolotlInputConfig(
        },
    )
    multipack_real_batches: bool | None = None
    pretraining_sample_concatenation: bool | None = Field(
        default=None,
        json_schema_extra={
            "description": "whether to concatenate samples during pretraining",
        },
    )
    batch_flattening: Literal["auto"] | bool | None = Field(
        default=None,
@@ -501,13 +495,34 @@ class AxolotlInputConfig(
    pose_max_context_len: int | None = None
    pose_num_chunks: int | None = None
-    pretrain_multipack_buffer_size: int | None = 10_000
+    # Deprecated: Use streaming_multipack_buffer_size instead
    pretrain_multipack_buffer_size: int | None = Field(
        default=None,
        deprecated="Deprecated in v0.13.0, will be removed in v0.14.0. Use streaming_multipack_buffer_size instead",
    )
    pretrain_multipack_attn: bool | None = Field(
        default=True,
        json_schema_extra={
            "description": "whether to prevent cross attention for packed sequences during pretraining",
        },
    )
    pretraining_sample_concatenation: bool | None = Field(
        default=None,
        json_schema_extra={
            "description": "whether to concatenate samples during pretraining",
        },
    )
    streaming: bool | None = Field(
        default=None,
        json_schema_extra={"description": "Use streaming mode for loading datasets"},
    )
    streaming_multipack_buffer_size: int | None = Field(
        default=10_000,
        json_schema_extra={
            "description": "Buffer size for multipack streaming datasets"
        },
    )
    xformers_attention: bool | None = Field(
        default=None,
@@ -836,10 +851,15 @@ class AxolotlInputConfig(
    include_tokens_per_second: bool | None = Field(
        default=None,
        json_schema_extra={
-            "description": "bool of whether to include tokens trainer per second in the training metrics. This iterates over the entire dataset once, so it takes some time."
+            "description": "bool of whether to report tokens per second at the end of training. This is not supported with pre-training datasets."
        },
    )
    include_tkps: bool | None = Field(
        default=True,
        json_schema_extra={
            "description": "bool of whether to report tokens per second per-gpu during training by measuring throughput of non-padding tokens."
        },
    )
    neftune_noise_alpha: float | None = Field(
        default=None,
        json_schema_extra={
@@ -933,7 +953,15 @@ class AxolotlInputConfig(
        },
    )
-    fix_untrained_tokens: int | list[int] | None = None
+    fix_untrained_tokens: int | list[int] | None = Field(
        default=None,
        json_schema_extra={
            "description": (
                "Token index or indices to adjust embedding weights to the mean of the other tokens. "
                "This is useful when the model has untrained embeddings."
            )
        },
    )
    # INTERNALS - document for now, generally not set externally
    is_preprocess: bool | None = None
@@ -992,6 +1020,26 @@ class AxolotlInputConfig(
            return [ds_config.model_dump(exclude_none=True) for ds_config in ds_configs]
        return None
    @model_validator(mode="before")
    @classmethod
    def warn_peft_trainable_token_to_fix_untrained(cls, data):
        if (
            peft_trainable_token_indices := data.get("peft_trainable_token_indices")
        ) and (fix_untrained_tokens := data.get("fix_untrained_tokens")):
            if isinstance(fix_untrained_tokens, int):
                fix_untrained_tokens = (fix_untrained_tokens,)
            if isinstance(peft_trainable_token_indices, int):
                peft_trainable_token_indices = (peft_trainable_token_indices,)
            for untrained_token_id in fix_untrained_tokens:
                if untrained_token_id not in peft_trainable_token_indices:
                    LOG.warning_once(
                        f"Token {untrained_token_id} is fixed via `fix_untrained_tokens`, yet not in `peft_trainable_token_indices: ` list. "
                        "Please add it, otherwise the token won't be trained on."
                    )
        return data
 class AxolotlConfigWCapabilities(AxolotlInputConfig):
    """wrapper to valdiate GPU capabilities with the configured options"""
@@ -1265,3 +1313,14 @@ class AxolotlConfigWCapabilities(AxolotlInputConfig):
            data["dataset_processes"] = get_default_process_count()
        return data
    @model_validator(mode="before")
    @classmethod
    def check_deduplication_with_streaming(cls, data):
        if data.get("dataset_exact_deduplication") and (
            data.get("streaming") or data.get("pretraining_dataset")
        ):
            raise NotImplementedError(
                "dataset_exact_deduplication is not available for streaming datasets. "
            )
        return data
--- a/src/axolotl/utils/schemas/model.py
+++ b/src/axolotl/utils/schemas/model.py
@@ -59,16 +59,21 @@ class ModelInputConfig(BaseModel):
    processor_type: str | None = Field(
        default=None, json_schema_extra={"description": "transformers processor class"}
    )
    tokenizer_save_jinja_files: bool | None = Field(
        default=True,  # match the default behavior from transformers
        json_schema_extra={
            "description": "Whether to save jinja files for tokenizer, transformers default is True"
        },
    )
    trust_remote_code: bool | None = Field(
        default=None,
        json_schema_extra={"description": "Trust remote code for untrusted source"},
    )
    experimental_skip_move_to_device: bool | None = Field(
-        default=None,
+        default=True,
        json_schema_extra={
-            "description": "Don't move the model to the device before sharding. "
+            "description": "Don't move the model to the device before sharding. Set to `false` to revert to legacy behavior."
            "This is an experimental feature that may be included in the future as the default."
        },
    )
--- a/src/axolotl/utils/schemas/peft.py
+++ b/src/axolotl/utils/schemas/peft.py
@@ -90,6 +90,16 @@ class LoraConfig(BaseModel):
            "description": "How to initialize LoRA weights. Default to True which is MS original implementation."
        },
    )
    peft_trainable_token_indices: list[int] | dict[str, list[int]] | None = Field(
        default=None,
        json_schema_extra={
            "description": (
                "A list of token indices to fine-tune on the `embed_tokens` layer.\n"
                "Otherwise, a dict mapping an embedding layer name to its trainable token indices.\n"
                "See https://huggingface.co/docs/peft/v0.17.0/en/developer_guides/lora#efficiently-train-tokens-alongside-lora"
            )
        },
    )
    qlora_sharded_model_loading: bool | None = Field(
        default=False,
--- a/src/axolotl/utils/schemas/validation.py
+++ b/src/axolotl/utils/schemas/validation.py
@@ -60,6 +60,20 @@ class DatasetValidationMixin:
            raise ValueError("either datasets or pretraining_dataset is required")
        return data
    @model_validator(mode="before")
    @classmethod
    def check_pretraining_streaming_deprecation(cls, data):
        # TODO(djsaunde): remove this check + implement change for 0.13.0 release
        if data.get("pretraining_dataset") and not data.get("streaming"):
            LOG.warning(
                "Setting `pretraining_dataset` without explicitly setting `streaming: "
                "true` is deprecated. In a future release, streaming will not be "
                "automatically enabled when using pretraining_dataset. Please "
                "explicitly set `streaming: true` in your configuration to maintain "
                "current behavior."
            )
        return data
    @model_validator(mode="before")
    @classmethod
    def check_push_ds_auth(cls, data):
@@ -340,6 +354,30 @@ class TrainingValidationMixin:
            )
        return data
    @model_validator(mode="before")
    @classmethod
    def check_multipack_buffer_size(cls, data):
        if data.get("pretrain_multipack_buffer_size") and not data.get(
            "streaming_multipack_buffer_size"
        ):
            LOG.warning(
                "`pretrain_multipack_buffer_size` is deprecated in v0.13.0, will be "
                "removed in v0.14.0. Use `streaming_multipack_buffer_size` instead."
            )
            data["streaming_multipack_buffer_size"] = data[
                "pretrain_multipack_buffer_size"
            ]
            del data["pretrain_multipack_buffer_size"]
        elif data.get("pretrain_multipack_buffer_size") and data.get(
            "streaming_multipack_buffer_size"
        ):
            raise ValueError(
                "pretrain_multipack_buffer_size is deprecated, use "
                "streaming_multipack_buffer_size; both are set, please remove the "
                "deprecated pretrain_multipack_buffer_size setting"
            )
        return data
    @model_validator(mode="after")
    def check_fft_possible_bad_config(self):
        if (
@@ -1076,20 +1114,46 @@ class PretrainingValidationMixin:
    @model_validator(mode="before")
    @classmethod
-    def check_pretraining_split_batches_accelerate(cls, data):
+    def check_pretraining_w_val_set_size(cls, data):
-        # alternatively set ACCELERATE_SPLIT_BATCHES=False
+        if data.get("pretraining_dataset") and data.get("val_set_size"):
-        if data.get("streaming"):
+            raise ValueError(
-            accelerator_config = data.get("accelerator_config", {})
+                "val_set_size is not supported with pretraining_dataset. "
-            if not accelerator_config:
+                "Use test_datasets to specify evaluation datasets for pretraining."
-                data["accelerator_config"] = {
+            )
-                    "split_batches": False,
+        return data
-                    "dispatch_batches": False,
+
-                }
+    @model_validator(mode="before")
-            else:
+    @classmethod
-                if accelerator_config.get("split_batches") is None:
+    def check_streaming_w_val_set_size(cls, data):
-                    data["accelerator_config"]["split_batches"] = False
+        if data.get("streaming") and data.get("val_set_size"):
-                if accelerator_config.get("dispatch_batches") is None:
+            raise ValueError(
-                    data["accelerator_config"]["dispatch_batches"] = False
+                "val_set_size is not supported with streaming datasets. "
                "Use test_datasets to specify evaluation datasets when streaming is enabled."
            )
        return data
    @model_validator(mode="before")
    @classmethod
    def check_streaming_w_max_steps(cls, data):
        if data.get("streaming") and not data.get("max_steps"):
            raise ValueError(
                "max_steps must be set when using streaming datasets. "
                "Trainer cannot infer dataset length for iterable datasets."
            )
        return data
    @model_validator(mode="before")
    @classmethod
    def check_streaming_w_multiple_datasets(cls, data):
        if (
            data.get("streaming")
            and data.get("sample_packing")
            and data.get("datasets")
            and len(data.get("datasets")) > 1
        ):
            raise NotImplementedError(
                "Sample packing with multiple streaming datasets is not yet supported"
            )
        return data
--- a/src/axolotl/utils/trainer.py
+++ b/src/axolotl/utils/trainer.py
@@ -475,7 +475,9 @@ def calculate_total_num_steps(cfg, train_dataset, update=True):
                train_dataset.remove_columns(["length"]),
                batch_sampler=sampler,
            )
-            data_loader_len = len(data_loader) * cfg.micro_batch_size // cfg.batch_size
+            data_loader_len = max(
                1, len(data_loader) * cfg.micro_batch_size // cfg.batch_size
            )
            LOG.debug(f"data_loader_len: {data_loader_len}")
            # FIXME: is there a bug here somewhere? the total num steps depends
            # on the agreed on value for sample_packing_eff_est
@@ -547,6 +549,13 @@ def setup_deepspeed_env(cfg, stage=None):
        if stage == 3:
            os.environ["ACCELERATE_DEEPSPEED_ZERO3_INIT"] = "true"
    device_count = torch.cuda.device_count()
    if device_count == 1:
        os.environ.setdefault("WORLD_SIZE", "1")
        os.environ.setdefault("LOCAL_RANK", "0")
        os.environ.setdefault("MASTER_ADDR", "0.0.0.0")  # nosec B104
        os.environ.setdefault("MASTER_PORT", "29500")
    # NOTE(djsaunde): The distribued state cannot be initialized prior to the
    # ACCELERATE_USE_DEEPSPEED assignment, but it must be initialized some time prior
    # to model load.
--- a/tests/e2e/integrations/test_kd.py
+++ b/tests/e2e/integrations/test_kd.py
@@ -25,7 +25,7 @@ def min_cfg(temp_dir):
        "liger_rms_norm": True,
        "liger_glu_activation": True,
        "torch_compile": True,
-        "chat_template": "llama3",
+        "chat_template": "qwen3",
        "kd_trainer": True,
        "kd_ce_alpha": 0.1,
        "kd_alpha": 0.9,
--- a/tests/e2e/test_streaming.py
+++ b/tests/e2e/test_streaming.py
@@ -0,0 +1,73 @@
 """E2E tests for streaming dataset functionality"""
 # pylint: disable=duplicate-code
 import pytest
 from axolotl.common.datasets import load_datasets
 from axolotl.train import train
 from axolotl.utils.config import normalize_config, validate_config
 from axolotl.utils.dict import DictDefault
 from .utils import check_model_output_exists, check_tensorboard
 class TestStreamingDatasets:
    """Test case for streaming datasets"""
    @pytest.mark.parametrize(
        "sample_packing",
        [True, False],
    )
    def test_streaming_dataset(self, temp_dir, sample_packing):
        """Test streaming datasets"""
        cfg = DictDefault(
            {
                "base_model": "HuggingFaceTB/SmolLM2-135M",
                "flash_attention": True,
                "sequence_len": 1024,
                "sample_packing": sample_packing,
                "pretrain_multipack_attn": sample_packing,
                "streaming_multipack_buffer_size": 10000,
                "dataset_processes": 1,
                "special_tokens": {
                    "pad_token": "<|endoftext|>",
                },
                "datasets": [
                    {
                        "path": "mhenrichsen/alpaca_2k_test",
                        "type": "alpaca",
                    },
                ],
                # Streaming config
                "streaming": True,
                "max_steps": 3,
                "micro_batch_size": 1,
                "gradient_accumulation_steps": 1,
                "val_set_size": 0.0,
                "output_dir": temp_dir,
                "learning_rate": 0.00001,
                "optimizer": "adamw_torch_fused",
                "lr_scheduler": "cosine",
                "save_safetensors": True,
                "bf16": "auto",
                "use_tensorboard": True,
                "save_first_step": False,
            }
        )
        cfg = validate_config(cfg)
        normalize_config(cfg)
        dataset_meta = load_datasets(cfg=cfg)
        train(cfg=cfg, dataset_meta=dataset_meta)
        check_model_output_exists(temp_dir, cfg)
        # Verify training actually happened by checking loss decrease
        check_tensorboard(
            temp_dir + "/runs",
            "train/train_loss",
            3.0,
            "Train Loss (%s) is too high",
        )
--- a/tests/e2e/test_tokenizer.py
+++ b/tests/e2e/test_tokenizer.py
@@ -0,0 +1,63 @@
 """
 e2e test for saving the tokenizer
 """
 from unittest.mock import patch
 from axolotl.common.datasets import load_datasets
 from axolotl.train import train
 from axolotl.utils.config import normalize_config, validate_config
 from axolotl.utils.dict import DictDefault
 from tests.e2e.utils import check_model_output_exists
 def test_tokenizer_no_save_jinja_files(temp_dir):
    # pylint: disable=duplicate-code
    cfg = DictDefault(
        {
            "base_model": "HuggingFaceTB/SmolLM2-135M",
            "tokenizer_type": "AutoTokenizer",
            "sequence_len": 1024,
            "load_in_8bit": True,
            "adapter": "lora",
            "lora_r": 8,
            "lora_alpha": 16,
            "lora_dropout": 0.05,
            "lora_target_linear": True,
            "val_set_size": 0.02,
            "special_tokens": {
                "pad_token": "<|endoftext|>",
            },
            "chat_template": "chatml",
            "datasets": [
                {
                    "path": "mhenrichsen/alpaca_2k_test",
                    "type": "alpaca",
                },
            ],
            "num_epochs": 1,
            "micro_batch_size": 2,
            "gradient_accumulation_steps": 1,
            "output_dir": temp_dir,
            "learning_rate": 0.00001,
            "optimizer": "adamw_torch_fused",
            "lr_scheduler": "cosine",
            "max_steps": 5,
            "save_first_step": False,
            "fp16": False,
            "tokenizer_save_jinja_files": False,
        }
    )
    cfg = validate_config(cfg)
    normalize_config(cfg)
    dataset_meta = load_datasets(cfg=cfg)
    with patch("axolotl.train.execute_training"):
        train(cfg=cfg, dataset_meta=dataset_meta)
    check_model_output_exists(temp_dir, cfg)
    with open(f"{temp_dir}/tokenizer_config.json", "r", encoding="utf-8") as f:
        tokenizer_config = f.read()
        assert "chat_template" in tokenizer_config
--- a/tests/monkeypatch/test_trainer_loss_calc.py
+++ b/tests/monkeypatch/test_trainer_loss_calc.py
@@ -3,7 +3,6 @@
 import unittest
 from axolotl.monkeypatch.transformers.trainer_loss_calc import (
    check_evaluation_loop_is_fsdp2_patchable,
    check_evaluation_loop_is_patchable,
    check_maybe_log_save_evaluate_is_patchable,
 )
@@ -20,7 +19,6 @@ class TestTrainerLossCalc(unittest.TestCase):
        the patched code changes upstream.
        """
        assert check_evaluation_loop_is_patchable()
        assert check_evaluation_loop_is_fsdp2_patchable()
        assert check_maybe_log_save_evaluate_is_patchable()
--- a/tests/test_data.py
+++ b/tests/test_data.py
@@ -6,7 +6,7 @@ import unittest
 from transformers import LlamaTokenizer
-from axolotl.utils.data import encode_pretraining, md5
+from axolotl.utils.data import encode_streaming, md5
 from tests.hf_offline_utils import enable_hf_offline
@@ -39,7 +39,7 @@ class TestEncodePretraining(unittest.TestCase):
                "hello, hello",
            ]
        }
-        result = encode_pretraining(self.tokenizer, self.max_tokens, examples)
+        result = encode_streaming(examples, self.tokenizer, self.max_tokens)
        self.assertEqual(len(result["input_ids"]), 3)
--- a/tests/test_packed_dataset.py
+++ b/tests/test_packed_dataset.py
@@ -1,16 +1,11 @@
 """Module for testing dataset sequence packing"""
 import unittest
 from pathlib import Path
 from datasets import Dataset, load_dataset
 from transformers import AutoTokenizer
 from axolotl.cli.args import TrainerCliArgs
 from axolotl.common.datasets import load_datasets
 from axolotl.datasets import ConstantLengthDataset, TokenizedPromptDataset
 from axolotl.prompt_tokenizers import AlpacaPromptTokenizingStrategy
 from axolotl.prompters import AlpacaPrompter
 from axolotl.train import setup_model_and_trainer
 from axolotl.utils.config import normalize_config, validate_config
 from axolotl.utils.dict import DictDefault
@@ -35,43 +30,6 @@ class TestPacking(unittest.TestCase):
            }
        )
    def test_increments_attention(self):
        prompter = AlpacaPrompter("chat")
        strat = AlpacaPromptTokenizingStrategy(
            prompter,
            self.tokenizer,
            False,
            2048,
        )
        dateset = load_dataset(
            "json",
            data_files=str(Path(__file__).parent / "fixtures/alpaca/alpaca.json"),
        )["train"]
        dataset = Dataset.from_list(list(TokenizedPromptDataset(strat, dateset)))
        constant_len_dataset = ConstantLengthDataset(
            self.tokenizer,
            [dataset],
            seq_length=2048,
        )
        packed_dataset = Dataset.from_list(list(constant_len_dataset))
        example = packed_dataset[0]
        next_bos_index = (
            example["input_ids"][1:].index(self.tokenizer.bos_token_id) + 1
        )  # add one since we sliced
        # first example doesn't have mask reset
        assert example["input_ids"][0] == self.tokenizer.bos_token_id
        assert example["attention_mask"][0] == 1
        assert example["position_ids"][0] == 0
        assert example["position_ids"][1] == 1
        # but subsequent one does
        assert example["input_ids"][next_bos_index] == self.tokenizer.bos_token_id
        assert example["attention_mask"][next_bos_index] == 2
        assert example["position_ids"][next_bos_index] == 0
        assert example["position_ids"][next_bos_index + 1] == 1
    @with_temp_dir
    def test_lora_packing(self, temp_dir):
        cfg = DictDefault(
--- a/tests/test_packed_pretraining.py
+++ b/tests/test_packed_pretraining.py
@@ -9,7 +9,7 @@ import torch
 from datasets import IterableDataset
 from torch.utils.data import DataLoader
-from axolotl.utils.data import get_dataset_wrapper, wrap_pretraining_dataset
+from axolotl.utils.data import get_dataset_wrapper, wrap_streaming_dataset
 from axolotl.utils.dict import DictDefault
@@ -77,14 +77,11 @@ class TestPretrainingPacking:
        )
        original_bsz = cfg.micro_batch_size
-        train_dataset = wrap_pretraining_dataset(
+        train_dataset = wrap_streaming_dataset(
            dataset,
            tokenizer_huggyllama,
            cfg,
            ds_wrapper_partial,
            max_tokens=cfg.sequence_len,
            batch_size=cfg.micro_batch_size,
            seed=cfg.seed or 42,
        )
        trainer_loader = DataLoader(
--- a/tests/test_streaming.py
+++ b/tests/test_streaming.py
@@ -0,0 +1,238 @@
 """Test streaming configuration and data loading functionality."""
 import unittest
 from unittest.mock import Mock, patch
 from datasets import IterableDataset
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.data.sft import (
    _prepare_streaming_dataset,
    prepare_datasets,
 )
 from axolotl.utils.config import validate_config
 class TestStreamingConfig(unittest.TestCase):
    """Test streaming configuration and deprecation handling."""
    def test_streaming_multipack_buffer_size_deprecation(self):
        """Test that pretrain_multipack_buffer_size is properly deprecated."""
        # Test with old config name
        cfg_old = DictDefault(
            {
                "base_model": "HuggingFaceTB/SmolLM2-135M",
                "pretrain_multipack_buffer_size": 5000,
                "datasets": [{"path": "test/dataset", "type": "alpaca"}],
                "sequence_len": 256,
                "micro_batch_size": 1,
                "gradient_accumulation_steps": 1,
                "learning_rate": 0.0001,
            }
        )
        with self.assertLogs("axolotl.utils.schemas.validation", level="WARNING") as cm:
            validated_cfg = validate_config(cfg_old)
            self.assertIn("pretrain_multipack_buffer_size` is deprecated", cm.output[0])
        self.assertEqual(validated_cfg.streaming_multipack_buffer_size, 5000)
        self.assertIsNone(
            getattr(validated_cfg, "pretrain_multipack_buffer_size", None)
        )
    def test_streaming_multipack_buffer_size_new(self):
        """Test that new streaming_multipack_buffer_size works correctly."""
        cfg_new = DictDefault(
            {
                "base_model": "HuggingFaceTB/SmolLM2-135M",
                "streaming_multipack_buffer_size": 7000,
                "datasets": [{"path": "test/dataset", "type": "alpaca"}],
                "sequence_len": 256,
                "micro_batch_size": 1,
                "gradient_accumulation_steps": 1,
                "learning_rate": 0.0001,
            }
        )
        validated_cfg = validate_config(cfg_new)
        self.assertEqual(validated_cfg.streaming_multipack_buffer_size, 7000)
    def test_both_buffer_sizes_raises_error(self):
        """Test that having both old and new buffer size configs raises an error."""
        cfg_both = DictDefault(
            {
                "base_model": "HuggingFaceTB/SmolLM2-135M",
                "pretrain_multipack_buffer_size": 5000,
                "streaming_multipack_buffer_size": 7000,
                "datasets": [{"path": "test/dataset", "type": "alpaca"}],
                "sequence_len": 256,
                "micro_batch_size": 1,
                "gradient_accumulation_steps": 1,
                "learning_rate": 0.0001,
            }
        )
        with self.assertRaises(ValueError) as cm:
            validate_config(cfg_both)
        self.assertIn("both are set", str(cm.exception))
 class TestStreamingDatasetPreparation(unittest.TestCase):
    """Test dataset preparation with streaming configuration."""
    def setUp(self):
        self.tokenizer = Mock()
        self.tokenizer.pad_token_id = 0
        self.tokenizer.eos_token_id = 1
    @patch("axolotl.utils.data.sft._prepare_streaming_dataset")
    def test_prepare_datasets_with_streaming_true(self, mock_prepare_streaming):
        """Test that streaming=True triggers streaming dataset preparation."""
        cfg = DictDefault(
            {
                "streaming": True,
                "datasets": [{"path": "test/dataset", "type": "alpaca"}],
            }
        )
        mock_prepare_streaming.return_value = (Mock(), None, 100, [])
        prepare_datasets(cfg, self.tokenizer)
        mock_prepare_streaming.assert_called_once_with(cfg, self.tokenizer, None)
    @patch("axolotl.utils.data.sft._prepare_streaming_dataset")
    def test_prepare_datasets_with_pretraining_dataset(self, mock_prepare_streaming):
        """Test that pretraining_dataset triggers streaming dataset preparation."""
        cfg = DictDefault(
            {
                "pretraining_dataset": "test/dataset",
            }
        )
        mock_prepare_streaming.return_value = (Mock(), None, 100, [])
        prepare_datasets(cfg, self.tokenizer)
        mock_prepare_streaming.assert_called_once_with(cfg, self.tokenizer, None)
    @patch("axolotl.utils.data.sft._prepare_standard_dataset")
    def test_prepare_datasets_without_streaming(self, mock_prepare_standard):
        """Test that without streaming, standard dataset preparation is used."""
        cfg = DictDefault(
            {
                "datasets": [{"path": "test/dataset", "type": "alpaca"}],
            }
        )
        mock_prepare_standard.return_value = (Mock(), None, 100, [])
        prepare_datasets(cfg, self.tokenizer)
        mock_prepare_standard.assert_called_once_with(cfg, self.tokenizer, None)
 class TestStreamingWithSamplePacking(unittest.TestCase):
    """Test streaming dataset preparation with sample packing."""
    def setUp(self):
        self.tokenizer = Mock()
        self.tokenizer.pad_token_id = 0
        self.tokenizer.eos_token_id = 1
    @patch("axolotl.utils.data.sft._load_streaming_dataset")
    def test_streaming_sft_with_sample_packing_sets_split(self, mock_load_streaming):
        """Test that streaming SFT with sample_packing sets default split."""
        cfg = DictDefault(
            {
                "streaming": True,
                "sample_packing": True,
                "datasets": [{"path": "test/dataset", "type": "alpaca"}],
                "sequence_len": 256,
                "micro_batch_size": 1,
            }
        )
        mock_load_streaming.return_value = Mock(spec=IterableDataset)
        with patch("axolotl.utils.data.sft._load_and_prepare_datasets"):
            _prepare_streaming_dataset(cfg, self.tokenizer, None)
            # Check that the dataset config has split set to 'train'
            call_args = mock_load_streaming.call_args
            dataset_config = call_args[0][0]
            self.assertEqual(dataset_config.split, "train")
    def test_multipack_attn_forced_true_for_sft(self):
        """Test that multipack_attn is forced to True for SFT with sample packing."""
        from axolotl.utils.data.streaming import wrap_streaming_dataset
        cfg = DictDefault(
            {
                "sample_packing": True,
                "pretrain_multipack_attn": False,  # Should be overridden for SFT
                "pretraining_dataset": None,  # This makes it SFT
                "sequence_len": 256,
                "micro_batch_size": 1,
                "streaming_multipack_buffer_size": 1000,
                "seed": 42,
            }
        )
        mock_dataset = Mock()
        mock_dataset.features = None  # For streaming datasets
        mock_dataset.__iter__ = Mock(return_value=iter([]))  # Empty iterator
        mock_dataset.map = Mock(return_value=mock_dataset)
        mock_ds_wrapper = Mock()
        with patch(
            "axolotl.utils.data.streaming.PretrainingBatchSamplerDataCollatorForSeq2Seq"
        ) as mock_collator:
            with patch("axolotl.utils.data.streaming.encode_packed_streaming"):
                wrap_streaming_dataset(
                    mock_dataset, self.tokenizer, cfg, mock_ds_wrapper
                )
                # Check that multipack_attn=True was used in the collator
                mock_collator.assert_called_once()
                call_kwargs = mock_collator.call_args[1]
                self.assertTrue(call_kwargs["multipack_attn"])
    def test_multipack_attn_respects_config_for_pretraining(self):
        """Test that multipack_attn respects config for pretraining datasets."""
        from axolotl.utils.data.streaming import wrap_streaming_dataset
        cfg = DictDefault(
            {
                "sample_packing": True,
                "pretrain_multipack_attn": False,  # Should be respected for pretraining
                "pretraining_dataset": "test/dataset",  # This makes it pretraining
                "sequence_len": 256,
                "micro_batch_size": 1,
                "streaming_multipack_buffer_size": 1000,
                "seed": 42,
            }
        )
        mock_dataset = Mock()
        mock_dataset.features = None  # For streaming datasets
        mock_dataset.__iter__ = Mock(return_value=iter([]))  # Empty iterator
        mock_dataset.map = Mock(return_value=mock_dataset)
        mock_ds_wrapper = Mock()
        with patch(
            "axolotl.utils.data.streaming.PretrainingBatchSamplerDataCollatorForSeq2Seq"
        ) as mock_collator:
            with patch("axolotl.utils.data.streaming.encode_packed_streaming"):
                wrap_streaming_dataset(
                    mock_dataset, self.tokenizer, cfg, mock_ds_wrapper
                )
                # Check that multipack_attn=False was used (respecting config)
                mock_collator.assert_called_once()
                call_kwargs = mock_collator.call_args[1]
                self.assertFalse(call_kwargs["multipack_attn"])
 if __name__ == "__main__":
    unittest.main()
Author	SHA1	Message	Date
Wing Lian	e1c7a61243	fix reentrant when using offloading	2025-09-14 10:42:15 -04:00
salman	9640338d37	Default `include_tkps` to true (#3134 ) * default true * force e2e * causal trainer only * fix eval loggin [skip-ci] * revert setup.py * force tests * guarding * guarding * fix test case * use evaluate [skip-e2e] * use evaluate [skip-e2e] * kick off ci * fixing * reverting	2025-09-09 10:50:21 -04:00
Wing Lian	b5d4c7ff54	allow 1% deviation for codecov (#3138 ) [skip ci]	2025-09-07 11:01:03 -04:00
Seungduk Kim	8fd9221f13	Add `ipo` as an `rl` type that shares DPODataset config (#3128 ) * Add `ipo` as an `rl` type that shares DPODataset config * chore: lint --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-09-07 10:49:10 -04:00
github-actions[bot]	bf00f29f3a	chore: update pre-commit hooks (#3137 ) [skip ci] Co-authored-by: djsaunde <1245942+djsaunde@users.noreply.github.com>	2025-09-07 10:33:20 -04:00
NanoCode012	1d32278755	feat: upgrade transformers to v4.56.1 (#3127 ) * feat: upgrade transformers to v4.56 * fix handling of CP/SP now that position_ids are default even for unpacked sequences * feat: monkeypatch list_repo_templates * fix: apply patch for tests only * see if updated main works at least * fix: update to patch release and remove monkeypatch * remove fsdp2 eval patch --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-09-05 11:00:54 -04:00
NanoCode012	c6ae5c43cb	fix: chat template jinja file not being loaded during inference (#3112 ) * fix: chat template jinja file not being loaded during inference * fix: bot comment	2025-09-03 16:25:09 -04:00
yardenhoch	efa1da52d5	Center rewards coefficient (#3124 ) * feat: add center_rewards_coefficient for reward modeling - Add center_rewards_coefficient parameter to Pydantic schema with paper reference - Pass parameter through base builder and causal builder to training args - Add documentation section with usage examples and theoretical background - Enable parameter in reward modeling example configs with recommended value - Enables reward centering for improved training stability in RLHF workflows Implements auxiliary loss from Eisenstein et al. 2023 (https://huggingface.co/papers/2312.09244) to incentivize mean-zero reward outputs without post-training normalization. * Update description * test: add unit tests for center_rewards_coefficient integration * Update src/axolotl/core/builders/base.py Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * Update docs/reward_modelling.qmd Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * Update docs/reward_modelling.qmd Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * reference to TRL documentation. * add new reward model configuration for qwen3 with comprehensive parameters * Verified center_rewards_coefficient is correctly passed through the trainer builder to training arguments. * Refactor reward modeling documentation to consolidate information on center_rewards_coefficient * Remove unit tests for center_rewards_coefficient integration as part of codebase cleanup. * linting * nit * Apply suggestions from code review Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * lint --------- Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> Co-authored-by: Salman Mohammadi <salman.mohammadi@outlook.com>	2025-09-03 16:22:37 -04:00
mhenrichsen	48db520d92	Create 270m-qlora.yml (#3075 ) [skip ci] Adds 270m gemma3 qlora	2025-09-03 16:20:32 -04:00
NanoCode012	53a0c1f39c	feat: add peft_trainable_token_indices (#3062 ) * feat: add peft_trainable_token_indices * feat: add warning compat with fix_untrained_tokens	2025-09-03 01:48:01 -04:00
github-actions[bot]	4cc6038d52	chore: update pre-commit hooks (#3122 ) [skip ci] Co-authored-by: djsaunde <1245942+djsaunde@users.noreply.github.com>	2025-09-03 01:41:34 -04:00
NanoCode012	e48aa8a5b1	feat(doc): improve visibility for colab notebooks (#3110 ) [skip ci] * feat: improve visibility for colab notebooks * fix: link to GH colab * feat: change to badge and move higher	2025-09-03 01:40:53 -04:00
xuyifann	24aba5caca	Clamping the len of dataloader to minimum of 1 (#3100 ) [skip ci] * Clamping the len of dataloader to minimum of 1 * linter reformat	2025-09-03 01:40:27 -04:00
Wing Lian	06bebcb65f	run cu128-2.8.0 e2e tests on B200 (#3126 ) * run cu128-2.8.0 e2e tests on B200 * not an int 🤦 * fix yaml	2025-09-02 13:13:23 -04:00
Dan Saunders	231a67e70b	Streaming SFT support (#3101 ) * working * fixes * deprecate --iterable; cleanup * pretrain_multipack_buffer_size -> streaming_multipack_buffer_size * improvements * tests * remove unused * docs, examples * nit * nit * add val_set_size validation * val * nit * min * coderabbito * cleanup * nit * add depr warning, cleanup * nit * fix test, fix quarto * fix * review comments * review comments * fix	2025-09-02 12:08:44 -04:00
Wing Lian	0094a2d744	support for tiledmlp for GPT-OSS (#3116 ) * fix use of flex attn kwargs and add support for tiledmlp for GPT-OSS * add logging back * update deps	2025-08-29 13:52:49 -04:00
Wing Lian	7ed40f1d70	automatically set env vars for single gpu deepspeed zero3 (#3118 ) [skip ci] * automatically set env vars for single gpu deepspeed zero3 * use setdefault	2025-08-29 13:36:47 -04:00
VED	5b6ec2820f	patch for ds_grads_remaining in deepspeed (#3102 ) [skip ci] * patch deepspeed * deepspeed patch for ds_grads_remaining * patch in Patchmanager * chore: lint * deepseed utils * chore2 * patch ds_grads_remaining chore * chore lint * chore lint * remove torch.nn patch * lint * Update src/axolotl/monkeypatch/utils.py Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * patched with checkpointwarapper * lint * only apply deepspeed patch when using activation offloading --------- Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-08-29 12:12:09 -04:00
Wing Lian	6afba3871d	Add support for PyTorch 2.8.0 (#3106 ) * Add support for PyTorch 2.8.0 * loosen triton requirements * handle torch 2.8.0 in setup.py * fix versions * no vllm for torch 2.8.0 * remove comment Co-authored-by: NanoCode012 <nano@axolotl.ai> --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2025-08-28 09:10:40 -04:00
Dan Saunders	dc338c3b0e	Update .coderabbit.yaml (#3109 ) [skip ci] Oops, should be false.	2025-08-27 09:50:52 -04:00
salman	d0d2fc5606	Tokens per second logging [skip-e2e] (#3072 )	2025-08-27 09:10:14 +01:00
Wing Lian	e1131e9619	make always skip_move_to_device default as true (#3084 )	2025-08-26 09:30:22 -04:00
Wing Lian	c4c4b90638	add tokenizer_save_jinja_files to keep legacy behavior of including chat template in tokenizer_config.json (#3093 ) * add tokenizer_save_jinja_files to keep legacy behavior of including chat template in tokenizer_config.json * fix test import	2025-08-26 09:30:04 -04:00
Wing Lian	0e9945e3b9	deploy training jobs to baseten w truss in axolotl cli (#3086 ) [skip ci] * deploy training jobs to baseten w truss in axolotl cli * cleanup	2025-08-26 09:29:50 -04:00
NanoCode012	0de254a0d0	feat: add gemma3_text attention handling for lora kernels (#3103 )	2025-08-26 16:47:26 +07:00