removing deepspeed guard for LoRA Triton kernels

fix(example): align example to correct adapter (#2478 )
* fix(example): align example to correct adapter * fix: add missing load in 4 bit
2025-04-03 16:44:45 +00:00 · 2025-04-03 08:48:14 -04:00 · 2025-04-03 08:47:52 -04:00 · 2025-04-02 09:50:56 -04:00 · 2025-04-02 09:35:42 -04:00 · 2025-04-02 09:35:29 -04:00
97 changed files with 2707 additions and 838 deletions
--- a/.github/workflows/base.yml
+++ b/.github/workflows/base.yml
@@ -40,12 +40,24 @@ jobs:
            python_version: "3.11"
            pytorch: 2.6.0
            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 8.7 8.9 9.0+PTX"
+          - cuda: "126"
+            cuda_version: 12.6.3
+            cudnn_version: ""
+            python_version: "3.11"
+            pytorch: 2.6.0
+            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 8.7 8.9 9.0+PTX"
          - cuda: "128"
            cuda_version: 12.8.1
            cudnn_version: ""
            python_version: "3.11"
            pytorch: nightly
            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 8.7 8.9 9.0+PTX"
+          - cuda: "128"
+            cuda_version: 12.8.1
+            cudnn_version: ""
+            python_version: "3.11"
+            pytorch: next
+            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 8.7 8.9 9.0+PTX"
    steps:
      - name: Checkout
        uses: actions/checkout@v4
@@ -67,7 +79,7 @@ jobs:
        uses: docker/build-push-action@v4
        with:
          context: .
-          file: ${{ matrix.pytorch == 'nightly' && './docker/Dockerfile-base-nightly' || './docker/Dockerfile-base' }}
+          file: ${{ matrix.pytorch == 'nightly' && './docker/Dockerfile-base-nightly' || matrix.pytorch == 'next' && './docker/Dockerfile-base-next' || './docker/Dockerfile-base' }}
          push: ${{ github.event_name != 'pull_request' }}
          tags: ${{ steps.metadata.outputs.tags }}-base-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}${{ matrix.axolotl_extras != '' && '-' || '' }}${{ matrix.axolotl_extras }}
          labels: ${{ steps.metadata.outputs.labels }}
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -25,12 +25,12 @@ jobs:
            python_version: "3.11"
            pytorch: 2.5.1
            axolotl_extras: vllm
-            is_latest: true
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
            pytorch: 2.6.0
            axolotl_extras:
+            is_latest: true
    runs-on: axolotl-gpu-runner
    steps:
      - name: Checkout
@@ -87,12 +87,12 @@ jobs:
            python_version: "3.11"
            pytorch: 2.5.1
            axolotl_extras:
-            is_latest: true
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
            pytorch: 2.6.0
            axolotl_extras:
+            is_latest: true
    runs-on: axolotl-gpu-runner
    steps:
      - name: Checkout
--- a/.github/workflows/multi-gpu-e2e.yml
+++ b/.github/workflows/multi-gpu-e2e.yml
@@ -42,8 +42,7 @@ jobs:
            cuda_version: 12.4.1
            python_version: "3.11"
            pytorch: 2.6.0
-            # awaiting vllm#12721
-            axolotl_extras:
+            axolotl_extras: vllm
            num_gpus: 2
            nightly_build: "true"
    runs-on: [self-hosted, modal]
--- a/.github/workflows/tests-nightly.yml
+++ b/.github/workflows/tests-nightly.yml
@@ -33,6 +33,15 @@ jobs:
      - name: Check out repository code
        uses: actions/checkout@v4

+      - name: Restore HF cache
+        id: hf-cache-restore
+        uses: actions/cache/restore@v4
+        with:
+          path: |
+            /home/runner/.cache/huggingface/hub/datasets--*
+            /home/runner/.cache/huggingface/hub/models--*
+          key: ${{ runner.os }}-hf-hub-cache-v2
+
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
@@ -46,7 +55,7 @@ jobs:

      - name: Install PyTorch
        run: |
-          pip3 install torch==${{ matrix.pytorch_version }} --index-url https://download.pytorch.org/whl/cpu
+          pip3 install torch==${{ matrix.pytorch_version }}

      - name: Update requirements.txt
        run: |
@@ -58,8 +67,7 @@ jobs:

      - name: Install dependencies
        run: |
-          pip3 install --upgrade pip
-          pip3 install --upgrade packaging==23.2
+          pip3 show torch
          pip3 install --no-build-isolation -U -e .
          python scripts/unsloth_install.py | sh
          python scripts/cutcrossentropy_install.py | sh
@@ -73,10 +81,15 @@ jobs:
        run: |
          axolotl --help

+      - name: Pre-Download dataset fixture
+        run: |
+          huggingface-cli download --repo-type=dataset axolotl-ai-internal/axolotl-oss-dataset-fixtures
+
      - name: Run tests
        run: |
-          pytest -n8 --dist loadfile --ignore=tests/e2e/ --ignore=tests/patched/ tests/
-          pytest tests/patched/
+          pytest -v -n8 --dist loadfile --ignore=tests/e2e/ --ignore=tests/patched/ --ignore=tests/cli/ tests/
+          pytest -v tests/patched/
+          pytest -v tests/cli/

      - name: cleanup pip cache
        run: |
@@ -136,4 +149,4 @@ jobs:
          echo "NIGHTLY_BUILD=${{ matrix.nightly_build }}" >> $GITHUB_ENV
      - name: Run tests job on Modal
        run: |
-          modal run cicd.tests
+          modal run cicd.e2e_tests
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -63,7 +63,7 @@ jobs:
          path: |
            /home/runner/.cache/huggingface/hub/datasets--*
            /home/runner/.cache/huggingface/hub/models--*
-          key: ${{ runner.os }}-hf-hub-cache-${{ hashFiles('**/conftest.py') }}
+          key: ${{ runner.os }}-hf-hub-cache-v2

      - name: Setup Python
        uses: actions/setup-python@v5
@@ -96,6 +96,10 @@ jobs:
        run: |
          axolotl --help

+      - name: Pre-Download dataset fixture
+        run: |
+          huggingface-cli download --repo-type=dataset axolotl-ai-internal/axolotl-oss-dataset-fixtures
+
      - name: Run tests
        run: |
          pytest -v -n8 --dist loadfile --ignore=tests/e2e/ --ignore=tests/patched/ --ignore=tests/cli/ tests/
@@ -137,7 +141,7 @@ jobs:
          path: |
            /home/runner/.cache/huggingface/hub/datasets--*
            /home/runner/.cache/huggingface/hub/models--*
-          key: ${{ runner.os }}-hf-hub-cache-${{ hashFiles('**/conftest.py') }}
+          key: ${{ runner.os }}-hf-hub-cache-v2

      - name: Setup Python
        uses: actions/setup-python@v5
@@ -171,6 +175,9 @@ jobs:
        run: |
          axolotl --help

+      - name: Show HF cache
+        run: huggingface-cli scan-cache
+
      - name: Run tests
        run: |
          pytest -v -n8 --dist loadfile --ignore=tests/e2e/ --ignore=tests/patched/ --ignore=tests/cli/ tests/
@@ -229,7 +236,7 @@ jobs:
          echo "N_GPUS=${{ matrix.num_gpus }}" >> $GITHUB_ENV
      - name: Run tests job on Modal
        run: |
-          modal run cicd.tests
+          modal run cicd.e2e_tests

  docker-e2e-tests:
    if: github.repository_owner == 'axolotl-ai-cloud'
@@ -253,7 +260,7 @@ jobs:
            python_version: "3.11"
            pytorch: 2.6.0
            num_gpus: 1
-            axolotl_extras:
+            axolotl_extras: vllm
    steps:
      - name: Checkout
        uses: actions/checkout@v4
@@ -276,4 +283,4 @@ jobs:
          echo "N_GPUS=${{ matrix.num_gpus }}" >> $GITHUB_ENV
      - name: Run tests job on Modal
        run: |
-          modal run cicd.tests
+          modal run cicd.e2e_tests
--- a/.isort.cfg
+++ b/.isort.cfg
@@ -1,3 +1,4 @@
 [settings]
 profile=black
 known_third_party=wandb,comet_ml
+known_local_folder=src,tests
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -40,6 +40,7 @@ quartodoc:
        - cli.preprocess
        - cli.sweeps
        - cli.utils
+        - cli.vllm_serve
        - cli.cloud.base
        - cli.cloud.modal_
    - title: Trainers
@@ -243,6 +244,7 @@ website:
            - docs/unsloth.qmd
            - docs/torchao.qmd
            - docs/custom_integrations.qmd
+            - docs/sequence_parallelism.qmd

        - section: "Troubleshooting"
          contents:
--- a/cicd/e2e_tests.py
+++ b/cicd/e2e_tests.py
--- a/cicd/multigpu.sh
+++ b/cicd/multigpu.sh
@@ -2,4 +2,5 @@
 set -e

 # only run one test at a time so as not to OOM the GPU
-pytest -v -n2 /workspace/axolotl/tests/e2e/multigpu/
+pytest -v -n2 /workspace/axolotl/tests/e2e/multigpu/ --ignore=/workspace/axolotl/tests/e2e/multigpu/solo/
+pytest -v -n1 /workspace/axolotl/tests/e2e/multigpu/solo/
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -20,9 +20,9 @@ WORKDIR /workspace/axolotl

 # If AXOLOTL_EXTRAS is set, append it in brackets
 RUN if [ "$AXOLOTL_EXTRAS" != "" ] ; then \
-        pip install --no-build-isolation -e .[deepspeed,flash-attn,optimizers,ray,$AXOLOTL_EXTRAS] $AXOLOTL_ARGS; \
+        pip install --no-build-isolation -e .[deepspeed,flash-attn,ring-flash-attn,optimizers,ray,$AXOLOTL_EXTRAS] $AXOLOTL_ARGS; \
    else \
-        pip install --no-build-isolation -e .[deepspeed,flash-attn,optimizers,ray] $AXOLOTL_ARGS; \
+        pip install --no-build-isolation -e .[deepspeed,flash-attn,ring-flash-attn,optimizers,ray] $AXOLOTL_ARGS; \
    fi

 RUN python scripts/unsloth_install.py | sh
--- a/docker/Dockerfile-base-next
+++ b/docker/Dockerfile-base-next
@@ -0,0 +1,38 @@
+ARG CUDA_VERSION="12.8.1"
+ARG CUDNN_VERSION="8"
+ARG UBUNTU_VERSION="22.04"
+ARG MAX_JOBS=4
+
+FROM nvidia/cuda:$CUDA_VERSION-cudnn$CUDNN_VERSION-devel-ubuntu$UBUNTU_VERSION AS base-builder
+
+ENV PATH="/root/miniconda3/bin:${PATH}"
+
+ARG PYTHON_VERSION="3.11"
+ARG PYTORCH_VERSION="next"
+ARG CUDA="128"
+ARG TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 9.0+PTX"
+
+ENV PYTHON_VERSION=$PYTHON_VERSION
+ENV TORCH_CUDA_ARCH_LIST=$TORCH_CUDA_ARCH_LIST
+
+RUN apt-get update \
+    && apt-get install -y wget git build-essential ninja-build git-lfs libaio-dev pkg-config && rm -rf /var/lib/apt/lists/* \
+    && wget \
+    https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \
+    && mkdir /root/.conda \
+    && bash Miniconda3-latest-Linux-x86_64.sh -b \
+    && rm -f Miniconda3-latest-Linux-x86_64.sh \
+    && conda create -n "py${PYTHON_VERSION}" python="${PYTHON_VERSION}"
+
+ENV PATH="/root/miniconda3/envs/py${PYTHON_VERSION}/bin:${PATH}"
+
+WORKDIR /workspace
+
+RUN python3 -m pip install --upgrade pip && pip3 install packaging && \
+    python3 -m pip install --no-cache-dir -U torch==2.7.0 --extra-index-url https://download.pytorch.org/whl/test/cu$CUDA && \
+    python3 -m pip install --no-cache-dir "causal_conv1d @ git+https://github.com/Dao-AILab/causal-conv1d.git@main" && \
+    python3 -m pip install --no-cache-dir "mamba_ssm @ git+https://github.com/state-spaces/mamba.git@main"
+
+RUN git lfs install --skip-repo && \
+    pip3 install awscli && \
+    pip3 install -U --no-cache-dir pydantic==2.10.6
--- a/docs/cli.qmd
+++ b/docs/cli.qmd
@@ -170,7 +170,7 @@ axolotl merge-sharded-fsdp-weights config.yml

 ### evaluate

-Evaluates a model's performance using metrics specified in the config.
+Evaluates a model's performance (loss etc) on the train and eval datasets.

 ```bash
 # Basic evaluation
@@ -197,6 +197,8 @@ lm_eval_batch_size: # Batch size for evaluation
 output_dir: # Directory to save evaluation results
 ```

+See [LM Eval Harness](https://github.com/EleutherAI/lm-evaluation-harness) for more details.
+
 ## Legacy CLI Usage

 While the new Click-based CLI is preferred, Axolotl still supports the legacy module-based CLI:
@@ -235,7 +237,7 @@ Create a cloud config YAML with your Modal settings:
 ```yaml
 # cloud_config.yml
 provider: modal
-gpu: a100  # Supported: l40s, a100-40gb, a100-80gb, a10g, h100, t4, l4
+gpu: a100       # Supported: l40s, a100-40gb, a100-80gb, a10g, h100, t4, l4
 gpu_count: 1    # Number of GPUs to use
 timeout: 86400  # Maximum runtime in seconds (24 hours)
 branch: main    # Git branch to use (optional)
@@ -248,7 +250,7 @@ volumes:        # Persistent storage volumes
  - name: axolotl-artifacts
    mount: /workspace/artifacts

-env:            # Environment variables
+secrets:        # Secrets to inject
  - WANDB_API_KEY
  - HF_TOKEN
 ```
@@ -274,15 +276,27 @@ axolotl lm-eval config.yml --cloud cloud_config.yml
 ### Cloud Configuration Options

 ```yaml
-provider: # compute provider, currently only `modal` is supported
-gpu: # GPU type to use
-gpu_count: # Number of GPUs (default: 1)
-memory: # RAM in GB (default: 128)
-timeout: # Maximum runtime in seconds
+provider:    # compute provider, currently only `modal` is supported
+gpu:         # GPU type to use
+gpu_count:   # Number of GPUs (default: 1)
+memory:      # RAM in GB (default: 128)
+timeout:     # Maximum runtime in seconds
 timeout_preprocess: # Preprocessing timeout
-branch: # Git branch to use
-docker_tag: # Custom Docker image tag
-volumes: # List of persistent storage volumes
-env: # Environment variables to pass
-secrets: # Secrets to inject
+branch:      # Git branch to use
+docker_tag:  # Custom Docker image tag
+volumes:     # List of persistent storage volumes
+
+# Environment variables to pass. Can be specified in two ways:
+# 1. As a string: Will load the value from the host computer's environment variables
+# 2. As a key-value pair: Will use the specified value directly
+# Example:
+# env:
+#   - CUSTOM_VAR  # Loads from host's $CUSTOM_VAR
+#   - {CUSTOM_VAR: "value"}  # Uses "value" directly
+env:
+
+# Secrets to inject. Same input format as `env` but for sensitive data.
+secrets:
+  # - HF_TOKEN
+  # - WANDB_API_KEY
 ```
--- a/docs/config.qmd
+++ b/docs/config.qmd
@@ -238,10 +238,10 @@ simpo_gamma: 0.5  # Target reward margin for the SimPO loss
 # grpo
 trl:
  use_vllm: # Optional[bool]. Whether to use VLLM for RL training.
-  vllm_device: # Optional[str]. Device to use for VLLM.
-  vllm_gpu_memory_utilization: # Optional[float]. GPU memory utilization for VLLM.
-  vllm_max_model_len: # Optional[int]. Maximum length of the model for VLLM.
-  vllm_dtype: # Optional[str]. Data type for VLLM.
+  vllm_server_host: # Optional[str]. Host of the vLLM server to connect to.
+  vllm_server_port: # Optional[int]. Port of the vLLM server to connect to.
+  vllm_server_timeout: # Optional[int]. Total timeout (in seconds) to wait for the vLLM server to respond.
+  vllm_guided_decoding_regex: # Optional[str]. Regex for vLLM guided decoding.

  beta: # Optional[float]. Beta parameter for the RL training. Same as `rl_beta`. Use
  max_completion_length: # Optional[int]. Maximum length of the completion for RL training.
@@ -320,9 +320,13 @@ total_num_tokens:
 sample_packing_group_size: 100000
 # The number of samples which can be packed into one sequence. Increase if using a large sequence_len with many short samples.
 sample_packing_bin_size: 200
+sample_pack_sequentially: # Optional[bool]. Whether to pack samples sequentially.
+
 # whether to concatenate samples during pretraining
 pretraining_sample_concatenation:

+curriculum_sampling: # Optional[bool]. Whether to use sequential sampling for curriculum learning
+
 # Use batch flattening for speedups when not using sample_packing
 batch_flattening:

@@ -354,7 +358,27 @@ lora_target_modules:
 #  - down_proj
 #  - up_proj
 lora_target_linear: # If true, will target all linear modules
-peft_layers_to_transform: # The layer indices to transform, otherwise, apply to all layers
+
+# List[int] | int. # The layer indices to transform, otherwise, apply to all layers
+# https://huggingface.co/docs/peft/v0.15.0/en/package_reference/lora#peft.LoraConfig.layers_to_transform
+peft_layers_to_transform:
+
+# Optional[bool]. Whether to use DoRA.
+# https://huggingface.co/docs/peft/v0.15.0/en/developer_guides/lora#weight-decomposed-low-rank-adaptation-dora
+peft_use_dora:
+
+# Optional[bool]. Whether to use RSLoRA.
+# https://huggingface.co/docs/peft/v0.15.0/en/developer_guides/lora#rank-stabilized-lora
+peft_use_rslora:
+
+# Optional[list[tuple[int, int]]]. List of layer indices to replicate.
+# https://huggingface.co/docs/peft/v0.15.0/en/developer_guides/lora#memory-efficient-layer-replication-with-lora
+peft_layer_replication:
+
+# bool | Literal["gaussian", "eva", "olora", "pissa", "pissa_niter_[number of iters]", "corda", "loftq"]
+# How to initialize LoRA weights. Default to True which is MS original implementation.
+# https://huggingface.co/docs/peft/v0.15.0/en/developer_guides/lora#initialization
+peft_init_lora_weights:

 # If you added new tokens to the tokenizer, you may need to save some LoRA modules because they need to know the new tokens.
 # For LLaMA and Mistral, you need to save `embed_tokens` and `lm_head`. It may vary for other models.
@@ -486,7 +510,8 @@ train_on_inputs: false
 # Note that training loss may have an oscillating pattern with this enabled.
 group_by_length: false

-# Whether to use gradient checkpointing https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing
+# Whether to use gradient checkpointing. Available options are: true, false, "offload".
+# https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing
 gradient_checkpointing: false
 # additional kwargs to pass to the trainer for gradient checkpointing
 # gradient_checkpointing_kwargs:
@@ -587,26 +612,31 @@ max_grad_norm:
 # currently only supported on Llama and Mistral
 neftune_noise_alpha:

-# Whether to bettertransformers
+# Optional[bool]. Whether to bettertransformers
 flash_optimum:
-# Whether to use xformers attention patch https://github.com/facebookresearch/xformers:
+
+# Note: Only one of the following attention patches can be used at a time.
+# For example, if you set `xformers_attention` to `true`, do not set `flash_attention` to `true`.
+
+# Optional[bool]. Whether to use xformers attention patch https://github.com/facebookresearch/xformers:
 xformers_attention:
-# Whether to use flash attention patch https://github.com/Dao-AILab/flash-attention:
+# Optional[bool]. Whether to use flash attention patch https://github.com/Dao-AILab/flash-attention:
 flash_attention:
-flash_attn_cross_entropy:  # Whether to use flash-attention cross entropy implementation - advanced use only
-flash_attn_rms_norm:  # Whether to use flash-attention rms norm implementation - advanced use only
-flash_attn_fuse_qkv: # Whether to fuse QKV into a single operation
-flash_attn_fuse_mlp: # Whether to fuse part of the MLP into a single operation
-# Whether to use scaled-dot-product attention
+flash_attn_cross_entropy:  # Optional[bool]. Whether to use flash-attention cross entropy implementation - advanced use only
+flash_attn_rms_norm:  # Optional[bool]. Whether to use flash-attention rms norm implementation - advanced use only
+flash_attn_fuse_qkv: # Optional[bool]. Whether to fuse QKV into a single operation
+flash_attn_fuse_mlp: # Optional[bool]. Whether to fuse part of the MLP into a single operation
+# Optional[bool]. Whether to use scaled-dot-product attention
 # https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
 sdp_attention:
-# Shifted-sparse attention (only llama) - https://arxiv.org/pdf/2309.12307.pdf
+# Optional[bool]. Shifted-sparse attention (only llama) - https://arxiv.org/pdf/2309.12307.pdf
 s2_attention:
+
 # Optional[bool]. Whether to use low_cpu_mem_usage
 low_cpu_mem_usage:
-# Resume from a specific checkpoint dir
+# Optional[str]. Resume from a specific checkpoint dir
 resume_from_checkpoint:
-# If resume_from_checkpoint isn't set and you simply want it to start where it left off.
+# Optional[bool]. If resume_from_checkpoint isn't set and you simply want it to start where it left off.
 # Be careful with this being turned on between different models.
 auto_resume_from_checkpoints: false

@@ -658,6 +688,9 @@ ddp_broadcast_buffers:
 # subsequences, or set to 4 to split into four equal-sized subsequences.
 # See https://axolotl-ai-cloud.github.io/axolotl/docs/sequence_parallelism.html for more details.
 sequence_parallel_degree:
+# Optional; strides across the key dimension. Larger values use more memory but should make training faster.
+# Must evenly divide the number of KV heads in your model.
+heads_k_stride: 1

 # Path to torch distx for optim 'adamw_anyprecision'
 torchdistx_path:
--- a/docs/faq.qmd
+++ b/docs/faq.qmd
@@ -35,12 +35,22 @@ description: Frequently asked questions

 **Q: How to call Axolotl via custom python scripts?**

-> A: Yes, since Axolotl is just Python, please see `src/axolotl/cli/main.py` on how each command is called.
+> A: Since Axolotl is just Python, please see `src/axolotl/cli/main.py` on how each command is called.

 **Q: How to know the value to use for `fsdp_transformer_layer_cls_to_wrap`?**

 > A: This is the class name of the transformer layer to wrap with FSDP. For example, for `LlamaForCausalLM`, the value is `LlamaDecoderLayer`. To find this for a specific model, check the model's `PreTrainedModel` definition and look for `_no_split_modules` variable in the `modeling_<model_name>.py` file within `transformers` library.

+**Q: ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token**
+
+> A: This is because the tokenizer does not have a padding token. Please add a padding token to the tokenizer via:
+
+> ```yaml
+> special_tokens:
+>   # str. If you're not sure, set to same as `eos_token`.
+>   pad_token: "..."
+> ```
+
 ### Chat templates

 **Q: `jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'content' / 'role' / ____`**
--- a/docs/lora_optims.qmd
+++ b/docs/lora_optims.qmd
@@ -17,6 +17,7 @@ We currently support several common model architectures, including (but not limi
 - `qwen2`
 - `gemma`
 - `gemma2`
+- `gemma3`

 <details>

--- a/docs/multi-gpu.qmd
+++ b/docs/multi-gpu.qmd
@@ -18,6 +18,7 @@ Axolotl supports several methods for multi-GPU training:

 - DeepSpeed (recommended)
 - FSDP (Fully Sharded Data Parallel)
+- Sequence parallelism
 - FSDP + QLoRA

 ## DeepSpeed {#sec-deepspeed}
@@ -66,6 +67,28 @@ fsdp_config:
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
 ```

+## Sequence parallelism {#sec-sequence-parallelism}
+
+We support sequence parallelism (SP) via the
+[ring-flash-attention](https://github.com/zhuzilin/ring-flash-attention) project. This
+allows one to split up sequences across GPUs, which is useful in the event that a
+single sequence causes OOM errors during model training.
+
+First, install `ring-flash-attn`, recommended via `pip install axolotl[ring-flash-attn]`,
+or from source with `pip install .[ring-flash-attn]`.
+
+Your Axolotl YAML config should contain the following lines:
+
+```{.yaml}
+sequence_parallel_degree: 4  # Split each sequence into 4 parts, one per GPU
+flash_attention: true  # Required with sequence parallelism
+
+# Optional; strides across the key dimension. Larger values use more memory but will make training faster.
+heads_k_stride: 1
+```
+
+See our [dedicated guide](sequence_parallelism.qmd) for more details.
+
 ### FSDP + QLoRA {#sec-fsdp-qlora}

 For combining FSDP with QLoRA, see our [dedicated guide](fsdp_qlora.qmd).
--- a/docs/rlhf.qmd
+++ b/docs/rlhf.qmd
@@ -502,9 +502,48 @@ The input format is a simple JSON input with customizable fields based on the ab
 Check out our [GRPO cookbook](https://github.com/axolotl-ai-cloud/axolotl-cookbook/tree/main/grpo#training-an-r1-style-large-language-model-using-grpo).
 :::

+If you have multiple GPUs available, we reccomend using `vLLM` with the `GRPOTrainer` to significantly speedup trajectory generation during training.
+First, launch a `vLLM` server using `trl vllm-serve` - you may use a config file or CLI overrides to configure your vLLM server. In this example, we're
+using 4 GPUs - 2 for training, and 2 for vLLM:
+
+::: {.callout-important}
+Make sure you've installed the correct version of vLLM by including it as an extra when installing axolotl, e.g. `pip install axolotl[vllm]`.
+:::
+
+```yaml
+base_model: Qwen/Qwen2.5-1.5B-Instruct
+
+vllm:
+    host: 0.0.0.0
+    port: 8000
+    tensor_parallel_size: 2
+    gpu_memory_utilization: 0.85
+    dtype: auto
+    # max_model_len: # you may find it useful to set the vLLM model context length if you know this beforehand
+
+rl: grpo
+trl:
+    use_vllm: true
+    vllm_server_host: 0.0.0.0
+    vllm_server_port: 8000
+    vllm_server_timeout: 300
+```
+
+```bash
+CUDA_VISIBLE_DEVICES=2,3 axolotl vllm_serve grpo.yaml
+```
+
+Your `vLLM` instance will now attempt to spin up, and it's time to kick off training utilizing our remaining two GPUs. In another terminal, execute:
+
+```bash
+CUDA_VISIBLE_DEVICES=0,1 axolotl train grpo.yaml --num-processes 2
+```
+
+#### Reward functions
+
 GRPO uses custom reward functions and transformations. Please have them ready locally.

-For ex, to load OpenAI's GSM8K and use a random reward for completions:
+For example, to load OpenAI's GSM8K and use a random reward for completions:

 ```python
 # rewards.py
@@ -530,8 +569,6 @@ trl:
    beta: 0.001
    max_completion_length: 256
    use_vllm: True
-    vllm_device: auto
-    vllm_gpu_memory_utilization: 0.15
    num_generations: 4
    reward_funcs: ["rewards.rand_reward_func"]    # format: '{file_name}.{fn_name}'
    reward_weights: [1.0]
--- a/docs/sequence_parallelism.qmd
+++ b/docs/sequence_parallelism.qmd
@@ -25,6 +25,8 @@ To enable sequence parallelism, add the following to your configuration file:
 ```yaml
 # Set to a divisor (> 1) of the number of GPUs available
 sequence_parallel_degree: 4  # Split sequences across 4 GPUs
+# Optional; strides across the key dimension. Larger values use more memory but should make training faster.
+heads_k_stride: 1
 ```

 The `sequence_parallel_degree` should be a divisor of the total number of GPUs. For example:
@@ -58,11 +60,16 @@ To use sequence parallelism, you need:
 ## Example

 ```yaml
-# Example config with sequence parallelism
 base_model: meta-llama/Llama-3-8B-Instruct
 sequence_len: 8192
-sequence_parallel_degree: 2  # Split each sequence into 4 parts
+
+...
+
+sequence_parallel_degree: 4  # Split each sequence into 4 parts, one per GPU
 flash_attention: true  # Required with sequence parallelism
+# Optional; strides across the key dimension. Larger values use more memory but should make training faster.
+heads_k_stride: 1
+
 ...
 ```

--- a/examples/gemma3/gemma-3-1b-qlora.yml
+++ b/examples/gemma3/gemma-3-1b-qlora.yml
@@ -5,12 +5,15 @@ tokenizer_type: AutoTokenizer
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name

+# gemma3 doesn't seem to play nice with ddp
+ddp_find_unused_parameters: true
+
 load_in_8bit: false
 load_in_4bit: true
 strict: false

 # huggingface repo
-chat_template: gemma3_text
+chat_template: gemma3
 datasets:
  - path: cgato/SlimOrcaDedupCleaned
    type: chat_template
@@ -54,6 +57,8 @@ fp16:
 tf32: true

 gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
 early_stopping_patience:
 resume_from_checkpoint:
 local_rank:
--- a/examples/gemma3/gemma-3-4b-qlora.yml
+++ b/examples/gemma3/gemma-3-4b-qlora.yml
@@ -0,0 +1,68 @@
+base_model: google/gemma-3-4b-it
+strict: false
+
+load_in_4bit: true
+
+# gemma3 doesn't seem to play nice with ddp
+ddp_find_unused_parameters: true
+
+chat_template: gemma3
+datasets:
+  - path: cgato/SlimOrcaDedupCleaned
+    type: chat_template
+    field_messages: conversations
+    message_property_mappings:
+      role: from
+      content: value
+
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.01
+output_dir: ./outputs/out
+
+adapter: qlora
+lora_model_dir:
+
+sequence_len: 2048
+sample_packing: true
+pad_to_sequence_len: true
+
+lora_r: 32
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_modules: 'language_model.model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 2
+num_epochs: 1
+optimizer: adamw_bnb_8bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+train_on_inputs: false
+group_by_length: false
+bf16: true
+fp16:
+tf32: true
+
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+local_rank:
+logging_steps: 1
+flash_attention: true
+eager_attention:
+
+warmup_ratio: 0.1
+evals_per_epoch: 1
+saves_per_epoch: 1
+debug:
+deepspeed:
+weight_decay: 0.0
+fsdp:
+fsdp_config:
--- a/examples/gemma3/gemma-3-4b-vision-qlora.yml
+++ b/examples/gemma3/gemma-3-4b-vision-qlora.yml
@@ -2,11 +2,16 @@ base_model: google/gemma-3-4b-it
 processor_type: AutoProcessor
 strict: false

+load_in_4bit: true
+
 # these 3 lines are needed for now to handle vision chat templates w images
 skip_prepare_dataset: true
 remove_unused_columns: false
 sample_packing: false

+# gemma3 doesn't seem to play nice with ddp
+ddp_find_unused_parameters: true
+
 chat_template: gemma3
 datasets:
  - path: HuggingFaceH4/llava-instruct-mix-vsft
@@ -17,7 +22,7 @@ dataset_prepared_path: last_run_prepared
 val_set_size: 0.01
 output_dir: ./outputs/out

-adapter: lora
+adapter: qlora
 lora_model_dir:

 sequence_len: 2048
@@ -48,6 +53,8 @@ fp16:
 tf32: true

 gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
 local_rank:
 logging_steps: 1
 flash_attention: true
--- a/examples/llama-3/lora-1b-deduplicate-sft.yml
+++ b/examples/llama-3/lora-1b-deduplicate-sft.yml
@@ -19,7 +19,6 @@ val_set_size: 0.0
 output_dir: ./outputs/lora-out

 dataset_exact_deduplication: true
-test_value: true

 sequence_len: 4096
 sample_packing: true
--- a/examples/llama-3/lora-1b-sample-packing-sequentially.yml
+++ b/examples/llama-3/lora-1b-sample-packing-sequentially.yml
@@ -0,0 +1,80 @@
+base_model: meta-llama/Llama-3.2-1B
+# optionally might have model_type or tokenizer_type
+model_type: LlamaForCausalLM
+tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
+load_in_8bit: true
+load_in_4bit: false
+strict: false
+
+datasets:
+  - path: mhenrichsen/alpaca_2k_test
+    type: alpaca
+  - path: mhenrichsen/alpaca_2k_test
+    type: alpaca
+dataset_prepared_path:
+val_set_size: 0.0
+output_dir: ./outputs/lora-out
+
+test_value: true
+
+sequence_len: 4096
+sample_packing: true
+sample_packing_sequentially: true
+curriculum_sampling: true
+eval_sample_packing: false
+pad_to_sequence_len: true
+
+adapter: lora
+lora_model_dir:
+lora_r: 32
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_linear: true
+lora_fan_in_fan_out:
+lora_modules_to_save:
+  - embed_tokens
+  - lm_head
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 2
+num_epochs: 4
+optimizer: adamw_bnb_8bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+train_on_inputs: false
+group_by_length: false
+bf16: auto
+fp16:
+tf32: false
+
+gradient_checkpointing: true
+early_stopping_patience:
+resume_from_checkpoint:
+local_rank:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+s2_attention:
+
+warmup_steps: 10
+evals_per_epoch: 4
+eval_table_size:
+eval_max_new_tokens: 128
+saves_per_epoch: 1
+debug:
+deepspeed:
+weight_decay: 0.0
+fsdp:
+fsdp_config:
+special_tokens:
+  pad_token: <|end_of_text|>
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,23 +1,23 @@
 --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

 # START section of dependencies that don't install on Darwin/MacOS
-bitsandbytes==0.45.3
+bitsandbytes==0.45.4
 triton>=3.0.0
 mamba-ssm==1.2.0.post1
 xformers>=0.0.23.post1
 autoawq==0.2.7.post3
-liger-kernel==0.5.3
+liger-kernel==0.5.5
 # END section

 packaging==23.2

 peft==0.15.0
-transformers==4.50.0
+transformers==4.50.3
 tokenizers>=0.21.1
 accelerate==1.5.2
-datasets==3.4.1
-deepspeed==0.16.4
-trl==0.15.1
+datasets==3.5.0
+deepspeed==0.15.4
+trl==0.16.0

 optimum==1.16.2
 hf_transfer
--- a/setup.py
+++ b/setup.py
@@ -10,7 +10,7 @@ from pathlib import Path
 from setuptools import find_packages, setup


-def parse_requirements():
+def parse_requirements(extras_require_map):
    _install_requires = []
    _dependency_links = []
    with open("./requirements.txt", encoding="utf-8") as requirements_file:
@@ -67,6 +67,7 @@ def parse_requirements():
            if (major, minor) >= (2, 6):
                _install_requires.pop(_install_requires.index(xformers_version))
                _install_requires.append("xformers==0.0.29.post2")
+                extras_require_map["vllm"] = ["vllm==0.8.1"]
            elif (major, minor) >= (2, 5):
                _install_requires.pop(_install_requires.index(xformers_version))
                if patch == 0:
@@ -86,7 +87,7 @@ def parse_requirements():

    except PackageNotFoundError:
        pass
-    return _install_requires, _dependency_links
+    return _install_requires, _dependency_links, extras_require_map


 def get_package_version():
@@ -103,7 +104,50 @@ def get_package_version():
    return version_


-install_requires, dependency_links = parse_requirements()
+extras_require = {
+    "flash-attn": ["flash-attn==2.7.4.post1"],
+    "ring-flash-attn": [
+        "flash-attn==2.7.4.post1",
+        "ring-flash-attn>=0.1.4",
+        "yunchang==0.6.0",
+    ],
+    "deepspeed": [
+        "deepspeed==0.15.4",
+        "deepspeed-kernels",
+    ],
+    "mamba-ssm": [
+        "mamba-ssm==1.2.0.post1",
+        "causal_conv1d",
+    ],
+    "auto-gptq": [
+        "auto-gptq==0.5.1",
+    ],
+    "mlflow": [
+        "mlflow",
+    ],
+    "galore": [
+        "galore_torch",
+    ],
+    "apollo": [
+        "apollo-torch",
+    ],
+    "optimizers": [
+        "galore_torch",
+        "apollo-torch",
+        "lomo-optim==0.1.1",
+        "torch-optimi==0.2.1",
+    ],
+    "ray": [
+        "ray[train]",
+    ],
+    "vllm": [
+        "vllm==0.7.2",
+    ],
+}
+
+install_requires, dependency_links, extras_require_build = parse_requirements(
+    extras_require
+)

 setup(
    version=get_package_version(),
@@ -116,40 +160,5 @@ setup(
            "axolotl=axolotl.cli.main:main",
        ],
    },
-    extras_require={
-        "flash-attn": ["flash-attn==2.7.4.post1"],
-        "ring-flash-attn": ["ring-flash-attn>=0.1.4", "yunchang==0.6.0"],
-        "deepspeed": [
-            "deepspeed==0.16.4",
-            "deepspeed-kernels",
-        ],
-        "mamba-ssm": [
-            "mamba-ssm==1.2.0.post1",
-            "causal_conv1d",
-        ],
-        "auto-gptq": [
-            "auto-gptq==0.5.1",
-        ],
-        "mlflow": [
-            "mlflow",
-        ],
-        "galore": [
-            "galore_torch",
-        ],
-        "apollo": [
-            "apollo-torch",
-        ],
-        "optimizers": [
-            "galore_torch",
-            "apollo-torch",
-            "lomo-optim==0.1.1",
-            "torch-optimi==0.2.1",
-        ],
-        "ray": [
-            "ray[train]",
-        ],
-        "vllm": [
-            "vllm==0.7.2",
-        ],
-    },
+    extras_require=extras_require_build,
 )
--- a/src/axolotl/init.py
+++ b/src/axolotl/init.py
@@ -4,4 +4,4 @@ import pkgutil

 __path__ = pkgutil.extend_path(__path__, __name__)  # Make this a namespace package

-__version__ = "0.8.0.dev0"
+__version__ = "0.8.0"
--- a/src/axolotl/cli/args.py
+++ b/src/axolotl/cli/args.py
@@ -35,6 +35,55 @@ class TrainerCliArgs:
    num_processes: Optional[int] = field(default=None)


+@dataclass
+class VllmServeCliArgs:
+    """Dataclass with CLI arguments for `axolotl vllm-serve` command."""
+
+    tensor_parallel_size: int = field(
+        default=1,
+        metadata={"help": "Number of tensor parallel workers to use."},
+    )
+    host: str = field(
+        default="0.0.0.0",  # nosec B104
+        metadata={"help": "Host address to run the server on."},
+    )
+    port: int = field(
+        default=8000,
+        metadata={"help": "Port to run the server on."},
+    )
+    gpu_memory_utilization: Optional[float] = field(
+        default=None,
+        metadata={
+            "help": "Ratio (between 0 and 1) of GPU memory to reserve for the model weights, activations, and KV "
+            "cache on the device dedicated to generation powered by vLLM. Higher values will increase the KV cache "
+            "size and thus improve the model's throughput. However, if the value is too high, it may cause "
+            "out-of-memory (OOM) errors during initialization."
+        },
+    )
+    dtype: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": "Data type to use for vLLM generation. If set to 'auto', the data type will be automatically "
+            "determined based on the model configuration. Find the supported values in the vLLM documentation."
+        },
+    )
+    max_model_len: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": "If set, the `max_model_len` to use for vLLM. This can be useful when running with reduced "
+            "`vllm_gpu_memory_utilization`, leading to a reduced KV cache size. If not set, vLLM will use the model "
+            "context size, which might be much larger than the KV cache, leading to inefficiencies."
+        },
+    )
+    enable_prefix_caching: Optional[bool] = field(
+        default=None,
+        metadata={
+            "help": "Whether to enable prefix caching in vLLM. If set to `True`, ensure that the model and the "
+            "hardware support this feature."
+        },
+    )
+
+
@dataclass
 class EvaluateCliArgs:
    """Dataclass with CLI arguments for `axolotl evaluate` command."""
--- a/src/axolotl/cli/inference.py
+++ b/src/axolotl/cli/inference.py
@@ -256,7 +256,7 @@ def do_cli(
    """
    # pylint: disable=duplicate-code
    print_axolotl_text_art()
-    parsed_cfg = load_cfg(config, inference=True, **kwargs)
+    parsed_cfg = load_cfg(config, inference=True, rl=None, **kwargs)
    parsed_cfg.sample_packing = False
    parser = transformers.HfArgumentParser(InferenceCliArgs)
    parsed_cli_args, _ = parser.parse_args_into_dataclasses(
--- a/src/axolotl/cli/main.py
+++ b/src/axolotl/cli/main.py
@@ -14,7 +14,12 @@ import yaml
 from dotenv import load_dotenv

 import axolotl
-from axolotl.cli.args import EvaluateCliArgs, PreprocessCliArgs, TrainerCliArgs
+from axolotl.cli.args import (
+    EvaluateCliArgs,
+    PreprocessCliArgs,
+    TrainerCliArgs,
+    VllmServeCliArgs,
+)
 from axolotl.cli.sweeps import generate_sweep_configs
 from axolotl.cli.utils import (
    add_options_from_config,
@@ -23,6 +28,7 @@ from axolotl.cli.utils import (
    fetch_from_github,
    filter_none_kwargs,
 )
+from axolotl.cli.vllm_serve import do_vllm_serve
 from axolotl.integrations.lm_eval.cli import lm_eval
 from axolotl.utils import set_pytorch_cuda_alloc_conf
 from axolotl.utils.schemas.config import AxolotlInputConfig
@@ -316,6 +322,14 @@ def fetch(directory: str, dest: Optional[str]) -> None:
    fetch_from_github(f"{directory}/", dest)


+@cli.command()
+@click.argument("config", type=click.Path(exists=True, path_type=str))
+@add_options_from_dataclass(VllmServeCliArgs)
+@filter_none_kwargs
+def vllm_serve(config: str, **cli_args: VllmServeCliArgs):
+    do_vllm_serve(config, cli_args)
+
+
 cli.add_command(lm_eval)


--- a/src/axolotl/cli/merge_lora.py
+++ b/src/axolotl/cli/merge_lora.py
@@ -74,8 +74,10 @@ def do_cli(config: Union[Path, str] = Path("examples/"), **kwargs) -> None:
        load_in_8bit=False,
        load_in_4bit=False,
        flash_attention=False,
+        sequence_parallel_degree=None,
        deepspeed=None,
        fsdp=None,
+        fsdp_config=None,
        **kwargs,
    )

@@ -86,13 +88,6 @@ def do_cli(config: Union[Path, str] = Path("examples/"), **kwargs) -> None:
            f"Target directory for merge: `{parsed_cfg.lora_model_dir}` does not exist."
        )

-    parsed_cfg.load_in_4bit = False
-    parsed_cfg.load_in_8bit = False
-    parsed_cfg.flash_attention = False
-    parsed_cfg.deepspeed = None
-    parsed_cfg.fsdp = None
-    parsed_cfg.fsdp_config = None
-
    do_merge_lora(cfg=parsed_cfg)


--- a/src/axolotl/cli/vllm_serve.py
+++ b/src/axolotl/cli/vllm_serve.py
@@ -0,0 +1,55 @@
+"""
+CLI to start the vllm server for online RL
+"""
+
+from pathlib import Path
+from typing import Union
+
+from trl.scripts.vllm_serve import ScriptArguments
+from trl.scripts.vllm_serve import main as vllm_serve_main
+
+from axolotl.cli.config import load_cfg
+
+
+def do_vllm_serve(
+    config: Union[Path, str],
+    cli_args: dict,
+):
+    """
+    Starts the VLLM server for serving LLM models used for online RL
+
+    Args
+        :param cfg: Parsed doct of the YAML config
+        :param cli_args: dict of additional command-line arguments of type VllmServeCliArgs
+
+    Returns:
+        process_id: the process id of the started VLLM server
+    """
+    cfg = load_cfg(config)
+    model = cfg.base_model
+
+    tensor_parallel_size = (
+        cli_args.get("tensor_parallel_size") or cfg.vllm.tensor_parallel_size
+    )
+    host = cli_args.get("host") or cfg.vllm.host
+    port = cli_args.get("port") or cfg.vllm.port
+    gpu_memory_utilization = (
+        cli_args.get("gpu_memory_utilization") or cfg.vllm.gpu_memory_utilization
+    )
+    dtype = cli_args.get("dtype") or cfg.vllm.dtype
+    max_model_len = cli_args.get("max_model_len") or cfg.vllm.max_model_len
+    enable_prefix_caching = (
+        cli_args.get("enable_prefix_caching") or cfg.vllm.enable_prefix_caching
+    )
+
+    vllm_script_args = ScriptArguments(
+        model,
+        tensor_parallel_size=tensor_parallel_size,
+        host=host,
+        port=port,
+        gpu_memory_utilization=gpu_memory_utilization,
+        dtype=dtype,
+        max_model_len=max_model_len,
+        enable_prefix_caching=enable_prefix_caching,
+    )
+    vllm_serve_main(vllm_script_args)
--- a/src/axolotl/core/trainer_builder.py
+++ b/src/axolotl/core/trainer_builder.py
@@ -69,7 +69,6 @@ from axolotl.utils.callbacks import (
    LossWatchDogCallback,
    SaveAxolotlConfigtoWandBCallback,
    SaveBetterTransformerModelCallback,
-    SaveModelCallback,
    bench_eval_callback_factory,
    causal_lm_bench_eval_callback_factory,
    log_prediction_callback_factory,
@@ -249,7 +248,6 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):

        if self.cfg.gc_steps:
            callbacks.append(GCCallback(gc_steps=self.cfg.gc_steps))
-        callbacks.append(SaveModelCallback())

        return callbacks

@@ -526,9 +524,15 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
            and self.cfg.eval_steps
            and self.cfg.save_steps % self.cfg.eval_steps == 0
        ) or False
+
+        # handle ddp
+        ddp_find_unused_parameters = None
+        if self.cfg.ddp:
+            ddp_find_unused_parameters = bool(self.cfg.ddp_find_unused_parameters)
        training_arguments_kwargs["ddp_find_unused_parameters"] = (
-            False if self.cfg.ddp else None
+            ddp_find_unused_parameters
        )
+
        training_arguments_kwargs["group_by_length"] = self.cfg.group_by_length
        training_arguments_kwargs["curriculum_sampling"] = self.cfg.curriculum_sampling
        report_to = []
@@ -937,7 +941,6 @@ class HFRLTrainerBuilder(TrainerBuilderBase):

    def get_callbacks(self):
        callbacks = super().get_callbacks()
-        callbacks.append(SaveModelCallback())

        return callbacks

--- a/src/axolotl/core/trainers/base.py
+++ b/src/axolotl/core/trainers/base.py
@@ -8,12 +8,11 @@ import logging
 import os
 from collections import defaultdict
 from functools import wraps
-from typing import Any, Literal
+from typing import Literal

 import datasets
 import torch
 from datasets import Dataset
-from torch import nn
 from torch.utils.data import (
    BatchSampler,
    DataLoader,
@@ -28,6 +27,7 @@ from typing_extensions import override

 from axolotl.core.trainers.mixins import (
    OptimizerMixin,
+    RngLoaderMixin,
    SchedulerMixin,
    SequenceParallelMixin,
 )
@@ -40,7 +40,9 @@ from axolotl.utils.samplers import MultipackBatchSampler, get_dataset_lengths
 LOG = logging.getLogger(__name__)


-class AxolotlTrainer(SchedulerMixin, OptimizerMixin, SequenceParallelMixin, Trainer):
+class AxolotlTrainer(
+    SchedulerMixin, OptimizerMixin, RngLoaderMixin, SequenceParallelMixin, Trainer
+):
    """Extend the base Trainer for axolotl helpers"""

    args = None  # type: "AxolotlTrainingArguments"  # type: ignore[name-defined]
@@ -112,6 +114,7 @@ class AxolotlTrainer(SchedulerMixin, OptimizerMixin, SequenceParallelMixin, Trai
            packing_efficiency_estimate=self.args.sample_packing_efficiency,
            batch_max_len=batch_max_len,
            batch_size=batch_size,
+            sequential=self.args.sample_packing_sequentially,
            drop_last=True,
        )

@@ -589,27 +592,3 @@ class AxolotlTrainer(SchedulerMixin, OptimizerMixin, SequenceParallelMixin, Trai
        output_dir = os.path.join(run_dir, checkpoint_folder)
        os.makedirs(output_dir, exist_ok=True)
        return super()._save_checkpoint(model, trial, **kwargs)
-
-    def training_step(
-        self,
-        model: nn.Module,
-        inputs: dict[str, torch.Tensor | Any],
-        num_items_in_batch: int | None = None,
-    ) -> torch.Tensor:
-        """
-        Perform a training step on a batch of inputs. Overrides the
-        `transformers.trainer.Trainer` method to handle sequence parallelism if
-        enabled.
-
-        Args:
-            model: Model to perform training step for.
-            inputs: Dictionary mapping.
-        """
-        # Set up sequence parallelism for this step if enabled
-        if self.args.sequence_parallel_degree > 1:
-            self._update_ring_flash_attn_params(inputs)
-
-        # Proceed with normal training step
-        loss = super().training_step(model, inputs, num_items_in_batch)
-
-        return loss
--- a/src/axolotl/core/trainers/dpo/trainer.py
+++ b/src/axolotl/core/trainers/dpo/trainer.py
@@ -13,7 +13,7 @@ from transformers import Trainer
 from transformers.utils import is_sagemaker_mp_enabled
 from trl import DPOTrainer

-from axolotl.core.trainers.mixins import SchedulerMixin
+from axolotl.core.trainers.mixins import RngLoaderMixin, SchedulerMixin
 from axolotl.core.trainers.utils import (
    sanitize_kwargs_for_ds_tagging,
    sanitize_kwargs_for_tagging,
@@ -23,7 +23,7 @@ if is_sagemaker_mp_enabled():
    import smdistributed.modelparallel.torch as smp


-class AxolotlDPOTrainer(SchedulerMixin, DPOTrainer):
+class AxolotlDPOTrainer(RngLoaderMixin, SchedulerMixin, DPOTrainer):
    """
    Extend the base DPOTrainer for axolotl helpers
    """
--- a/src/axolotl/core/trainers/grpo/init.py
+++ b/src/axolotl/core/trainers/grpo/init.py
@@ -40,18 +40,15 @@ class GRPOStrategy:

        if trl.use_vllm:
            grpo_args_kwargs["use_vllm"] = trl.use_vllm
-            grpo_args_kwargs["vllm_device"] = (
-                trl.vllm_device if trl.vllm_device else "auto"
-            )
-
-            if trl.vllm_gpu_memory_utilization:
-                grpo_args_kwargs["vllm_gpu_memory_utilization"] = (
-                    trl.vllm_gpu_memory_utilization
+            grpo_args_kwargs["vllm_server_host"] = trl.vllm_server_host
+            grpo_args_kwargs["vllm_server_port"] = trl.vllm_server_port
+            if trl.vllm_server_timeout:
+                grpo_args_kwargs["vllm_server_timeout"] = trl.vllm_server_timeout
+            if trl.vllm_guided_decoding_regex:
+                grpo_args_kwargs["vllm_guided_decoding_regex"] = (
+                    trl.vllm_guided_decoding_regex
                )

-            if trl.vllm_max_model_len:
-                grpo_args_kwargs["vllm_max_model_len"] = trl.vllm_max_model_len
-
        if trl.num_generations:
            grpo_args_kwargs["num_generations"] = trl.num_generations

@@ -70,6 +67,25 @@ class GRPOStrategy:
        if trl.reward_weights:
            grpo_args_kwargs["reward_weights"] = trl.reward_weights

+        if trl.scale_rewards is not None:
+            grpo_args_kwargs["scale_rewards"] = trl.scale_rewards
+
+        if trl.temperature is not None:
+            grpo_args_kwargs["temperature"] = trl.temperature
+        if trl.top_p is not None:
+            grpo_args_kwargs["top_p"] = trl.top_p
+        if trl.top_k is not None:
+            grpo_args_kwargs["top_k"] = trl.top_k
+        if trl.min_p is not None:
+            grpo_args_kwargs["min_p"] = trl.min_p
+        if trl.repetition_penalty is not None:
+            grpo_args_kwargs["repetition_penalty"] = trl.repetition_penalty
+
+        if trl.num_iterations is not None:
+            grpo_args_kwargs["num_iterations"] = trl.num_iterations
+        if trl.epsilon is not None:
+            grpo_args_kwargs["epsilon"] = trl.epsilon
+
        return grpo_args_kwargs

    @classmethod
--- a/src/axolotl/core/trainers/grpo/trainer.py
+++ b/src/axolotl/core/trainers/grpo/trainer.py
@@ -2,108 +2,68 @@
 Axolotl GRPO trainer
 """

-from accelerate.utils import is_peft_model
-from accelerate.utils.other import is_compiled_module
-from transformers import PreTrainedModel
-from trl import GRPOConfig, GRPOTrainer
-from trl.models import unwrap_model_for_generation
+from contextlib import nullcontext

-from axolotl.core.trainers.base import SchedulerMixin
+from accelerate.utils import is_deepspeed_available, is_peft_model
+from trl import GRPOTrainer
+from trl.extras.profiling import profiling_decorator
+
+from axolotl.core.trainers.mixins import RngLoaderMixin, SchedulerMixin
+
+if is_deepspeed_available():
+    import deepspeed


-# mypy: ignore-errors
-class AxolotlGRPOTrainer(SchedulerMixin, GRPOTrainer):
+class AxolotlGRPOTrainer(RngLoaderMixin, SchedulerMixin, GRPOTrainer):
    """
    Extend the base GRPOTrainer for axolotl helpers
    """

    _tag_names = ["trl", "grpo", "axolotl"]

-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-
-        # pylint: disable=access-member-before-definition
-        # Enable gradient checkpointing if requested
-        if kwargs["args"].gradient_checkpointing:
-            # Ensure use_cache is disabled
-            if hasattr(self.model, "config"):
-                self.model.config.use_cache = False
-
-            # Enable gradient checkpointing on the base model for PEFT
-            if is_peft_model(self.model) and hasattr(
-                self.model.base_model, "gradient_checkpointing_enable"
-            ):
-                self.model.base_model.gradient_checkpointing_enable()
-            # Enable gradient checkpointing for non-PEFT models
-            elif hasattr(self.model, "gradient_checkpointing_enable"):
-                self.model.gradient_checkpointing_enable()
-            self.model = self._enable_gradient_checkpointing(self.model, kwargs["args"])
-        # pylint: enable=access-member-before-definition
-
-    def _enable_gradient_checkpointing(
-        self, model: PreTrainedModel, args: GRPOConfig
-    ) -> PreTrainedModel:
-        """Enables gradient checkpointing for the model."""
-        # pylint: disable=unused-argument,redefined-builtin
-        gradient_checkpointing_kwargs = args.gradient_checkpointing_kwargs or {}
-        use_reentrant = (
-            "use_reentrant" not in gradient_checkpointing_kwargs
-            or gradient_checkpointing_kwargs["use_reentrant"]
+    @profiling_decorator
+    def _move_model_to_vllm(self):
+        # For DeepSpeed ZeRO-3, we need to gather all parameters before operations
+        deepspeed_plugin = self.accelerator.state.deepspeed_plugin
+        zero_stage_3 = deepspeed_plugin is not None and deepspeed_plugin.zero_stage == 3
+        gather_if_zero3 = (
+            deepspeed.zero.GatheredParameters if zero_stage_3 else nullcontext
        )

-        if use_reentrant:
-            if hasattr(model, "enable_input_require_grads"):
-                model.enable_input_require_grads()
-            else:
+        if is_peft_model(self.model):
+            # With PEFT and DeepSpeed ZeRO Stage 3, we must gather the full model at once before merging, as merging
+            # adapters in a sharded manner is not supported.
+            with gather_if_zero3(list(self.model.parameters())):
+                self.model.merge_adapter()

-                def make_inputs_require_grad(module, input, output):
-                    output.requires_grad_(True)
+                # Update vLLM weights while parameters are gathered
+                for name, param in self.model.named_parameters():
+                    # When using PEFT, we need to recover the original parameter name and discard some parameters
+                    name = (
+                        name.removeprefix("base_model.model.")
+                        .removeprefix("base_model.model.")
+                        .replace(".base_layer", "")
+                    )
+                    if self.model.prefix in name:
+                        continue
+                    # When module to save, remove its prefix and discard the original module
+                    if "original_module" in name:
+                        continue
+                    name = name.replace("modules_to_save.default.", "")

-                model.get_input_embeddings().register_forward_hook(
-                    make_inputs_require_grad
-                )
+                    if self.accelerator.is_main_process:
+                        self.vllm_client.update_named_param(name, param.data)

-        return model
-        # pylint: enable=unused-argument,redefined-builtin
+                # Unmerge adapters while parameters are still gathered
+                self.model.unmerge_adapter()
+                # Parameters will automatically be repartitioned when exiting the context
+        else:
+            # For non-PEFT models, simply gather and update each parameter individually.
+            for name, param in self.model.named_parameters():
+                with gather_if_zero3([param]):
+                    if self.accelerator.is_main_process:
+                        self.vllm_client.update_named_param(name, param.data)

-    def _move_model_to_vllm(self):
-        with unwrap_model_for_generation(
-            self.model,
-            self.accelerator,
-            gather_deepspeed3_params=self.args.ds3_gather_for_generation,
-        ) as unwrapped_model:
-            if is_compiled_module(unwrapped_model):
-                unwrapped_model = (
-                    unwrapped_model._orig_mod  # pylint: disable=protected-access
-                )
-            if is_peft_model(unwrapped_model):
-                unwrapped_model.merge_adapter()
-                state_dict = unwrapped_model.state_dict()
-                # Remove base_model and base_layer prefixes
-                state_dict = {
-                    k.removeprefix("base_model.model.")
-                    .removeprefix("base_model.model.")
-                    .replace(".base_layer", ""): v
-                    for k, v in state_dict.items()
-                }
-                # Remove values with adapter prefix (example: "_lora")
-                state_dict = {
-                    k: v
-                    for k, v in state_dict.items()
-                    if unwrapped_model.prefix not in k
-                }
-                # When module to save, remove its prefix and discard the original module
-                state_dict = {
-                    k.replace("modules_to_save.default.", ""): v
-                    for k, v in state_dict.items()
-                    if "original_module" not in k
-                }
-            else:
-                state_dict = unwrapped_model.state_dict()
-            if self.accelerator.is_main_process:
-                llm_model = (
-                    self.llm.llm_engine.model_executor.driver_worker.model_runner.model
-                )
-                llm_model.load_weights(state_dict.items())
-            if is_peft_model(unwrapped_model):
-                unwrapped_model.unmerge_adapter()
+        # Reset cache on main process
+        if self.accelerator.is_main_process:
+            self.vllm_client.reset_prefix_cache()
--- a/src/axolotl/core/trainers/mixins/init.py
+++ b/src/axolotl/core/trainers/mixins/init.py
@@ -4,5 +4,6 @@
 # flake8: noqa

 from .optimizer import OptimizerMixin
+from .rng_state_loader import RngLoaderMixin
 from .scheduler import SchedulerMixin
 from .sequence_parallel import SequenceParallelMixin
--- a/src/axolotl/core/trainers/mixins/rng_state_loader.py
+++ b/src/axolotl/core/trainers/mixins/rng_state_loader.py
@@ -0,0 +1,67 @@
+"""
+Temporary fix/override for bug in resume from checkpoint
+
+See https://github.com/huggingface/transformers/pull/37162
+
+TODO: Remove when upstream added PR to release
+"""
+
+import logging
+import os
+import random
+
+import numpy as np
+import torch
+from transformers import Trainer, is_torch_npu_available
+from transformers.trainer import safe_globals
+from transformers.trainer_pt_utils import set_rng_state_for_device
+from transformers.training_args import ParallelMode
+
+LOG = logging.getLogger(__name__)
+
+
+class RngLoaderMixin(Trainer):
+    """
+    mixin for method override to load RNG states from a checkpoint
+    """
+
+    def _load_rng_state(self, checkpoint):
+        # Load RNG states from `checkpoint`
+        if checkpoint is None:
+            return
+
+        if self.args.world_size > 1:
+            process_index = self.args.process_index
+            rng_file = os.path.join(checkpoint, f"rng_state_{process_index}.pth")
+            if not os.path.isfile(rng_file):
+                LOG.info(
+                    f"Didn't find an RNG file for process {process_index}, if you are resuming a training that "
+                    "wasn't launched in a distributed fashion, reproducibility is not guaranteed."
+                )
+                return
+        else:
+            rng_file = os.path.join(checkpoint, "rng_state.pth")
+            if not os.path.isfile(rng_file):
+                LOG.info(
+                    "Didn't find an RNG file, if you are resuming a training that was launched in a distributed "
+                    "fashion, reproducibility is not guaranteed."
+                )
+                return
+
+        # Use safe_globals to ensure numpy RNG states can be deserialized safely under PyTorch 2.6+,
+        # which requires allowlisted classes when loading with weights_only=True.
+        with safe_globals():
+            checkpoint_rng_state = torch.load(rng_file)  # nosec B614
+        random.setstate(checkpoint_rng_state["python"])
+        np.random.set_state(checkpoint_rng_state["numpy"])
+        torch.random.set_rng_state(checkpoint_rng_state["cpu"])
+
+        is_distributed = self.args.parallel_mode == ParallelMode.DISTRIBUTED
+        if torch.cuda.is_available():
+            set_rng_state_for_device(
+                "CUDA", torch.cuda, checkpoint_rng_state, is_distributed
+            )
+        if is_torch_npu_available():
+            set_rng_state_for_device(
+                "NPU", torch.npu, checkpoint_rng_state, is_distributed
+            )
--- a/src/axolotl/core/trainers/mixins/sequence_parallel.py
+++ b/src/axolotl/core/trainers/mixins/sequence_parallel.py
@@ -7,6 +7,7 @@ import torch
 import torch.distributed as dist
 import torch.nn.functional as F
 from datasets import Dataset
+from torch import nn
 from torch.utils.data import DistributedSampler, Sampler

 from axolotl.monkeypatch.attention.ring_attn import get_ring_attn_group
@@ -129,3 +130,53 @@ class SequenceParallelMixin:
        )

        update_ring_flash_attn_params(cu_seqlens, self.ring_attn_group)
+
+    def training_step(
+        self,
+        model: nn.Module,
+        inputs: dict[str, torch.Tensor | Any],
+        num_items_in_batch: int | None = None,
+    ) -> torch.Tensor:
+        """
+        Perform a training step on a batch of inputs. Overrides the
+        `transformers.trainer.Trainer` method to handle sequence parallelism if
+        enabled.
+
+        Args:
+            model: Model to perform training step for.
+            inputs: Dictionary mapping.
+        """
+        # Set up sequence parallelism for this step if enabled
+        if self.args.sequence_parallel_degree > 1:
+            self._update_ring_flash_attn_params(inputs)
+
+        # Proceed with normal training step
+        return super().training_step(model, inputs, num_items_in_batch)  # type: ignore
+
+    def prediction_step(
+        self,
+        model: nn.Module,
+        inputs: dict[str, torch.Tensor | Any],
+        prediction_loss_only: bool,
+        ignore_keys: list[str] | None = None,
+    ) -> tuple[torch.Tensor | None, torch.Tensor | None, torch.Tensor | None]:
+        """
+        Perform a prediction step on a batch of inputs. Overrides the
+        `transformers.trainer.Trainer` method to handle sequence parallelism if
+        enabled.
+
+        Args:
+            model: Model to perform prediction step for.
+            inputs: Dictionary mapping of inputs.
+            prediction_loss_only: Whether to return only the loss.
+            ignore_keys: Keys to ignore in the inputs.
+
+        Returns:
+            Tuple of (loss, logits, labels).
+        """
+        # Set up sequence parallelism for this prediction step if enabled
+        if self.args.sequence_parallel_degree > 1:
+            self._update_ring_flash_attn_params(inputs)
+
+        # Proceed with normal prediction step
+        return super().prediction_step(model, inputs, prediction_loss_only, ignore_keys)  # type: ignore
--- a/src/axolotl/core/trainers/trl.py
+++ b/src/axolotl/core/trainers/trl.py
@@ -13,6 +13,7 @@ from trl import (
    RewardTrainer,
 )

+from axolotl.core.trainers.mixins import RngLoaderMixin
 from axolotl.core.trainers.mixins.scheduler import SchedulerMixin


@@ -74,7 +75,7 @@ class TRLPPOTrainer(PPOTrainer):
            )


-class AxolotlORPOTrainer(SchedulerMixin, ORPOTrainer):
+class AxolotlORPOTrainer(RngLoaderMixin, SchedulerMixin, ORPOTrainer):
    """
    Extend the base ORPOTrainer for axolotl helpers
    """
@@ -154,7 +155,7 @@ class AxolotlORPOTrainer(SchedulerMixin, ORPOTrainer):
        return loss, metrics


-class AxolotlKTOTrainer(SchedulerMixin, KTOTrainer):
+class AxolotlKTOTrainer(RngLoaderMixin, SchedulerMixin, KTOTrainer):
    """
    Extend the base KTOTrainer for axolotl helpers
    """
@@ -162,7 +163,7 @@ class AxolotlKTOTrainer(SchedulerMixin, KTOTrainer):
    tag_names = ["axolotl", "kto"]


-class AxolotlCPOTrainer(SchedulerMixin, CPOTrainer):
+class AxolotlCPOTrainer(RngLoaderMixin, SchedulerMixin, CPOTrainer):
    """
    Extend the base CPOTrainer for axolotl helpers
    """
@@ -244,7 +245,7 @@ class AxolotlCPOTrainer(SchedulerMixin, CPOTrainer):
        return loss, metrics


-class AxolotlRewardTrainer(SchedulerMixin, RewardTrainer):
+class AxolotlRewardTrainer(RngLoaderMixin, SchedulerMixin, RewardTrainer):
    """
    Extend the base RewardTrainer for axolotl helpers
    """
@@ -252,7 +253,7 @@ class AxolotlRewardTrainer(SchedulerMixin, RewardTrainer):
    tag_names = ["axolotl", "reward"]


-class AxolotlPRMTrainer(SchedulerMixin, PRMTrainer):
+class AxolotlPRMTrainer(RngLoaderMixin, SchedulerMixin, PRMTrainer):
    """
    Extend the base trl.PRMTrainer for axolotl helpers
    """
--- a/src/axolotl/core/training_args.py
+++ b/src/axolotl/core/training_args.py
@@ -34,6 +34,12 @@ class AxolotlTrainingMixins:
        default=False,
        metadata={"help": "Use sample packing for efficient training."},
    )
+    sample_packing_sequentially: bool = field(
+        default=False,
+        metadata={
+            "help": "Use next-fit sample packing that preserves the order of samples coming from the sampler. Use in combination with curriculum_sampling for fully sequential packing."
+        },
+    )
    multipack_real_batches: bool = field(
        default=False,
        metadata={"help": "Use real batches for efficient training."},
--- a/src/axolotl/evaluate.py
+++ b/src/axolotl/evaluate.py
@@ -15,6 +15,7 @@ from axolotl.logging_config import configure_logging
 from axolotl.train import TrainDatasetMeta
 from axolotl.utils import set_pytorch_cuda_alloc_conf
 from axolotl.utils.dict import DictDefault
+from axolotl.utils.distributed import cleanup_distributed
 from axolotl.utils.models import load_model, load_processor, load_tokenizer
 from axolotl.utils.trainer import setup_trainer

@@ -159,4 +160,6 @@ def evaluate(*, cfg: DictDefault, dataset_meta: TrainDatasetMeta) -> Dict[str, f
    del model
    del tokenizer

+    cleanup_distributed()
+
    return all_metrics
--- a/src/axolotl/integrations/cut_cross_entropy/init.py
+++ b/src/axolotl/integrations/cut_cross_entropy/init.py
@@ -25,8 +25,8 @@ import torch

 from axolotl.integrations.base import BasePlugin
 from axolotl.utils import get_pytorch_version
+from axolotl.utils.distributed import zero_only

-from ...utils.distributed import zero_only
 from .args import CutCrossEntropyArgs  # pylint: disable=unused-import. # noqa: F401

 LOG = logging.getLogger("axolotl.integrations.cut_cross_entropy")
--- a/src/axolotl/integrations/cut_cross_entropy/monkeypatch/gemma3.py
+++ b/src/axolotl/integrations/cut_cross_entropy/monkeypatch/gemma3.py
@@ -15,7 +15,6 @@ import transformers
 from cut_cross_entropy.transformers.utils import (
    PatchOptions,
    TransformersModelT,
-    apply_lce,
 )
 from torch import nn
 from transformers.cache_utils import Cache, HybridCache
@@ -33,6 +32,8 @@ from transformers.utils import (
 )
 from transformers.utils.deprecation import deprecate_kwarg

+from axolotl.integrations.cut_cross_entropy.monkeypatch.utils import apply_lce
+
 _PATCH_OPTS: PatchOptions | None = None


@@ -134,25 +135,17 @@ def cce_forward(

    if _PATCH_OPTS is not None and _PATCH_OPTS.use_lce(labels, self.training):
        assert labels is not None
-        if self.config.final_logit_softcapping is not None:
-            logger.warning_once(
-                "final_logit_softcapping is not supported for gemma3_text with CCE. Disabling."
-            )
        loss = apply_lce(
            hidden_states[:, slice_indices, :],
            self.lm_head.weight,
            labels,
            _PATCH_OPTS,
+            softcap=getattr(self.config, "final_logit_softcapping", None),
            **loss_kwargs,
        )
    elif _PATCH_OPTS is not None and defer_logits_calculation:
        # defer logits calculation to the ConditionalGeneration forward
        logits = hidden_states[:, slice_indices, :]
-
-        if self.config.final_logit_softcapping is not None:
-            logger.warning_once(
-                "final_logit_softcapping is not supported for gemma3 with CCE. Disabling."
-            )
    else:
        logits = self.lm_head(hidden_states[:, slice_indices, :])
        if self.config.final_logit_softcapping is not None:
@@ -353,6 +346,7 @@ def cce_forward_multimodal(
            self.language_model.lm_head.weight,
            labels,
            _PATCH_OPTS,
+            softcap=getattr(self.config, "final_logit_softcapping", None),
            **lm_kwargs,
        )
    else:
--- a/src/axolotl/integrations/cut_cross_entropy/monkeypatch/utils.py
+++ b/src/axolotl/integrations/cut_cross_entropy/monkeypatch/utils.py
@@ -0,0 +1,40 @@
+# Copyright (C) 2024 Apple Inc. All Rights Reserved.
+
+"""Monkeypatch for apply_lce to add softcap."""
+
+import torch
+from cut_cross_entropy import linear_cross_entropy
+from cut_cross_entropy.transformers.utils import PatchOptions
+
+
+def apply_lce(
+    e: torch.Tensor,
+    c: torch.Tensor,
+    labels: torch.Tensor,
+    opts: PatchOptions,
+    bias: torch.Tensor | None = None,
+    softcap: float | None = None,
+    **loss_kwargs,
+) -> torch.Tensor:
+    """Monkey patch for apply_lce to support softcap kwarg."""
+    num_items_in_batch = loss_kwargs.get("num_items_in_batch", None)
+    cce_kwargs = opts.to_kwargs()
+    if num_items_in_batch is not None and cce_kwargs["reduction"] == "mean":
+        cce_kwargs["reduction"] = "sum"
+    else:
+        num_items_in_batch = None
+
+    loss = linear_cross_entropy(
+        e,
+        c,
+        labels.to(e.device),
+        bias=bias,
+        shift=True,
+        softcap=softcap,
+        **cce_kwargs,
+    )
+
+    if num_items_in_batch is not None:
+        loss = loss / num_items_in_batch
+
+    return loss
--- a/src/axolotl/integrations/liger/README.md
+++ b/src/axolotl/integrations/liger/README.md
@@ -20,6 +20,26 @@ liger_layer_norm: true
 liger_fused_linear_cross_entropy: true
 ```

+## Supported Models
+
+- deepseek_v2
+- gemma
+- gemma2
+- gemma3 (partial support, no support for FLCE yet)
+- granite
+- jamba
+- llama
+- mistral
+- mixtral
+- mllama
+- mllama_text_model
+- olmo2
+- paligemma
+- phi3
+- qwen2
+- qwen2_5_vl
+- qwen2_vl
+
 ## Citation

 ```bib
--- a/src/axolotl/integrations/liger/init.py
+++ b/src/axolotl/integrations/liger/init.py
@@ -21,6 +21,7 @@ It is designed to be performant, correct, and light-weight.
 import inspect
 import logging
 import sys
+from functools import partial

 from axolotl.integrations.base import BasePlugin

@@ -41,11 +42,18 @@ class LigerPlugin(BasePlugin):
    def pre_model_load(self, cfg):
        from liger_kernel.transformers.cross_entropy import LigerCrossEntropyLoss
        from liger_kernel.transformers.functional import liger_cross_entropy
+        from liger_kernel.transformers.geglu import LigerGEGLUMLP
+        from liger_kernel.transformers.layer_norm import LigerLayerNorm
        from liger_kernel.transformers.monkey_patch import MODEL_TYPE_TO_APPLY_LIGER_FN
        from liger_kernel.transformers.rms_norm import LigerRMSNorm
        from liger_kernel.transformers.rope import liger_rotary_pos_emb
        from liger_kernel.transformers.swiglu import LigerSwiGLUMLP

+        if cfg.liger_cross_entropy and cfg.liger_fused_linear_cross_entropy:
+            raise ValueError(
+                "Cannot have both `liger_cross_entropy` and `liger_fused_linear_cross_entropy` set."
+            )
+
        if cfg.model_config_type in MODEL_TYPE_TO_APPLY_LIGER_FN:
            apply_liger_fn = MODEL_TYPE_TO_APPLY_LIGER_FN[cfg.model_config_type]
            liger_fn_sig = inspect.signature(apply_liger_fn)
@@ -82,6 +90,8 @@ class LigerPlugin(BasePlugin):
                modeling_jamba.JambaRMSNorm = LigerRMSNorm
            if cfg.liger_glu_activation:
                modeling_jamba.JambaMLP = LigerSwiGLUMLP
+            if cfg.liger_layer_norm:
+                modeling_jamba.nn.LayerNorm = LigerLayerNorm
            if cfg.liger_cross_entropy:
                from transformers.loss.loss_utils import nn

@@ -104,15 +114,51 @@ class LigerPlugin(BasePlugin):
                # The DeepseekV2 version of RoPE is different than upstream LLaMA.
                # See https://github.com/linkedin/Liger-Kernel/issues/129#issuecomment-2313763528
                logging.warning("Fused liger_rope is not supported for DeepseekV2.")
+            if cfg.liger_glu_activation:
+                logging.warning("liger_glu_activation is not supported for DeepseekV2.")
            if cfg.liger_rms_norm:
                modeling_mod.DeepseekV2RMSNorm = LigerRMSNorm
            if cfg.liger_glu_activation:
                modeling_mod.DeepseekV2MLP.forward = LigerSwiGLUMLP.forward
+            if cfg.liger_layer_norm:
+                modeling_mod.DeepseekV2MLP.forward = LigerLayerNorm.forward
            if cfg.liger_cross_entropy:
                # We do not patch `nn.functional.cross_entropy` for DeepseekV2 as it still uses
                # nn.CrossEntropyLoss in the forward method.
                modeling_mod.CrossEntropyLoss = LigerCrossEntropyLoss
            if cfg.liger_fused_linear_cross_entropy:
                modeling_mod.DeepseekV2ForCausalLM.forward = deepseekv2_lce_forward
-        elif cfg.model_config_type in ["gemma3_text", "deepseek_v3"]:
+        elif cfg.model_config_type in ["gemma3", "gemma3_text"]:
+            from transformers.models.gemma3 import modeling_gemma3
+
+            if cfg.liger_rope:
+                modeling_gemma3.apply_rotary_pos_emb = liger_rotary_pos_emb
+            if cfg.liger_rms_norm:
+
+                def _liger_rms_norm_wrapper(dim, **kwargs):
+                    "Convert 'dim' keyword to 'hidden_size' to pass to LigerRMSNorm"
+                    return LigerRMSNorm(hidden_size=dim, **kwargs)
+
+                modeling_gemma3.Gemma3RMSNorm = partial(
+                    _liger_rms_norm_wrapper,
+                    offset=1.0,
+                    casting_mode="gemma",
+                    init_fn="zeros",
+                    in_place=False,
+                )
+            if cfg.liger_glu_activation:
+                modeling_gemma3.Gemma3MLP = LigerGEGLUMLP
+            if cfg.liger_layer_norm:
+                modeling_gemma3.nn.LayerNorm = LigerLayerNorm
+
+            if cfg.liger_cross_entropy:
+                from transformers.loss.loss_utils import nn
+
+                nn.functional.cross_entropy = liger_cross_entropy
+
+            if cfg.liger_fused_linear_cross_entropy:
+                raise NotImplementedError(
+                    "Fused linear cross entropy is not yet supported for Gemma3."
+                )
+        elif cfg.model_config_type in ["deepseek_v3"]:
            raise ValueError(f"Unsupported model config type: {cfg.model_config_type}")
--- a/src/axolotl/monkeypatch/attention/ring_attn.py
+++ b/src/axolotl/monkeypatch/attention/ring_attn.py
@@ -38,13 +38,19 @@ def set_ring_attn_group(ring_attn_group: dist.ProcessGroup | None):
    RING_ATTN_GROUP = ring_attn_group


-def register_ring_attn(sequence_parallel_degree: int):
+def register_ring_attn(sequence_parallel_degree: int, heads_k_stride: int | None):
    """
    Create ring attention group and substitute flash attn with ring flash attn.

    Args:
        sequence_parallel_degree: Sequence parallelism factor.
+        heads_k_stride: Sequence parallelism K head stride size. Passed
+            through to `ring_flash_attn.substitute_hf_flash_attn`.
    """
+    if get_ring_attn_group() is not None:
+        LOG.info("Ring attention already registered, exiting early...")
+        return
+
    LOG.info(
        "Enabling ring attention sequence parallelism: "
        f"each sequence will be processed across {sequence_parallel_degree} GPUs"
@@ -84,6 +90,11 @@ def register_ring_attn(sequence_parallel_degree: int):
    if rank == 0:
        LOG.info(f"Sequence parallel group assignments: {group_assignments}")

+    if heads_k_stride is None:
+        heads_k_stride = 1
+
    from ring_flash_attn import substitute_hf_flash_attn

-    substitute_hf_flash_attn(get_ring_attn_group(), sequence_parallel_degree)
+    substitute_hf_flash_attn(
+        process_group=get_ring_attn_group(), heads_k_stride=heads_k_stride
+    )
--- a/src/axolotl/monkeypatch/gemma3.py
+++ b/src/axolotl/monkeypatch/gemma3.py
@@ -0,0 +1,238 @@
+"""Monkeypatch for gemma3 conditional generation forward to fix loss exploding"""
+
+# pylint: disable=duplicate-code
+
+from typing import Optional, Tuple, Union
+
+import torch
+from transformers.cache_utils import Cache
+from transformers.models.gemma3.modeling_gemma3 import (
+    _CONFIG_FOR_DOC,
+    GEMMA3_INPUTS_DOCSTRING,
+    Gemma3CausalLMOutputWithPast,
+    logger,
+)
+from transformers.utils import (
+    add_start_docstrings_to_model_forward,
+    is_torchdynamo_compiling,
+    replace_return_docstrings,
+)
+from transformers.utils.deprecation import deprecate_kwarg
+
+
+@deprecate_kwarg("num_logits_to_keep", version="4.50", new_name="logits_to_keep")
+@add_start_docstrings_to_model_forward(GEMMA3_INPUTS_DOCSTRING)
+@replace_return_docstrings(
+    output_type=Gemma3CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC
+)
+def new_forward(
+    self,
+    input_ids: torch.LongTensor = None,
+    pixel_values: torch.FloatTensor = None,
+    attention_mask: Optional[torch.Tensor] = None,
+    position_ids: Optional[torch.LongTensor] = None,
+    past_key_values: Optional[Union[list[torch.FloatTensor], Cache]] = None,
+    token_type_ids: Optional[torch.LongTensor] = None,
+    cache_position: Optional[torch.LongTensor] = None,
+    inputs_embeds: Optional[torch.FloatTensor] = None,
+    labels: Optional[torch.LongTensor] = None,
+    use_cache: Optional[bool] = None,
+    output_attentions: Optional[bool] = None,
+    output_hidden_states: Optional[bool] = None,
+    return_dict: Optional[bool] = None,
+    logits_to_keep: Union[int, torch.Tensor] = 0,
+    **lm_kwargs,
+) -> Union[Tuple, Gemma3CausalLMOutputWithPast]:
+    r"""
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+            config.text_config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.text_config.vocab_size]`.
+
+        logits_to_keep (`int` or `torch.Tensor`, *optional*):
+            If an `int`, compute logits for the last `logits_to_keep` tokens. If `0`, calculate logits for all
+            `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
+            token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
+            If a `torch.Tensor`, must be 1D corresponding to the indices to keep in the sequence length dimension.
+            This is useful when using packed tensor format (single dimension for batch and sequence length).
+
+    Returns:
+
+    Example:
+
+    ```python
+    >>> from PIL import Image
+    >>> import requests
+    >>> from transformers import AutoProcessor, Gemma3ForConditionalGeneration
+
+    >>> model = Gemma3ForConditionalGeneration.from_pretrained("google/Gemma3-test-224px-hf")
+    >>> processor = AutoProcessor.from_pretrained("google/Gemma3-test-224px-hf")
+
+    >>> prompt = "answer en Where is the cow standing?"
+    >>> url = "https://huggingface.co/gv-hf/Gemma3-test-224px-hf/resolve/main/cow_beach_1.png"
+    >>> image = Image.open(requests.get(url, stream=True).raw)
+
+    >>> inputs = processor(images=image, text=prompt,  return_tensors="pt")
+
+    >>> # Generate
+    >>> generate_ids = model.generate(**inputs, max_length=30)
+    >>> processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+    "answer en Where is the cow standing?\nbeach"
+    ```"""
+
+    if (input_ids is None) ^ (inputs_embeds is not None):
+        raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+
+    output_attentions = (
+        output_attentions
+        if output_attentions is not None
+        else self.config.output_attentions
+    )
+    output_hidden_states = (
+        output_hidden_states
+        if output_hidden_states is not None
+        else self.config.output_hidden_states
+    )
+    return_dict = (
+        return_dict if return_dict is not None else self.config.use_return_dict
+    )
+
+    is_training = token_type_ids is not None and labels is not None
+
+    # Replace image id with PAD if the image token is OOV, to avoid index-errors
+    if input_ids is not None and self.config.image_token_index >= self.vocab_size:
+        special_image_mask = input_ids == self.config.image_token_index
+        llm_input_ids = input_ids.clone()
+        llm_input_ids[special_image_mask] = 0
+    else:
+        llm_input_ids = input_ids
+
+    if inputs_embeds is None:
+        inputs_embeds = self.get_input_embeddings()(llm_input_ids)
+
+    if cache_position is None:
+        past_seen_tokens = (
+            past_key_values.get_seq_length() if past_key_values is not None else 0
+        )
+        cache_position = torch.arange(
+            past_seen_tokens,
+            past_seen_tokens + inputs_embeds.shape[1],
+            device=inputs_embeds.device,
+        )
+
+    # Merge text and images
+    if pixel_values is not None:
+        image_features = self.get_image_features(pixel_values)
+
+        if input_ids is None:
+            special_image_mask = inputs_embeds == self.get_input_embeddings()(
+                torch.tensor(
+                    self.config.image_token_index,
+                    dtype=torch.long,
+                    device=inputs_embeds.device,
+                )
+            )
+        else:
+            special_image_mask = (input_ids == self.config.image_token_index).unsqueeze(
+                -1
+            )
+            special_image_mask = special_image_mask.expand_as(inputs_embeds).to(
+                inputs_embeds.device
+            )
+
+        if (
+            not is_torchdynamo_compiling()
+            and inputs_embeds[special_image_mask].numel() != image_features.numel()
+        ):
+            image_tokens_in_text = (special_image_mask).sum(dim=1).sum(dim=0)[0]
+            raise ValueError(
+                f"Number of images does not match number of special image tokens in the input text. "
+                f"Got {image_tokens_in_text} image tokens in the text but {image_features.shape[0] * image_features.shape[1]} "
+                "tokens from image embeddings."
+            )
+        image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
+        inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)
+
+    # mask out pad-token-ids in labels for BC
+    if labels is not None and self.pad_token_id in labels:
+        logger.warning_once(
+            "`labels` contains `pad_token_id` which will be masked with `config.ignore_index`. "
+            "You have to mask out `pad_token_id` when preparing `labels`, this behavior will be removed in v.4.46.",
+        )
+        labels = torch.where(
+            input_ids == self.pad_token_id, self.config.ignore_index, labels
+        )
+
+    causal_mask = self._update_causal_mask(  # pylint: disable=protected-access
+        attention_mask,
+        token_type_ids,
+        past_key_values,
+        cache_position,
+        inputs_embeds,
+        is_training,
+    )
+    outputs = self.language_model(
+        attention_mask=causal_mask,
+        position_ids=position_ids,
+        past_key_values=past_key_values,
+        inputs_embeds=inputs_embeds,
+        use_cache=use_cache,
+        output_attentions=output_attentions,
+        output_hidden_states=output_hidden_states,
+        return_dict=return_dict,
+        cache_position=cache_position,
+        logits_to_keep=logits_to_keep,
+        **lm_kwargs,
+    )
+
+    logits = outputs[0]
+    loss = None
+    if labels is not None:
+        if attention_mask is not None:
+            # Get the shifted attention mask
+            shift_attention_mask = attention_mask[:, -logits.shape[1] + 1 :].to(
+                logits.device
+            )  # +1 for shift
+
+            # Filter logits and labels based on attention mask
+            valid_indices = shift_attention_mask != 0
+            filtered_logits = logits[..., :-1, :][valid_indices]
+            filtered_labels = labels[..., 1:][valid_indices.to(labels.device)]
+
+            # TODO: do we need to handle num_items_in_batch given we filter the logits and labels?
+
+            loss = self.loss_function(
+                logits=filtered_logits,
+                labels=None,  # we pass shift_labels
+                shift_labels=filtered_labels,
+                vocab_size=self.config.text_config.vocab_size,
+                **lm_kwargs,
+            )
+        else:
+            # Standard case without filtering
+            loss = self.loss_function(
+                logits=logits,
+                labels=labels,
+                vocab_size=self.config.text_config.vocab_size,
+                **lm_kwargs,
+            )
+    if not return_dict:
+        output = (logits,) + outputs[1:]
+        return (loss,) + output if loss is not None else output
+
+    return Gemma3CausalLMOutputWithPast(
+        loss=loss,
+        logits=logits,
+        past_key_values=outputs.past_key_values,
+        hidden_states=outputs.hidden_states,
+        attentions=outputs.attentions,
+        image_hidden_states=image_features if pixel_values is not None else None,
+    )
+
+
+def patch_gemma3conditionalgeneration_forward():
+    from transformers.models.gemma3.modeling_gemma3 import (
+        Gemma3ForConditionalGeneration,
+    )
+
+    Gemma3ForConditionalGeneration.forward = new_forward
--- a/src/axolotl/monkeypatch/lora_kernels.py
+++ b/src/axolotl/monkeypatch/lora_kernels.py
@@ -252,12 +252,38 @@ def apply_lora_kernel_patches(
    LOG.setLevel(logging.INFO)

    # Choose activation based on model type
-    activation = model.config.hidden_act
+    activation = None
+    text_config = (
+        model.config.get_text_config()
+        if hasattr(model.config, "get_text_config")
+        else model.config
+    )
+    if hasattr(text_config, "hidden_act"):
+        activation = text_config.hidden_act
+    elif hasattr(text_config, "hidden_activation"):
+        activation = text_config.hidden_activation
+
+    # map activation to supported activation
+    if "gelu" in activation:
+        # gemma3 uses gelu_pytorch_tanh
+        activation = "gelu"
+
    if activation not in SUPPORTED_ACTIVATIONS:
        raise NotImplementedError(f"Activation {activation} is not supported")

+    layers = []
+    # check for multimodal models first
+    if hasattr(model, "language_model"):
+        layers = model.language_model.model.layers
+    elif hasattr(model, "model"):
+        layers = model.model.model.layers
+    else:
+        raise NotImplementedError(
+            f"Model type {model.config.model_type} is not supported yet. Please create an Issue."
+        )
+
    # Patch each layer
-    for layer in model.model.model.layers:
+    for layer in layers:
        # Add QKV, O fallback implementations to start
        # These will be overwritten later (if some conditions apply)
        layer.self_attn.apply_qkv = types.MethodType(
--- a/src/axolotl/monkeypatch/multipack.py
+++ b/src/axolotl/monkeypatch/multipack.py
@@ -22,6 +22,7 @@ SUPPORTED_MULTIPACK_MODEL_TYPES = [
    "phi3",
    "gemma",
    "gemma2",
+    "gemma3",
    "gemma3_text",
    "cohere",
    "cohere2",
--- a/src/axolotl/prompt_strategies/chat_template.py
+++ b/src/axolotl/prompt_strategies/chat_template.py
@@ -411,11 +411,15 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):
        if turn_idx >= len(turns):
            raise ValueError(f"Turn index {turn_idx} out of range")

-        # mistral does not output message if it contains only system message
+        # mistral/gemma3 does not output message if it contains only system message
        if (
            turn_idx == 0
            and turns[0].get("role") == "system"
-            and "mistral" in self.tokenizer.name_or_path.lower()
+            and (
+                "mistral" in self.tokenizer.name_or_path.lower()
+                # gemma3 uses gemma tokenizer
+                or "gemma" in self.tokenizer.name_or_path.lower()
+            )
        ):
            return -1, -1

--- a/src/axolotl/train.py
+++ b/src/axolotl/train.py
@@ -14,6 +14,7 @@ import transformers.modelcard
 from accelerate.logging import get_logger
 from accelerate.utils import save_fsdp_model
 from datasets import Dataset
+from huggingface_hub.errors import OfflineModeIsEnabled
 from peft import PeftConfig, PeftModel
 from transformers import PreTrainedModel, PreTrainedTokenizer, ProcessorMixin
 from transformers.integrations.deepspeed import is_deepspeed_zero3_enabled
@@ -26,6 +27,7 @@ from axolotl.contribs.lgpl import (  # pylint: disable = no-name-in-module
 from axolotl.core.trainer_builder import HFCausalTrainerBuilder, HFRLTrainerBuilder
 from axolotl.logging_config import configure_logging
 from axolotl.utils.dict import DictDefault
+from axolotl.utils.distributed import cleanup_distributed
 from axolotl.utils.freeze import freeze_layers_except
 from axolotl.utils.models import load_model, load_processor, load_tokenizer
 from axolotl.utils.trainer import setup_trainer
@@ -156,6 +158,8 @@ def setup_signal_handler(
                _model.save_pretrained(
                    cfg.output_dir, safe_serialization=safe_serialization
                )
+
+            cleanup_distributed()
            sys.exit(0)

        _model_weakref = weakref.ref(model)
@@ -302,7 +306,7 @@ def create_model_card(cfg: DictDefault, trainer: Trainer):
                    model_card_kwarg["dataset_tags"] = dataset_tags

            trainer.create_model_card(**model_card_kwarg)
-        except (AttributeError, UnicodeDecodeError):
+        except (AttributeError, UnicodeDecodeError, OfflineModeIsEnabled):
            pass
    elif cfg.hub_model_id:
        # Defensively push to the hub to ensure the model card is updated
@@ -477,7 +481,7 @@ def train(
    Returns:
        Tuple of (model, tokenizer) after training
    """
-    # Setup model, tokenizer, (causal or RLHF) trainer etc.
+    # Setup model, tokenizer, (causal or RLHF) trainer, etc.
    (
        trainer,
        model,
@@ -486,34 +490,26 @@ def train(
        processor,
    ) = setup_model_and_trainer(cfg, dataset_meta)

-    # Determine if we need to resume from a checkpoint
-    resume_from_checkpoint = determine_resume_checkpoint(cfg)
-
-    # Configuration for saving
-    safe_serialization = cfg.save_safetensors is True
-
    # Handle untrained tokens if configured
+    safe_serialization = cfg.save_safetensors is True
    train_dataset = dataset_meta.train_dataset
    handle_untrained_tokens_fix(
        cfg, model, tokenizer, train_dataset, safe_serialization
    )

-    # Save initial configs
+    # Additional setup
    save_initial_configs(cfg, tokenizer, model, peft_config, processor)
-
-    # Set up signal handler for graceful termination
    setup_signal_handler(cfg, model, safe_serialization)
-
-    # Set up badges and config info for model card
    setup_model_card(cfg)

    # Execute the training
+    resume_from_checkpoint = determine_resume_checkpoint(cfg)
    execute_training(cfg, trainer, resume_from_checkpoint)

-    # Save the trained model
+    # Save the trained model and cleanup
    save_trained_model(cfg, trainer, model, safe_serialization)
-
-    # Create model card
    create_model_card(cfg, trainer)
+    if not cfg.use_ray:
+        cleanup_distributed()

    return model, tokenizer, trainer
--- a/src/axolotl/utils/callbacks/init.py
+++ b/src/axolotl/utils/callbacks/init.py
@@ -816,27 +816,6 @@ class SaveAxolotlConfigtoWandBCallback(TrainerCallback):
        return control


-class SaveModelCallback(TrainerCallback):
-    """Callback to save model on train end"""
-
-    def on_step_end(  # pylint: disable=unused-argument
-        self,
-        args: TrainingArguments,
-        state: TrainerState,
-        control: TrainerControl,
-        **kwargs,
-    ):
-        # Save
-        if state.global_step >= state.max_steps:
-            control.should_save = True
-
-    def on_train_end(  # pylint: disable=unused-argument
-        self, args, state, control, **kwargs
-    ):
-        control.should_save = True
-        return control
-
-
 class GCCallback(TrainerCallback):
    """Callback to garbage collect torch cache"""

--- a/src/axolotl/utils/collators/batching.py
+++ b/src/axolotl/utils/collators/batching.py
@@ -112,6 +112,7 @@ class DataCollatorForSeq2Seq:
            self.local_world_size = dist.get_world_size(group=sp_group)

    def __call__(self, features, return_tensors=None):
+        has_attn_mask = "attention_mask" in features[0].keys()
        labels = None
        if return_tensors is None:
            return_tensors = self.return_tensors
@@ -164,6 +165,8 @@ class DataCollatorForSeq2Seq:
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors=return_tensors,
        )
+        if not has_attn_mask:
+            del features["attention_mask"]

        # prepare decoder_input_ids
        if (
--- a/src/axolotl/utils/config/init.py
+++ b/src/axolotl/utils/config/init.py
@@ -78,6 +78,7 @@ def resolve_dtype(cfg):
        cfg.bf16 = False
    else:
        torch.backends.cuda.matmul.allow_tf32 = cfg.tf32 or False
+        torch.backends.cudnn.allow_tf32 = cfg.tf32 or False
        if cfg.bf16:
            cfg.fp16 = False

--- a/src/axolotl/utils/data/shared.py
+++ b/src/axolotl/utils/data/shared.py
@@ -6,8 +6,12 @@ from pathlib import Path
 from typing import Optional, Union

 from datasets import Dataset, DatasetDict, load_dataset, load_from_disk
-from huggingface_hub import hf_hub_download
-from huggingface_hub.errors import HFValidationError
+from huggingface_hub import hf_hub_download, snapshot_download
+from huggingface_hub.errors import (
+    HFValidationError,
+    RepositoryNotFoundError,
+    RevisionNotFoundError,
+)

 from axolotl.utils.dict import DictDefault

@@ -70,20 +74,25 @@ def load_dataset_w_config(
    # pylint: disable=invalid-name
    ds: Optional[Union[Dataset, DatasetDict]] = None  # pylint: disable=invalid-name
    ds_from_hub = False
-    ds_trust_remote_code = config_dataset.trust_remote_code
    try:
        # this is just a basic check to see if the path is a
        # valid HF dataset that's loadable
-        load_dataset(
-            config_dataset.path,
-            name=config_dataset.name,
-            streaming=True,
+        snapshot_download(
+            repo_id=config_dataset.path,
+            repo_type="dataset",
            token=use_auth_token,
            revision=config_dataset.revision,
-            trust_remote_code=ds_trust_remote_code,
+            ignore_patterns=["*"],
        )
        ds_from_hub = True
-    except (FileNotFoundError, ConnectionError, HFValidationError, ValueError):
+    except (
+        RepositoryNotFoundError,
+        RevisionNotFoundError,
+        FileNotFoundError,
+        ConnectionError,
+        HFValidationError,
+        ValueError,
+    ):
        pass

    ds_from_cloud = False
@@ -229,7 +238,8 @@ def load_dataset_w_config(
            trust_remote_code=config_dataset.trust_remote_code,
            **load_ds_kwargs,
        )
-    else:
+    elif config_dataset.data_files:
+        fp: str | list[str] | None = None
        if isinstance(config_dataset.data_files, str):
            fp = hf_hub_download(
                repo_id=config_dataset.path,
--- a/src/axolotl/utils/distributed.py
+++ b/src/axolotl/utils/distributed.py
@@ -71,8 +71,8 @@ def barrier():

 def is_main_process():
    """
-    Check if the current process is the main process.
-    If not in distributed mode, always return True.
+    Check if the current process is the main process. If not in distributed mode,
+    always return `True`.
    """
    if not is_distributed():
        return True
@@ -87,6 +87,18 @@ def get_world_size():
    return int(os.getenv("WORLD_SIZE", "1"))


+def cleanup_distributed():
+    """
+    Destroy process group if torch distributed is initialized. Called in training early
+    termination or when training successfully completes.
+    """
+    # Ensure that all operations are completed before destroying the process group
+    torch.cuda.synchronize()
+    # Destroy the process group
+    if torch.distributed.is_initialized():
+        torch.distributed.destroy_process_group()
+
+
@contextmanager
 def zero_only():
    """
--- a/src/axolotl/utils/models.py
+++ b/src/axolotl/utils/models.py
@@ -8,7 +8,7 @@ import math
 import os
 import types
 from functools import cached_property
-from typing import Any, Dict, Optional, Tuple, Union  # noqa: F401
+from typing import Any, Dict, Optional, Tuple

 import addict
 import bitsandbytes as bnb
@@ -25,7 +25,7 @@ from peft import (
    prepare_model_for_kbit_training,
 )
 from torch import nn
-from transformers import (  # noqa: F401
+from transformers import (
    AddedToken,
    AutoConfig,
    AutoModelForCausalLM,
@@ -39,6 +39,7 @@ from transformers import (  # noqa: F401
    LlavaForConditionalGeneration,
    Mistral3ForConditionalGeneration,
    MllamaForConditionalGeneration,
+    PretrainedConfig,
    PreTrainedModel,
    PreTrainedTokenizerBase,
    ProcessorMixin,
@@ -107,14 +108,21 @@ def get_module_class_from_name(module, name):
    return None


-def check_model_config(cfg: DictDefault, model_config: Union[AutoConfig, DictDefault]):
+def check_model_config(cfg: DictDefault, model_config: PretrainedConfig):
+    # Set use_cache to False
+    if hasattr(model_config, "use_cache"):
+        model_config.use_cache = False
+
    if cfg.is_multimodal:
-        if hasattr(model_config, "text_config"):
-            model_config = model_config.text_config
-            model_config.use_cache = False
-        elif hasattr(model_config, "get_text_config"):
-            model_config = model_config.get_text_config()
-            model_config.use_cache = False
+        # For multimodal configs, use_cache is set in the text_config
+        if hasattr(model_config, "get_text_config"):
+            text_config = model_config.get_text_config()
+            if hasattr(text_config, "use_cache"):
+                text_config.use_cache = False
+        else:
+            raise ValueError(
+                "No text config found for multimodal model. Please raise an Issue with model details."
+            )

        # check if image_size is not set and load image size from model config if available
        if (
@@ -523,18 +531,19 @@ class ModelLoader:

        # init model config
        self.model_config = load_model_config(cfg)
-        if cfg.is_multimodal:
-            if hasattr(self.model_config, "text_config"):
-                self.text_model_config = self.model_config.text_config
-            else:
-                # for qwen2_vl
-                self.text_model_config = self.model_config.get_text_config()
-        else:
-            self.text_model_config = self.model_config

        self.auto_model_loader = AutoModelForCausalLM  # pylint: disable=invalid-name

    def apply_patches(self) -> None:
+        # patch gemma3 conditional generation forward before loading plugins
+        # as it could be overridden by plugins
+        if self.cfg.model_config_type == "gemma3":
+            from axolotl.monkeypatch.gemma3 import (
+                patch_gemma3conditionalgeneration_forward,
+            )
+
+            patch_gemma3conditionalgeneration_forward()
+
        # load any patches from plugins
        from axolotl.integrations.base import PluginManager

@@ -609,7 +618,10 @@ class ModelLoader:
            # Initialize ring attn for sequence parallelism. This must be done after
            # model init but before the first forward pass, since it modifies flash
            # attn to use ring comm for SP training across multiple GPUs.
-            register_ring_attn(self.cfg.sequence_parallel_degree)
+            register_ring_attn(
+                sequence_parallel_degree=self.cfg.sequence_parallel_degree,
+                heads_k_stride=self.cfg.heads_k_stride,
+            )

    def patch_attention(self) -> None:
        if hasattr(self.model_config, "model_type"):
@@ -947,8 +959,6 @@ class ModelLoader:
            quantization_config = (
                quantization_config or self.model_kwargs["quantization_config"]
            )
-            if self.cfg.is_multimodal:
-                self.model_config.text_config = self.text_model_config
            self.model = load_sharded_model_quant(
                self.base_model,
                self.model_config,
@@ -969,9 +979,6 @@ class ModelLoader:

            _ = _configure_zero3_memory_efficient_loading()

-            if self.cfg.is_multimodal:
-                self.model_config.text_config = self.text_model_config
-
            # Load model with random initialization if specified
            if self.cfg.random_init_weights:
                # AutoModel classes support the from_config method
@@ -1026,8 +1033,6 @@ class ModelLoader:
            and self.model_type != "AutoModelForCausalLM"
            and not self.cfg.trust_remote_code
        ):
-            if self.cfg.is_multimodal:
-                self.model_config.text_config = self.text_model_config
            if self.cfg.gptq:
                self.model = self.auto_model_loader.from_pretrained(
                    self.base_model,
@@ -1043,25 +1048,7 @@ class ModelLoader:
                    **self.model_kwargs,
                )
        else:
-            # Shouldn't be a problem most of the time. will obviously error if the model doesn't support this
-            # when training starts
-            if (
-                hasattr(self.text_model_config, "max_seq_len")
-                and self.text_model_config.max_seq_len
-                and self.cfg.sequence_len > self.text_model_config.max_seq_len
-            ):
-                self.text_model_config.max_seq_len = self.cfg.sequence_len
-                LOG.warning(f"increasing context length to {self.cfg.sequence_len}")
-            elif (
-                hasattr(self.text_model_config, "max_sequence_length")
-                and self.text_model_config.max_sequence_length
-                and self.cfg.sequence_len > self.text_model_config.max_sequence_length
-            ):
-                self.text_model_config.max_sequence_length = self.cfg.sequence_len
-                LOG.warning(f"increasing context length to {self.cfg.sequence_len}")
            if self.cfg.gptq:
-                if self.cfg.is_multimodal:
-                    self.model_config.text_config = self.text_model_config
                self.model = self.auto_model_loader.from_pretrained(
                    self.base_model,
                    config=self.model_config,
@@ -1080,8 +1067,6 @@ class ModelLoader:

                _ = _configure_zero3_memory_efficient_loading()

-                if self.cfg.is_multimodal:
-                    self.model_config.text_config = self.text_model_config
                self.model = self.auto_model_loader.from_pretrained(
                    self.base_model,
                    config=self.model_config,
@@ -1346,8 +1331,6 @@ class ModelLoader:
                requires_grad.append(f"{name}: {param.requires_grad}")
        if len(requires_grad) == 0:
            LOG.warning("there are no parameters that require gradient updates")
-        if hasattr(self.model, "config"):
-            self.model.config.use_cache = False

        if self.cfg.flash_optimum:
            from optimum.bettertransformer import BetterTransformer
--- a/src/axolotl/utils/samplers/multipack.py
+++ b/src/axolotl/utils/samplers/multipack.py
@@ -8,11 +8,13 @@ from typing import Any, Iterable, List, Union

 import numba
 import numpy as np
-from torch.utils.data import BatchSampler, Sampler
+from torch.utils.data import BatchSampler, Sampler, SequentialSampler

 from axolotl.utils.distributed import reduce_and_broadcast

-LOG = logging.getLogger("axolotl.utils.samplers.multipack")
+LOG = logging.getLogger(__name__)
+
+LOG.setLevel(logging.INFO)


@numba.njit
@@ -103,6 +105,55 @@ def allocate(
    return result, s, len(result) * c * n


+@numba.njit
+def allocate_sequentially(lengths: np.ndarray, rank: int, c: int, n: int):
+    """
+    Sequential allocator that preserves example order
+
+    Parameters:
+    - lengths: The lengths of all examples
+    - rank: The current rank (for distributed training)
+    - c: The capacity of each bin (maximum sequence length)
+    - n: Number of ranks
+
+    Returns:
+    - result: List of batches for the current rank
+    - total_used: Number of actual example tokens
+    - total_slots: Maximum theoretical number of example tokens (number of bins * bin capacity)
+    """
+    result = []
+    total_used = 0
+
+    # First, do sequential packing into bins
+    all_bins = []
+    current_bin = [0 for i in range(0)]  # numba hint
+    remaining_capacity = c
+
+    for idx, size in enumerate(lengths):
+        if size <= remaining_capacity:
+            # Example fits in current bin
+            current_bin.append(idx)
+            remaining_capacity -= size
+            total_used += size
+        else:
+            # Example doesn't fit, start a new bin
+            if current_bin:  # Add non-empty bin to all_bins
+                all_bins.append(current_bin)
+            current_bin = [idx]
+            remaining_capacity = c - size
+            total_used += size
+
+    # Add the last bin if not empty
+    if current_bin:
+        all_bins.append(current_bin)
+
+    # Assign bins to ranks - each rank gets every n-th bin
+    for bin_idx in range(rank, len(all_bins), n):
+        result.append(all_bins[bin_idx])
+
+    return result, total_used, len(all_bins) * c
+
+
 class MultipackBatchSampler(BatchSampler):
    """Batch sampler class for multipack"""

@@ -115,6 +166,7 @@ class MultipackBatchSampler(BatchSampler):
        packing_efficiency_estimate: float = 1.0,
        drop_last: bool = False,
        num_count_samples: int = 16,
+        sequential: bool = False,
        **kwargs,
    ):
        super().__init__(sampler, batch_size, drop_last)
@@ -122,6 +174,7 @@ class MultipackBatchSampler(BatchSampler):
        self.batch_max_len = batch_max_len
        self.lengths: np.ndarray = lengths
        self.packing_efficiency_estimate = packing_efficiency_estimate or 1.0
+        self.sequential = sequential

        assert isinstance(self.lengths, np.ndarray)

@@ -136,6 +189,11 @@ class MultipackBatchSampler(BatchSampler):
        # the minimum packed dataset length across all ranks determined by a gather/broadcast
        self.len_across_ranks = None

+        if self.sequential and not isinstance(sampler, SequentialSampler):
+            LOG.warn(
+                "using sequential sample packing with non-sequential sampler, did you want to also enable curriculum_sampling?"
+            )
+
    def set_epoch(self, epoch: int):
        self.epoch = epoch

@@ -145,13 +203,21 @@ class MultipackBatchSampler(BatchSampler):
        lengths = self.lengths[indices]
        lengths_cumsum = np.cumsum(lengths)

-        batches, total_used, total_slots = allocate(
-            lengths=lengths,
-            lengths_cumsum=lengths_cumsum,
-            rank=0,
-            c=self.batch_max_len,
-            n=1,
-        )
+        if self.sequential:
+            batches, total_used, total_slots = allocate_sequentially(
+                lengths=lengths,
+                rank=0,
+                c=self.batch_max_len,
+                n=1,
+            )
+        else:
+            batches, total_used, total_slots = allocate(
+                lengths=lengths,
+                lengths_cumsum=lengths_cumsum,
+                rank=0,
+                c=self.batch_max_len,
+                n=1,
+            )

        batches = [
            [
--- a/src/axolotl/utils/schemas/config.py
+++ b/src/axolotl/utils/schemas/config.py
@@ -46,6 +46,7 @@ from axolotl.utils.schemas.multimodal import MultiModalConfig
 from axolotl.utils.schemas.peft import LoraConfig, ReLoRAConfig
 from axolotl.utils.schemas.training import HyperparametersConfig
 from axolotl.utils.schemas.trl import TRLConfig
+from axolotl.utils.schemas.vllm import VllmConfig

 LOG = logging.getLogger(__name__)

@@ -86,6 +87,9 @@ class AxolotlInputConfig(
    trl: TRLConfig | None = Field(
        default_factory=lambda: TRLConfig(),  # pylint: disable=unnecessary-lambda
    )
+    vllm: VllmConfig | None = Field(
+        default_factory=lambda: VllmConfig(),  # pylint: disable=unnecessary-lambda
+    )
    reward_model: bool | None = None
    process_reward_model: bool | None = None
    num_labels: int | None = None
@@ -188,6 +192,7 @@ class AxolotlInputConfig(
    sample_packing: bool | None = None
    sample_packing_group_size: int | None = 100_000
    sample_packing_bin_size: int | None = 200
+    sample_packing_sequentially: bool | None = None
    eval_sample_packing: bool | None = None
    pad_to_sequence_len: bool | None = None
    curriculum_sampling: bool | None = None
@@ -248,6 +253,7 @@ class AxolotlInputConfig(
    val_set_size: float | None = Field(default=0.0)

    sequence_parallel_degree: int | None = None
+    heads_k_stride: int | None = None

    special_tokens: SpecialTokensConfig | None = None
    tokens: list[str] | None = None
@@ -1108,7 +1114,7 @@ class AxolotlInputConfig(

    @field_validator("sequence_parallel_degree", mode="before")
    @classmethod
-    def check_sequence_parallel_config(cls, value, info):
+    def check_sequence_parallel_degree(cls, value, info):
        if not value:
            value = 1

@@ -1129,6 +1135,17 @@ class AxolotlInputConfig(

        return value

+    @model_validator(mode="before")
+    @classmethod
+    def check_muon_deepspeed_fsdp(cls, data):
+        if data.get("optimizer") == "muon" and (
+            data.get("deepspeed") or data.get("fsdp") or data.get("fsdp_config")
+        ):
+            raise ValueError(
+                "Muon optimizer is currently incompatible with DeepSpeed and FSDP"
+            )
+        return data
+

 class AxolotlConfigWCapabilities(AxolotlInputConfig):
    """wrapper to valdiate gpu capabilities with the configured options"""
@@ -1207,17 +1224,12 @@ class AxolotlConfigWCapabilities(AxolotlInputConfig):
        ):
            capabilities = data.get("capabilities")
            is_fsdp = data.get("fsdp") is not None
-            is_deepspeed = data.get("deepspeed") is not None

            if capabilities and capabilities.get("n_gpu", 0) > 1:
                if is_fsdp:
                    raise ValueError(
                        "lora_mlp_kernel, lora_qkv_kernel, and lora_o_kernel are not compatible with FSDP."
                    )
-                if is_deepspeed:
-                    raise ValueError(
-                        "lora_mlp_kernel, lora_qkv_kernel, and lora_o_kernel are not compatible with DeepSpeed."
-                    )
        return data

    @model_validator(mode="before")
@@ -1264,3 +1276,14 @@ class AxolotlConfigWCapabilities(AxolotlInputConfig):
            if data["beta"] != data["trl"]["beta"]:
                raise ValueError("beta and trl.beta must match or one must be removed")
        return data
+
+    @model_validator(mode="after")
+    def check_min_torch_version(self):
+        if self.env_capabilities and self.env_capabilities.torch_version:
+            torch_version = self.env_capabilities.torch_version
+            if version.parse(torch_version) < version.parse("2.5.1"):
+                LOG.warning(
+                    f"torch=={torch_version} may not be supported in future versions. Please consider upgrading to torch>=2.5.1."
+                )
+
+        return self
--- a/src/axolotl/utils/schemas/trl.py
+++ b/src/axolotl/utils/schemas/trl.py
@@ -20,27 +20,30 @@ class TRLConfig(BaseModel):
    )

    # GRPO specific args
-    # Ref: https://github.com/huggingface/trl/blob/e3244d2d096ff1e2e248c931d06d39e165e20623/trl/trainer/grpo_config.py#L22
-    use_vllm: bool | None = Field(
+    # Ref: https://github.com/huggingface/trl/blob/26d86757a7c7e24e397ea44f57ecce6031dfac01/trl/trainer/grpo_config.py#L23
+    use_vllm: bool = Field(
        default=False,
        json_schema_extra={"description": "Whether to use VLLM for RL training"},
    )
-    vllm_device: str | None = Field(
-        default="auto",
-        json_schema_extra={"description": "Device to use for VLLM"},
+    vllm_server_host: str | None = Field(
+        default="0.0.0.0",  # nosec B104
+        json_schema_extra={"description": "Host of the vLLM server to connect to"},
    )
-    vllm_gpu_memory_utilization: float | None = Field(
-        default=0.9,
-        json_schema_extra={"description": "GPU memory utilization for VLLM"},
+    vllm_server_port: int | None = Field(
+        default=8000,
+        json_schema_extra={"description": "Port of the vLLM server to connect to"},
    )
-    vllm_dtype: str | None = Field(
-        default="auto",
-        json_schema_extra={"description": "Data type for VLLM"},
-    )
-    vllm_max_model_len: int | None = Field(
+    vllm_server_timeout: int | None = Field(
        default=None,
        json_schema_extra={
-            "description": "Maximum length of the model context for VLLM"
+            "description": "Total timeout duration in seconds to wait for the vLLM server to be up. If the server is not up "
+            "after the timeout, a `ConnectionError` is raised."
+        },
+    )
+    vllm_guided_decoding_regex: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Regex for vLLM guided decoding. If `None` (default), guided decoding is disabled."
        },
    )

@@ -85,3 +88,48 @@ class TRLConfig(BaseModel):
            "description": "Sync steps for the reference model. Requires `sync_ref_model=True`."
        },
    )
+    scale_rewards: bool = Field(
+        default=True,
+        json_schema_extra={
+            "description": "Whether to scale the rewards for GRPO by dividing them by their standard deviation."
+        },
+    )
+
+    temperature: float | None = Field(
+        default=None,
+        json_schema_extra={"description": "Sampling temperature for the GRPO policy."},
+    )
+    top_p: float | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Top-p sampling probability for the generation policy."
+        },
+    )
+    top_k: int | None = Field(
+        default=None,
+        json_schema_extra={"description": "Top-k sampling for the generation policy."},
+    )
+    min_p: float | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Minimum probability for the generation policy."
+        },
+    )
+    repetition_penalty: float | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far."
+        },
+    )
+    num_iterations: int | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Number of iterations per batch (denoted as μ in the algorithm) for GRPO."
+        },
+    )
+    epsilon: float | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Epsilon value for clipping in the GRPO algorithm."
+        },
+    )
--- a/src/axolotl/utils/schemas/vllm.py
+++ b/src/axolotl/utils/schemas/vllm.py
@@ -0,0 +1,38 @@
+"""
+Pydantic models for VLLM configuration, used primarily for RL training with TRL + grpo
+"""
+
+from pydantic import BaseModel, Field
+
+
+class VllmConfig(BaseModel):
+    """
+    Configuration for VLLM server
+    """
+
+    device: str | None = Field(
+        default="auto",
+        json_schema_extra={"description": "Device to use for VLLM"},
+    )
+    tensor_parallel_size: int | None = Field(
+        default=None,
+        json_schema_extra={"description": "Tensor parallel size for VLLM"},
+    )
+    gpu_memory_utilization: float | None = Field(
+        default=0.9,
+        json_schema_extra={"description": "GPU memory utilization for VLLM"},
+    )
+    dtype: str | None = Field(
+        default="auto",
+        json_schema_extra={"description": "Data type for VLLM"},
+    )
+    max_model_len: int | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Maximum length of the model context for VLLM"
+        },
+    )
+    enable_prefix_caching: bool | None = Field(
+        default=None,
+        json_schema_extra={"description": "Enable prefix caching for VLLM"},
+    )
--- a/src/axolotl/utils/trainer.py
+++ b/src/axolotl/utils/trainer.py
@@ -13,7 +13,7 @@ import torch
 import torch.cuda
 from accelerate.logging import get_logger
 from datasets import IterableDataset, disable_caching, enable_caching
-from torch.utils.data import DataLoader, RandomSampler
+from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
 from transformers.utils import is_torch_bf16_gpu_available

 from axolotl.core.trainer_builder import HFCausalTrainerBuilder, HFRLTrainerBuilder
@@ -235,7 +235,7 @@ def drop_long_seq(sample, sequence_len=2048, min_sequence_len=2):


 def process_datasets_for_packing(cfg, train_dataset, eval_dataset):
-    if cfg.model_config_type == "mamba":
+    if cfg.model_config_type in ["mamba", "gemma3"]:
        LOG.info("dropping attention_mask column")
        train_dataset = train_dataset.remove_columns("attention_mask")
        if eval_dataset:
@@ -456,13 +456,18 @@ def calculate_total_num_steps(cfg, train_dataset, update=True):
            else:
                sampler_batch_size = cfg.micro_batch_size
                batch_max_len = cfg.sequence_len
+            if cfg.curriculum_sampling:
+                sampler = SequentialSampler(train_dataset)
+            else:
+                sampler = RandomSampler(train_dataset)
            sampler = MultipackBatchSampler(
-                sampler=RandomSampler(train_dataset),
+                sampler=sampler,
                lengths=get_dataset_lengths(train_dataset),
                batch_size=sampler_batch_size,
                batch_max_len=batch_max_len,
                group_size=cfg.sample_packing_group_size,
                bin_size=cfg.sample_packing_bin_size,
+                sequential=cfg.sample_packing_sequentially,
                drop_last=True,
            )

--- a/tests/init.py
+++ b/tests/init.py
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -8,10 +8,16 @@ import shutil
 import sys
 import tempfile
 import time
+from pathlib import Path

+import datasets
 import pytest
 import requests
 from huggingface_hub import snapshot_download
+from tokenizers import AddedToken
+from transformers import AutoTokenizer
+
+from tests.hf_offline_utils import disable_hf_offline, enable_hf_offline


 def retry_on_request_exceptions(max_retries=3, delay=1):
@@ -25,9 +31,11 @@ def retry_on_request_exceptions(max_retries=3, delay=1):
                except (
                    requests.exceptions.ReadTimeout,
                    requests.exceptions.ConnectionError,
+                    requests.exceptions.HTTPError,
                ) as exc:
                    if attempt < max_retries - 1:
-                        time.sleep(delay)
+                        wait = 2**attempt * delay  # in seconds
+                        time.sleep(wait)
                    else:
                        raise exc

@@ -37,26 +45,35 @@ def retry_on_request_exceptions(max_retries=3, delay=1):


@retry_on_request_exceptions(max_retries=3, delay=5)
+@disable_hf_offline
 def snapshot_download_w_retry(*args, **kwargs):
    return snapshot_download(*args, **kwargs)


+@pytest.fixture(scope="session", autouse=True)
+def download_ds_fixture_bundle():
+    ds_dir = snapshot_download_w_retry(
+        "axolotl-ai-internal/axolotl-oss-dataset-fixtures", repo_type="dataset"
+    )
+    return Path(ds_dir)
+
+
@pytest.fixture(scope="session", autouse=True)
 def download_smollm2_135m_model():
    # download the model
-    snapshot_download_w_retry("HuggingFaceTB/SmolLM2-135M")
+    snapshot_download_w_retry("HuggingFaceTB/SmolLM2-135M", repo_type="model")


@pytest.fixture(scope="session", autouse=True)
 def download_llama_68m_random_model():
    # download the model
-    snapshot_download_w_retry("JackFram/llama-68m")
+    snapshot_download_w_retry("JackFram/llama-68m", repo_type="model")


@pytest.fixture(scope="session", autouse=True)
 def download_qwen_2_5_half_billion_model():
    # download the model
-    snapshot_download_w_retry("Qwen/Qwen2.5-0.5B")
+    snapshot_download_w_retry("Qwen/Qwen2.5-0.5B", repo_type="model")


@pytest.fixture(scope="session", autouse=True)
@@ -94,13 +111,52 @@ def download_argilla_distilabel_capybara_dpo_7k_binarized_dataset():


@pytest.fixture(scope="session", autouse=True)
-def download_argilla_ultrafeedback_binarized_preferences_cleaned_dataset():
+def download_argilla_distilabel_intel_orca_dpo_dataset():
    # download the dataset
    snapshot_download_w_retry(
-        "argilla/ultrafeedback-binarized-preferences-cleaned", repo_type="dataset"
+        "argilla/distilabel-intel-orca-dpo-pairs", repo_type="dataset"
    )


+# @pytest.fixture(scope="session", autouse=True)
+# def download_argilla_ultrafeedback_binarized_preferences_cleaned_dataset():
+#     # download the dataset
+#     snapshot_download_w_retry(
+#         "argilla/ultrafeedback-binarized-preferences-cleaned", repo_type="dataset"
+#     )
+
+
+# @pytest.fixture(scope="session", autouse=True)
+# def download_fozzie_alpaca_dpo_dataset():
+#     # download the dataset
+#     snapshot_download_w_retry(
+#         "fozziethebeat/alpaca_messages_2k_dpo_test", repo_type="dataset"
+#     )
+#     snapshot_download_w_retry(
+#         "fozziethebeat/alpaca_messages_2k_dpo_test",
+#         repo_type="dataset",
+#         revision="ea82cff",
+#     )
+
+
+# @pytest.fixture(scope="session")
+# @disable_hf_offline
+# def dataset_fozzie_alpaca_dpo_dataset(
+#     download_fozzie_alpaca_dpo_dataset,
+# ):  # pylint: disable=unused-argument,redefined-outer-name
+#     return load_dataset("fozziethebeat/alpaca_messages_2k_dpo_test", split="train")
+#
+#
+# @pytest.fixture(scope="session")
+# @disable_hf_offline
+# def dataset_fozzie_alpaca_dpo_dataset_rev_ea82cff(
+#     download_fozzie_alpaca_dpo_dataset,
+# ):  # pylint: disable=unused-argument,redefined-outer-name
+#     return load_dataset(
+#         "fozziethebeat/alpaca_messages_2k_dpo_test", split="train", revision="ea82cff"
+#     )
+
+
@pytest.fixture(scope="session", autouse=True)
 def download_arcee_ai_distilabel_intel_orca_dpo_pairs_dataset():
    # download the dataset
@@ -109,10 +165,192 @@ def download_arcee_ai_distilabel_intel_orca_dpo_pairs_dataset():
    )


+@pytest.fixture(scope="session", autouse=True)
+def download_argilla_dpo_pairs_dataset():
+    # download the dataset
+    snapshot_download_w_retry(
+        "argilla/distilabel-intel-orca-dpo-pairs", repo_type="dataset"
+    )
+
+
@pytest.fixture(scope="session", autouse=True)
 def download_tiny_shakespeare_dataset():
    # download the dataset
-    snapshot_download_w_retry("Trelis/tiny-shakespeare", repo_type="dataset")
+    snapshot_download_w_retry("winglian/tiny-shakespeare", repo_type="dataset")
+
+
+@pytest.fixture(scope="session", autouse=True)
+def download_deepseek_model_fixture():
+    snapshot_download_w_retry("axolotl-ai-co/DeepSeek-V3-11M", repo_type="model")
+
+
+@pytest.fixture(scope="session", autouse=True)
+def download_huggyllama_model_fixture():
+    # download the tokenizer only
+    snapshot_download_w_retry(
+        "huggyllama/llama-7b",
+        repo_type="model",
+        allow_patterns=["*token*", "config.json"],
+    )
+
+
+@pytest.fixture(scope="session", autouse=True)
+def download_llama_1b_model_fixture():
+    # download the tokenizer only
+    snapshot_download_w_retry(
+        "NousResearch/Llama-3.2-1B",
+        repo_type="model",
+        allow_patterns=["*token*", "config.json"],
+    )
+
+
+@pytest.fixture(scope="session", autouse=True)
+def download_llama3_8b_model_fixture():
+    # download the tokenizer only
+    snapshot_download_w_retry(
+        "NousResearch/Meta-Llama-3-8B", repo_type="model", allow_patterns=["*token*"]
+    )
+
+
+@pytest.fixture(scope="session", autouse=True)
+def download_llama3_8b_instruct_model_fixture():
+    # download the tokenizer only
+    snapshot_download_w_retry(
+        "NousResearch/Meta-Llama-3-8B-Instruct",
+        repo_type="model",
+        allow_patterns=["*token*"],
+    )
+
+
+@pytest.fixture(scope="session", autouse=True)
+def download_phi_35_mini_model_fixture():
+    # download the tokenizer only
+    snapshot_download_w_retry(
+        "microsoft/Phi-3.5-mini-instruct", repo_type="model", allow_patterns=["*token*"]
+    )
+
+
+@pytest.fixture(scope="session", autouse=True)
+def download_phi_3_medium_model_fixture():
+    # download the tokenizer only
+    snapshot_download_w_retry(
+        "microsoft/Phi-3-medium-128k-instruct",
+        repo_type="model",
+        allow_patterns=["*token*"],
+    )
+
+
+@pytest.fixture(scope="session", autouse=True)
+def download_mistral_7b_model_fixture():
+    # download the tokenizer only
+    snapshot_download_w_retry(
+        "casperhansen/mistral-7b-instruct-v0.1-awq",
+        repo_type="model",
+        allow_patterns=["*token*", "config.json"],
+    )
+
+
+@pytest.fixture(scope="session", autouse=True)
+def download_gemma_2b_model_fixture():
+    # download the tokenizer only
+    snapshot_download_w_retry(
+        "unsloth/gemma-2b-it",
+        revision="703fb4a",
+        repo_type="model",
+        allow_patterns=["*token*", "config.json"],
+    )
+
+
+@pytest.fixture(scope="session", autouse=True)
+def download_gemma2_9b_model_fixture():
+    # download the tokenizer only
+    snapshot_download_w_retry(
+        "mlx-community/gemma-2-9b-it-4bit",
+        repo_type="model",
+        allow_patterns=["*token*", "config.json"],
+    )
+
+
+@pytest.fixture(scope="session", autouse=True)
+def download_mlx_mistral_7b_model_fixture():
+    # download the tokenizer only
+    snapshot_download_w_retry(
+        "mlx-community/Mistral-7B-Instruct-v0.3-4bit",
+        repo_type="model",
+        allow_patterns=["*token*", "config.json"],
+    )
+
+
+@pytest.fixture
+def download_llama2_model_fixture():
+    # download the tokenizer only
+    snapshot_download_w_retry(
+        "NousResearch/Llama-2-7b-hf",
+        repo_type="model",
+        allow_patterns=["*token*", "config.json"],
+    )
+
+
+@pytest.fixture
+@enable_hf_offline
+def tokenizer_huggyllama(
+    download_huggyllama_model_fixture,
+):  # pylint: disable=unused-argument,redefined-outer-name
+    tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
+    tokenizer.pad_token = "</s>"
+
+    return tokenizer
+
+
+@pytest.fixture
+@enable_hf_offline
+def tokenizer_huggyllama_w_special_tokens(
+    tokenizer_huggyllama,
+):  # pylint: disable=redefined-outer-name
+    tokenizer_huggyllama.add_special_tokens(
+        {
+            "bos_token": "<s>",
+            "eos_token": "</s>",
+            "unk_token": "<unk>",
+        }
+    )
+
+    return tokenizer_huggyllama
+
+
+@pytest.fixture
+@enable_hf_offline
+def tokenizer_llama2_7b(
+    download_llama2_model_fixture,
+):  # pylint: disable=unused-argument,redefined-outer-name
+    tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-hf")
+
+    return tokenizer
+
+
+@pytest.fixture
+@enable_hf_offline
+def tokenizer_mistral_7b_instruct(
+    download_mlx_mistral_7b_model_fixture,
+):  # pylint: disable=unused-argument,redefined-outer-name
+    return AutoTokenizer.from_pretrained("casperhansen/mistral-7b-instruct-v0.1-awq")
+
+
+@pytest.fixture
+def tokenizer_mistral_7b_instruct_chatml(tokenizer_mistral_7b_instruct):
+    tokenizer_mistral_7b_instruct.add_special_tokens(
+        {
+            "eos_token": AddedToken(
+                "<|im_end|>", rstrip=False, lstrip=False, normalized=False
+            )
+        }
+    )
+    tokenizer_mistral_7b_instruct.add_tokens(
+        [
+            AddedToken("<|im_start|>", rstrip=False, lstrip=False, normalized=False),
+        ]
+    )
+    return tokenizer_mistral_7b_instruct


@pytest.fixture
@@ -178,3 +416,88 @@ def cleanup_monkeypatches():
            module_globals = module_name_tuple[1]
            for module_global in module_globals:
                globals().pop(module_global, None)
+
+
+@pytest.fixture
+def dataset_winglian_tiny_shakespeare(
+    download_ds_fixture_bundle: Path,
+):  # pylint: disable=redefined-outer-name
+    ds_path = download_ds_fixture_bundle / "winglian__tiny-shakespeare"
+    return datasets.load_from_disk(ds_path)
+
+
+@pytest.fixture
+def dataset_tatsu_lab_alpaca(
+    download_ds_fixture_bundle: Path,
+):  # pylint: disable=redefined-outer-name
+    ds_path = download_ds_fixture_bundle / "tatsu-lab__alpaca"
+    return datasets.load_from_disk(ds_path)["train"]
+
+
+@pytest.fixture
+def dataset_mhenrichsen_alpaca_2k_test(
+    download_ds_fixture_bundle: Path,
+):  # pylint: disable=redefined-outer-name
+    ds_path = download_ds_fixture_bundle / "mhenrichsen__alpaca_2k_test"
+    return datasets.load_from_disk(ds_path)["train"]
+
+
+@pytest.fixture
+def dataset_argilla_ultrafeedback_binarized_preferences_cleaned(
+    download_ds_fixture_bundle: Path,
+):  # pylint: disable=redefined-outer-name
+    ds_path = (
+        download_ds_fixture_bundle
+        / "argilla__ultrafeedback-binarized-preferences-cleaned"
+    )
+    return datasets.load_from_disk(ds_path)["train"]
+
+
+@pytest.fixture
+def dataset_fozziethebeat_alpaca_messages_2k_dpo_test(
+    download_ds_fixture_bundle: Path,
+):  # pylint: disable=redefined-outer-name
+    ds_path = download_ds_fixture_bundle / "fozziethebeat__alpaca_messages_2k_dpo_test"
+    return datasets.load_from_disk(ds_path)["train"]
+
+
+@pytest.fixture
+def dataset_fozziethebeat_alpaca_messages_2k_dpo_test_rev_ea82cff(
+    download_ds_fixture_bundle: Path,
+):  # pylint: disable=redefined-outer-name
+    ds_path = (
+        download_ds_fixture_bundle
+        / "fozziethebeat__alpaca_messages_2k_dpo_test__rev_ea82cff"
+    )
+    return datasets.load_from_disk(ds_path)["train"]
+
+
+# # pylint: disable=redefined-outer-name,unused-argument
+# def test_load_fixtures(
+#     download_smollm2_135m_model,
+#     download_llama_68m_random_model,
+#     download_qwen_2_5_half_billion_model,
+#     download_tatsu_lab_alpaca_dataset,
+#     download_mhenrichsen_alpaca_2k_dataset,
+#     download_mhenrichsen_alpaca_2k_w_revision_dataset,
+#     download_mlabonne_finetome_100k_dataset,
+#     download_argilla_distilabel_capybara_dpo_7k_binarized_dataset,
+#     download_argilla_ultrafeedback_binarized_preferences_cleaned_dataset,
+#     download_fozzie_alpaca_dpo_dataset,
+#     download_arcee_ai_distilabel_intel_orca_dpo_pairs_dataset,
+#     download_argilla_dpo_pairs_dataset,
+#     download_tiny_shakespeare_dataset,
+#     download_deepseek_model_fixture,
+#     download_huggyllama_model_fixture,
+#     download_llama_1b_model_fixture,
+#     download_llama3_8b_model_fixture,
+#     download_llama3_8b_instruct_model_fixture,
+#     download_phi_35_mini_model_fixture,
+#     download_phi_3_medium_model_fixture,
+#     download_mistral_7b_model_fixture,
+#     download_gemma_2b_model_fixture,
+#     download_gemma2_9b_model_fixture,
+#     download_mlx_mistral_7b_model_fixture,
+#     download_llama2_model_fixture,
+# ):
+#     pass
--- a/tests/core/chat/test_messages.py
+++ b/tests/core/chat/test_messages.py
@@ -10,10 +10,13 @@ from transformers import AddedToken, AutoTokenizer
 from axolotl.core.chat.format.chatml import format_message
 from axolotl.core.chat.messages import ChatFormattedChats, Chats

+from tests.hf_offline_utils import enable_hf_offline  # noqa
+

@pytest.fixture(scope="session", name="llama_tokenizer")
+@enable_hf_offline
 def llama_tokenizer_fixture():
-    return AutoTokenizer.from_pretrained("NousResearch/Meta-Llama-3.1-8B")
+    return AutoTokenizer.from_pretrained("NousResearch/Meta-Llama-3-8B")


@pytest.fixture(scope="session", name="chatml_tokenizer")
--- a/tests/e2e/integrations/test_kd.py
+++ b/tests/e2e/integrations/test_kd.py
@@ -5,7 +5,6 @@ e2e tests for kd trainer support in Axolotl
 from pathlib import Path

 import pytest
-from e2e.utils import check_tensorboard, require_torch_2_5_1

 from axolotl.cli.args import TrainerCliArgs
 from axolotl.common.datasets import load_datasets
@@ -13,6 +12,8 @@ from axolotl.train import train
 from axolotl.utils.config import normalize_config, prepare_plugins, validate_config
 from axolotl.utils.dict import DictDefault

+from tests.e2e.utils import check_tensorboard, require_torch_2_5_1
+

@pytest.fixture(name="kd_min_cfg")
 def min_cfg(temp_dir):
--- a/tests/e2e/integrations/test_liger.py
+++ b/tests/e2e/integrations/test_liger.py
@@ -2,15 +2,13 @@
 Simple end-to-end test for Liger integration
 """

-from e2e.utils import require_torch_2_4_1
-
 from axolotl.cli.args import TrainerCliArgs
 from axolotl.common.datasets import load_datasets
 from axolotl.train import train
 from axolotl.utils.config import normalize_config, prepare_plugins
 from axolotl.utils.dict import DictDefault

-from ..utils import check_model_output_exists
+from tests.e2e.utils import check_model_output_exists, require_torch_2_4_1


 class LigerIntegrationTestCase:
--- a/tests/e2e/multigpu/solo/init.py
+++ b/tests/e2e/multigpu/solo/init.py
--- a/tests/e2e/multigpu/solo/test_grpo.py
+++ b/tests/e2e/multigpu/solo/test_grpo.py
@@ -0,0 +1,294 @@
+"""
+GRPO test suite
+"""
+
+import os
+import random
+import subprocess  # nosec B404
+import sys
+import time
+from pathlib import Path
+
+import pytest
+import requests
+import yaml
+from accelerate.test_utils import execute_subprocess_async
+from transformers.testing_utils import get_torch_dist_unique_port
+
+from axolotl.utils.dict import DictDefault
+
+from tests.e2e.utils import require_vllm
+
+
+def start_vllm(
+    model: str, env: dict | None = None, wait: int | None = None, quiet=False, **kwargs
+) -> int:
+    """
+    helper function to start the VLLM server in the background, mostly for testing purposes
+    """
+    cmd = [sys.executable, "-m", "trl.scripts.vllm_serve", "--model", model]
+
+    if tensor_parallel_size := kwargs.get("tensor_parallel_size"):
+        cmd.extend(["--tensor-parallel-size", str(tensor_parallel_size)])
+    if host := kwargs.get("host"):
+        cmd.extend(["--host", host])
+    if port := kwargs.get("port"):
+        cmd.extend(["--port", str(port)])
+    if gpu_memory_utilization := kwargs.get("gpu_memory_utilization"):
+        cmd.extend(["--gpu-memory-utilization", str(gpu_memory_utilization)])
+    if dtype := kwargs.get("dtype"):
+        cmd.extend(["--dtype", dtype])
+    if max_model_len := kwargs.get("max_model_len"):
+        cmd.extend(["--max-model-len", str(max_model_len)])
+    if kwargs.get("enable_prefix_caching"):
+        cmd.extend(["--enable-prefix-caching", "True"])
+
+    # print out the command to be executed
+    print(" ".join(cmd))
+
+    # start `trl vllm-serve` command in the background and capture the process id
+    process = subprocess.Popen(  # pylint: disable=consider-using-with
+        cmd,
+        env=env,
+        stdout=subprocess.DEVNULL if quiet else subprocess.PIPE,
+        stderr=subprocess.DEVNULL if quiet else subprocess.PIPE,
+    )  # nosec B603
+
+    # print out the process id so the user can easily kill it later
+    print(f"VLLM server process started (PID: {process.pid})")
+
+    # wait until the http server is ready, even if it 404s, but timeout after 60 seconds
+    started = False
+    if wait and host and port:
+        for _ in range(int(wait)):
+            try:
+                response = requests.get(f"http://{host}:{port}", timeout=1)
+                if int(response.status_code) in [200, 404]:
+                    started = True
+                    break
+            except requests.exceptions.RequestException:
+                pass
+
+            # also check if the process.pid is still running
+            if not process.poll() is None:
+                break
+
+            time.sleep(1)
+
+    if wait and not started:
+        print(
+            f"VLLM server process did not start within {wait} seconds. Please check your server logs."
+        )
+        process.kill()
+        raise RuntimeError(f"VLLM server process did not start within {wait} seconds.")
+
+    # return the process id
+    return process.pid
+
+
+class TestGRPO:
+    """
+    Test case for GRPO training using multilpe GPUs
+    """
+
+    def _utils_write_yaml_and_rewards(self, cfg, temp_dir, suffix=""):
+        # write cfg to yaml file
+        Path(temp_dir).mkdir(parents=True, exist_ok=True)
+        with open(Path(temp_dir) / "config.yaml", "w", encoding="utf-8") as fout:
+            fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
+        with open(f"rewards_{suffix}.py", "w", encoding="utf-8") as fout:
+            fout.write(
+                """import random
+def rand_reward_func(completions, **kwargs) -> list[float]:
+    return [random.uniform(0, 1) for _ in completions]
+
+def oai_gsm8k_transform(cfg, *args, **kwargs):
+    def transform_fn(example, tokenizer=None):
+        label = example["answer"].split("####")[-1].strip().replace(",", "")
+        return {
+            "prompt": [{"role": "user", "content": example["question"]},],
+            "answer": label,
+        }
+    return transform_fn, {"remove_columns": ["question"]}
+"""
+            )
+
+    @pytest.mark.parametrize(
+        "num_gpus",
+        [1, 2],
+    )
+    @require_vllm
+    def test_llama_dora(self, temp_dir, num_gpus):
+        rnd_reward_suffix = str(random.randint(1000, 9999))
+        cfg = DictDefault(
+            {
+                "base_model": "HuggingFaceTB/SmolLM2-135M",
+                "chat_template": "llama3",
+                "rl": "grpo",
+                "trl": {
+                    "beta": 0.001,
+                    "max_completion_length": 256,
+                    "use_vllm": True,
+                    "num_generations": 4,
+                    "reward_funcs": [f"rewards_{rnd_reward_suffix}.rand_reward_func"],
+                },
+                "vllm": {
+                    "max_model_len": 800,
+                    "enable_prefix_caching": True,
+                },
+                "datasets": [
+                    {
+                        "path": "openai/gsm8k",
+                        "name": "main",
+                        "type": f"rewards_{rnd_reward_suffix}.oai_gsm8k_transform",
+                    },
+                ],
+                "adapter": "lora",
+                "lora_r": 8,
+                "lora_alpha": 16,
+                "lora_dropout": 0.05,
+                "lora_target_linear": True,
+                "peft_use_dora": True,
+                "flash_attention": True,
+                "sequence_len": 1024,
+                "special_tokens": {
+                    "pad_token": "<|endoftext|>",
+                },
+                "max_steps": 3,
+                "num_epochs": 1,
+                "micro_batch_size": 4,
+                "gradient_accumulation_steps": 2,
+                "warmup_steps": 10,
+                "val_set_size": 0.0,
+                "output_dir": temp_dir,
+                "learning_rate": 0.0001,
+                "optimizer": "adamw_torch_fused",
+                "lr_scheduler": "cosine",
+                "save_safetensors": True,
+                "bf16": "auto",
+                "use_tensorboard": True,
+            }
+        )
+
+        self._utils_write_yaml_and_rewards(cfg, temp_dir, suffix=rnd_reward_suffix)
+
+        current_env = os.environ.copy()
+        env = {
+            "NCCL_P2P_LEVEL": "LOC",
+            **current_env,
+            "CUDA_VISIBLE_DEVICES": "1",
+        }
+        vllm_process_id = start_vllm(
+            cfg.base_model,
+            env=env,
+            quiet=True,
+            wait=120,
+            gpu_memory_utilization=0.15,
+            max_model_len=cfg.vllm.max_model_len,
+            enable_prefix_caching=cfg.vllm.enable_prefix_caching,
+            host="0.0.0.0",
+            port=8000,
+        )
+
+        try:
+            execute_subprocess_async(
+                [
+                    "axolotl",
+                    "train",
+                    str(Path(temp_dir) / "config.yaml"),
+                    "--num-processes",
+                    str(num_gpus),
+                    "--main-process-port",
+                    f"{get_torch_dist_unique_port()}",
+                ],
+                env={"NCCL_P2P_LEVEL": "LOC", "NCCL_DEBUG": "INFO", **current_env},
+            )
+        finally:
+            os.kill(vllm_process_id, 9)
+
+    @pytest.mark.parametrize(
+        "num_gpus",
+        [1, 2],
+    )
+    @require_vllm
+    def test_llama_fft(self, temp_dir, num_gpus):
+        rnd_reward_suffix = str(random.randint(1000, 9999))
+        cfg = DictDefault(
+            {
+                "base_model": "HuggingFaceTB/SmolLM2-135M",
+                "chat_template": "llama3",
+                "rl": "grpo",
+                "trl": {
+                    "beta": 0.001,
+                    "max_completion_length": 256,
+                    "use_vllm": True,
+                    "num_generations": 4,
+                    "reward_funcs": [f"rewards_{rnd_reward_suffix}.rand_reward_func"],
+                },
+                "vllm": {
+                    "max_model_len": 800,
+                    "enable_prefix_caching": True,
+                },
+                "datasets": [
+                    {
+                        "path": "openai/gsm8k",
+                        "name": "main",
+                        "type": f"rewards_{rnd_reward_suffix}.oai_gsm8k_transform",
+                    },
+                ],
+                "flash_attention": True,
+                "sequence_len": 1024,
+                "special_tokens": {
+                    "pad_token": "<|endoftext|>",
+                },
+                "max_steps": 3,
+                "num_epochs": 1,
+                "micro_batch_size": 4,
+                "gradient_accumulation_steps": 2,
+                "warmup_steps": 10,
+                "val_set_size": 0.0,
+                "output_dir": temp_dir,
+                "learning_rate": 0.0001,
+                "optimizer": "adamw_torch_fused",
+                "lr_scheduler": "cosine",
+                "save_safetensors": True,
+                "bf16": "auto",
+                "use_tensorboard": True,
+            }
+        )
+
+        self._utils_write_yaml_and_rewards(cfg, temp_dir, suffix=rnd_reward_suffix)
+
+        current_env = os.environ.copy()
+        env = {
+            "NCCL_P2P_LEVEL": "LOC",  # nccl can be brittle, assume P2P isn't reliable
+            **current_env,
+            "CUDA_VISIBLE_DEVICES": "1",
+        }
+        vllm_process_id = start_vllm(
+            cfg.base_model,
+            env=env,
+            quiet=True,
+            wait=120,
+            gpu_memory_utilization=0.15,
+            max_model_len=cfg.vllm.max_model_len,
+            enable_prefix_caching=cfg.vllm.enable_prefix_caching,
+            host="0.0.0.0",
+            port=8000,
+        )
+
+        try:
+            execute_subprocess_async(
+                [
+                    "axolotl",
+                    "train",
+                    str(Path(temp_dir) / "config.yaml"),
+                    "--num-processes",
+                    str(num_gpus),
+                    "--main-process-port",
+                    f"{get_torch_dist_unique_port()}",
+                ],
+                env={"NCCL_P2P_LEVEL": "LOC", "NCCL_DEBUG": "INFO", **current_env},
+            )
+        finally:
+            os.kill(vllm_process_id, 9)
--- a/tests/e2e/multigpu/test_eval.py
+++ b/tests/e2e/multigpu/test_eval.py
@@ -52,9 +52,9 @@ class TestMultiGPUEval:
                    },
                ],
                "num_epochs": 1,
-                "max_steps": 5,
+                "max_steps": 2,
                "micro_batch_size": 2,
-                "gradient_accumulation_steps": 4,
+                "gradient_accumulation_steps": 2,
                "output_dir": temp_dir,
                "learning_rate": 0.00001,
                "optimizer": "adamw_8bit",
@@ -121,9 +121,9 @@ class TestMultiGPUEval:
                    },
                ],
                "num_epochs": 1,
-                "max_steps": 5,
+                "max_steps": 2,
                "micro_batch_size": 2,
-                "gradient_accumulation_steps": 4,
+                "gradient_accumulation_steps": 2,
                "output_dir": temp_dir,
                "learning_rate": 0.00001,
                "optimizer": "adamw_8bit",
--- a/tests/e2e/multigpu/test_gemma3.py
+++ b/tests/e2e/multigpu/test_gemma3.py
@@ -0,0 +1,100 @@
+"""
+E2E tests for multigpu lora tinyllama
+"""
+
+import logging
+import os
+from pathlib import Path
+
+import pytest
+import yaml
+from accelerate.test_utils import execute_subprocess_async
+from huggingface_hub import snapshot_download
+from transformers.testing_utils import get_torch_dist_unique_port
+
+from axolotl.utils.dict import DictDefault
+
+from tests.e2e.utils import check_tensorboard
+
+LOG = logging.getLogger("axolotl.tests.e2e.multigpu")
+os.environ["WANDB_DISABLED"] = "true"
+
+AXOLOTL_ROOT = Path(__file__).parent.parent.parent.parent
+
+
+@pytest.fixture(scope="session", autouse=True)
+def download_model():
+    # download the model
+    snapshot_download("axolotl-mirrors/gemma-3-4b-pt", repo_type="model")
+
+
+class TestMultiGPUGemma3:
+    """
+    Test case for Gemma3 models using LoRA
+    """
+
+    def test_lora_ddp_packed(self, temp_dir):
+        # pylint: disable=duplicate-code
+        cfg = DictDefault(
+            {
+                "base_model": "axolotl-mirrors/gemma-3-4b-pt",
+                "sequence_len": 2048,
+                "ddp_find_unused_parameters": True,
+                "sample_packing": True,
+                "eval_sample_packing": False,
+                "pad_to_sequence_len": True,
+                "adapter": "lora",
+                "lora_r": 8,
+                "lora_alpha": 16,
+                "lora_dropout": 0.05,
+                "lora_target_linear": True,
+                "val_set_size": 0.0,
+                "chat_template": "gemma3",
+                "datasets": [
+                    {
+                        "path": "mlabonne/FineTome-100k",
+                        "type": "chat_template",
+                        "split": "train[:10%]",
+                        "field_messages": "conversations",
+                        "message_field_role": "from",
+                        "message_field_content": "value",
+                    },
+                ],
+                "num_epochs": 1,
+                "max_steps": 2,
+                "micro_batch_size": 4,
+                "gradient_checkpointing": True,
+                "gradient_checkpointing_kwargs": {
+                    "use_reentrant": False,
+                },
+                "gradient_accumulation_steps": 2,
+                "output_dir": temp_dir,
+                "learning_rate": 0.0001,
+                "optimizer": "adamw_8bit",
+                "lr_scheduler": "cosine",
+                "flash_attention": True,
+                "use_tensorboard": True,
+                "bf16": True,
+            }
+        )
+
+        # write cfg to yaml file
+        Path(temp_dir).mkdir(parents=True, exist_ok=True)
+        with open(Path(temp_dir) / "config.yaml", "w", encoding="utf-8") as fout:
+            fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
+
+        execute_subprocess_async(
+            [
+                "axolotl",
+                "train",
+                str(Path(temp_dir) / "config.yaml"),
+                "--num-processes",
+                "2",
+                "--main-process-port",
+                f"{get_torch_dist_unique_port()}",
+            ]
+        )
+
+        check_tensorboard(
+            temp_dir + "/runs", "train/train_loss", 1.8, "Train Loss is too high"
+        )
--- a/tests/e2e/multigpu/test_grpo.py
+++ b/tests/e2e/multigpu/test_grpo.py
@@ -1,174 +0,0 @@
-"""
-GRPO test suite
-"""
-
-import random
-from pathlib import Path
-
-import pytest
-import yaml
-from accelerate.test_utils import execute_subprocess_async
-from e2e.utils import require_vllm
-from transformers.testing_utils import get_torch_dist_unique_port
-
-from axolotl.utils.dict import DictDefault
-
-
-class TestGRPO:
-    """
-    Test case for GRPO training using multilpe GPUs
-    """
-
-    def _utils_write_yaml_and_rewards(self, cfg, temp_dir, suffix=""):
-        # write cfg to yaml file
-        Path(temp_dir).mkdir(parents=True, exist_ok=True)
-        with open(Path(temp_dir) / "config.yaml", "w", encoding="utf-8") as fout:
-            fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
-        with open(f"rewards_{suffix}.py", "w", encoding="utf-8") as fout:
-            fout.write(
-                """import random
-def rand_reward_func(completions, **kwargs) -> list[float]:
-    return [random.uniform(0, 1) for _ in completions]
-
-def oai_gsm8k_transform(cfg, *args, **kwargs):
-    def transform_fn(example, tokenizer=None):
-        label = example["answer"].split("####")[-1].strip().replace(",", "")
-        return {
-            "prompt": [{"role": "user", "content": example["question"]},],
-            "answer": label,
-        }
-    return transform_fn, {"remove_columns": ["question"]}
-"""
-            )
-
-    @pytest.mark.parametrize(
-        "num_gpus",
-        [1, 2],
-    )
-    @require_vllm
-    def test_llama_dora(self, temp_dir, num_gpus):
-        rnd_reward_suffix = str(random.randint(1000, 9999))
-        cfg = DictDefault(
-            {
-                "base_model": "HuggingFaceTB/SmolLM2-135M",
-                "chat_template": "llama3",
-                "rl": "grpo",
-                "trl": {
-                    "beta": 0.001,
-                    "max_completion_length": 256,
-                    "use_vllm": True,
-                    "vllm_device": "auto" if num_gpus == 1 else "cuda:1",
-                    "vllm_gpu_memory_utilization": 0.15,
-                    "num_generations": 4,
-                    "reward_funcs": [f"rewards_{rnd_reward_suffix}.rand_reward_func"],
-                },
-                "datasets": [
-                    {
-                        "path": "openai/gsm8k",
-                        "name": "main",
-                        "type": f"rewards_{rnd_reward_suffix}.oai_gsm8k_transform",
-                    },
-                ],
-                "adapter": "lora",
-                "lora_r": 8,
-                "lora_alpha": 16,
-                "lora_dropout": 0.05,
-                "lora_target_linear": True,
-                "peft_use_dora": True,
-                "flash_attention": True,
-                "sequence_len": 1024,
-                "special_tokens": {
-                    "pad_token": "<|endoftext|>",
-                },
-                "max_steps": 5,
-                "num_epochs": 1,
-                "micro_batch_size": 4,
-                "gradient_accumulation_steps": 2,
-                "warmup_steps": 10,
-                "val_set_size": 0.0,
-                "output_dir": temp_dir,
-                "learning_rate": 0.0001,
-                "optimizer": "adamw_torch_fused",
-                "lr_scheduler": "cosine",
-                "save_safetensors": True,
-                "bf16": "auto",
-                "use_tensorboard": True,
-            }
-        )
-
-        self._utils_write_yaml_and_rewards(cfg, temp_dir, suffix=rnd_reward_suffix)
-
-        execute_subprocess_async(
-            [
-                "axolotl",
-                "train",
-                str(Path(temp_dir) / "config.yaml"),
-                "--num-processes",
-                str(num_gpus),
-                "--main-process-port",
-                f"{get_torch_dist_unique_port()}",
-            ]
-        )
-
-    @pytest.mark.parametrize(
-        "num_gpus",
-        [1, 2],
-    )
-    @require_vllm
-    def test_llama_fft(self, temp_dir, num_gpus):
-        rnd_reward_suffix = str(random.randint(1000, 9999))
-        cfg = DictDefault(
-            {
-                "base_model": "HuggingFaceTB/SmolLM2-135M",
-                "chat_template": "llama3",
-                "rl": "grpo",
-                "trl": {
-                    "beta": 0.001,
-                    "max_completion_length": 256,
-                    "use_vllm": True,
-                    "vllm_device": "auto" if num_gpus == 1 else "cuda:1",
-                    "vllm_gpu_memory_utilization": 0.15,
-                    "num_generations": 4,
-                    "reward_funcs": [f"rewards_{rnd_reward_suffix}.rand_reward_func"],
-                },
-                "datasets": [
-                    {
-                        "path": "openai/gsm8k",
-                        "name": "main",
-                        "type": f"rewards_{rnd_reward_suffix}.oai_gsm8k_transform",
-                    },
-                ],
-                "flash_attention": True,
-                "sequence_len": 1024,
-                "special_tokens": {
-                    "pad_token": "<|endoftext|>",
-                },
-                "max_steps": 5,
-                "num_epochs": 1,
-                "micro_batch_size": 4,
-                "gradient_accumulation_steps": 2,
-                "warmup_steps": 10,
-                "val_set_size": 0.0,
-                "output_dir": temp_dir,
-                "learning_rate": 0.0001,
-                "optimizer": "adamw_torch_fused",
-                "lr_scheduler": "cosine",
-                "save_safetensors": True,
-                "bf16": "auto",
-                "use_tensorboard": True,
-            }
-        )
-
-        self._utils_write_yaml_and_rewards(cfg, temp_dir, suffix=rnd_reward_suffix)
-
-        execute_subprocess_async(
-            [
-                "axolotl",
-                "train",
-                str(Path(temp_dir) / "config.yaml"),
-                "--num-processes",
-                str(num_gpus),
-                "--main-process-port",
-                f"{get_torch_dist_unique_port()}",
-            ]
-        )
--- a/tests/e2e/multigpu/test_llama.py
+++ b/tests/e2e/multigpu/test_llama.py
@@ -9,12 +9,13 @@ from pathlib import Path
 import pytest
 import yaml
 from accelerate.test_utils import execute_subprocess_async
-from e2e.utils import check_tensorboard
 from huggingface_hub import snapshot_download
 from transformers.testing_utils import get_torch_dist_unique_port

 from axolotl.utils.dict import DictDefault

+from tests.e2e.utils import check_tensorboard
+
 LOG = logging.getLogger("axolotl.tests.e2e.multigpu")
 os.environ["WANDB_DISABLED"] = "true"

@@ -57,6 +58,7 @@ class TestMultiGPULlama:
                "max_steps": 2,
                "micro_batch_size": 4,
                "gradient_accumulation_steps": 4,
+                "gradient_checkpointing": True,
                "output_dir": temp_dir,
                "learning_rate": 0.00001,
                "optimizer": "adamw_8bit",
@@ -120,6 +122,7 @@ class TestMultiGPULlama:
                "max_steps": 2,
                "micro_batch_size": 1,
                "gradient_accumulation_steps": gradient_accumulation_steps,
+                "gradient_checkpointing": True,
                "output_dir": temp_dir,
                "learning_rate": 0.00001,
                "optimizer": "adamw_8bit",
@@ -192,6 +195,7 @@ class TestMultiGPULlama:
                "max_steps": 2,
                "micro_batch_size": 4,
                "gradient_accumulation_steps": 4,
+                "gradient_checkpointing": True,
                "output_dir": temp_dir,
                "warmup_steps": 0,
                "learning_rate": 0.00001,
@@ -269,6 +273,7 @@ class TestMultiGPULlama:
                "max_steps": 2,
                "micro_batch_size": 2,
                "gradient_accumulation_steps": 4,
+                "gradient_checkpointing": True,
                "output_dir": temp_dir,
                "warmup_steps": 0,
                "learning_rate": 0.00001,
@@ -329,6 +334,7 @@ class TestMultiGPULlama:
                "max_steps": 2,
                "micro_batch_size": 2,
                "gradient_accumulation_steps": gradient_accumulation_steps,
+                "gradient_checkpointing": True,
                "output_dir": temp_dir,
                "learning_rate": 0.00001,
                "optimizer": "adamw_torch_fused",
@@ -398,7 +404,8 @@ class TestMultiGPULlama:
                "num_epochs": 1,
                "max_steps": 2,
                "micro_batch_size": 4,
-                "gradient_accumulation_steps": 4,
+                "gradient_accumulation_steps": 2,
+                "gradient_checkpointing": True,
                "output_dir": temp_dir,
                "learning_rate": 0.00001,
                "optimizer": "adamw_torch_fused",
@@ -477,7 +484,8 @@ class TestMultiGPULlama:
                "num_epochs": 1,
                "max_steps": 2,
                "micro_batch_size": 4,
-                "gradient_accumulation_steps": 4,
+                "gradient_accumulation_steps": 2,
+                "gradient_checkpointing": True,
                "output_dir": temp_dir,
                "learning_rate": 0.00001,
                "optimizer": "adamw_torch_fused",
@@ -777,9 +785,10 @@ class TestMultiGPULlama:
                    },
                ],
                "num_epochs": 1,
-                "max_steps": 5,
+                "max_steps": 2,
                "micro_batch_size": 1,
                "gradient_accumulation_steps": 1,
+                "gradient_checkpointing": True,
                "output_dir": temp_dir,
                "learning_rate": 0.00001,
                "optimizer": "adamw_torch_fused",
--- a/tests/e2e/multigpu/test_qwen2.py
+++ b/tests/e2e/multigpu/test_qwen2.py
@@ -46,7 +46,7 @@ class TestMultiGPUQwen2:
                    },
                ],
                "num_epochs": 1,
-                "max_steps": 5,
+                "max_steps": 2,
                "warmup_steps": 20,
                "micro_batch_size": 2,
                "gradient_accumulation_steps": 2,
--- a/tests/e2e/multigpu/test_ray.py
+++ b/tests/e2e/multigpu/test_ray.py
@@ -9,10 +9,11 @@ from pathlib import Path
 import pytest
 import yaml
 from accelerate.test_utils import execute_subprocess_async
-from e2e.utils import check_tensorboard, require_torch_lt_2_6_0

 from axolotl.utils.dict import DictDefault

+from tests.e2e.utils import check_tensorboard, require_torch_lt_2_6_0
+
 LOG = logging.getLogger(__name__)
 os.environ["WANDB_DISABLED"] = "true"

@@ -49,7 +50,7 @@ class TestMultiGPURay:
                "num_epochs": 1,
                "max_steps": 2,
                "micro_batch_size": 4,
-                "gradient_accumulation_steps": 4,
+                "gradient_accumulation_steps": 2,
                "output_dir": temp_dir,
                "learning_rate": 0.00001,
                "optimizer": "adamw_8bit",
--- a/tests/e2e/patched/test_sp.py
+++ b/tests/e2e/patched/test_sp.py
@@ -110,7 +110,7 @@ class TestRingAttention:
        mock_new_group.return_value = mock_group

        # Call register_ring_attn with size 4
-        register_ring_attn(sequence_parallel_degree=4)
+        register_ring_attn(sequence_parallel_degree=4, heads_k_stride=1)

        # Verify the number of calls without examining the arguments
        assert mock_new_group.call_count == 2
--- a/tests/e2e/test_deepseekv3.py
+++ b/tests/e2e/test_deepseekv3.py
@@ -14,6 +14,8 @@ from axolotl.train import train
 from axolotl.utils.config import normalize_config, validate_config
 from axolotl.utils.dict import DictDefault

+from tests.hf_offline_utils import enable_hf_offline
+
 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"

@@ -23,6 +25,7 @@ class TestDeepseekV3:
    Test case for DeepseekV3 models
    """

+    @enable_hf_offline
    @pytest.mark.parametrize(
        "sample_packing",
        [True, False],
@@ -80,6 +83,7 @@ class TestDeepseekV3:
        train(cfg=cfg, dataset_meta=dataset_meta)
        assert (Path(temp_dir) / "adapter_model.safetensors").exists()

+    @enable_hf_offline
    @pytest.mark.parametrize(
        "sample_packing",
        [True, False],
--- a/tests/e2e/test_llama.py
+++ b/tests/e2e/test_llama.py
@@ -5,14 +5,14 @@ E2E tests for llama
 import logging
 import os

-from e2e.utils import check_model_output_exists
-
 from axolotl.cli.args import TrainerCliArgs
 from axolotl.common.datasets import load_datasets
 from axolotl.train import train
 from axolotl.utils.config import normalize_config, validate_config
 from axolotl.utils.dict import DictDefault

+from tests.e2e.utils import check_model_output_exists
+
 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"

--- a/tests/hf_offline_utils.py
+++ b/tests/hf_offline_utils.py
@@ -0,0 +1,85 @@
+"""
+test utils for helpers and decorators
+"""
+
+import os
+from functools import wraps
+
+from huggingface_hub.utils import reset_sessions
+
+
+def reload_modules(hf_hub_offline):
+    # Force reload of the modules that check this variable
+    import importlib
+
+    import datasets
+    import huggingface_hub.constants
+
+    # Reload the constants module first, as others depend on it
+    importlib.reload(huggingface_hub.constants)
+    huggingface_hub.constants.HF_HUB_OFFLINE = hf_hub_offline
+    importlib.reload(datasets.config)
+    setattr(datasets.config, "HF_HUB_OFFLINE", hf_hub_offline)
+    reset_sessions()
+
+
+def enable_hf_offline(test_func):
+    """
+    test decorator that sets HF_HUB_OFFLINE environment variable to True and restores it after the test even if the test fails.
+    :param test_func:
+    :return:
+    """
+
+    @wraps(test_func)
+    def wrapper(*args, **kwargs):
+        # Save the original value of HF_HUB_OFFLINE environment variable
+        original_hf_offline = os.getenv("HF_HUB_OFFLINE")
+
+        # Set HF_OFFLINE environment variable to True
+        os.environ["HF_HUB_OFFLINE"] = "1"
+
+        reload_modules(True)
+        try:
+            # Run the test function
+            return test_func(*args, **kwargs)
+        finally:
+            # Restore the original value of HF_HUB_OFFLINE environment variable
+            if original_hf_offline is not None:
+                os.environ["HF_HUB_OFFLINE"] = original_hf_offline
+                reload_modules(bool(original_hf_offline))
+            else:
+                del os.environ["HF_HUB_OFFLINE"]
+                reload_modules(False)
+
+    return wrapper
+
+
+def disable_hf_offline(test_func):
+    """
+    test decorator that sets HF_HUB_OFFLINE environment variable to False and restores it after the wrapped func
+    :param test_func:
+    :return:
+    """
+
+    @wraps(test_func)
+    def wrapper(*args, **kwargs):
+        # Save the original value of HF_HUB_OFFLINE environment variable
+        original_hf_offline = os.getenv("HF_HUB_OFFLINE")
+
+        # Set HF_OFFLINE environment variable to True
+        os.environ["HF_HUB_OFFLINE"] = "0"
+
+        reload_modules(False)
+        try:
+            # Run the test function
+            return test_func(*args, **kwargs)
+        finally:
+            # Restore the original value of HF_HUB_OFFLINE environment variable
+            if original_hf_offline is not None:
+                os.environ["HF_HUB_OFFLINE"] = original_hf_offline
+                reload_modules(bool(original_hf_offline))
+            else:
+                del os.environ["HF_HUB_OFFLINE"]
+                reload_modules(False)
+
+    return wrapper
--- a/tests/prompt_strategies/conftest.py
+++ b/tests/prompt_strategies/conftest.py
@@ -4,12 +4,13 @@ shared fixtures for prompt strategies tests

 import pytest
 from datasets import Dataset
-from huggingface_hub import hf_hub_download
 from transformers import AutoTokenizer

 from axolotl.prompt_strategies.jinja_template_analyzer import JinjaTemplateAnalyzer
 from axolotl.utils.chat_templates import _CHAT_TEMPLATES

+from tests.hf_offline_utils import enable_hf_offline
+

@pytest.fixture(name="assistant_dataset")
 def fixture_assistant_dataset():
@@ -108,31 +109,27 @@ def fixture_toolcalling_dataset():


@pytest.fixture(name="llama3_tokenizer", scope="session", autouse=True)
-def fixture_llama3_tokenizer():
-    hf_hub_download(
-        repo_id="NousResearch/Meta-Llama-3-8B-Instruct",
-        filename="special_tokens_map.json",
-    )
-    hf_hub_download(
-        repo_id="NousResearch/Meta-Llama-3-8B-Instruct",
-        filename="tokenizer_config.json",
-    )
-    hf_hub_download(
-        repo_id="NousResearch/Meta-Llama-3-8B-Instruct", filename="tokenizer.json"
-    )
+@enable_hf_offline
+def fixture_llama3_tokenizer(
+    download_llama3_8b_instruct_model_fixture,
+):  # pylint: disable=unused-argument,redefined-outer-name
    tokenizer = AutoTokenizer.from_pretrained("NousResearch/Meta-Llama-3-8B-Instruct")

    return tokenizer


@pytest.fixture(name="smollm2_tokenizer", scope="session", autouse=True)
+@enable_hf_offline
 def fixture_smollm2_tokenizer():
    tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")
    return tokenizer


@pytest.fixture(name="mistralv03_tokenizer", scope="session", autouse=True)
-def fixture_mistralv03_tokenizer():
+@enable_hf_offline
+def fixture_mistralv03_tokenizer(
+    download_mlx_mistral_7b_model_fixture,
+):  # pylint: disable=unused-argument,redefined-outer-name
    tokenizer = AutoTokenizer.from_pretrained(
        "mlx-community/Mistral-7B-Instruct-v0.3-4bit"
    )
@@ -140,6 +137,7 @@ def fixture_mistralv03_tokenizer():


@pytest.fixture(name="phi35_tokenizer", scope="session", autouse=True)
+@enable_hf_offline
 def fixture_phi35_tokenizer():
    tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")
    return tokenizer
--- a/tests/prompt_strategies/test_alpaca.py
+++ b/tests/prompt_strategies/test_alpaca.py
@@ -11,6 +11,8 @@ from axolotl.datasets import TokenizedPromptDataset
 from axolotl.prompt_tokenizers import AlpacaPromptTokenizingStrategy
 from axolotl.prompters import AlpacaPrompter, PromptStyle

+from tests.hf_offline_utils import enable_hf_offline
+

@pytest.fixture(name="alpaca_dataset")
 def fixture_alpaca_dataset():
@@ -26,6 +28,7 @@ def fixture_alpaca_dataset():


@pytest.fixture(name="tokenizer")
+@enable_hf_offline
 def fixture_tokenizer():
    # pylint: disable=all
    tokenizer = AutoTokenizer.from_pretrained(
--- a/tests/prompt_strategies/test_chat_template_utils.py
+++ b/tests/prompt_strategies/test_chat_template_utils.py
@@ -13,8 +13,11 @@ from axolotl.utils.chat_templates import (
    get_chat_template,
 )

+from tests.hf_offline_utils import enable_hf_offline
+

@pytest.fixture(name="llama3_tokenizer")
+@enable_hf_offline
 def fixture_llama3_tokenizer():
    tokenizer = AutoTokenizer.from_pretrained("NousResearch/Meta-Llama-3-8B")

--- a/tests/prompt_strategies/test_chat_templates_advanced.py
+++ b/tests/prompt_strategies/test_chat_templates_advanced.py
@@ -17,6 +17,8 @@ from axolotl.prompt_strategies.chat_template import (
 from axolotl.prompters import IGNORE_TOKEN_ID
 from axolotl.utils.chat_templates import get_chat_template

+from tests.hf_offline_utils import enable_hf_offline
+
 logging.basicConfig(level=logging.DEBUG)
 LOG = logging.getLogger("axolotl")

@@ -30,12 +32,14 @@ PARAMETRIZE_PARAMS = [
        "mistralv03_tokenizer_chat_template_jinja",
        "[/INST]",
    ),
-    (
-        "gemma2_tokenizer",
-        "jinja",
-        "gemma2_tokenizer_chat_template_jinja",
-        "<end_of_turn>",
-    ),
+    # TODO: temporarily skip gemma due to gemma3 template
+    # Re-enable on new chat_template implementation for perf
+    # (
+    #     "gemma2_tokenizer",
+    #     "jinja",
+    #     "gemma2_tokenizer_chat_template_jinja",
+    #     "<end_of_turn>",
+    # ),
    ("phi35_tokenizer", "phi_35", None, "<|end|>"),
 ]

@@ -93,7 +97,11 @@ class TestChatTemplateConfigurations:
        if (
            turn_idx == 0
            and turn.get("from") in ["system", "context"]
-            and "mistral" in tokenizer.name_or_path.lower()
+            and (
+                "mistral" in tokenizer.name_or_path.lower()
+                or "gemma"
+                in tokenizer.name_or_path.lower()  # temporarily skip gemma due to gemma3 template
+            )
        ):
            assert (
                start_idx == -1 and end_idx == -1
@@ -101,6 +109,7 @@ class TestChatTemplateConfigurations:
            return True
        return False

+    @enable_hf_offline
    def test_train_on_inputs_true(
        self,
        tokenizer,
--- a/tests/prompt_strategies/test_dpo_chat_templates.py
+++ b/tests/prompt_strategies/test_dpo_chat_templates.py
@@ -11,6 +11,8 @@ from transformers import AutoTokenizer
 from axolotl.prompt_strategies.dpo.chat_template import default
 from axolotl.utils.dict import DictDefault

+from tests.hf_offline_utils import enable_hf_offline
+

@pytest.fixture(name="assistant_dataset")
 def fixture_assistant_dataset():
@@ -78,15 +80,8 @@ def fixture_custom_assistant_dataset():
    )


-@pytest.fixture(name="llama3_tokenizer")
-def fixture_llama3_tokenizer():
-    tokenizer = AutoTokenizer.from_pretrained("NousResearch/Meta-Llama-3-8B")
-    tokenizer.eos_token = "<|eot_id|>"
-
-    return tokenizer
-
-
@pytest.fixture(name="phi3_tokenizer")
+@enable_hf_offline
 def fixture_phi3_tokenizer():
    tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-medium-128k-instruct")

@@ -94,6 +89,7 @@ def fixture_phi3_tokenizer():


@pytest.fixture(name="gemma_tokenizer")
+@enable_hf_offline
 def fixture_gemma_tokenizer():
    tokenizer = AutoTokenizer.from_pretrained("unsloth/gemma-2b-it", revision="703fb4a")

--- a/tests/prompt_strategies/test_dpo_chatml.py
+++ b/tests/prompt_strategies/test_dpo_chatml.py
@@ -10,6 +10,8 @@ from axolotl.prompt_strategies.dpo import load as load_dpo
 from axolotl.utils.data.rl import load_prepare_preference_datasets
 from axolotl.utils.dict import DictDefault

+from tests.hf_offline_utils import enable_hf_offline
+

@pytest.fixture(name="minimal_dpo_cfg")
 def fixture_cfg():
@@ -34,6 +36,8 @@ class TestDPOChatml:
    Test loading DPO preference datasets with chatml formatting
    """

+    @pytest.mark.skip(reason="TODO: fix hf hub offline to work with HF rate limits")
+    @enable_hf_offline
    def test_default(self, minimal_dpo_cfg):
        cfg = DictDefault(
            {
--- a/tests/test_data.py
+++ b/tests/test_data.py
@@ -8,12 +8,15 @@ from transformers import LlamaTokenizer

 from axolotl.utils.data import encode_pretraining, md5

+from tests.hf_offline_utils import enable_hf_offline
+

 class TestEncodePretraining(unittest.TestCase):
    """
    test class for encode pretraining and md5 helper
    """

+    @enable_hf_offline
    def setUp(self):
        self.tokenizer = LlamaTokenizer.from_pretrained("huggyllama/llama-7b")
        self.tokenizer.add_special_tokens(
--- a/tests/test_datasets.py
+++ b/tests/test_datasets.py
@@ -4,31 +4,37 @@ Test dataset loading under various conditions.

 import shutil
 import tempfile
-import unittest
 from pathlib import Path
+from unittest.mock import patch

-from conftest import snapshot_download_w_retry
-from constants import (
-    ALPACA_MESSAGES_CONFIG_OG,
-    ALPACA_MESSAGES_CONFIG_REVISION,
-    SPECIAL_TOKENS,
-)
+import pytest
 from datasets import Dataset
-from transformers import AutoTokenizer
+from huggingface_hub import snapshot_download
+from transformers import PreTrainedTokenizer

 from axolotl.utils.data import load_tokenized_prepared_datasets
 from axolotl.utils.data.rl import load_prepare_preference_datasets
 from axolotl.utils.dict import DictDefault

+from tests.constants import (
+    ALPACA_MESSAGES_CONFIG_OG,
+    ALPACA_MESSAGES_CONFIG_REVISION,
+    SPECIAL_TOKENS,
+)
+from tests.hf_offline_utils import enable_hf_offline

-class TestDatasetPreparation(unittest.TestCase):
+
+class TestDatasetPreparation:
    """Test a configured dataloader."""

-    def setUp(self) -> None:
-        self.tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
-        self.tokenizer.add_special_tokens(SPECIAL_TOKENS)
-        # Alpaca dataset.
-        self.dataset = Dataset.from_list(
+    @pytest.fixture
+    def tokenizer(self, tokenizer_huggyllama) -> PreTrainedTokenizer:
+        tokenizer_huggyllama.add_special_tokens(SPECIAL_TOKENS)
+        yield tokenizer_huggyllama
+
+    @pytest.fixture
+    def dataset_fixture(self):
+        yield Dataset.from_list(
            [
                {
                    "instruction": "Evaluate this sentence for spelling and grammar mistakes",
@@ -38,7 +44,9 @@ class TestDatasetPreparation(unittest.TestCase):
            ]
        )

-    def test_load_hub(self):
+    @pytest.mark.skip(reason="TODO: fix hf hub offline to work with HF rate limits")
+    @enable_hf_offline
+    def test_load_hub(self, tokenizer):
        """Core use case.  Verify that processing data from the hub works"""
        with tempfile.TemporaryDirectory() as tmp_dir:
            prepared_path = Path(tmp_dir) / "prepared"
@@ -55,25 +63,28 @@ class TestDatasetPreparation(unittest.TestCase):
                }
            )

-            dataset, _ = load_tokenized_prepared_datasets(
-                self.tokenizer, cfg, prepared_path
-            )
+            dataset, _ = load_tokenized_prepared_datasets(tokenizer, cfg, prepared_path)

            assert len(dataset) == 2000
            assert "input_ids" in dataset.features
            assert "attention_mask" in dataset.features
            assert "labels" in dataset.features

-    def test_load_local_hub(self):
+    @enable_hf_offline
+    @pytest.mark.skip("datasets bug with local datasets when offline")
+    def test_load_local_hub(self, tokenizer):
        """Niche use case.  Verify that a local copy of a hub dataset can be loaded"""
        with tempfile.TemporaryDirectory() as tmp_dir:
            tmp_ds_path = Path(tmp_dir) / "mhenrichsen/alpaca_2k_test"
            tmp_ds_path.mkdir(parents=True, exist_ok=True)
-            snapshot_download_w_retry(
+            snapshot_path = snapshot_download(
                repo_id="mhenrichsen/alpaca_2k_test",
                repo_type="dataset",
                local_dir=tmp_ds_path,
            )
+            # offline mode doesn't actually copy it to local_dir, so we
+            # have to copy all the contents in the dir manually from the returned snapshot_path
+            shutil.copytree(snapshot_path, tmp_ds_path, dirs_exist_ok=True)

            prepared_path = Path(tmp_dir) / "prepared"
            # Right now a local copy that doesn't fully conform to a dataset
@@ -96,9 +107,7 @@ class TestDatasetPreparation(unittest.TestCase):
                }
            )

-            dataset, _ = load_tokenized_prepared_datasets(
-                self.tokenizer, cfg, prepared_path
-            )
+            dataset, _ = load_tokenized_prepared_datasets(tokenizer, cfg, prepared_path)

            assert len(dataset) == 2000
            assert "input_ids" in dataset.features
@@ -106,11 +115,12 @@ class TestDatasetPreparation(unittest.TestCase):
            assert "labels" in dataset.features
            shutil.rmtree(tmp_ds_path)

-    def test_load_from_save_to_disk(self):
+    @enable_hf_offline
+    def test_load_from_save_to_disk(self, tokenizer, dataset_fixture):
        """Usual use case.  Verify datasets saved via `save_to_disk` can be loaded."""
        with tempfile.TemporaryDirectory() as tmp_dir:
            tmp_ds_name = Path(tmp_dir) / "tmp_dataset"
-            self.dataset.save_to_disk(str(tmp_ds_name))
+            dataset_fixture.save_to_disk(str(tmp_ds_name))

            prepared_path = Path(tmp_dir) / "prepared"
            cfg = DictDefault(
@@ -126,22 +136,21 @@ class TestDatasetPreparation(unittest.TestCase):
                }
            )

-            dataset, _ = load_tokenized_prepared_datasets(
-                self.tokenizer, cfg, prepared_path
-            )
+            dataset, _ = load_tokenized_prepared_datasets(tokenizer, cfg, prepared_path)

            assert len(dataset) == 1
            assert "input_ids" in dataset.features
            assert "attention_mask" in dataset.features
            assert "labels" in dataset.features

-    def test_load_from_dir_of_parquet(self):
+    @enable_hf_offline
+    def test_load_from_dir_of_parquet(self, tokenizer, dataset_fixture):
        """Usual use case.  Verify a directory of parquet files can be loaded."""
        with tempfile.TemporaryDirectory() as tmp_dir:
            tmp_ds_dir = Path(tmp_dir) / "tmp_dataset"
            tmp_ds_dir.mkdir()
            tmp_ds_path = tmp_ds_dir / "shard1.parquet"
-            self.dataset.to_parquet(tmp_ds_path)
+            dataset_fixture.to_parquet(tmp_ds_path)

            prepared_path: Path = Path(tmp_dir) / "prepared"
            cfg = DictDefault(
@@ -162,22 +171,21 @@ class TestDatasetPreparation(unittest.TestCase):
                }
            )

-            dataset, _ = load_tokenized_prepared_datasets(
-                self.tokenizer, cfg, prepared_path
-            )
+            dataset, _ = load_tokenized_prepared_datasets(tokenizer, cfg, prepared_path)

            assert len(dataset) == 1
            assert "input_ids" in dataset.features
            assert "attention_mask" in dataset.features
            assert "labels" in dataset.features

-    def test_load_from_dir_of_json(self):
+    @enable_hf_offline
+    def test_load_from_dir_of_json(self, tokenizer, dataset_fixture):
        """Standard use case.  Verify a directory of json files can be loaded."""
        with tempfile.TemporaryDirectory() as tmp_dir:
            tmp_ds_dir = Path(tmp_dir) / "tmp_dataset"
            tmp_ds_dir.mkdir()
            tmp_ds_path = tmp_ds_dir / "shard1.json"
-            self.dataset.to_json(tmp_ds_path)
+            dataset_fixture.to_json(tmp_ds_path)

            prepared_path: Path = Path(tmp_dir) / "prepared"
            cfg = DictDefault(
@@ -198,20 +206,19 @@ class TestDatasetPreparation(unittest.TestCase):
                }
            )

-            dataset, _ = load_tokenized_prepared_datasets(
-                self.tokenizer, cfg, prepared_path
-            )
+            dataset, _ = load_tokenized_prepared_datasets(tokenizer, cfg, prepared_path)

            assert len(dataset) == 1
            assert "input_ids" in dataset.features
            assert "attention_mask" in dataset.features
            assert "labels" in dataset.features

-    def test_load_from_single_parquet(self):
+    @enable_hf_offline
+    def test_load_from_single_parquet(self, tokenizer, dataset_fixture):
        """Standard use case.  Verify a single parquet file can be loaded."""
        with tempfile.TemporaryDirectory() as tmp_dir:
            tmp_ds_path = Path(tmp_dir) / "tmp_dataset.parquet"
-            self.dataset.to_parquet(tmp_ds_path)
+            dataset_fixture.to_parquet(tmp_ds_path)

            prepared_path: Path = Path(tmp_dir) / "prepared"
            cfg = DictDefault(
@@ -228,20 +235,19 @@ class TestDatasetPreparation(unittest.TestCase):
                }
            )

-            dataset, _ = load_tokenized_prepared_datasets(
-                self.tokenizer, cfg, prepared_path
-            )
+            dataset, _ = load_tokenized_prepared_datasets(tokenizer, cfg, prepared_path)

            assert len(dataset) == 1
            assert "input_ids" in dataset.features
            assert "attention_mask" in dataset.features
            assert "labels" in dataset.features

-    def test_load_from_single_json(self):
+    @enable_hf_offline
+    def test_load_from_single_json(self, tokenizer, dataset_fixture):
        """Standard use case.  Verify a single json file can be loaded."""
        with tempfile.TemporaryDirectory() as tmp_dir:
            tmp_ds_path = Path(tmp_dir) / "tmp_dataset.json"
-            self.dataset.to_json(tmp_ds_path)
+            dataset_fixture.to_json(tmp_ds_path)

            prepared_path: Path = Path(tmp_dir) / "prepared"
            cfg = DictDefault(
@@ -258,15 +264,15 @@ class TestDatasetPreparation(unittest.TestCase):
                }
            )

-            dataset, _ = load_tokenized_prepared_datasets(
-                self.tokenizer, cfg, prepared_path
-            )
+            dataset, _ = load_tokenized_prepared_datasets(tokenizer, cfg, prepared_path)

            assert len(dataset) == 1
            assert "input_ids" in dataset.features
            assert "attention_mask" in dataset.features
            assert "labels" in dataset.features

+    @pytest.mark.skip(reason="TODO: fix hf offline mode for CI rate limits")
+    @enable_hf_offline
    def test_load_hub_with_dpo(self):
        """Verify that processing dpo data from the hub works"""

@@ -285,7 +291,9 @@ class TestDatasetPreparation(unittest.TestCase):
        assert len(train_dataset) == 1800
        assert "conversation" in train_dataset.features

-    def test_load_hub_with_revision(self):
+    @pytest.mark.skip(reason="TODO: fix hf hub offline to work with HF rate limits")
+    @enable_hf_offline
+    def test_load_hub_with_revision(self, tokenizer):
        """Verify that processing data from the hub works with a specific revision"""
        with tempfile.TemporaryDirectory() as tmp_dir:
            prepared_path = Path(tmp_dir) / "prepared"
@@ -307,16 +315,17 @@ class TestDatasetPreparation(unittest.TestCase):
                }
            )

-            dataset, _ = load_tokenized_prepared_datasets(
-                self.tokenizer, cfg, prepared_path
-            )
+            dataset, _ = load_tokenized_prepared_datasets(tokenizer, cfg, prepared_path)

            assert len(dataset) == 2000
            assert "input_ids" in dataset.features
            assert "attention_mask" in dataset.features
            assert "labels" in dataset.features

-    def test_load_hub_with_revision_with_dpo(self):
+    @enable_hf_offline
+    def test_load_hub_with_revision_with_dpo(
+        self, dataset_fozziethebeat_alpaca_messages_2k_dpo_test_rev_ea82cff
+    ):
        """Verify that processing dpo data from the hub works with a specific revision"""

        cfg = DictDefault(
@@ -329,22 +338,34 @@ class TestDatasetPreparation(unittest.TestCase):
            }
        )

-        train_dataset, _ = load_prepare_preference_datasets(cfg)
+        # pylint: disable=duplicate-code
+        with patch("axolotl.utils.data.rl.load_dataset_w_config") as mock_load_dataset:
+            # Set up the mock to return different values on successive calls
+            mock_load_dataset.return_value = (
+                dataset_fozziethebeat_alpaca_messages_2k_dpo_test_rev_ea82cff
+            )

-        assert len(train_dataset) == 1800
-        assert "conversation" in train_dataset.features
+            train_dataset, _ = load_prepare_preference_datasets(cfg)

-    def test_load_local_hub_with_revision(self):
+            assert len(train_dataset) == 1800
+            assert "conversation" in train_dataset.features
+
+    @enable_hf_offline
+    @pytest.mark.skip("datasets bug with local datasets when offline")
+    def test_load_local_hub_with_revision(
+        self, dataset_fozziethebeat_alpaca_messages_2k_dpo_test_rev_ea82cff, tokenizer
+    ):
        """Verify that a local copy of a hub dataset can be loaded with a specific revision"""
        with tempfile.TemporaryDirectory() as tmp_dir:
            tmp_ds_path = Path(tmp_dir) / "mhenrichsen/alpaca_2k_test"
            tmp_ds_path.mkdir(parents=True, exist_ok=True)
-            snapshot_download_w_retry(
+            snapshot_path = snapshot_download(
                repo_id="mhenrichsen/alpaca_2k_test",
                repo_type="dataset",
                local_dir=tmp_ds_path,
                revision="d05c1cb",
            )
+            shutil.copytree(snapshot_path, tmp_ds_path, dirs_exist_ok=True)

            prepared_path = Path(tmp_dir) / "prepared"
            cfg = DictDefault(
@@ -365,27 +386,37 @@ class TestDatasetPreparation(unittest.TestCase):
                }
            )

-            dataset, _ = load_tokenized_prepared_datasets(
-                self.tokenizer, cfg, prepared_path
-            )
+            with patch(
+                "axolotl.utils.data.shared.load_dataset_w_config"
+            ) as mock_load_dataset:
+                # Set up the mock to return different values on successive calls
+                mock_load_dataset.return_value = (
+                    dataset_fozziethebeat_alpaca_messages_2k_dpo_test_rev_ea82cff
+                )

-            assert len(dataset) == 2000
-            assert "input_ids" in dataset.features
-            assert "attention_mask" in dataset.features
-            assert "labels" in dataset.features
-            shutil.rmtree(tmp_ds_path)
+                dataset, _ = load_tokenized_prepared_datasets(
+                    tokenizer, cfg, prepared_path
+                )

-    def test_loading_local_dataset_folder(self):
+                assert len(dataset) == 2000
+                assert "input_ids" in dataset.features
+                assert "attention_mask" in dataset.features
+                assert "labels" in dataset.features
+                shutil.rmtree(tmp_ds_path)
+
+    @enable_hf_offline
+    def test_loading_local_dataset_folder(self, tokenizer):
        """Verify that a dataset downloaded to a local folder can be loaded"""

        with tempfile.TemporaryDirectory() as tmp_dir:
            tmp_ds_path = Path(tmp_dir) / "mhenrichsen/alpaca_2k_test"
            tmp_ds_path.mkdir(parents=True, exist_ok=True)
-            snapshot_download_w_retry(
+            snapshot_path = snapshot_download(
                repo_id="mhenrichsen/alpaca_2k_test",
                repo_type="dataset",
                local_dir=tmp_ds_path,
            )
+            shutil.copytree(snapshot_path, tmp_ds_path, dirs_exist_ok=True)

            prepared_path = Path(tmp_dir) / "prepared"
            cfg = DictDefault(
@@ -401,16 +432,10 @@ class TestDatasetPreparation(unittest.TestCase):
                }
            )

-            dataset, _ = load_tokenized_prepared_datasets(
-                self.tokenizer, cfg, prepared_path
-            )
+            dataset, _ = load_tokenized_prepared_datasets(tokenizer, cfg, prepared_path)

            assert len(dataset) == 2000
            assert "input_ids" in dataset.features
            assert "attention_mask" in dataset.features
            assert "labels" in dataset.features
            shutil.rmtree(tmp_ds_path)
-
-
-if __name__ == "__main__":
-    unittest.main()
--- a/tests/test_exact_deduplication.py
+++ b/tests/test_exact_deduplication.py
@@ -8,9 +8,8 @@ import hashlib
 import unittest
 from unittest.mock import patch

-from constants import ALPACA_MESSAGES_CONFIG_REVISION, SPECIAL_TOKENS
+import pytest
 from datasets import Dataset
-from transformers import AutoTokenizer

 from axolotl.utils.config import normalize_config
 from axolotl.utils.data import prepare_dataset
@@ -19,6 +18,9 @@ from axolotl.utils.data.utils import deduplicate_and_log_datasets
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.models import load_processor, load_tokenizer

+from tests.constants import ALPACA_MESSAGES_CONFIG_REVISION
+from tests.hf_offline_utils import enable_hf_offline
+

 def verify_deduplication(actual_dataset, expected_dataset, dataset_name):
    """
@@ -214,13 +216,12 @@ class TestDeduplicateIndividualFunctions(unittest.TestCase):
        verify_deduplication(eval_dataset, expected_dataset_eval, "eval_dataset")


-class TestDeduplicateRLDataset(unittest.TestCase):
+class TestDeduplicateRLDataset:
    """Test a configured dataloader with deduplication."""

-    def setUp(self) -> None:
-        self.tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
-        self.tokenizer.add_special_tokens(SPECIAL_TOKENS)
-        self.cfg = DictDefault(
+    @pytest.fixture
+    def cfg(self):
+        fixture = DictDefault(
            {
                "tokenizer_config": "huggyllama/llama-7b",
                "sequence_len": 1024,
@@ -233,34 +234,68 @@ class TestDeduplicateRLDataset(unittest.TestCase):
                ],
            }
        )
+        yield fixture

-    def test_load_with_deduplication(self):
+    @enable_hf_offline
+    def test_load_with_deduplication(
+        self,
+        cfg,
+        dataset_fozziethebeat_alpaca_messages_2k_dpo_test_rev_ea82cff,
+        tokenizer_huggyllama,
+    ):
        """Verify that loading with deduplication removes duplicates."""

-        # Load the dataset using the deduplication setting
-        train_dataset, _ = load_prepare_preference_datasets(self.cfg)
+        # pylint: disable=duplicate-code
+        with (
+            patch("axolotl.utils.data.rl.load_dataset_w_config") as mock_load_dataset,
+            patch("axolotl.utils.models.load_tokenizer") as mock_load_tokenizer,
+        ):
+            # Set up the mock to return different values on successive calls
+            mock_load_dataset.side_effect = [
+                dataset_fozziethebeat_alpaca_messages_2k_dpo_test_rev_ea82cff,
+                dataset_fozziethebeat_alpaca_messages_2k_dpo_test_rev_ea82cff,
+            ]
+            mock_load_tokenizer.return_value = tokenizer_huggyllama

-        # Verify that the dataset has been deduplicated
-        assert len(train_dataset) == 1800, "Dataset was not properly deduplicated"
+            train_dataset, _ = load_prepare_preference_datasets(cfg)

-    def test_load_without_deduplication(self):
-        """Verify that loading without deduplication retains duplicates."""
-        self.cfg.dataset_exact_deduplication = False
-        # Load the dataset without deduplication
-        train_dataset, _ = load_prepare_preference_datasets(self.cfg)
+            # Verify that the dataset has been deduplicated
+            assert len(train_dataset) == 1800, "Dataset was not properly deduplicated"

-        # Verify that the dataset retains duplicates
-        assert (
-            len(train_dataset) == 1800 * 2
-        ), "Dataset deduplication occurred when it should not have"
+    @enable_hf_offline
+    def test_load_without_deduplication(
+        self,
+        cfg,
+        dataset_fozziethebeat_alpaca_messages_2k_dpo_test_rev_ea82cff,
+        tokenizer_huggyllama,
+    ):
+        # pylint: disable=duplicate-code
+        with (
+            patch("axolotl.utils.data.rl.load_dataset_w_config") as mock_load_dataset,
+            patch("axolotl.utils.models.load_tokenizer") as mock_load_tokenizer,
+        ):
+            # Set up the mock to return different values on successive calls
+            mock_load_dataset.side_effect = [
+                dataset_fozziethebeat_alpaca_messages_2k_dpo_test_rev_ea82cff,
+                dataset_fozziethebeat_alpaca_messages_2k_dpo_test_rev_ea82cff,
+            ]
+            mock_load_tokenizer.return_value = tokenizer_huggyllama
+
+            cfg.dataset_exact_deduplication = False
+            # Load the dataset without deduplication
+            train_dataset, _ = load_prepare_preference_datasets(cfg)
+
+            # Verify that the dataset retains duplicates
+            assert (
+                len(train_dataset) == 1800 * 2
+            ), "Dataset deduplication occurred when it should not have"


 class TestDeduplicateNonRL(unittest.TestCase):
    """Test prepare_dataset function with different configurations."""

+    @enable_hf_offline
    def setUp(self) -> None:
-        self.tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
-        self.tokenizer.add_special_tokens(SPECIAL_TOKENS)
        self.cfg_1 = DictDefault(
            {
                "base_model": "huggyllama/llama-7b",
@@ -286,6 +321,8 @@ class TestDeduplicateNonRL(unittest.TestCase):
        )
        normalize_config(self.cfg_1)

+    @pytest.mark.skip(reason="TODO: fix hf hub offline to work with HF rate limits")
+    @enable_hf_offline
    def test_prepare_dataset_with_deduplication_train(self):
        """Verify that prepare_dataset function processes the dataset correctly with deduplication."""
        self.cfg_1.dataset_exact_deduplication = True
@@ -311,6 +348,8 @@ class TestDeduplicateNonRL(unittest.TestCase):
            "Train dataset should have 2000 samples after deduplication.",
        )

+    @pytest.mark.skip(reason="TODO: fix hf hub offline to work with HF rate limits")
+    @enable_hf_offline
    def test_prepare_dataset_with_deduplication_eval(self):
        """Verify that prepare_dataset function processes the dataset correctly with deduplication."""
        self.cfg_1.dataset_exact_deduplication = True
@@ -336,6 +375,8 @@ class TestDeduplicateNonRL(unittest.TestCase):
            "Eval dataset should have 2000 samples after deduplication.",
        )

+    @pytest.mark.skip(reason="TODO: fix hf hub offline to work with HF rate limits")
+    @enable_hf_offline
    def test_prepare_dataset_without_deduplication(self):
        """Verify that prepare_dataset function processes the dataset correctly without deduplication."""
        self.cfg_1.dataset_exact_deduplication = False
--- a/tests/test_packed_batch_sampler.py
+++ b/tests/test_packed_batch_sampler.py
@@ -1,7 +1,7 @@
 """Module for testing streaming dataset sequence packing"""

 import pytest
-from datasets import concatenate_datasets, load_dataset
+from datasets import concatenate_datasets
 from torch.utils.data import DataLoader, RandomSampler
 from transformers import AutoTokenizer

@@ -12,6 +12,8 @@ from axolotl.utils.data.utils import drop_long_seq_in_dataset
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.samplers import MultipackBatchSampler, get_dataset_lengths

+from tests.hf_offline_utils import enable_hf_offline
+

@pytest.fixture(name="tokenizer")
 def fixture_tokenizer():
@@ -35,13 +37,20 @@ class TestBatchedSamplerPacking:
        ],
    )
    @pytest.mark.parametrize("max_seq_length", [4096, 512])
-    def test_packing(self, batch_size, num_workers, tokenizer, max_seq_length):
+    @pytest.mark.parametrize("sequential", [True, False])
+    @enable_hf_offline
+    def test_packing(
+        self,
+        dataset_winglian_tiny_shakespeare,
+        batch_size,
+        num_workers,
+        tokenizer,
+        max_seq_length,
+        sequential,
+    ):
        import axolotl.monkeypatch.data.batch_dataset_fetcher  # pylint: disable=unused-import  # noqa: F401

-        dataset = load_dataset(
-            "Trelis/tiny-shakespeare",
-            split="train",
-        )
+        dataset = dataset_winglian_tiny_shakespeare["train"]

        cfg = DictDefault(
            {
@@ -51,7 +60,7 @@ class TestBatchedSamplerPacking:
        )
        ds_cfg = DictDefault(
            {
-                "field": "Text",
+                "field": "text",
            }
        )
        completion_strategy = load(tokenizer, cfg, ds_cfg)
@@ -71,6 +80,7 @@ class TestBatchedSamplerPacking:
            batch_max_len=max_seq_length,
            group_size=100000,
            bin_size=200,
+            sequential=sequential,
        )

        loader = DataLoader(
--- a/tests/test_packed_dataset.py
+++ b/tests/test_packed_dataset.py
@@ -10,12 +10,15 @@ from axolotl.datasets import ConstantLengthDataset, TokenizedPromptDataset
 from axolotl.prompt_tokenizers import AlpacaPromptTokenizingStrategy
 from axolotl.prompters import AlpacaPrompter

+from tests.hf_offline_utils import enable_hf_offline
+

 class TestPacking(unittest.TestCase):
    """
    Test class for packing dataset sequences
    """

+    @enable_hf_offline
    def setUp(self) -> None:
        # pylint: disable=duplicate-code
        self.tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
--- a/tests/test_packed_pretraining.py
+++ b/tests/test_packed_pretraining.py
@@ -1,43 +1,60 @@
 """Module for testing streaming dataset sequence packing"""

 import functools
-import unittest
+import random
+import string

 import pytest
 import torch
-from datasets import load_dataset
+from datasets import IterableDataset
 from torch.utils.data import DataLoader
-from transformers import AutoTokenizer

 from axolotl.utils.data import get_dataset_wrapper, wrap_pretraining_dataset
 from axolotl.utils.dict import DictDefault


-class TestPretrainingPacking(unittest.TestCase):
+class TestPretrainingPacking:
    """
    Test class for packing streaming dataset sequences
    """

-    def setUp(self) -> None:
-        # pylint: disable=duplicate-code
-        self.tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
-        self.tokenizer.pad_token = "</s>"
+    @pytest.fixture
+    def random_text(self):
+        # seed with random.seed(0) for reproducibility
+        random.seed(0)

-    @pytest.mark.flaky(retries=3, delay=5)
-    def test_packing_stream_dataset(self):
-        # pylint: disable=duplicate-code
-        dataset = load_dataset(
-            "allenai/c4",
-            "en",
-            streaming=True,
-        )["train"]
+        # generate row of random text with "words" of between 2 and 10 characters and
+        # between 400 to 1200 characters per line
+        def rand_txt():
+            return " ".join(
+                [
+                    "".join(
+                        random.choices(string.ascii_lowercase, k=random.randint(2, 10))
+                    )
+                    for _ in range(random.randint(50, 200))
+                ]
+            )
+
+        # Create a list of 2000 random texts rather than just using it within the
+        # generator so the test runs faster
+        data = [rand_txt() for _ in range(500)]
+
+        # Create an IterableDataset
+        def generator():
+            for row in data:
+                yield {"text": row}
+
+        return IterableDataset.from_generator(generator)
+
+    @pytest.mark.flaky(retries=1, delay=5)
+    def test_packing_stream_dataset(self, tokenizer_huggyllama, random_text):
+        dataset = random_text

        cfg = DictDefault(
            {
                "pretraining_dataset": [
                    {
-                        "path": "allenai/c4",
-                        "name": "en",
+                        "path": "winglian/tiny-shakespeare",
                        "type": "pretrain",
                    }
                ],
@@ -54,15 +71,16 @@ class TestPretrainingPacking(unittest.TestCase):
        ds_wrapper_partial = functools.partial(
            get_dataset_wrapper,
            cfg.pretraining_dataset[0],
-            self.tokenizer,
+            tokenizer_huggyllama,
            cfg,
            cfg.pretraining_dataset[0]["type"] or "pretrain",
        )

+        # pylint: disable=duplicate-code
        original_bsz = cfg.micro_batch_size
        train_dataset = wrap_pretraining_dataset(
            dataset,
-            self.tokenizer,
+            tokenizer_huggyllama,
            cfg,
            ds_wrapper_partial,
            max_tokens=cfg.sequence_len,
@@ -78,7 +96,7 @@ class TestPretrainingPacking(unittest.TestCase):
        )
        idx = 0
        for data in trainer_loader:
-            if idx > 10:
+            if idx > 3:
                break
            assert data["input_ids"].shape == torch.Size(
                [1, original_bsz * cfg.sequence_len]
@@ -95,7 +113,3 @@ class TestPretrainingPacking(unittest.TestCase):
            #     [1, original_bsz * cfg.sequence_len]
            # )
            idx += 1
-
-
-if __name__ == "__main__":
-    unittest.main()
--- a/tests/test_prompt_tokenizers.py
+++ b/tests/test_prompt_tokenizers.py
@@ -2,12 +2,8 @@

 import json
 import logging
-import unittest
 from pathlib import Path

-from datasets import load_dataset
-from transformers import AddedToken, AutoTokenizer, LlamaTokenizer
-
 from axolotl.prompt_strategies.alpaca_chat import NoSystemPrompter
 from axolotl.prompt_strategies.alpaca_w_system import (
    InstructionWSystemPromptTokenizingStrategy,
@@ -22,6 +18,8 @@ from axolotl.prompt_tokenizers import AlpacaPromptTokenizingStrategy
 from axolotl.prompters import AlpacaPrompter, PromptStyle
 from axolotl.utils.dict import DictDefault

+from tests.hf_offline_utils import enable_hf_offline
+
 LOG = logging.getLogger("axolotl")

 test_data = {
@@ -58,23 +56,13 @@ test_data = {
 }


-class TestPromptTokenizationStrategies(unittest.TestCase):
+class TestPromptTokenizationStrategies:
    """
    Test class for prompt tokenization strategies.
    """

-    def setUp(self) -> None:
-        # pylint: disable=duplicate-code
-        self.tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
-        self.tokenizer.add_special_tokens(
-            {
-                "bos_token": "<s>",
-                "eos_token": "</s>",
-                "unk_token": "<unk>",
-            }
-        )
-
-    def test_no_sys_prompt(self):
+    @enable_hf_offline
+    def test_no_sys_prompt(self, tokenizer_huggyllama_w_special_tokens):
        """
        tests the interface between the user and assistant parts
        """
@@ -82,7 +70,7 @@ class TestPromptTokenizationStrategies(unittest.TestCase):
        # pylint: disable=duplicate-code
        strat = AlpacaPromptTokenizingStrategy(
            prompter,
-            self.tokenizer,
+            tokenizer_huggyllama_w_special_tokens,
            False,
            2048,
        )
@@ -95,7 +83,8 @@ class TestPromptTokenizationStrategies(unittest.TestCase):
        assert example["labels"][world_idx] == 3186
        assert example["labels"][world_idx - 1] == -100

-    def test_alpaca(self):
+    @enable_hf_offline
+    def test_alpaca(self, tokenizer_huggyllama_w_special_tokens):
        """
        tests the interface between the user and assistant parts
        """
@@ -103,7 +92,7 @@ class TestPromptTokenizationStrategies(unittest.TestCase):
        prompter = AlpacaPrompter()
        strat = AlpacaPromptTokenizingStrategy(
            prompter,
-            self.tokenizer,
+            tokenizer_huggyllama_w_special_tokens,
            False,
            2048,
        )
@@ -114,27 +103,17 @@ class TestPromptTokenizationStrategies(unittest.TestCase):
        assert example["labels"][world_idx - 1] == -100


-class InstructionWSystemPromptTokenizingStrategyTest(unittest.TestCase):
+class TestInstructionWSystemPromptTokenizingStrategy:
    """
    Test class for prompt tokenization strategies with sys prompt from the dataset
    """

-    def setUp(self) -> None:
-        # pylint: disable=duplicate-code
-        self.tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
-        self.tokenizer.add_special_tokens(
-            {
-                "bos_token": "<s>",
-                "eos_token": "</s>",
-                "unk_token": "<unk>",
-            }
-        )
-
-    def test_system_alpaca(self):
+    @enable_hf_offline
+    def test_system_alpaca(self, tokenizer_huggyllama_w_special_tokens):
        prompter = SystemDataPrompter(PromptStyle.CHAT.value)
        strat = InstructionWSystemPromptTokenizingStrategy(
            prompter,
-            self.tokenizer,
+            tokenizer_huggyllama_w_special_tokens,
            False,
            2048,
        )
@@ -155,17 +134,13 @@ class InstructionWSystemPromptTokenizingStrategyTest(unittest.TestCase):
        assert example["input_ids"][8] == 11889  # USER


-class Llama2ChatTokenizationTest(unittest.TestCase):
+class Llama2ChatTokenizationTest:
    """
    Test class for prompt tokenization strategies with sys prompt from the dataset
    """

-    def setUp(self) -> None:
-        # pylint: disable=duplicate-code
-        self.tokenizer = LlamaTokenizer.from_pretrained("NousResearch/Llama-2-7b-hf")
-        # woraround because official Meta repos are not open
-
-    def test_llama2_chat_integration(self):
+    @enable_hf_offline
+    def test_llama2_chat_integration(self, tokenizer_llama2_7b):
        with open(
            Path(__file__).parent / "fixtures/conversation.json", encoding="utf-8"
        ) as fin:
@@ -180,16 +155,18 @@ class Llama2ChatTokenizationTest(unittest.TestCase):
        prompter = Llama2ChatPrompter()
        strat = LLama2ChatTokenizingStrategy(
            prompter,
-            self.tokenizer,
+            tokenizer_llama2_7b,
            False,
            4096,
        )
        example = strat.tokenize_prompt(conversation)
        for fields in ["input_ids", "attention_mask", "labels"]:
-            self.assertEqual(len(example[fields]), len(tokenized_conversation[fields]))
-            self.assertEqual(example[fields], tokenized_conversation[fields])
+            # pytest assert equals

-    def compare_with_transformers_integration(self):
+            assert len(example[fields]) == len(tokenized_conversation[fields])
+            assert example[fields] == tokenized_conversation[fields]
+
+    def compare_with_transformers_integration(self, tokenizer_llama2_7b):
        # this needs transformers >= v4.31.0
        from transformers.models.llama.tokenization_llama import B_SYS, E_SYS
        from transformers.pipelines.conversational import Conversation
@@ -228,47 +205,27 @@ If a question does not make any sense, or is not factually coherent, explain why
            generated_responses=answers,
        )
        # pylint: disable=W0212
-        hf_tokens = self.tokenizer._build_conversation_input_ids(hf_conf)
+        hf_tokens = tokenizer_llama2_7b._build_conversation_input_ids(hf_conf)

-        self.assertEqual(
-            hf_tokens, tokenized_conversation["input_ids"][: len(hf_tokens)]
-        )
+        assert hf_tokens == tokenized_conversation["input_ids"][: len(hf_tokens)]


-class OrpoTokenizationTest(unittest.TestCase):
+class OrpoTokenizationTest:
    """test case for the ORPO tokenization"""

-    def setUp(self) -> None:
-        # pylint: disable=duplicate-code
-        tokenizer = LlamaTokenizer.from_pretrained(
-            "casperhansen/mistral-7b-instruct-v0.1-awq"
-        )
-        tokenizer.add_special_tokens(
-            {
-                "eos_token": AddedToken(
-                    "<|im_end|>", rstrip=False, lstrip=False, normalized=False
-                )
-            }
-        )
-        tokenizer.add_tokens(
-            [
-                AddedToken(
-                    "<|im_start|>", rstrip=False, lstrip=False, normalized=False
-                ),
-            ]
-        )
-        self.tokenizer = tokenizer
-        self.dataset = load_dataset(
-            "argilla/ultrafeedback-binarized-preferences-cleaned", split="train"
-        ).select([0])
-
-    def test_orpo_integration(self):
+    @enable_hf_offline
+    def test_orpo_integration(
+        self,
+        tokenizer_mistral_7b_instruct_chatml,
+        dataset_argilla_ultrafeedback_binarized_preferences_cleaned,
+    ):
+        ds = dataset_argilla_ultrafeedback_binarized_preferences_cleaned.select([0])
        strat = load(
-            self.tokenizer,
+            tokenizer_mistral_7b_instruct_chatml,
            DictDefault({"train_on_inputs": False}),
            DictDefault({"chat_template": "chatml"}),
        )
-        res = strat.tokenize_prompt(self.dataset[0])
+        res = strat.tokenize_prompt(ds[0])
        assert "rejected_input_ids" in res
        assert "rejected_labels" in res
        assert "input_ids" in res
@@ -287,7 +244,3 @@ class OrpoTokenizationTest(unittest.TestCase):

        assert res["prompt_attention_mask"][0] == 1
        assert res["prompt_attention_mask"][-1] == 0
-
-
-if __name__ == "__main__":
-    unittest.main()
--- a/tests/test_tokenizers.py
+++ b/tests/test_tokenizers.py
@@ -9,12 +9,15 @@ import pytest
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.models import load_tokenizer

+from tests.hf_offline_utils import enable_hf_offline
+

 class TestTokenizers:
    """
    test class for the load_tokenizer fn
    """

+    @enable_hf_offline
    def test_default_use_fast(self):
        cfg = DictDefault(
            {
@@ -24,6 +27,7 @@ class TestTokenizers:
        tokenizer = load_tokenizer(cfg)
        assert "Fast" in tokenizer.__class__.__name__

+    @enable_hf_offline
    def test_dont_use_fast(self):
        cfg = DictDefault(
            {
@@ -34,6 +38,7 @@ class TestTokenizers:
        tokenizer = load_tokenizer(cfg)
        assert "Fast" not in tokenizer.__class__.__name__

+    @enable_hf_offline
    def test_special_tokens_modules_to_save(self):
        # setting special_tokens to new token
        cfg = DictDefault(
@@ -68,6 +73,7 @@ class TestTokenizers:
        )
        load_tokenizer(cfg)

+    @enable_hf_offline
    def test_add_additional_special_tokens(self):
        cfg = DictDefault(
            {
@@ -83,6 +89,7 @@ class TestTokenizers:
        tokenizer = load_tokenizer(cfg)
        assert len(tokenizer) == 32001

+    @enable_hf_offline
    def test_added_tokens_overrides(self, temp_dir):
        cfg = DictDefault(
            {
@@ -104,11 +111,12 @@ class TestTokenizers:
            128042
        ]

+    @enable_hf_offline
    def test_added_tokens_overrides_with_toolargeid(self, temp_dir):
        cfg = DictDefault(
            {
                # use with tokenizer that has reserved_tokens in added_tokens
-                "tokenizer_config": "NousResearch/Llama-3.2-1B",
+                "tokenizer_config": "HuggingFaceTB/SmolLM2-135M",
                "added_tokens_overrides": {1000000: "BROKEN_RANDOM_OVERRIDE_1"},
                "output_dir": temp_dir,
            }
--- a/tests/test_validation_dataset.py
+++ b/tests/test_validation_dataset.py
@@ -321,3 +321,48 @@ class TestValidationCheckDatasetConfig(BaseValidation):
        )

        validate_config(cfg)
+
+
+class TestOptimizerValidation(BaseValidation):
+    """
+    Test muon optimizer validation
+    """
+
+    def test_muon_deepspeed(self, minimal_cfg):
+        cfg = DictDefault(
+            minimal_cfg
+            | {
+                "datasets": [
+                    {
+                        "path": "mhenrichsen/alpaca_2k_test",
+                        "type": "alpaca",
+                    }
+                ],
+                "optimizer": "muon",
+                "deepspeed": "deepspeed_configs/zero3.json",
+            }
+        )
+
+        with pytest.raises(ValueError, match=r".*is currently incompatible with*"):
+            validate_config(cfg)
+
+    def test_muon_fsdp(self, minimal_cfg):
+        cfg = DictDefault(
+            minimal_cfg
+            | {
+                "datasets": [
+                    {
+                        "path": "mhenrichsen/alpaca_2k_test",
+                        "type": "alpaca",
+                    }
+                ],
+                "optimizer": "muon",
+                "fsdp": ["full_shard"],
+                "fsdp_config": {
+                    "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
+                },
+            }
+        )
+
+        with pytest.raises(ValueError, match=r".*is currently incompatible with*"):
+            validate_config(cfg)
--- a/tests/utils/init.py
+++ b/tests/utils/init.py
Author	SHA1	Message	Date
Dan Saunders	700409be6f	removing deepspeed guard for LoRA Triton kernels	2025-04-03 16:44:45 +00:00
NanoCode012	64d8035f50	fix(example): align example to correct adapter (#2478 ) * fix(example): align example to correct adapter * fix: add missing load in 4 bit	2025-04-03 08:48:14 -04:00
Wing Lian	5249e98058	add additional tf32 opt for cudnn (#2477 ) [skip ci]	2025-04-03 08:47:52 -04:00
Wing Lian	3877c5c69d	set release version 0.8.0 (#2476 ) Some checks failed ci-cd / build-axolotl (<nil>, 124, 12.4.1, 3.11, 2.4.1) (push) Has been cancelled Details ci-cd / build-axolotl (<nil>, 124, 12.4.1, true, 3.11, 2.6.0) (push) Has been cancelled Details ci-cd / build-axolotl (vllm, 124, 12.4.1, 3.11, 2.5.1) (push) Has been cancelled Details publish pypi / Create Release (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 124, 12.4.1, 3.11, 2.4.1) (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 124, 12.4.1, 3.11, 2.5.1) (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 124, 12.4.1, true, 3.11, 2.6.0) (push) Has been cancelled Details ci-cd / build-axolotl-cloud-no-tmux (<nil>, 124, 12.4.1, 3.11, 2.4.1) (push) Has been cancelled Details publish pypi / Upload release to PyPI (push) Has been cancelled Details * set release version 0.8.0 * make sure to include ring-flash-attn in docker image build	2025-04-02 09:50:56 -04:00
NanoCode012	adb593abac	fix: document offload gradient_checkpointing option (#2475 )	2025-04-02 09:35:42 -04:00
NanoCode012	a0117c9bce	fix: separate gemma3 text and vision example config (#2471 ) [skip ci] * fix: separate gemma3 text and vision example config * fix: update to use a text-only dataset * fix: typo	2025-04-02 09:35:29 -04:00
NanoCode012	e6cfb093d2	fix: disable SP during merge (#2470 ) [skip ci]	2025-04-02 09:35:00 -04:00
NanoCode012	7abc71dc0b	fix: gemma3 loss in forward pass (#2473 ) [skip ci] * fix: gemma3 loss in forward pass * fix: lint * fix: move patch before plugins * Update src/axolotl/monkeypatch/gemma3.py Co-authored-by: salman <salman.mohammadi@outlook.com> --------- Co-authored-by: Wing Lian <wing.lian@gmail.com> Co-authored-by: salman <salman.mohammadi@outlook.com>	2025-04-02 09:34:41 -04:00
NanoCode012	45bf634d17	feat: add support for multimodal in lora kernels (#2472 ) [skip ci] * feat: add support for multimodal in lora kernels * fix: improve multimodal checks * fix: add fallback for model config * chor: add gemma3 to docs	2025-04-02 09:33:46 -04:00
NanoCode012	80ba4b69f1	fix: pydantic warning validator not returning self (#2474 )	2025-04-02 07:40:49 -04:00
Wing Lian	0bfa180f7d	torch 2.7.0 base image for testing (#2467 )	2025-04-01 15:38:26 -04:00
NanoCode012	9e22c4ca6a	fix: set rl=None during inference (#2463 )	2025-04-01 12:25:53 -04:00
NanoCode012	990b5896bc	fix: downgrade deepspeed to fix grad checkpoint oom (#2465 ) [skip ci]	2025-04-01 12:25:05 -04:00
Dan Saunders	7d0eb66b54	fixing eval for SP (#2468 )	2025-04-01 11:59:08 -04:00
Wing Lian	df119e3724	Validation for Muon optimizer with DS/FSDP (#2464 )	2025-04-01 09:39:12 -04:00
NanoCode012	f4ae8816bb	Fix: remove the numerous sequential log (#2461 ) * fix: remove sequential logs * feat(doc): add for sample pack sequentially and curriculum sampling	2025-04-01 09:20:00 -04:00
NanoCode012	9b95e06cbb	Fix(doc): Minor doc changes for peft and modal (#2462 ) [skip ci] * fix(doc): document peft configs * fix(doc): explain modal env vs secrets difference * fix(doc): clarify evaluate vs lm-eval * fix: clarify what is performance	2025-04-01 08:48:36 -04:00
Wing Lian	e0aba74dd0	Release update 20250331 (#2460 ) [skip ci] * make torch 2.6.0 the default image * fix tests against upstream main * fix attribute access * use fixture dataset * fix dataset load * correct the fixtures + tests * more fixtures * add accidentally removed shakespeare fixture * fix conversion from unittest to pytest class * nightly main ci caches * build 12.6.3 cuda base image * override for fix from huggingface/transformers#37162 * address PR feedback	2025-04-01 08:47:50 -04:00
Wing Lian	328d598114	gemma3 packing fixes (#2449 ) * make gemma3 work with packing * multi-gpu e2e for ci * update gemma3 model namespace to use mirror * add gradient checkpointing to multigpu e2e ci * update gemma3 examples for use_reentrant and fix ddp find unused params * fix tests for gemma3 * fix import for test utils * set correct train loss for gemma3 e2e	2025-03-31 17:15:23 -04:00
DreamGenX	4d36ecc724	Sequential sample packing (#2404 ) [skip ci] * add sequential sample packing * chore: lint --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-03-31 15:48:20 -04:00
NanoCode012	7acf93b59f	Fix(doc): Clarify doc on attention configs and missing pad_token (#2455 ) [skip ci] * fix: clarify input type * fix: handling of error message if data_files not available * fix: clarify attention handling * fix: add doc on missing pad token	2025-03-31 15:47:28 -04:00
Wing Lian	b6fc46ada8	Updates for trl 0.16.0 - mostly for GRPO (#2437 ) [skip ci] * add grpo scale_rewards config for trl#3135 * options to connect to vllm server directly w grpo trl#3094 * temperature support trl#3029 * sampling/generation kwargs for grpo trl#2989 * make vllm_enable_prefix_caching a config param trl#2900 * grpo multi-step optimizeations trl#2899 * remove overrides for grpo trainer * bump trl to 0.16.0 * add cli to start vllm-serve via trl * call the python module directly * update to use vllm with 2.6.0 too now and call trl vllm serve from module * vllm 0.8.1 * use python3 * use sys.executable * remove context and wait for start * fixes to make it actually work * fixes so the grpo tests pass with new vllm paradigm * explicit host/port and check in start vllm * make sure that vllm doesn't hang by setting quiet so outouts go to dev null * also bump bnb to latest release * add option for wait from cli and nccl debugging for ci * grpo + vllm test on separate devices for now * make sure grpo + vllm tests runs single worker since pynccl comms would conflict * fix cli * remove wait and add caching for argilla dataset * refactoring configs * chore: lint * add vllm config * fixup vllm grpo args * fix one more incorrect schema/config path * fix another vlllm reference and increase timeout * make the tests run a bit faster * change mbsz back so it is correct for grpo * another change mbsz back so it is correct for grpo * fixing cli args * nits * adding docs * docs * include tensor parallel size for vllm in pydantic schema * moving start_vllm, more docs * limit output len for grpo vllm * vllm enable_prefix_caching isn't a bool cli arg * fix env ordering in tests and also use pid check when looking for vllm --------- Co-authored-by: Salman Mohammadi <salman.mohammadi@outlook.com>	2025-03-31 15:47:11 -04:00
Dan Saunders	b35992262e	Ray train bugfix (#2458 ) * fix nccl pg destroy warning * update * ray bugfix	2025-03-31 15:17:43 -04:00
Dan Saunders	ef6eb77cc8	destroy process group on Ctrl+C / training or eval run (#2457 ) * fix nccl pg destroy warning * update	2025-03-31 12:36:47 -04:00
Dan Saunders	5410195e0b	Sequence parallelism quick follow-ups; remove ModelCallback (#2450 ) * guard return if ring attn alrady registered * add docs link, bits in multi-gpu docs, remove save model callback (subsumed by HF trainers) * configurable heads_k_stride from ring-flash-attn hf adapter	2025-03-31 09:13:42 -04:00
NanoCode012	cf0c79d52e	fix: minor patches for multimodal (#2441 ) * fix: update chat_template * fix: handle gemma3 showing a lot of no content for turn 0 * fix: remove unknown config from examples * fix: test * fix: temporary disable gemma2 test * fix: stop overwriting config.text_config unnecessarily * fix: handling of set cache to the text_config section * feat: add liger gemma support and bump liger to 0.5.5 * fix: add double use_cache setting * fix: add support for final_logit_softcap in CCE for gemma2/3 * fix: set use_cache before model load * feat: add missing layernorm override * fix: handle gemma3 rmsnorm * fix: use wrapper to pass dim as hidden_size * fix: change dim to positional * fix: patch with wrong mlp * chore: refactor use_cache handling * fix import issues * fix tests.e2e.utils import --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-03-31 13:40:12 +07:00
Wing Lian	4ba80a0e5a	fix streaming packing test (#2454 ) * fix streaming packing test * constrain amount of text generated	2025-03-29 08:30:06 -04:00
Wing Lian	c49682132b	use offline for precached stream dataset (#2453 )	2025-03-28 23:39:09 -04:00
Wing Lian	e46239f8d3	bump liger to 0.5.5 (#2448 )	2025-03-28 19:21:03 -04:00
Wing Lian	05f03b541a	hf offline decorator for tests to workaround rate limits (#2452 ) [skip ci] * hf offline decorator for tests to workaround rate limits * fail quicker so we can see logs * try new cache name * limit files downloaded * phi mini predownload * offline decorator for phi tokenizer * handle meta llama 8b offline too * make sure to return fixtures if they are wrapped too * more fixes * more things offline * more offline things * fix the env var * fix the model name * handle gemma also * force reload of modules to recheck offline status * prefetch mistral too * use reset_sessions so hub picks up offline mode * more fixes * rename so it doesn't seem like a context manager * fix backoff * switch out tinyshakespeare dataset since it runs a py script to fetch data and doesn't work offline * include additional dataset * more fixes * more fixes * replace tiny shakespeaere dataset * skip some tests for now * use more robust check using snapshot download to determine if a dataset name is on the hub * typo for skip reason * use local_files_only * more fixtures * remove local only * use tiny shakespeare as pretrain dataset and streaming can't be offline even if precached * make sure fixtures aren't offline improve the offline reset try bumping version of datasets reorder reloading and setting prime a new cache run the tests now with fresh cache try with a static cache * now run all the ci again with hopefully a correct cache * skip wonky tests for now * skip wonky tests for now * handle offline mode for model card creation	2025-03-28 19:20:46 -04:00