Compare commits

..

5 Commits

Author SHA1 Message Date
Dan Saunders
09725be990 add support for CP + torch SDPA 2025-09-25 12:03:43 -04:00
Dan Saunders
f9bd6936c1 Merge branch 'main' into cp-fix 2025-09-24 14:01:23 -04:00
Dan Saunders
b9a3bfee5a only patch in CP > 1 case 2025-09-24 13:36:14 -04:00
Dan Saunders
08124a7c92 nits 2025-09-24 13:25:46 -04:00
Dan Saunders
56e0a77e0d patch transformers to allow CP + FA2 2025-09-24 13:08:38 -04:00
25 changed files with 346 additions and 543 deletions

View File

@@ -267,7 +267,6 @@ website:
- docs/dataset_loading.qmd
- docs/qat.qmd
- docs/quantize.qmd
- docs/optimizations.qmd
- section: "Core Concepts"
contents:

View File

@@ -212,14 +212,6 @@ Instead of passing `tools` via the system prompt, an alternative method would be
Tools need to follow [JSON schema](https://json-schema.org/learn/getting-started-step-by-step).
:::
::: {.callout-warning}
If you have tool arguments with same name but different dtypes (like `"time": string` and `"time": number`), please save `arguments: ` as JSON string to prevent `datasets` from having casting issues.
```
"arguments": "{\"...\": \"...\"}"
```
:::
Example config for Llama4:
```yaml
chat_template: llama4

View File

@@ -61,7 +61,7 @@ While we recommend `.jsonl`, you can also use the other formats (`csv`, `parquet
### Pre-training without streaming
In the case that the dataset is small and can be loaded entirely into memory, another approach to running pre-training is to use the `completion` format. This would mean that the entire dataset is pre-tokenized instead of on-demand in streaming.
On the rare case that the dataset is small and can be loaded entirely into memory, another approach to running pre-training is to use the `completion` format. This would mean that the entire dataset is pre-tokenized instead of on-demand in streaming.
One benefit of this is that the tokenization can be performed separately on a CPU-only machine, and then transferred to a GPU machine for training to save costs.

View File

@@ -140,7 +140,3 @@ description: Frequently asked questions
**Q: `ValueError("Backward pass should have cleared tracker of all tensors")`
> A: This may happen due to edge cases in using the modern OffloadActivations context manager for CUDA streams. If you encounter this error, you may have success using the naive implementation with `offload_activations: legacy` in your YAML.
**Q: `Error parsing tool_calls arguments as JSON.`
> A: There is an error parsing string arguments to a dict. Please check your dataset and the error message for more details.

View File

@@ -5,11 +5,10 @@ description: "Custom autograd functions and Triton kernels in Axolotl for optimi
Inspired by [Unsloth](https://github.com/unslothai/unsloth), we've implemented two
optimizations for LoRA and QLoRA fine-tuning, supporting both single GPU and multi-GPU
(including DDP, DeepSpeed, and FSDP2) training. These include (1) SwiGLU and GEGLU
activation function Triton kernels, and (2) LoRA MLP and attention custom autograd
functions. Our goal was to leverage operator fusion and tensor re-use in order to
improve speed and reduce memory usage during the forward and backward passes of these
calculations.
(in the DDP and DeepSpeed settings) training. These include (1) SwiGLU and GEGLU activation function
Triton kernels, and (2) LoRA MLP and attention custom autograd functions. Our goal was
to leverage operator fusion and tensor re-use in order to improve speed and reduce
memory usage during the forward and backward passes of these calculations.
We currently support several common model architectures, including (but not limited to):
@@ -93,12 +92,13 @@ Currently, LoRA kernels are not supported for RLHF training, only SFT.
- One or more NVIDIA or AMD GPUs (in order to use the Triton kernels)
- Note: Set `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` to enable [memory-efficient attention on AMD GPUs](https://github.com/ROCm/aotriton/issues/16#issuecomment-2346675491)
- Targeted LoRA adapters must disable dropout (`lora_dropout: 0`)
- Targeted LoRA adapters cannot use Dropout
- This may limit model expressivity / cause overfitting
- Targeted LoRA adapters cannot have bias terms
- This may limit model expressivity
- Adapters that already include bias terms are supported.
Models with pre-existing LoRA adapters that use Dropout may need to be re-finetuned
without it in order to be as performant.
Models with pre-existing LoRA adapters that use Dropout or have bias terms may need to
be re-finetuned without these features in order to be useful.
## Implementation details
@@ -131,5 +131,6 @@ computation path.
## Future Work
- Support for additional model architectures
- Support for dropout
- Support for the FSDP setting
- Support for dropout and bias
- Additional operator fusions

View File

@@ -1,133 +0,0 @@
---
title: Optimizations Guide
description: A guide to the performance and memory optimizations available in Axolotl.
---
Axolotl includes numerous optimizations to speed up training, reduce memory usage, and handle large models.
This guide provides a high-level overview and directs you to the detailed documentation for each feature.
## Speed Optimizations
These optimizations focus on increasing training throughput and reducing total training time.
### Sample Packing
Improves GPU utilization by combining multiple short sequences into a single packed sequence for training. This requires enabling one of the [attention](#attention-implementations) implementations below.
- **Config:** `sample_packing: true`
- **Learn more:** [Sample Packing](multipack.qmd)
### Attention Implementations
Using an optimized attention implementation is critical for training speed.
- **[Flash Attention 2](https://github.com/Dao-AILab/flash-attention)**: `flash_attention: true`. **(Recommended)** The industry standard for fast attention on modern GPUs. Requires Ampere or higher. For AMD, check [AMD Support](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#amd-rocm-support).
- **[Flex Attention](https://pytorch.org/blog/flexattention/)**: `flex_attention: true`.
- **[SDP Attention](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)**: `sdp_attention: true`. PyTorch's native implementation.
- **[Xformers](https://github.com/facebookresearch/xformers)**: `xformers_attention: true`. Works with FP16.
*Note: You should only enable one attention backend.*
### LoRA Optimizations
Leverages optimized kernels to accelerate LoRA training and reduce memory usage.
- **Learn more:** [LoRA Optimizations Documentation](lora_optims.qmd)
## Memory Optimizations
These techniques help you fit larger models or use bigger batch sizes on your existing hardware.
### Parameter Efficient Finetuning (LoRA & QLoRA)
Drastically reduces memory by training a small set of "adapter" parameters instead of the full model. This is the most common and effective memory-saving technique.
- Examples: Find configs with `lora` or `qlora` in the [examples directory](https://github.com/axolotl-ai-cloud/axolotl/tree/main/examples/llama-3).
- Config Reference: See `adapter`, `load_in_4bit`, and `load_in_8bit` in the [Configuration Reference](config-reference.qmd).
### Gradient Checkpointing & Activation Offloading
These techniques save VRAM by changing how activations are handled.
- Gradient Checkpointing: re-computes activations during the backward pass, trading compute time for VRAM.
- Activation Offloading: moves activations to CPU RAM or disk, trading I/O overhead for VRAM.
- Learn more: [Gradient Checkpointing and Offloading Docs](gradient_checkpointing.qmd)
### Cut Cross Entropy (CCE)
Reduces VRAM usage by using an optimized cross-entropy loss calculation.
- **Learn more:** [Custom Integrations - CCE](custom_integrations.qmd#cut-cross-entropy)
### Liger Kernels
Provides efficient Triton kernels to improve training speed and reduce memory usage.
- **Learn more:** [Custom Integrations - Liger Kernels](custom_integrations.qmd#liger-kernels)
## Long Context Models
Techniques to train models on sequences longer than their original context window.
### RoPE Scaling
Extends a model's context window by interpolating its Rotary Position Embeddings.
- **Config:** Pass the `rope_scaling` config under the `overrides_of_model_config: `. To learn how to set RoPE, check the respective model config.
### Sequence Parallelism
Splits long sequences across multiple GPUs, enabling training with sequence lengths that would not fit on a single device.
- **Learn more:** [Sequence Parallelism Documentation](sequence_parallelism.qmd)
### Artic Long Sequence Training (ALST)
ALST is a recipe that combines several techniques to train long-context models efficiently. It typically involves:
- TiledMLP to reduce memory usage in MLP layers.
- Tiled Loss functions (like [CCE](#cut-cross-entropy-(cce) or [Liger](#liger-kernels)).
- Activation Offloading to CPU.
- Example: [ALST Example Configuration](https://github.com/axolotl-ai-cloud/axolotl/tree/main/examples/alst)
## Large Models (Distributed Training)
To train models that don't fit on a single GPU, you'll need to use a distributed training strategy like FSDP or DeepSpeed. These frameworks shard the model weights, gradients, and optimizer states across multiple GPUs and nodes.
- **Learn more:** [Multi-GPU Guide](multi-gpu.qmd)
- **Learn more:** [Multi-Node Guide](multi-node.qmd)
### N-D Parallelism (Beta)
For advanced scaling, Axolotl allows you to compose different parallelism techniques (e.g., Data, Tensor, Sequence Parallelism). This is a powerful approach to train an extremely large model by overcoming multiple bottlenecks at once.
- **Learn more:** [N-D Parallelism Guide](nd_parallelism.qmd)
## Quantization
Techniques to reduce the precision of model weights for memory savings.
### 4-bit Training (QLoRA)
The recommended approach for quantization-based training. It loads the base model in 4-bit using `bitsandbytes` and then trains QLoRA adapters. See [Adapter Finetuning](#adapter-finetuning-lora-qlora) for details.
### FP8 Training
Enables training with 8-bit floating point precision on supported hardware (e.g., NVIDIA Hopper series GPUs) for significant speed and memory gains.
- **Example:** [Llama 3 FP8 FSDP Example](https://github.com/axolotl-ai-cloud/axolotl/blob/main/examples/llama-3/3b-fp8-fsdp2.yaml)
### Quantization Aware Training (QAT)
Simulates quantization effects during training, helping the model adapt and potentially improving the final accuracy of the quantized model.
- **Learn more:** [QAT Documentation](qat.qmd)
### GPTQ
Allows you to finetune LoRA adapters on top of a model that has already been quantized using the GPTQ method.
- **Example:** [GPTQ LoRA Example](https://github.com/axolotl-ai-cloud/axolotl/blob/main/examples/llama-2/gptq-lora.yml)

View File

@@ -30,7 +30,6 @@ qat:
```
We support the following quantization schemas:
- `Int4WeightOnly` (requires the `fbgemm-gpu` extra when installing Axolotl)
- `Int8DynamicActivationInt4Weight`
- `Float8DynamicActivationFloat8Weight`

View File

@@ -7,24 +7,3 @@ techniques. It is a combination of:
- Activation Offloading: Offload activations to CPU RAM to reduce memory usage
For more information, you can check out the ALST paper [here](https://www.arxiv.org/abs/2506.13996).
## Usage
```yaml
tiled_mlp: true
# See Sequence Parallelism docs
# https://docs.axolotl.ai/docs/sequence_parallelism.html
context_parallel_size: int
plugins:
# See Cut Cross Entropy docs
# https://docs.axolotl.ai/docs/custom_integrations.html#cut-cross-entropy
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
# or Liger Kernel docs
# https://docs.axolotl.ai/docs/custom_integrations.html#liger-kernels
- axolotl.integrations.liger.LigerPlugin
# ...
```

View File

@@ -38,7 +38,7 @@ pip3 uninstall -y causal-conv1d && pip3 install flash-linear-attention==0.3.2
axolotl train examples/qwen3-next/qwen3-next-80b-a3b-qlora.yaml
```
This config uses about 45.62 GiB VRAM.
This config uses about 41.7 GiB VRAM.
Let us know how it goes. Happy finetuning! 🚀

View File

@@ -27,14 +27,6 @@ lora_r: 16
lora_alpha: 8
lora_dropout: 0.05
lora_target_modules:
- linear_attn.in_proj_ba
- linear_attn.in_proj_qkvz
- linear_attn.out_proj
- shared_expert.up_proj
- shared_expert.down_proj
- shared_expert.gate_proj
- shared_expert_gate
- mlp.gate
- q_proj
- v_proj
- k_proj

View File

@@ -84,7 +84,9 @@ class PatchManager:
patch_evaluation_loop()
patch_maybe_log_save_evaluate()
if self.cfg.context_parallel_size > 1:
if self.cfg.context_parallel_size > 1 and getattr(
self.cfg, "flash_attention", False
):
from axolotl.monkeypatch.transformers.trainer_context_parallel import (
patch_prepare_context_parallel_inputs,
)

View File

@@ -323,8 +323,8 @@ def apply_lora_kernel_patches(
AssertionError: If multiple adapters are active (currently unsupported).
Note:
The optimizations require LoRA adapters with no dropout. The function will skip
patching if that condition isn't met.
The optimizations require LoRA adapters with no dropout and no bias terms. The
function will skip patching if these conditions aren't met.
"""
if not isinstance(model, PeftModelForCausalLM):
raise TypeError("Model must be a PeftModelForCausalLM")
@@ -340,10 +340,10 @@ def apply_lora_kernel_patches(
lora_config = model.model.peft_config[active_adapter]
# Only patch if conditions are met
can_patch = lora_config.lora_dropout == 0
can_patch = lora_config.lora_dropout == 0 and lora_config.bias == "none"
if not can_patch:
LOG.warning("Cannot patch layers - requires `lora_dropout: 0`")
LOG.warning("Cannot patch layers - requires no dropout and no bias")
LOG.warning("Please specify `lora_dropout: 0` in your axolotl config file")
return model

View File

@@ -13,21 +13,10 @@ from typing import Callable
import torch
import torch.distributed as dist
import transformers
import transformers.modeling_flash_attention_utils
import transformers.modeling_flash_attention_utils as flash_utils
from ring_flash_attn import ring_flash_attn_func
from ring_flash_attn.adapters.hf_adapter import check_params
from transformers.modeling_flash_attention_utils import is_flash_attn_greater_or_equal
try:
from transformers.modeling_flash_attention_utils import _flash_supports_window
except ImportError:
try:
from transformers.modeling_flash_attention_utils import (
_flash_supports_window_size as _flash_supports_window,
)
except ImportError:
_flash_supports_window = True
from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
from axolotl.utils.schemas.enums import RingAttnFunc
@@ -118,7 +107,7 @@ def create_flash_attn_forward_varlen_llama3(
# Handle sliding window
use_sliding_windows = (
_flash_supports_window
_flash_windows_supported()
and sliding_window is not None
and key_states.shape[1] > sliding_window
)
@@ -194,3 +183,18 @@ def substitute_hf_flash_attn(
from ring_flash_attn.adapters.hf_adapter import flash_attention_forward
ALL_ATTENTION_FUNCTIONS["flash_attention_2"] = flash_attention_forward
def _flash_windows_supported() -> bool:
"""Return whether current transformers build advertises sliding-window support."""
support = getattr(flash_utils, "_flash_supports_window", None)
if support is None:
support = getattr(flash_utils, "_flash_supports_window_size", None)
if support is None:
return True
if callable(support):
return True
return bool(support)

View File

@@ -13,18 +13,9 @@ from typing import Optional
import torch
import torch.distributed as dist
import transformers.modeling_flash_attention_utils as flash_utils
from torch.distributed import DeviceMesh
try:
from transformers.modeling_flash_attention_utils import _flash_supports_window
except ImportError:
try:
from transformers.modeling_flash_attention_utils import (
_flash_supports_window_size as _flash_supports_window,
)
except ImportError:
_flash_supports_window = True
from axolotl.monkeypatch.utils import get_cu_seqlens_from_pos_ids
from axolotl.utils.logging import get_logger
from axolotl.utils.schemas.enums import RingAttnFunc
@@ -83,7 +74,7 @@ def create_ring_flash_attention_forward(
# Assuming 4D tensors, key_states.shape[1] is the key/value sequence length (source length).
use_sliding_windows = (
_flash_supports_window
_flash_windows_supported()
and sliding_window is not None
and key_states.shape[1] > sliding_window
)
@@ -225,3 +216,19 @@ def update_ring_attn_params(position_ids: torch.Tensor | None):
cu_seqlens, _ = get_cu_seqlens_from_pos_ids(position_ids)
cu_seqlens = cu_seqlens.squeeze().to(device=torch.cuda.current_device())
update_ring_flash_attn_params(cu_seqlens, get_ring_attn_group())
def _flash_windows_supported() -> bool:
"""Best-effort check for FlashAttention sliding-window support."""
support = getattr(flash_utils, "_flash_supports_window", None)
if support is None:
support = getattr(flash_utils, "_flash_supports_window_size", None)
if support is None:
return True
if callable(support):
# Signature differs across versions; assume support when callable.
return True
return bool(support)

View File

@@ -2,7 +2,6 @@
HF Chat Templates prompt strategy
"""
import json
from collections import defaultdict
from typing import TYPE_CHECKING, Any, Dict, List, Set, Union
@@ -795,22 +794,6 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):
if val is not None:
transformed_message[key] = val
if "tool_calls" in transformed_message and transformed_message["tool_calls"]:
for tool_call in transformed_message["tool_calls"]:
if "function" in tool_call and "arguments" in tool_call["function"]:
args = tool_call["function"]["arguments"]
if isinstance(args, str):
try:
tool_call["function"]["arguments"] = json.loads(args)
except json.JSONDecodeError as e:
LOG.error(
f"Error parsing tool_calls arguments as JSON. "
f"Function: {tool_call.get('function', {}).get('name', 'unknown')}, "
f"Arguments string: {args!r}, "
f"Error: {e}"
)
raise
return transformed_message
def _get_images(self, prompt):

View File

@@ -179,7 +179,11 @@ def execute_training(
)
)
if cfg.context_parallel_size > 1:
use_flash_cp = cfg.context_parallel_size > 1 and bool(
getattr(cfg, "flash_attention", False)
)
if use_flash_cp:
models = [trainer.model]
if hasattr(trainer, "ref_model") and trainer.ref_model:
models.append(trainer.ref_model)

View File

@@ -1,7 +1,6 @@
"""Module with validation methods for config pydantic model."""
import json
import sys
import tempfile
from pathlib import Path
@@ -1314,50 +1313,40 @@ class ComplexValidationMixin:
if not self.context_parallel_size:
self.context_parallel_size = 1
elif self.context_parallel_size > 1:
if not self.flash_attention:
use_flash_attention = getattr(self, "flash_attention", False)
use_sdp_attention = getattr(self, "sdp_attention", False)
if not (use_flash_attention or use_sdp_attention):
raise ValueError(
"flash_attention: true must be set with context_parallel_size > 1"
"context_parallel_size > 1 requires either flash_attention: true "
"or sdp_attention: true"
)
if self.sample_packing and self.micro_batch_size > 1:
raise ValueError(
"micro_batch_size must be set to 1 when sample_packing is enabled "
"due to a `ring-flash-attn` requirement"
if use_flash_attention:
if self.sample_packing and self.micro_batch_size > 1:
raise ValueError(
"micro_batch_size must be set to 1 when sample_packing is enabled "
"due to a `ring-flash-attn` requirement"
)
try:
import ring_flash_attn # noqa: F401 # Required after monkey-patching
except ImportError as exception:
raise ImportError(
"context_parallel_size > 1 but ring_flash_attn is not installed. "
"Please install it with `pip install axolotl[ring-flash-attn] "
"or `pip install ring-flash-attn>=0.1.4`."
) from exception
LOG.warning(
"Sequence parallelism (SP) is enabled with "
f"context_parallel_size={self.context_parallel_size}. "
"Please note that logged losses may differ slightly to the non-SP "
"losses due to transformers Trainer implementation details. "
"Please see https://github.com/axolotl-ai-cloud/axolotl/pull/2495#issuecomment-2784022042 "
"for more details."
)
try:
import transformers.modeling_flash_attention_utils
from transformers.utils import is_flash_attn_greater_or_equal
transformers.modeling_flash_attention_utils._flash_supports_window = (
True
)
sys.modules[
"transformers.modeling_flash_attention_utils"
]._flash_supports_window = True
sys.modules[
"transformers.modeling_flash_attention_utils"
]._flash_supports_window_size = True
sys.modules[
"transformers.modeling_flash_attention_utils"
].is_flash_attn_greater_or_equal = is_flash_attn_greater_or_equal
import ring_flash_attn # noqa: F401 # Required after monkey-patching
except ImportError as exception:
raise ImportError(
"context_parallel_size > 1 but ring_flash_attn is not installed. "
"Please install it with `pip install axolotl[ring-flash-attn] "
"or `pip install ring-flash-attn>=0.1.4`."
) from exception
LOG.warning(
"Sequence parallelism (SP) is enabled with "
f"context_parallel_size={self.context_parallel_size}. "
"Please note that logged losses may differ slightly to the non-SP "
"losses due to transformers Trainer implementation details. "
"Please see https://github.com/axolotl-ai-cloud/axolotl/pull/2495#issuecomment-2784022042 "
"for more details."
)
return self
@model_validator(mode="after")

View File

@@ -23,6 +23,8 @@ class TestSequenceParallelism:
pad_to_sequence_len=True,
ring_attn_func=None,
threshold=2.0,
flash_attention=True,
sdp_attention=False,
):
"""Helper method to run sequence parallel tests with different configurations"""
cfg = DictDefault(
@@ -58,7 +60,8 @@ class TestSequenceParallelism:
"learning_rate": 0.00001,
"optimizer": "adamw_8bit",
"lr_scheduler": "cosine",
"flash_attention": True,
"flash_attention": flash_attention,
"sdp_attention": sdp_attention,
"loss_watchdog_threshold": 5.0,
"loss_watchdog_patience": 3,
"bf16": "auto",
@@ -132,3 +135,16 @@ class TestSequenceParallelism:
ring_attn_func=ring_attn_func,
threshold=threshold,
)
def test_sequence_parallel_training_sdpa(self, temp_dir):
"""Smoke test for SDPA-based context parallelism."""
self._run_sequence_parallel_test(
temp_dir,
sample_packing=False,
micro_batch_size=1,
pad_to_sequence_len=True,
ring_attn_func=None,
threshold=3.0,
flash_attention=False,
sdp_attention=True,
)

View File

@@ -221,53 +221,44 @@ def test_model_specific_activation(model_name, expected_activation):
assert layer.mlp.forward.__func__ is expected_activation
def test_kernel_patch_requires_zero_dropout():
"""Kernel patching should be skipped when dropout is enabled."""
config = {
"peft_type": "LORA",
"task_type": "CAUSAL_LM",
"r": 8,
"lora_alpha": 16,
"target_modules": ["gate_proj", "up_proj", "down_proj"],
"lora_dropout": 0.1,
"bias": "none",
}
def test_kernel_patch_conditions():
"""Test various conditions that should prevent kernel patching."""
test_configs = [
# Dropout prevents patching
{
"peft_type": "LORA",
"task_type": "CAUSAL_LM",
"r": 8,
"lora_alpha": 16,
"target_modules": ["gate_proj", "up_proj", "down_proj"],
"lora_dropout": 0.1,
"bias": "none",
},
# Bias prevents patching
{
"peft_type": "LORA",
"task_type": "CAUSAL_LM",
"r": 8,
"lora_alpha": 16,
"target_modules": ["gate_proj", "up_proj", "down_proj"],
"lora_dropout": 0,
"bias": "lora_only",
},
]
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-135M")
peft_config = get_peft_config(config)
model = PeftModelForCausalLM(model, peft_config)
cfg = DictDefault({"lora_mlp_kernel": True})
for config in test_configs:
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-135M")
peft_config = get_peft_config(config)
model = PeftModelForCausalLM(model, peft_config)
cfg = DictDefault({"lora_mlp_kernel": True})
patched_model = apply_lora_kernel_patches(model, cfg)
layer = patched_model.model.model.layers[0].mlp
# Should not patch
patched_model = apply_lora_kernel_patches(model, cfg)
layer = patched_model.model.model.layers[0].mlp
# Verify no patches applied when dropout is non-zero
assert layer.forward.__func__ is not apply_lora_mlp_swiglu
assert layer.forward.__func__ is not apply_lora_mlp_geglu
def test_kernel_patch_with_bias_enabled():
"""Kernel patching should succeed when LoRA bias is enabled."""
config = {
"peft_type": "LORA",
"task_type": "CAUSAL_LM",
"r": 8,
"lora_alpha": 16,
"target_modules": ["gate_proj", "up_proj", "down_proj"],
"lora_dropout": 0,
"bias": "lora_only",
}
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-135M")
peft_config = get_peft_config(config)
model = PeftModelForCausalLM(model, peft_config)
cfg = DictDefault({"lora_mlp_kernel": True})
patched_model = apply_lora_kernel_patches(model, cfg)
layer = patched_model.model.model.layers[0].mlp
# Verify patches applied when bias support is enabled
assert layer.forward.__func__ is apply_lora_mlp_swiglu
# Verify no patches applied
assert layer.forward.__func__ is not apply_lora_mlp_swiglu
assert layer.forward.__func__ is not apply_lora_mlp_geglu
def test_kernel_config_options():

View File

@@ -0,0 +1,74 @@
"""Tests for PatchManager context parallel patch selection."""
import addict
from axolotl.loaders.patch_manager import PatchManager
from axolotl.utils.dict import DictDefault
def _stub_transformers_patches(monkeypatch):
"""Replace trainer loss patchers with no-ops for isolation."""
monkeypatch.setattr(
"axolotl.monkeypatch.transformers.trainer_loss_calc.patch_evaluation_loop",
lambda: None,
)
monkeypatch.setattr(
"axolotl.monkeypatch.transformers.trainer_loss_calc.patch_maybe_log_save_evaluate",
lambda: None,
)
def test_patch_manager_applies_flash_cp_patch(monkeypatch):
"""When flash attention is enabled, we patch Trainer for CP."""
_stub_transformers_patches(monkeypatch)
patch_calls = {"count": 0}
def stub_patch():
patch_calls["count"] += 1
monkeypatch.setattr(
"axolotl.monkeypatch.transformers.trainer_context_parallel.patch_prepare_context_parallel_inputs",
stub_patch,
)
cfg = DictDefault(
{
"context_parallel_size": 2,
"flash_attention": True,
"sdp_attention": False,
}
)
manager = PatchManager(cfg, addict.Dict())
manager._apply_transformers_patches()
assert patch_calls["count"] == 1
def test_patch_manager_skips_flash_patch_for_sdpa(monkeypatch):
"""When only SDPA is requested, we should not patch Trainer."""
_stub_transformers_patches(monkeypatch)
patch_calls = {"count": 0}
def stub_patch():
patch_calls["count"] += 1
monkeypatch.setattr(
"axolotl.monkeypatch.transformers.trainer_context_parallel.patch_prepare_context_parallel_inputs",
stub_patch,
)
cfg = DictDefault(
{
"context_parallel_size": 2,
"flash_attention": False,
"sdp_attention": True,
}
)
manager = PatchManager(cfg, addict.Dict())
manager._apply_transformers_patches()
assert patch_calls["count"] == 0

View File

@@ -177,15 +177,6 @@ def fixture_devstral_1_1_tokenizer():
return tokenizer
@pytest.fixture(name="qwen3_tokenizer")
def qwen3_tokenizer_fixture(
download_qwen3_half_billion_model,
): # pylint: disable=unused-argument,redefined-outer-name
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
return tokenizer
@pytest.fixture(name="mistralv03_tokenizer_chat_template_jinja")
def fixture_mistralv03_chat_template_jinja_w_system() -> str:
return '{%- if messages[0]["role"] == "system" %}\n {%- set system_message = messages[0]["content"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr("role", "equalto", "user") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n {%- if not (message.role == "tool" or message.role == "tool_results" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n {%- if (message["role"] == "user") != (ns.index % 2 == 0) %}\n {{- raise_exception("After the optional system message, conversation roles must alternate user/assistant/user/assistant/...") }}\n {%- endif %}\n {%- set ns.index = ns.index + 1 %}\n {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if message["role"] == "user" %}\n {%- if tools is not none and (message == user_messages[-1]) %}\n {{- "[AVAILABLE_TOOLS] [" }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- \'{"type": "function", "function": {\' }}\n {%- for key, val in tool.items() if key != "return" %}\n {%- if val is string %}\n {{- \'"\' + key + \'": "\' + val + \'"\' }}\n {%- else %}\n {{- \'"\' + key + \'": \' + val|tojson }}\n {%- endif %}\n {%- if not loop.last %}\n {{- ", " }}\n {%- endif %}\n {%- endfor %}\n {{- "}}" }}\n {%- if not loop.last %}\n {{- ", " }}\n {%- else %}\n {{- "]" }}\n {%- endif %}\n {%- endfor %}\n {{- "[/AVAILABLE_TOOLS]" }}\n {%- endif %}\n {%- if loop.first and system_message is defined %}\n {{- "[INST] " + system_message + "\\n\\n" + message["content"] + "[/INST]" }}\n {%- else %}\n {{- "[INST] " + message["content"] + "[/INST]" }}\n {%- endif %}\n {%- elif message.tool_calls is defined and message.tool_calls is not none %}\n {{- "[TOOL_CALLS] [" }}\n {%- for tool_call in message.tool_calls %}\n {%- set out = tool_call.function|tojson %}\n {{- out[:-1] }}\n {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n {{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }}\n {%- endif %}\n {{- \', "id": "\' + tool_call.id + \'"}\' }}\n {%- if not loop.last %}\n {{- ", " }}\n {%- else %}\n {{- "]" + eos_token }}\n {%- endif %}\n {%- endfor %}\n {%- elif message["role"] == "assistant" %}\n {{- " " + message["content"]|trim + eos_token}}\n {%- elif message["role"] == "tool_results" or message["role"] == "tool" %}\n {%- if message.content is defined and message.content.content is defined %}\n {%- set content = message.content.content %}\n {%- else %}\n {%- set content = message.content %}\n {%- endif %}\n {{- \'[TOOL_RESULTS] {"content": \' + content|string + ", " }}\n {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n {{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }}\n {%- endif %}\n {{- \'"call_id": "\' + message.tool_call_id + \'"}[/TOOL_RESULTS]\' }}\n {%- else %}\n {{- raise_exception("Only user and assistant roles are supported, with the exception of an initial optional system message!") }}\n {%- endif %}\n{%- endfor %}\n'

View File

@@ -6,6 +6,7 @@ import json
import pytest
from datasets import Dataset
from transformers import AutoTokenizer
from axolotl.prompt_strategies.chat_template import StrategyLoader
from axolotl.utils.dict import DictDefault
@@ -22,6 +23,15 @@ def fixture_messages_w_tools():
return Dataset.from_list(rows)
@pytest.fixture(name="qwen3_tokenizer")
def qwen3_tokenizer_fixture(
download_qwen3_half_billion_model,
):
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
return tokenizer
@pytest.fixture(name="qwen3_prompt_strategy")
def qwen3_chat_template_strategy(qwen3_tokenizer):
cfg = DictDefault(

View File

@@ -4,6 +4,7 @@ Tests for splitting reasoning/thinking from content into separate field
import pytest
from datasets import Dataset
from transformers import AutoTokenizer
from axolotl.prompt_strategies.chat_template import (
load,
@@ -55,6 +56,15 @@ def messages_w_reasoning_fixture():
)
@pytest.fixture(name="qwen3_tokenizer")
def qwen3_tokenizer_fixture(
download_qwen3_half_billion_model,
):
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
return tokenizer
class TestSplitThinking:
"""
test class to make sure datasets with reasoning content conforms to the chat_template strategy

View File

@@ -1,214 +0,0 @@
"""
Tests for handling json tool content
"""
import json
import pytest
from datasets import Dataset
from axolotl.prompt_strategies.chat_template import (
load,
)
from axolotl.utils.dict import DictDefault
@pytest.fixture(name="qwen3_instruct_prompt_strategy")
def qwen3_instruct_chat_template_strategy(qwen3_tokenizer):
strategy = load(
qwen3_tokenizer,
DictDefault(
{
"train_on_inputs": False,
"sequence_len": 512,
}
),
DictDefault(
{
"chat_template": "qwen3",
"message_field_role": "role",
"message_field_content": "content",
"message_property_mappings": {
"role": "role",
"content": "content",
},
"roles": {
"user": ["user"],
"assistant": ["assistant"],
"system": ["system"],
},
"field_messages": "messages",
}
),
)
return strategy
class TestQwen3IdenticalConversationArgs:
"""
Test Qwen3 tools is identical between JSON and dict
"""
@pytest.fixture(name="conversation_dict_args_dataset")
def fixture_conversation_dict_args_dataset(self):
"""
Provides a dataset with conversation where arguments is a dict.
"""
user_content = "What is the weather in Boston?"
function_name = "get_current_weather"
arguments_dict = {"location": "Boston, MA", "unit": "celsius"}
data = [
{
"messages": [
{"role": "user", "content": user_content},
{
"role": "assistant",
"content": "",
"tool_calls": [
{
"function": {
"name": function_name,
"arguments": arguments_dict, # dict格式
}
}
],
},
],
}
]
return Dataset.from_list(data)
@pytest.fixture(name="conversation_str_args_dataset")
def fixture_conversation_str_args_dataset(self):
"""
Provides a dataset with conversation where arguments is a JSON string.
"""
user_content = "What is the weather in Boston?"
function_name = "get_current_weather"
arguments_dict = {"location": "Boston, MA", "unit": "celsius"}
arguments_str = json.dumps(arguments_dict)
data = [
{
"messages": [
{"role": "user", "content": user_content},
{
"role": "assistant",
"content": "",
"tool_calls": [
{
"function": {
"name": function_name,
"arguments": arguments_str, # str格式
}
}
],
},
],
}
]
return Dataset.from_list(data)
@pytest.fixture(name="conversation_mixed_time_types_dataset")
def fixture_conversation_mixed_time_types_dataset(self):
"""
Provides a dataset where 'time' field has different types in different tool calls.
"""
data = [
{
"messages": [
{
"role": "user",
"content": "Get weather information at different times",
},
{
"role": "assistant",
"content": "",
"tool_calls": [
{
"function": {
"name": "func1",
"arguments": json.dumps(
{"time": "2025-08-01"}
), # string type
}
},
{
"function": {
"name": "func2",
"arguments": json.dumps(
{"time": 1690876800}
), # number type
}
},
],
},
],
}
]
return Dataset.from_list(data)
def test_dict_and_str_args_produce_identical_output(
self,
conversation_dict_args_dataset,
conversation_str_args_dataset,
qwen3_instruct_prompt_strategy,
qwen3_tokenizer,
):
"""
Tests that after tokenization and decoding, the outputs for both
dict and string `arguments` are exactly the same.
"""
processed_dict_args = conversation_dict_args_dataset.map(
qwen3_instruct_prompt_strategy.tokenize_prompt,
batched=True,
remove_columns=["messages"],
)
processed_str_args = conversation_str_args_dataset.map(
qwen3_instruct_prompt_strategy.tokenize_prompt,
batched=True,
remove_columns=["messages"],
)
decoded_prompt_from_dict = qwen3_tokenizer.decode(
processed_dict_args[0]["input_ids"]
)
decoded_prompt_from_str = qwen3_tokenizer.decode(
processed_str_args[0]["input_ids"]
)
assert decoded_prompt_from_dict == decoded_prompt_from_str, (
f"Dict format output:\n{decoded_prompt_from_dict}\n"
f"String format output:\n{decoded_prompt_from_str}"
)
assert (
processed_dict_args[0]["input_ids"] == processed_str_args[0]["input_ids"]
), "The tokenized input_ids should be identical for dict and str arguments"
def test_str_args_with_mixed_time_types_no_error(
self,
conversation_mixed_time_types_dataset,
qwen3_instruct_prompt_strategy,
qwen3_tokenizer,
):
"""
Tests that when 'time' field has different types (string vs number)
in different tool calls, str format arguments don't cause errors.
"""
processed = conversation_mixed_time_types_dataset.map(
qwen3_instruct_prompt_strategy.tokenize_prompt,
batched=True,
remove_columns=["messages"],
)
assert len(processed) == 1
assert "input_ids" in processed[0]
assert len(processed[0]["input_ids"]) > 0
decoded = qwen3_tokenizer.decode(processed[0]["input_ids"])
assert "2025-08-01" in decoded, "String time value should be present"
assert "1690876800" in decoded, "Number time value should be present"

View File

@@ -0,0 +1,111 @@
"""Unit tests for choosing the correct context parallel implementation."""
from types import SimpleNamespace
from axolotl.train import execute_training
from axolotl.utils.dict import DictDefault
class DummyTrainer:
"""Minimal trainer stub to exercise execute_training."""
def __init__(self):
self.model = object()
self.ref_model = None
self.accelerator = SimpleNamespace(torch_device_mesh=None)
self.train_called = False
def train(self, resume_from_checkpoint=None): # pylint: disable=unused-argument
self.train_called = True
class DummyPluginManager:
"""Minimal plugin manager stub."""
@staticmethod
def post_train(cfg, model): # pylint: disable=unused-argument
return None
class DummyContext:
"""Test context manager that records entries/exits."""
def __init__(self, recorder, **kwargs):
recorder.append({"kwargs": kwargs})
self.recorder = recorder
def __enter__(self):
self.recorder[-1]["entered"] = True
return self
def __exit__(self, exc_type, exc, tb): # pylint: disable=unused-argument
self.recorder[-1]["exited"] = True
return False
def _base_cfg(**overrides):
base = {
"context_parallel_size": 2,
"gradient_accumulation_steps": 1,
"ring_attn_func": None,
"heads_k_stride": None,
"rl": None,
"flash_optimum": False,
}
base.update(overrides)
return DictDefault(base)
def test_execute_training_uses_ring_when_flash(monkeypatch):
"""FlashAttention CP should engage the custom ring context manager."""
recorder: list[dict] = []
monkeypatch.setattr(
"axolotl.train.SequenceParallelContextManager",
lambda **kwargs: DummyContext(recorder, **kwargs),
)
monkeypatch.setattr(
"axolotl.train.PluginManager.get_instance",
lambda: DummyPluginManager(),
)
cfg = _base_cfg(flash_attention=True, sdp_attention=False)
trainer = DummyTrainer()
execute_training(cfg, trainer, resume_from_checkpoint=None)
assert trainer.train_called
assert len(recorder) == 1
assert recorder[0]["kwargs"]["context_parallel_size"] == 2
assert recorder[0].get("entered") is True
assert recorder[0].get("exited") is True
def test_execute_training_uses_transformers_cp_for_sdpa(monkeypatch):
"""SDPA CP should bypass the ring context manager."""
invoked = {"count": 0}
class NoOpContext:
def __enter__(self):
return self
def __exit__(self, exc_type, exc, tb): # pylint: disable=unused-argument
return False
monkeypatch.setattr(
"axolotl.train.SequenceParallelContextManager",
lambda **kwargs: invoked.__setitem__("count", invoked["count"] + 1)
or NoOpContext(),
)
monkeypatch.setattr(
"axolotl.train.PluginManager.get_instance",
lambda: DummyPluginManager(),
)
cfg = _base_cfg(flash_attention=False, sdp_attention=True)
trainer = DummyTrainer()
execute_training(cfg, trainer, resume_from_checkpoint=None)
assert trainer.train_called
assert invoked["count"] == 0