* Fix Axolotl ReLoRA optimizer reset scope
* fix: make relora reset method honor relora_prune_ratio
When relora_prune_method='reset' and relora_prune_ratio is explicitly
set, the ratio was silently ignored and replaced with the hardcoded
_FULL_RESET_RATIO (0.999). Fix by moving the default-ratio logic to
ReLoRACallback.on_step_begin: None maps to _FULL_RESET_RATIO for reset
and 0.9 for other methods. reset_optimizer now uses the same random
pruning path for both 'random' and 'reset'.
Also consolidate three-layer default mismatch: schema default for
relora_prune_method is now 'magnitude' (single canonical source);
dataclass defaults for both fields changed to None to eliminate the
conflicting fallback layer.
Tests updated: removed the test case that verified the old broken
behavior (reset ignoring ratio), added two cases proving reset honors
the passed ratio. E2E reset fixture now uses ratio=0.5 to make it
unambiguous that the ratio is honored.
* Fix ReLoRA uint8 pruning regression
---------
Signed-off-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>
* feat: systemic multimodal assistant-only loss masking + cfg.role_boundaries
Fixes silent ignoring of `cfg.train_on_inputs` / `cfg.roles_to_train` /
`cfg.train_on_eos` in the multimodal training path. Before this branch,
only Gemma 3n honored these knobs; every other VLM trained on the full
sequence regardless of config. Also adds `cfg.role_boundaries` YAML
override so users can declare per-role markers without subclassing.
What changed
------------
- `ProcessingStrategy` gains a declarative boundary scanner. Each
strategy declares per-role start/end markers via
`_build_role_boundaries`; the shared scanner honors
`train_on_inputs` / `roles_to_train` / `train_on_eos` (incl. "last").
- New per-template strategies: Gemma 4, Llama 3.2 Vision, Llama 4,
Pixtral, Mistral V7 Tekken.
- Refactored: Gemma 3 (previously no role masking), Gemma 3n
(previously ad-hoc scanner, now shared).
- Strategies whose boundary tokens couldn't be verified offline
(Voxtral, SmolVLM2, Mistral3, InternVL, GLM4V, llava/lfm2vl
fallback) retain legacy behavior and emit a one-shot warning. Users
can enable masking on them via `cfg.role_boundaries`.
- Pixtral / Mistral V7 Tekken correctly handle the shared `[/INST]`
token between user-end and assistant-start via `include_end=False`
+ scanner rewind.
See `docs/multimodal_assistant_mask.md` for the full audit table,
root-cause analysis, and design rationale.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: systemic multimodal assistant-only loss masking + cfg.role_boundaries
Fixes silent ignoring of `cfg.train_on_inputs` / `cfg.roles_to_train` /
`cfg.train_on_eos` in the multimodal training path. Before this branch,
only Gemma 3n honored these knobs; every other VLM trained on the full
sequence regardless of config. Also adds `cfg.role_boundaries` YAML
override so users can declare per-role markers without subclassing.
What changed
------------
- `ProcessingStrategy` gains a declarative boundary scanner. Each
strategy declares per-role start/end markers via
`_build_role_boundaries`; the shared scanner honors
`train_on_inputs` / `roles_to_train` / `train_on_eos` (incl. "last").
- New per-template strategies: Gemma 4, Llama 3.2 Vision, Llama 4,
Pixtral, Mistral V7 Tekken.
- Refactored: Gemma 3 (previously no role masking), Gemma 3n
(previously ad-hoc scanner, now shared).
- Strategies whose boundary tokens couldn't be verified offline
(Voxtral, SmolVLM2, Mistral3, InternVL, GLM4V, llava/lfm2vl
fallback) retain legacy behavior and emit a one-shot warning. Users
can enable masking on them via `cfg.role_boundaries`.
- Pixtral / Mistral V7 Tekken correctly handle the shared `[/INST]`
token between user-end and assistant-start via `include_end=False`
+ scanner rewind.
See `docs/multimodal_assistant_mask.md` for the full audit table,
root-cause analysis, and design rationale.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs+types: address CodeRabbit nitpicks on PR #7
- builders/causal.py: add inline NOTE that multi-dataset configs reuse
the first dataset's masking knobs (roles_to_train / train_on_eos) for
all datasets — heterogeneous per-dataset overrides are not supported
in the MM path today.
- processing_strategies.py: annotate inner scanner helpers
_match_prefix and _find_end with explicit types (Tensor, int,
list[int] → bool / tuple[int, bool]) for readability.
- docs/multimodal_assistant_mask.md: renumber the "Commits on this
branch" list to 1-7 consecutive (previously skipped 3).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(mm-mask): address two CodeRabbit findings on PR #7
1. Schema rejected `train_on_eos: "none"` despite the scanner honoring it.
`_VALID_TRAIN_ON_EOS` accepts "none" and the design doc lists it, but
`SFTDataset.train_on_eos` was `Literal["all", "turn", "last"]`, so YAML
users hit a pydantic ValidationError at config load. Added "none" to
the Literal and updated the description.
2. `cfg.role_boundaries: []` had split-personality semantics: the strategy
ctor treated it as "replace built-ins with empty" while the collator
plumbing treated it as "unset", and both the design doc and the
MultiModalConfig schema help text promised wholesale replacement for
any set value. Aligned on opt-in semantics across all four surfaces —
a non-empty list replaces built-ins wholesale; unset or `[]` falls back
to built-ins. Rationale: honoring `[]` literally yields all-masked
labels and zero gradient, which is almost always a typo or leftover
rather than a deliberate user action. Users who want to disable role
masking should unset the field or use `train_on_inputs: true`.
Also sharpened the fallback one-shot warning for strategies without
built-in boundaries: names the consequence ("only pad and media tokens
are masked, every other token contributes to loss") and points users
at `cfg.role_boundaries` + docs/multimodal_assistant_mask.md instead
of "see axolotl/processing_strategies.py for how to declare
boundaries."
Files:
- src/axolotl/utils/schemas/datasets.py: Literal adds "none"
- src/axolotl/processing_strategies.py: ctor truthiness check on
role_boundaries_override; sharpened fallback warning
- src/axolotl/utils/schemas/multimodal.py: role_boundaries description
now calls out opt-in + empty-list fallback semantics
- docs/multimodal_assistant_mask.md: same clarification in the Semantics
block; updated the fallback-path detection paragraph to quote the new
warning text
- tests/test_processing_strategies.py: +2 regressions
(test_sft_dataset_schema_accepts_all_supported_train_on_eos_values,
test_empty_role_boundaries_override_falls_back_to_builtin); 63/63 pass
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* doc cleanup
* fix(mm-mask): CodeRabbit findings + lint fix on PR #3625
Pre-commit failure: trailing newline missing on
docs/multimodal_assistant_mask.md (end-of-file-fixer hook).
Six CodeRabbit findings addressed:
1. Scanner: non-trainable role's end marker ignored ``include_end``.
Under ``train_on_eos="all"``, the shared ``[/INST]`` token (user-end
with ``include_end=False``, intentionally re-matched as assistant-start)
leaked into loss via the user branch on Pixtral / Mistral V7 Tekken.
Fix: gate the non-trainable branch on ``best_match.include_end`` to
mirror the trainable branch.
2. Gemma3 ``boi_token`` lookup used ``tokenizer.special_tokens_map.get("boi_token")``,
which never fires on real checkpoints (``special_tokens_map`` only
holds HF's standard slots — bos/eos/pad/unk/...). Swap to direct
attribute read ``getattr(tokenizer, "boi_token", None)``, matching
what ``transformers.models.gemma3.processing_gemma3`` itself does.
Updated the ``_gemma_tokenizer`` test fixture to mirror real-model
shape so the test exercises the production code path.
3. GLM dispatcher only registered ``Glm46VProcessor`` (GLM-4.6V /
GLM-4.7V). Real ``Glm4vProcessor`` (GLM-4V / GLM-4.1V) users fell
through to the base fallback. Both processors ship identical
media-token markers, so register both under the shared
``Glm4vProcessingStrategy`` with independent try/except import blocks.
Updated class docstring. +2 dispatcher regressions.
4. Gemma3 ``process_labels`` hardcoded 262144 for the soft image token.
Resolve dynamically via ``tokenizer.convert_tokens_to_ids("<image_soft_token>")``
with unk-id guard; fall back to 262144 only if the string isn't in
vocab. Mirrors ``Gemma4ProcessingStrategy.process_labels`` pattern.
5. ``build_collator`` was called twice per ``build()`` (eval + train
passes), producing two identical ``MM collator: ...`` INFO banners on
startup. Gate the log on ``is_eval=False`` so only the training pass
emits it.
6. Removed unused ``_mistral_common_stub`` pytest fixture (13 refs → 0,
always returned ``None``; the dispatcher already handles missing
``mistral_common`` via lazy import + ``try/except``). Added
``test_scanner_train_on_eos_all_with_non_trainable_include_end_false``
— a focused scanner-level lock-in for finding #1, independent of any
specific VLM strategy.
Test count: 63 → 68 passing. Local ``pre-commit run --all-files`` green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(mm-mask): hoist .tolist() out of scanner; shorten comments/docstrings
- Scanner perf: convert labels[i] to a Python list once per row so
_match_prefix / _find_end operate on list slices instead of
re-materializing Tensor slices via .tolist() on every probe. Cuts
O(n*boundaries) CPython↔C boundary crossings per batch.
- Markdown lint (MD001, MD040): promote two h3 section headings to h2
under the h1; add `text` language to the verify-at-runtime fenced block.
- Shorten verbose comments/docstrings added in recent commits to
bare-minimum "why" notes matching the repo's existing style.
68/68 tests, 8/8 pre-commit hooks still pass.
* upgrade to torchao 0.17.0
* chore: lint
* refactor attention handling
* replace legacy attention boolean flags with capability properties
Replace checks with capability-based properties derived from attn_implementation
This separates three concerns that were conflated under flash_attention:
1. Backend selection -> attn_implementation enum
2. Packing capability -> attn_supports_packing property
3. Flash-attn library dependency -> attn_uses_flash_lib property
* compute attn capability flags in normalizer instead of properties
* make attn_implementation the single source of truth
* move attention-dependent validators to mode=after
* migrate remaining consumers to canonical attn_implementation
* expand attention tests + rewrite docs
* migrate example configs to canonical attn_implementation
* update doc snippets + reject gemma4-hybrid with non-FA2 backend
* remove dead gemma4 branch in _set_attention_config
* fix duplicate attn_implementation in gpt-oss yamls and flaky caplog tests
* drop "Phase 2" naming from attn-implementation tests
* regroup attn_implementation tests by feature concern
* clean up verbose comments and remove MD
Signed-off-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>
* fix(collator): pass return_dict=True at apply_chat_template top level for transformers 5.x
In transformers 5.x, ProcessorMixin.apply_chat_template gained its own
`return_dict` parameter (defaulting to False). When return_dict=False
and tokenize=True the method returns out["input_ids"] directly — a 2-D
tensor — rather than the full BatchFeature dict.
The old code placed `return_dict=True` inside processor_kwargs. In
transformers 5.x those kwargs are forwarded to the underlying processor
call self(...) where _merge_kwargs silently ignores any key not present
in MllamaProcessorKwargs (emitting a warning). The outer return_dict
therefore stayed False, apply_chat_template returned the raw input_ids
tensor, and the subsequent `batch["input_ids"]` attempted to index a
2-D tensor with the 9-character string "input_ids", producing:
IndexError: too many indices for tensor of dimension 2
The fix is to pass return_dict=True as a top-level keyword argument to
apply_chat_template (where it is actually consumed) and remove it from
processor_kwargs (where it was silently dropped). No version guard is
needed: transformers is pinned to ==5.5.4 in pyproject.toml.
Adds a unit-level regression test (tests/test_mm_chat_collator.py) that
mocks the processor to return a raw tensor when apply_chat_template is
called without top-level return_dict=True, verifying the four invariants:
process_rows returns a dict, input_ids is 2-D, labels is 2-D, and
apply_chat_template receives return_dict=True as a top-level kwarg.
Fixes: tests/e2e/test_llama_vision.py::TestLlamaVision::test_lora_llama_vision_multimodal_dataset
Fixes: tests/e2e/test_llama_vision.py::TestLlamaVision::test_lora_llama_vision_text_only_dataset
Signed-off-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>
* fix(collator): process_rows returns dict (BatchFeature) shape
Two related changes for the multimodal chat collator under transformers 5.x:
1. Wrap apply_chat_template result in dict(...) so process_rows returns
a plain dict rather than a BatchFeature instance. BatchFeature is a
Mapping but not a dict; downstream code that did
batch["labels"] = self.processing_strategy.process_labels(batch["input_ids"])
would index on a tensor when the result wasn't dict-shaped, raising
IndexError: too many indices for tensor of dimension 2
2. Soften the regression test's contract from `dict` to `Mapping` so it
exercises the actual semantic guarantee (key/value access) rather
than the implementation detail (dict vs BatchFeature). Test guards
against the original transformers 5.x breakage where apply_chat_template's
return_dict default went from True to False.
Includes regression test under tests/test_mm_chat_collator.py.
Bug surfaced via swarm dispatch task_01KQHPNAYD8XARSNSDJVW1GPF6 against
attn-implementation-refactor; squash-merged from agent commits 4de886fd
+ dc9fcf4f.
Signed-off-by: Wing Lian <wing@axolotl.ai>
---------
Signed-off-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>
* Support loss_type/loss_weights DPO
* Validate dpo loss type/weights only set for dpo
* Tests: Update ipo tests to use new path
* Docs: Update docs for new ipo path
* PR fixes - typo/validation
* PR nit - warning
* chore: fix warnings arg
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai>
* [gemma4] fix VRAM leak in hybrid FA2+SDPA path under activation checkpointing
Route shared_kv_states through a thread-local side channel instead of the
decoder-layer kwargs so the checkpoint partial never references the dict.
HF's Gemma4TextModel.forward passes shared_kv_states (a mutable dict used
for cross-layer K/V sharing) as a kwarg to every decoder_layer call.
GradientCheckpointingLayer.__call__ then forms
partial(super().__call__, **kwargs), and whichever checkpoint runs
(axolotl's CPU_Offloaded_Gradient_Checkpointer or torch's stock
checkpoint) captures that partial. The partial holds a reference to the
dict, which holds the K/V tensors produced by store_full_length_kv
layers. Those tensors stay pinned for the full duration of backward, and
delayed ref-cycle cleanup in torch's caching allocator under FSDP2 +
activation checkpointing bleeds the residual across steps.
Observed symptom: VRAM climbs ~0.47 GiB/step from a 42 GiB baseline,
OOMs around step 73 (~94 GiB peak) on Gemma-4 31B multimodal with
gemma4_hybrid_attn_impl: true. Independent of seq len / image size.
All-flex-attention path is flat but ~22x slower.
Violated invariant: anything crossing an activation-checkpoint boundary
must be a tensor (refcounted by autograd) or plain Python data -- never
a mutable container holding tensor references.
Fix (all in src/axolotl/monkeypatch/models/gemma4/fused_attn.py):
* threading.local() store with _get/_set_shared_kv_states helpers
* _patch_decoder_layer_call(): monkeypatches
Gemma4TextDecoderLayer.__call__ to pop shared_kv_states from kwargs
and stash it in TLS before delegating to GradientCheckpointingLayer.
The partial formed downstream no longer references the dict.
* fused_forward reads TLS first, falls back to kwarg for callers that
bypass the patched __call__ (e.g. direct attention invocation).
* wired into patch_gemma4_fused_attn; idempotent via a sentinel.
TLS is overwritten on each new step's first decoder-layer call, so the
previous step's dict is released promptly. No changes to hybrid dispatch,
FSDP wrap policy, or any config behaviour. Works for hybrid, flex, and
eager paths.
Introduced by PR #3598 (commit b8358aa5).
* Coderabbit comment: gemma4: clear TLS unconditionally in decoder-layer patched __call__
Overwrite the thread-local shared_kv_states store on every invocation
(including with None) instead of only when the kwarg is present.
The previous conditional write left stale dicts in TLS on any path that
reaches Gemma4TextDecoderLayer.__call__ without a shared_kv_states
kwarg — e.g. generation, eval hooks, or future HF refactors that make
the kwarg optional. fused_forward would then silently consume a prior
step's K/V dict instead of falling back to its own kwarg path.
Unconditional write makes the invariant in the surrounding comment
("TLS is overwritten on each new step's first decoder-layer call, so
the previous step's dict is released promptly") actually hold.
No behavior change for the training happy path, which always passes
the kwarg. Addresses CodeRabbit review on PR #3611
* fix: swap threading.local() for module-level store so autograd worker threads see shared_kv_states during backward recompute
Previous commits fixed memory leak on 31B but caused type error with MOE Gemma4 variants - this fixes that:
PR 3611's TLS variant only works when recompute runs on the same thread
that set TLS during forward. PyTorch's C++ autograd engine
(_engine_run_backward) spawns per-device worker threads to dispatch
backward, and HF-Trainer gradient_checkpointing (stock
torch.utils.checkpoint, non-reentrant / saved-tensor-hooks) fires
unpack_hook -> recompute_fn on those worker threads. TLS set on the main
thread during forward is invisible there, so _get_shared_kv_states()
returns None and the consumer-layer lookup crashes with
"'NoneType' object is not subscriptable" at
fused_attn.py:97 (shared_kv_states[self.kv_shared_layer_index]).
A plain module-level dict is visible to all threads in the process.
Lifecycle is identical: the slot is overwritten each forward, releasing
the previous step's dict and allowing its K/V tensors to be GC'd, so
the original VRAM-leak fix still holds under FSDP2 AC too.
* scope gemma4 shared_kv_states side channel to checkpointed training
Update PR #3611 with gate for checkpointed training to avoid regressions across async flows.
Added unit tests for kwargs pop, store-clear regression, and flag gating. Condensed verbose comments
* add gemma4 cross-thread visibility test for shared_kv_states store
Additional regression test for MoE gemma4 variants - asserts the module-level store is readable from threads other than the one that set it in response to previously observed 'NoneType' error
* fix logger
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
* feat: move to uv first
* fix: update doc to uv first
* fix: merge dev/tests into uv pyproject
* fix: update docker docs to match current config
* fix: migrate examples to readme
* fix: add llmcompressor to conflict
* feat: rec uv sync with lockfile for dev/ci
* fix: update docker docs to clarify how to use uv images
* chore: docs
* fix: use system python, no venv
* fix: set backend cpu
* fix: only set for installing pytorch step
* fix: remove unsloth kernel and installs
* fix: remove U in tests
* fix: set backend in deps too
* chore: test
* chore: comments
* fix: attempt to lock torch
* fix: workaround torch cuda and not upgraded
* fix: forgot to push
* fix: missed source
* fix: nightly upstream loralinear config
* fix: nightly phi3 long rope not work
* fix: forgot commit
* fix: test phi3 template change
* fix: no more requirements
* fix: carry over changes from new requirements to pyproject
* chore: remove lockfile per discussion
* fix: set match-runtime
* fix: remove unneeded hf hub buildtime
* fix: duplicate cache delete on nightly
* fix: torchvision being overridden
* fix: migrate to uv images
* fix: leftover from merge
* fix: simplify base readme
* fix: update assertion message to be clearer
* chore: docs
* fix: change fallback for cicd script
* fix: match against main exactly
* fix: peft 0.19.1 change
* fix: e2e test
* fix: ci
* fix: e2e test
* feat: support excess_length_strategy for RL trainers
Previously, RL data loading always dropped sequences exceeding
sequence_len. This adds support for the existing `excess_length_strategy`
config option (`drop`, `truncate`, `raise`) in RL training pipelines,
matching the behavior already available for SFT.
- `drop` (default): unchanged behavior, filters out long samples
- `truncate`: tokenizes text components, truncates responses to fit
within sequence_len while preserving the full prompt, then decodes
back to text. Handles DPO/IPO/ORPO/SIMPO and KTO datasets.
- `raise`: raises ValueError if any sample exceeds sequence_len
Closes#3547
* improve RL truncation strategy robustness and performance
---------
Co-authored-by: yurekami <yurekami@users.noreply.github.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>
* Skip redundant evaluation when resuming from checkpoint
* add condition check for adding callback
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
* better handling of dora merge on Conv layers in Qwen 3.5
* address issues from code review
* stricter efficient merges for dora since we now have meta model to reference
* upgrade to torchao 0.17.0
* upgrade mistral-common too
* chore: lint
* patch fix for torchao low bit optimizers
* fix up
* propagate dtype
* fix test for ao change
* address PR comments
* feat: add sonicmoe fused lora support
* fix: forgot to add file
* feat: add test
* feat: add lora support for other routes
* fix: add int8 lora support
* fix: add qwen35_moe interleave support
* fix: qwen3_5_moe loss
* chore: lint
* address some pr comments
* fix test imports
* add support matrix for moe kernels [skip ci]
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
* fix: DPO tool role KeyError, dataset hash output_dir, config validators [skip-e2e]
- Add 'tool' to default role_map_inv in dpo/chat_template.py default() and
argilla_chat() so datasets with tool-call messages no longer raise
KeyError: 'tool' (closes#3217)
- Fix generate_dataset_hash_from_config to use canonical tokenizer config +
overrides content instead of tokenizer.name_or_path when added_tokens_overrides
is set, preventing cache busting when only output_dir changes (closes#3303)
- Add three Pydantic config validators to AxolotlConfigWCapabilities:
* save_strategy: 'best' requires metric_for_best_model
* streaming=True is incompatible with val_set_size > 0
* lora_target_modules list entries must be valid Python regex patterns
- Tests for all three changes
* review: condense comment in shared.py, swap Mistral model for SmolLM2-135M in test_hash
* chore: lint
* move the validators out of the w/ capabilities schema
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
* qat patch
* tests fixes
* fixup per PR code review
* use state dict hooks to handle dequant for saving safetensors from transformers
* use transformers torch ao quantizer hooks to save mx quantized model
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* Add precompute_ref_log_probs to config schema
* chore: add description for config
* Add test for precompute_ref_log_probs and move to training args
* useing precompute logprobs as the default slows down CI as it has to precompute
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai>
Co-authored-by: Wing Lian <wing@axolotl.ai>
* allow bf16 flag but warn
Reason: when doing e.g. LoRA merges with CUDA_VISIBLE_DEVICES=, this will unnecessarily crash, even though the LoRA merge operation would have finished successfully. This seems to warrant changing it to a warning instead, as the code will most likely crash later if bf16 is unavailable and training begins anyway.
* don't use deprecated LOG.warn
* update tests to reflect validation change
* Deperecate dpo_norm_loss
* Rename chosen/rejected_input_ids to chosen/rejected_ids to match TRL https://github.com/huggingface/trl/pull/5179
* Remove deprecated rpo_alpha
* Remove dead_code tokenize_row
* Add _tokenize override to prevent double bos token on Llama DPO
* Fix DPO loss type now list not string
* Linting fix
* PR fixes
* update _tokenize override for DPO for multimodal
* support flattening/packing for GRPO
* more flattening
* fix tests
* improve dead vllm handling
* refactor out process handling for vllm serve and move bench flattening tests to gpu tests
* add validation for flattening with liger
* isolate batch flattening test
* flaky test
* nemo gym integration with grpo wip
* mostly working
* cleanup
* simplify
* update docs
* nemo gym support wip
* cleanup
* chore: lint
* address PR review and add more tests
* chore: lint
* post merge lora fixes for CI (#3536) [skip ci]
* post merge lora fixes for CI
* handle lora kernel auto-enable for moe without grouped_mm
* prefer not to import torch in schema validation
* address pr comments, add timeout, add tests
* roundup_power2_divisions not needed with newer pytorch versions (#3540)
* roundup_power2_divisions not needed with newer pytorch versions
* remove typo
* update qwen3.5 moe 35b-a3b yaml for 5090
* more bug fixes
* fix tests to match updated trainer
* don't use fa2 for hooks test
* reset plugins on the instance
* retry download
* fix references to renamed axolotl_cfg property on trainer
* Fix ref to trainer cfg
* fix: robust handling of race condition on patching check (#3543) [skip ci]
* EBFT: Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models (#3527) [skip ci]
* EBFT wip
* fixes
* more fixeS
* add missing strided module
* ebft fixes for multi-turn
* make ebft work with async
* add example for ebft w qwen3.5
* fix for split thinking and update yaml for lora over linear attention only
* enforce_eager for vllm arg in schema
* fix sync weights
* fix multi-gpu
* handle updated sig for mm
* ddp fixes
* improve multi-gpu handling, don't calculate logits, adaptive completion length
* chore: lint
* chore: lint
* support completion_mean
* Address corereview feedback
* clamp min IS ratio
* Address PR code review
* more fixes identified
* address code review
* Fix property from rebase conflict
* fix for ebft sync and update docs
* make trainer loss patch check a solo test
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* EBFT wip
* fixes
* more fixeS
* add missing strided module
* ebft fixes for multi-turn
* make ebft work with async
* add example for ebft w qwen3.5
* fix for split thinking and update yaml for lora over linear attention only
* enforce_eager for vllm arg in schema
* fix sync weights
* fix multi-gpu
* handle updated sig for mm
* ddp fixes
* improve multi-gpu handling, don't calculate logits, adaptive completion length
* chore: lint
* chore: lint
* support completion_mean
* Address corereview feedback
* clamp min IS ratio
* Address PR code review
* more fixes identified
* address code review
* Fix property from rebase conflict
* roundup_power2_divisions not needed with newer pytorch versions
* remove typo
* update qwen3.5 moe 35b-a3b yaml for 5090
* more bug fixes
* fix tests to match updated trainer
* don't use fa2 for hooks test
* reset plugins on the instance
* retry download
* fix references to renamed axolotl_cfg property on trainer
* Fix ref to trainer cfg
* feat: LoRA kernel support for bias, dropout, dora, embeddings
* chore: lint
* chore: lint
* address PR feedback, add regression tests, add fsdp2 tests for lora kernels
* update tests for new sigs
* update tests now that bias and dropout are supported
* fix token state json and mistral tokenizer issue
* centralize constants
* forgot to commit constants file
* Fix weakref in pickling relora state dict
* make curl a bit quieter so it doesn't log 2K lines
* fix path traversal for olmoe test
* more test fixes that weren't flagged previously
* chore: lint
* skip tests that fail b/c of OutOfResources
* scattermoe as slow tests
* update fbgemm-genai for torch 2.10
Transformers 5.x routes attention through sdpa_attention.py and no longer
calls the _prepare_4d_causal_attention_mask* or _expand_mask functions that
these patches targeted. This makes the following patches dead code:
- llama_patch_multipack.py (patched _prepare_4d_causal_attention_mask*)
- llama_expand_mask.py (patched _expand_mask, never called)
- Related utility functions in monkeypatch/utils.py
Closesaxolotl-ai-cloud/axolotl#3331
* optimize moe + lora
* more scattermoe optims
* selective dequant
* add correctness unit tests and benchmarks for scattermoe + lora
* handle base+lora split kernel for older moe models
* chore: lint
* fix casting for H200 and B200
* register pressure estimation and pruning for h200/b200
* use soft limit for pruning
* qkv patch for qwen3.5moe
* support text_model for qwen3.5 moe
* nesting of qwen3
* use udpated cce with zero3 support
* Fix decomposed backward for QKV and O projections
eliminates B @ A materialization in LoRA attention backward, replacing full [out, in] matmuls with two small [T, R] matmuls.
* use custom triton kernels for entropy from logits and selective softmax
* PR comments fixes
* fix out of bounds, include tests, include benchmarks
* chore: lint
* async grpo support
* implement data producer
* use fast async
* handle call to create data producer
* fix liger kernel setup
* fix replay buffer
* chore: lint
* make gpus go brrr
* chore: lint
* inplace div_, unwrap model for logits in bf16
* fuse selective softmax and empty cuda cache on each scoring step
* remove waiting for synch time and fix race
* make fp8 work and allow lora kernels w rl
* grpo with lora vllm sync and fixes for sharded distributed
* update docs
* more patches so it works against trl main
* address PR feedback for corerabbit
* consolidate behavioud of routing in scattermoe kernels
* collect telemetry on best chosen autotuned kernel
* properly collect data
* Fix property name and get smem too
* handle issues raised by coderabbit
* add tests for parity before refactoring
* docs: fix codestyle placeholders in CONTRIBUTING.md
Replace unresolved {codestyle} and {URLofCodestyle} template
variables with Ruff, the project's actual linter/formatter
as configured in .pre-commit-config.yaml.
* fix: replace bare except clauses with specific exception types
- quantization.py: use except ImportError for optional torchao imports
(consistent with line 48 which already uses ImportError correctly)
- cli/config.py: use except (RuntimeError, AssertionError) for CUDA
device property query
Prevents masking unrelated errors like KeyboardInterrupt or SystemExit.
* test: add unit tests for convert.py JSON/JSONL utilities
Cover FileReader, FileWriter, StdoutWriter, JsonParser,
JsonlSerializer, and JsonToJsonlConverter with 8 test cases
including roundtrip and edge case (empty list) scenarios.
Previously this module had zero test coverage.
* fix: address CodeRabbit review feedback
- quantization.py: catch (ImportError, RuntimeError) for optional
torchao imports; CUDA wheel/GPU mismatches raise RuntimeError,
not ImportError
- convert.py: remove unused output_file_path parameter from
JsonToJsonlConverter.convert() — FileWriter already holds the
output path from construction
- tests/test_convert.py: update call site to match new signature