* Fix Axolotl ReLoRA optimizer reset scope
* fix: make relora reset method honor relora_prune_ratio
When relora_prune_method='reset' and relora_prune_ratio is explicitly
set, the ratio was silently ignored and replaced with the hardcoded
_FULL_RESET_RATIO (0.999). Fix by moving the default-ratio logic to
ReLoRACallback.on_step_begin: None maps to _FULL_RESET_RATIO for reset
and 0.9 for other methods. reset_optimizer now uses the same random
pruning path for both 'random' and 'reset'.
Also consolidate three-layer default mismatch: schema default for
relora_prune_method is now 'magnitude' (single canonical source);
dataclass defaults for both fields changed to None to eliminate the
conflicting fallback layer.
Tests updated: removed the test case that verified the old broken
behavior (reset ignoring ratio), added two cases proving reset honors
the passed ratio. E2E reset fixture now uses ratio=0.5 to make it
unambiguous that the ratio is honored.
* Fix ReLoRA uint8 pruning regression
---------
Signed-off-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>
* feat: systemic multimodal assistant-only loss masking + cfg.role_boundaries
Fixes silent ignoring of `cfg.train_on_inputs` / `cfg.roles_to_train` /
`cfg.train_on_eos` in the multimodal training path. Before this branch,
only Gemma 3n honored these knobs; every other VLM trained on the full
sequence regardless of config. Also adds `cfg.role_boundaries` YAML
override so users can declare per-role markers without subclassing.
What changed
------------
- `ProcessingStrategy` gains a declarative boundary scanner. Each
strategy declares per-role start/end markers via
`_build_role_boundaries`; the shared scanner honors
`train_on_inputs` / `roles_to_train` / `train_on_eos` (incl. "last").
- New per-template strategies: Gemma 4, Llama 3.2 Vision, Llama 4,
Pixtral, Mistral V7 Tekken.
- Refactored: Gemma 3 (previously no role masking), Gemma 3n
(previously ad-hoc scanner, now shared).
- Strategies whose boundary tokens couldn't be verified offline
(Voxtral, SmolVLM2, Mistral3, InternVL, GLM4V, llava/lfm2vl
fallback) retain legacy behavior and emit a one-shot warning. Users
can enable masking on them via `cfg.role_boundaries`.
- Pixtral / Mistral V7 Tekken correctly handle the shared `[/INST]`
token between user-end and assistant-start via `include_end=False`
+ scanner rewind.
See `docs/multimodal_assistant_mask.md` for the full audit table,
root-cause analysis, and design rationale.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: systemic multimodal assistant-only loss masking + cfg.role_boundaries
Fixes silent ignoring of `cfg.train_on_inputs` / `cfg.roles_to_train` /
`cfg.train_on_eos` in the multimodal training path. Before this branch,
only Gemma 3n honored these knobs; every other VLM trained on the full
sequence regardless of config. Also adds `cfg.role_boundaries` YAML
override so users can declare per-role markers without subclassing.
What changed
------------
- `ProcessingStrategy` gains a declarative boundary scanner. Each
strategy declares per-role start/end markers via
`_build_role_boundaries`; the shared scanner honors
`train_on_inputs` / `roles_to_train` / `train_on_eos` (incl. "last").
- New per-template strategies: Gemma 4, Llama 3.2 Vision, Llama 4,
Pixtral, Mistral V7 Tekken.
- Refactored: Gemma 3 (previously no role masking), Gemma 3n
(previously ad-hoc scanner, now shared).
- Strategies whose boundary tokens couldn't be verified offline
(Voxtral, SmolVLM2, Mistral3, InternVL, GLM4V, llava/lfm2vl
fallback) retain legacy behavior and emit a one-shot warning. Users
can enable masking on them via `cfg.role_boundaries`.
- Pixtral / Mistral V7 Tekken correctly handle the shared `[/INST]`
token between user-end and assistant-start via `include_end=False`
+ scanner rewind.
See `docs/multimodal_assistant_mask.md` for the full audit table,
root-cause analysis, and design rationale.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs+types: address CodeRabbit nitpicks on PR #7
- builders/causal.py: add inline NOTE that multi-dataset configs reuse
the first dataset's masking knobs (roles_to_train / train_on_eos) for
all datasets — heterogeneous per-dataset overrides are not supported
in the MM path today.
- processing_strategies.py: annotate inner scanner helpers
_match_prefix and _find_end with explicit types (Tensor, int,
list[int] → bool / tuple[int, bool]) for readability.
- docs/multimodal_assistant_mask.md: renumber the "Commits on this
branch" list to 1-7 consecutive (previously skipped 3).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(mm-mask): address two CodeRabbit findings on PR #7
1. Schema rejected `train_on_eos: "none"` despite the scanner honoring it.
`_VALID_TRAIN_ON_EOS` accepts "none" and the design doc lists it, but
`SFTDataset.train_on_eos` was `Literal["all", "turn", "last"]`, so YAML
users hit a pydantic ValidationError at config load. Added "none" to
the Literal and updated the description.
2. `cfg.role_boundaries: []` had split-personality semantics: the strategy
ctor treated it as "replace built-ins with empty" while the collator
plumbing treated it as "unset", and both the design doc and the
MultiModalConfig schema help text promised wholesale replacement for
any set value. Aligned on opt-in semantics across all four surfaces —
a non-empty list replaces built-ins wholesale; unset or `[]` falls back
to built-ins. Rationale: honoring `[]` literally yields all-masked
labels and zero gradient, which is almost always a typo or leftover
rather than a deliberate user action. Users who want to disable role
masking should unset the field or use `train_on_inputs: true`.
Also sharpened the fallback one-shot warning for strategies without
built-in boundaries: names the consequence ("only pad and media tokens
are masked, every other token contributes to loss") and points users
at `cfg.role_boundaries` + docs/multimodal_assistant_mask.md instead
of "see axolotl/processing_strategies.py for how to declare
boundaries."
Files:
- src/axolotl/utils/schemas/datasets.py: Literal adds "none"
- src/axolotl/processing_strategies.py: ctor truthiness check on
role_boundaries_override; sharpened fallback warning
- src/axolotl/utils/schemas/multimodal.py: role_boundaries description
now calls out opt-in + empty-list fallback semantics
- docs/multimodal_assistant_mask.md: same clarification in the Semantics
block; updated the fallback-path detection paragraph to quote the new
warning text
- tests/test_processing_strategies.py: +2 regressions
(test_sft_dataset_schema_accepts_all_supported_train_on_eos_values,
test_empty_role_boundaries_override_falls_back_to_builtin); 63/63 pass
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* doc cleanup
* fix(mm-mask): CodeRabbit findings + lint fix on PR #3625
Pre-commit failure: trailing newline missing on
docs/multimodal_assistant_mask.md (end-of-file-fixer hook).
Six CodeRabbit findings addressed:
1. Scanner: non-trainable role's end marker ignored ``include_end``.
Under ``train_on_eos="all"``, the shared ``[/INST]`` token (user-end
with ``include_end=False``, intentionally re-matched as assistant-start)
leaked into loss via the user branch on Pixtral / Mistral V7 Tekken.
Fix: gate the non-trainable branch on ``best_match.include_end`` to
mirror the trainable branch.
2. Gemma3 ``boi_token`` lookup used ``tokenizer.special_tokens_map.get("boi_token")``,
which never fires on real checkpoints (``special_tokens_map`` only
holds HF's standard slots — bos/eos/pad/unk/...). Swap to direct
attribute read ``getattr(tokenizer, "boi_token", None)``, matching
what ``transformers.models.gemma3.processing_gemma3`` itself does.
Updated the ``_gemma_tokenizer`` test fixture to mirror real-model
shape so the test exercises the production code path.
3. GLM dispatcher only registered ``Glm46VProcessor`` (GLM-4.6V /
GLM-4.7V). Real ``Glm4vProcessor`` (GLM-4V / GLM-4.1V) users fell
through to the base fallback. Both processors ship identical
media-token markers, so register both under the shared
``Glm4vProcessingStrategy`` with independent try/except import blocks.
Updated class docstring. +2 dispatcher regressions.
4. Gemma3 ``process_labels`` hardcoded 262144 for the soft image token.
Resolve dynamically via ``tokenizer.convert_tokens_to_ids("<image_soft_token>")``
with unk-id guard; fall back to 262144 only if the string isn't in
vocab. Mirrors ``Gemma4ProcessingStrategy.process_labels`` pattern.
5. ``build_collator`` was called twice per ``build()`` (eval + train
passes), producing two identical ``MM collator: ...`` INFO banners on
startup. Gate the log on ``is_eval=False`` so only the training pass
emits it.
6. Removed unused ``_mistral_common_stub`` pytest fixture (13 refs → 0,
always returned ``None``; the dispatcher already handles missing
``mistral_common`` via lazy import + ``try/except``). Added
``test_scanner_train_on_eos_all_with_non_trainable_include_end_false``
— a focused scanner-level lock-in for finding #1, independent of any
specific VLM strategy.
Test count: 63 → 68 passing. Local ``pre-commit run --all-files`` green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(mm-mask): hoist .tolist() out of scanner; shorten comments/docstrings
- Scanner perf: convert labels[i] to a Python list once per row so
_match_prefix / _find_end operate on list slices instead of
re-materializing Tensor slices via .tolist() on every probe. Cuts
O(n*boundaries) CPython↔C boundary crossings per batch.
- Markdown lint (MD001, MD040): promote two h3 section headings to h2
under the h1; add `text` language to the verify-at-runtime fenced block.
- Shorten verbose comments/docstrings added in recent commits to
bare-minimum "why" notes matching the repo's existing style.
68/68 tests, 8/8 pre-commit hooks still pass.
* memory clean up for fsdp full state dict
* Update src/axolotl/monkeypatch/accelerate/fsdp2.py
Co-authored-by: Wing Lian <wing.lian@gmail.com>
---------
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* upgrade to torchao 0.17.0
* chore: lint
* refactor attention handling
* replace legacy attention boolean flags with capability properties
Replace checks with capability-based properties derived from attn_implementation
This separates three concerns that were conflated under flash_attention:
1. Backend selection -> attn_implementation enum
2. Packing capability -> attn_supports_packing property
3. Flash-attn library dependency -> attn_uses_flash_lib property
* compute attn capability flags in normalizer instead of properties
* make attn_implementation the single source of truth
* move attention-dependent validators to mode=after
* migrate remaining consumers to canonical attn_implementation
* expand attention tests + rewrite docs
* migrate example configs to canonical attn_implementation
* update doc snippets + reject gemma4-hybrid with non-FA2 backend
* remove dead gemma4 branch in _set_attention_config
* fix duplicate attn_implementation in gpt-oss yamls and flaky caplog tests
* drop "Phase 2" naming from attn-implementation tests
* regroup attn_implementation tests by feature concern
* clean up verbose comments and remove MD
Signed-off-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>
* fix(collator): pass return_dict=True at apply_chat_template top level for transformers 5.x
In transformers 5.x, ProcessorMixin.apply_chat_template gained its own
`return_dict` parameter (defaulting to False). When return_dict=False
and tokenize=True the method returns out["input_ids"] directly — a 2-D
tensor — rather than the full BatchFeature dict.
The old code placed `return_dict=True` inside processor_kwargs. In
transformers 5.x those kwargs are forwarded to the underlying processor
call self(...) where _merge_kwargs silently ignores any key not present
in MllamaProcessorKwargs (emitting a warning). The outer return_dict
therefore stayed False, apply_chat_template returned the raw input_ids
tensor, and the subsequent `batch["input_ids"]` attempted to index a
2-D tensor with the 9-character string "input_ids", producing:
IndexError: too many indices for tensor of dimension 2
The fix is to pass return_dict=True as a top-level keyword argument to
apply_chat_template (where it is actually consumed) and remove it from
processor_kwargs (where it was silently dropped). No version guard is
needed: transformers is pinned to ==5.5.4 in pyproject.toml.
Adds a unit-level regression test (tests/test_mm_chat_collator.py) that
mocks the processor to return a raw tensor when apply_chat_template is
called without top-level return_dict=True, verifying the four invariants:
process_rows returns a dict, input_ids is 2-D, labels is 2-D, and
apply_chat_template receives return_dict=True as a top-level kwarg.
Fixes: tests/e2e/test_llama_vision.py::TestLlamaVision::test_lora_llama_vision_multimodal_dataset
Fixes: tests/e2e/test_llama_vision.py::TestLlamaVision::test_lora_llama_vision_text_only_dataset
Signed-off-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>
* fix(collator): process_rows returns dict (BatchFeature) shape
Two related changes for the multimodal chat collator under transformers 5.x:
1. Wrap apply_chat_template result in dict(...) so process_rows returns
a plain dict rather than a BatchFeature instance. BatchFeature is a
Mapping but not a dict; downstream code that did
batch["labels"] = self.processing_strategy.process_labels(batch["input_ids"])
would index on a tensor when the result wasn't dict-shaped, raising
IndexError: too many indices for tensor of dimension 2
2. Soften the regression test's contract from `dict` to `Mapping` so it
exercises the actual semantic guarantee (key/value access) rather
than the implementation detail (dict vs BatchFeature). Test guards
against the original transformers 5.x breakage where apply_chat_template's
return_dict default went from True to False.
Includes regression test under tests/test_mm_chat_collator.py.
Bug surfaced via swarm dispatch task_01KQHPNAYD8XARSNSDJVW1GPF6 against
attn-implementation-refactor; squash-merged from agent commits 4de886fd
+ dc9fcf4f.
Signed-off-by: Wing Lian <wing@axolotl.ai>
---------
Signed-off-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>
* fix dpo collation/padding
* fix DPO collator encoder-decoder pixel_values dtype and is_encoder_decoder detection
- Use float32 instead of LongTensor for _pixel_values in encoder-decoder branch
- Add missing padding_value case for _pixel_values in encoder-decoder branch
- Derive is_encoder_decoder from model config instead of hardcoding False
* Support loss_type/loss_weights DPO
* Validate dpo loss type/weights only set for dpo
* Tests: Update ipo tests to use new path
* Docs: Update docs for new ipo path
* PR fixes - typo/validation
* PR nit - warning
* chore: fix warnings arg
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai>
* [gemma4] fix VRAM leak in hybrid FA2+SDPA path under activation checkpointing
Route shared_kv_states through a thread-local side channel instead of the
decoder-layer kwargs so the checkpoint partial never references the dict.
HF's Gemma4TextModel.forward passes shared_kv_states (a mutable dict used
for cross-layer K/V sharing) as a kwarg to every decoder_layer call.
GradientCheckpointingLayer.__call__ then forms
partial(super().__call__, **kwargs), and whichever checkpoint runs
(axolotl's CPU_Offloaded_Gradient_Checkpointer or torch's stock
checkpoint) captures that partial. The partial holds a reference to the
dict, which holds the K/V tensors produced by store_full_length_kv
layers. Those tensors stay pinned for the full duration of backward, and
delayed ref-cycle cleanup in torch's caching allocator under FSDP2 +
activation checkpointing bleeds the residual across steps.
Observed symptom: VRAM climbs ~0.47 GiB/step from a 42 GiB baseline,
OOMs around step 73 (~94 GiB peak) on Gemma-4 31B multimodal with
gemma4_hybrid_attn_impl: true. Independent of seq len / image size.
All-flex-attention path is flat but ~22x slower.
Violated invariant: anything crossing an activation-checkpoint boundary
must be a tensor (refcounted by autograd) or plain Python data -- never
a mutable container holding tensor references.
Fix (all in src/axolotl/monkeypatch/models/gemma4/fused_attn.py):
* threading.local() store with _get/_set_shared_kv_states helpers
* _patch_decoder_layer_call(): monkeypatches
Gemma4TextDecoderLayer.__call__ to pop shared_kv_states from kwargs
and stash it in TLS before delegating to GradientCheckpointingLayer.
The partial formed downstream no longer references the dict.
* fused_forward reads TLS first, falls back to kwarg for callers that
bypass the patched __call__ (e.g. direct attention invocation).
* wired into patch_gemma4_fused_attn; idempotent via a sentinel.
TLS is overwritten on each new step's first decoder-layer call, so the
previous step's dict is released promptly. No changes to hybrid dispatch,
FSDP wrap policy, or any config behaviour. Works for hybrid, flex, and
eager paths.
Introduced by PR #3598 (commit b8358aa5).
* Coderabbit comment: gemma4: clear TLS unconditionally in decoder-layer patched __call__
Overwrite the thread-local shared_kv_states store on every invocation
(including with None) instead of only when the kwarg is present.
The previous conditional write left stale dicts in TLS on any path that
reaches Gemma4TextDecoderLayer.__call__ without a shared_kv_states
kwarg — e.g. generation, eval hooks, or future HF refactors that make
the kwarg optional. fused_forward would then silently consume a prior
step's K/V dict instead of falling back to its own kwarg path.
Unconditional write makes the invariant in the surrounding comment
("TLS is overwritten on each new step's first decoder-layer call, so
the previous step's dict is released promptly") actually hold.
No behavior change for the training happy path, which always passes
the kwarg. Addresses CodeRabbit review on PR #3611
* fix: swap threading.local() for module-level store so autograd worker threads see shared_kv_states during backward recompute
Previous commits fixed memory leak on 31B but caused type error with MOE Gemma4 variants - this fixes that:
PR 3611's TLS variant only works when recompute runs on the same thread
that set TLS during forward. PyTorch's C++ autograd engine
(_engine_run_backward) spawns per-device worker threads to dispatch
backward, and HF-Trainer gradient_checkpointing (stock
torch.utils.checkpoint, non-reentrant / saved-tensor-hooks) fires
unpack_hook -> recompute_fn on those worker threads. TLS set on the main
thread during forward is invisible there, so _get_shared_kv_states()
returns None and the consumer-layer lookup crashes with
"'NoneType' object is not subscriptable" at
fused_attn.py:97 (shared_kv_states[self.kv_shared_layer_index]).
A plain module-level dict is visible to all threads in the process.
Lifecycle is identical: the slot is overwritten each forward, releasing
the previous step's dict and allowing its K/V tensors to be GC'd, so
the original VRAM-leak fix still holds under FSDP2 AC too.
* scope gemma4 shared_kv_states side channel to checkpointed training
Update PR #3611 with gate for checkpointed training to avoid regressions across async flows.
Added unit tests for kwargs pop, store-clear regression, and flag gating. Condensed verbose comments
* add gemma4 cross-thread visibility test for shared_kv_states store
Additional regression test for MoE gemma4 variants - asserts the module-level store is readable from threads other than the one that set it in response to previously observed 'NoneType' error
* fix logger
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
* feat: move to uv first
* fix: update doc to uv first
* fix: merge dev/tests into uv pyproject
* fix: update docker docs to match current config
* fix: migrate examples to readme
* fix: add llmcompressor to conflict
* feat: rec uv sync with lockfile for dev/ci
* fix: update docker docs to clarify how to use uv images
* chore: docs
* fix: use system python, no venv
* fix: set backend cpu
* fix: only set for installing pytorch step
* fix: remove unsloth kernel and installs
* fix: remove U in tests
* fix: set backend in deps too
* chore: test
* chore: comments
* fix: attempt to lock torch
* fix: workaround torch cuda and not upgraded
* fix: forgot to push
* fix: missed source
* fix: nightly upstream loralinear config
* fix: nightly phi3 long rope not work
* fix: forgot commit
* fix: test phi3 template change
* fix: no more requirements
* fix: carry over changes from new requirements to pyproject
* chore: remove lockfile per discussion
* fix: set match-runtime
* fix: remove unneeded hf hub buildtime
* fix: duplicate cache delete on nightly
* fix: torchvision being overridden
* fix: migrate to uv images
* fix: leftover from merge
* fix: simplify base readme
* fix: update assertion message to be clearer
* chore: docs
* fix: change fallback for cicd script
* fix: match against main exactly
* fix: peft 0.19.1 change
* fix: e2e test
* fix: ci
* fix: e2e test
* bump transformers to 5.5.4 and trl to latest 1.1.0
* more upgrades
* update peft too
* adapt lora_merge to peft 0.19 layer config API
PEFT 0.19 requires a LoraConfig object on Linear/ParamWrapper/Conv
layer constructors and moved use_rslora, use_dora, fan_in_fan_out,
lora_dropout, and lora_bias into that config. Build the config
per branch in _build_peft_layer_and_get_delta so the merge utility
works with the upgraded peft.
* allow lora_dropout on mixed attention+MoE configs under peft 0.19
PEFT 0.19's convert_peft_config_for_transformers auto-remaps old MoE
target_modules (w1/w2/w3 on Mixtral, etc.) into target_parameters for
transformers v5's fused 3D expert Parameters. Those targets get wrapped
with ParamWrapper, which rejects lora_dropout != 0 because the 3D
einsum can't factor dropout out of lora_B(lora_A(dropout(x))).
Monkeypatch ParamWrapper.__init__ to internally use a copy of the
LoraConfig with lora_dropout=0, so its dropout slot becomes nn.Identity
while the shared config still delivers real dropout to sibling Linear
LoRA layers (attention q/k/v/o). A probe runs the same conversion on a
deep copy to detect the situation and emit a warning before patching.
* fix: rename model to adapter_model for fsdp sharded final model
* fix: follow upstream transformer shard size
* fix: handle multiple model files
* fix redundant condition, tighten to safetensors, keep shard size small
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
* feat: support excess_length_strategy for RL trainers
Previously, RL data loading always dropped sequences exceeding
sequence_len. This adds support for the existing `excess_length_strategy`
config option (`drop`, `truncate`, `raise`) in RL training pipelines,
matching the behavior already available for SFT.
- `drop` (default): unchanged behavior, filters out long samples
- `truncate`: tokenizes text components, truncates responses to fit
within sequence_len while preserving the full prompt, then decodes
back to text. Handles DPO/IPO/ORPO/SIMPO and KTO datasets.
- `raise`: raises ValueError if any sample exceeds sequence_len
Closes#3547
* improve RL truncation strategy robustness and performance
---------
Co-authored-by: yurekami <yurekami@users.noreply.github.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>
Allow loading FP8-quantized models (e.g. Mistral-Small-4-119B) with
FineGrainedFP8Config and optional dequantize kwarg for full fine-tuning.
Made-with: Cursor
* Skip redundant evaluation when resuming from checkpoint
* add condition check for adding callback
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
* better handling of dora merge on Conv layers in Qwen 3.5
* address issues from code review
* stricter efficient merges for dora since we now have meta model to reference
* qwen3_5.jinja: handle list content on system messages
The system message branch used string concatenation on
messages[0].content, which breaks when the first system message uses
the OpenAI-style list-of-parts format that multimodal datasets require.
User and assistant branches already handle both string and list content,
but the system branch did not.
Check whether content is a string and fall back to iterating over parts
when it is a list, matching the pattern used for user messages.
Fixes#3590
* Address pr for other content types
---------
Co-authored-by: Joaquin Hui Gomez <joaquinhuigomez@users.noreply.github.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>
* fix: remove unneeded debug log
* fix: cleanup
* feat: add dense gemma config and cleanup
* feat: add cce support
* update notes and set torch compile
* fix patch for new number of return vals
* fixes for gemma4
* fix packing bug
* use updated cce for mm
* fix: pass in kv cache func when avail for transformers 5.5
* feat: update examples with flex variant and readme
* gemma4 lora attention kernels
---------
Co-authored-by: Wing Lian <wing.lian@gmail.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>
* upgrade to torchao 0.17.0
* upgrade mistral-common too
* chore: lint
* patch fix for torchao low bit optimizers
* fix up
* propagate dtype
* fix test for ao change
* address PR comments
* feat: add sonicmoe fused lora support
* fix: forgot to add file
* feat: add test
* feat: add lora support for other routes
* fix: add int8 lora support
* fix: add qwen35_moe interleave support
* fix: qwen3_5_moe loss
* chore: lint
* address some pr comments
* fix test imports
* add support matrix for moe kernels [skip ci]
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
* docs: comprehensive documentation improvements for humans and agents
New human docs:
- grpo.qmd: GRPO deep dive (async, rewards, IS correction, scaling)
- ebft.qmd: EBFT guide (structured/strided modes, feature extraction)
- choosing_method.qmd: decision tree for SFT vs LoRA vs DPO vs GRPO
- vllm_serving.qmd: vLLM setup for GRPO (server/colocate, LoRA sync)
- training_stability.qmd: monitoring, NaN debugging, OOM, healthy metrics
New agent docs:
- AGENTS_SFT.md: agent reference for supervised fine-tuning
- AGENTS_DPO.md: agent reference for preference learning (DPO/KTO/ORPO)
Updated existing docs:
- rlhf.qmd: cross-references to new GRPO/EBFT/choosing-method guides
- getting-started.qmd: reorganized Next Steps with links to new guides
- debugging.qmd: link to training stability guide
- _quarto.yml: added new pages to sidebar navigation
Removed:
- bak.agents.md: stale backup that confused agents
* docs: trim duplicated generic config from AGENTS_DPO.md
Remove boilerplate training params (optimizer, gradient_checkpointing,
flash_attention, etc.) from each method template. These are not
preference-learning-specific and are already covered in AGENTS_SFT.md.
Config templates now show only method-specific fields with a reference
to AGENTS_SFT.md for the rest.
* docs: deduplicate across new doc pages
- grpo.qmd: collapse vLLM setup section to brief config + link to
vllm_serving.qmd; collapse IS correction to essentials + link;
replace full monitoring tables with summary + link to
training_stability.qmd
- vllm_serving.qmd: remove duplicated async/IS config reference tables
(already in grpo.qmd config reference); replace full example config
with link to grpo.qmd quick start
- ebft.qmd: trim generic training params in quick start config
* fix: train scripts
* feat: split files into cleaner parts
* fix: cleanup pretraining docs
---------
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* fix: DPO tool role KeyError, dataset hash output_dir, config validators [skip-e2e]
- Add 'tool' to default role_map_inv in dpo/chat_template.py default() and
argilla_chat() so datasets with tool-call messages no longer raise
KeyError: 'tool' (closes#3217)
- Fix generate_dataset_hash_from_config to use canonical tokenizer config +
overrides content instead of tokenizer.name_or_path when added_tokens_overrides
is set, preventing cache busting when only output_dir changes (closes#3303)
- Add three Pydantic config validators to AxolotlConfigWCapabilities:
* save_strategy: 'best' requires metric_for_best_model
* streaming=True is incompatible with val_set_size > 0
* lora_target_modules list entries must be valid Python regex patterns
- Tests for all three changes
* review: condense comment in shared.py, swap Mistral model for SmolLM2-135M in test_hash
* chore: lint
* move the validators out of the w/ capabilities schema
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>