axolotl

Author	SHA1	Message	Date
Wing Lian	936149380f	support nemotron for scattermoe-lora	2026-03-23 21:29:58 +00:00
Wing Lian	86be9f329e	post merge lora fixes for CI (#3536 ) [skip ci] * post merge lora fixes for CI * handle lora kernel auto-enable for moe without grouped_mm * prefer not to import torch in schema validation	2026-03-23 02:26:10 -04:00
Wing Lian	0e583efeaa	increase rtol, codecov informational only, don't silently fail errors w curl (#3534 ) [skip ci]	2026-03-22 13:54:03 -04:00
Wing Lian	b3289fd190	feat: LoRA kernel support for bias, dropout, dora, embeddings (#3528 ) [skip ci] * feat: LoRA kernel support for bias, dropout, dora, embeddings * chore: lint * chore: lint * address PR feedback, add regression tests, add fsdp2 tests for lora kernels * update tests for new sigs * update tests now that bias and dropout are supported	2026-03-22 13:53:19 -04:00
Wing Lian	a67392c427	liger support for qwen 3.5 and fused rmsnorm+gated (#3531 ) [skip ci] * liger support for qwen 3.5 and fused rmsnorm+gated * support for qwen 3.5 moe * fix version ref * fixups for PR code review	2026-03-22 13:19:21 -04:00
Wing Lian	5b2e3f00ce	fix: handle connection errors when checking user whoami (#3529 )	2026-03-22 09:11:17 -04:00
Wing Lian	fc3b3d1d4e	synthetic datasets for benchmarking and testing (#3518 ) [skip ci] * synthetic datasets for benchmarking and testing * fix synthetic dataset parse from config and add tests * use type=_synthetic	2026-03-21 22:47:26 -04:00
Wing Lian	c9df6efdc2	support offloading layers to CPU (#3512 ) [skip ci] * support offloading layers to CPU * chore: lint * revert change * update docs	2026-03-21 22:47:02 -04:00
Wing Lian	0ee98a0309	fix token state json and mistral tokenizer issue (#3522 ) [skip ci] * fix token state json and mistral tokenizer issue * centralize constants * forgot to commit constants file * Fix weakref in pickling relora state dict * make curl a bit quieter so it doesn't log 2K lines * fix path traversal for olmoe test * more test fixes that weren't flagged previously * chore: lint * skip tests that fail b/c of OutOfResources * scattermoe as slow tests * update fbgemm-genai for torch 2.10	2026-03-21 22:46:10 -04:00
Wing Lian	2c05847a5f	reduce autotune search space (#3525 ) [skip ci] * reduce autotune search space * consistent docstrings	2026-03-21 18:30:15 -04:00
Wing Lian	b0294b3427	handle qwen3.5 moe loading (#3523 ) [skip ci]	2026-03-20 09:25:16 -04:00
Avaya Aggarwal	1bcfc08c90	feat: add support and end-to-end tests for multiple custom optimizers… (#3457 ) [skip ci] * feat: add support and end-to-end tests for multiple custom optimizers including Optimi AdamW, ADOPT AdamW, Muon, Dion, Schedule-Free AdamW, CAME PyTorch, and Flash AdamW. * feat: Add standalone flashoptim integration test and E2E tests for various custom optimizers including FlashAdamW, FlashAdam, FlashSGD, FlashSGDW, FlashLion, optimi_adamw, adopt_adamw, muon, dion, and schedule_free_adamw. * feat: introduce Pydantic schema validation for dataset, attention, and training configurations. * feat: add e2e tests for custom optimizers including optimi_adamw, adopt_adamw, muon, dion, schedule_free_adamw, came_pytorch, and flash optimizers. * test: add e2e tests for custom optimizers including optimi_adamw, adopt_adamw, muon, dion, schedule_free_adamw, came_pytorch, and flash optimizers. * test: fix assertion in flash optimizers test to compare class names directly * fix: address PR review - reuse require_torch_2_7_0 decorator, remove fsdp_config.version check, extract shared FSDP version helper, remove unused imports and optim_args * chore: lint --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2026-03-20 08:24:44 -04:00
NanoCode012	5a5cf30b26	fix: add dequant bf16 repo (#3507 ) [skip ci]	2026-03-20 17:11:46 +07:00
Avaya Aggarwal	7ddfb2d8a0	cleanup: remove dead SDPA patches (#3488 ) [skip ci] Transformers 5.x routes attention through sdpa_attention.py and no longer calls the _prepare_4d_causal_attention_mask* or _expand_mask functions that these patches targeted. This makes the following patches dead code: - llama_patch_multipack.py (patched _prepare_4d_causal_attention_mask*) - llama_expand_mask.py (patched _expand_mask, never called) - Related utility functions in monkeypatch/utils.py Closes axolotl-ai-cloud/axolotl#3331	2026-03-20 17:10:41 +07:00
Owen Arliawan	c57acef2c7	Qwen3.5-MoE example config with lora_target_modules regex (#3515 ) [skip ci] * lora target modules with regex * updates * fsdp for non moe * update wording * chore: cleanup and lint * chore: cleanup docs from merge --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2026-03-20 16:52:46 +07:00
Lorenzo Baraldi	038ffe3f26	fix: solved double sequence partition from SequenceParallelContextManager and Accelerate's native CP (#3498 )	2026-03-20 16:27:24 +07:00
VED	c13cb7c853	feat: add nemotron config (#3506 ) * nemotron config exp * Update examples/nemotron/nemotron-mini-4b-qlora.yaml Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> --------- Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>	2026-03-20 16:23:42 +07:00
VED	b3823cc6b0	fix: gemma3 configs (#3500 ) [skip ci] * gemma fft , text fix * good lint	2026-03-20 16:14:06 +07:00
VED	113d275bd9	qwen docs + new config (#3499 ) [skip ci] * qwen docs + new config * docss lint * simplify comments * read me * lint comments * Update docs/multimodal.qmd * Update docs/multimodal.qmd * Update examples/qwen3.5/9b-fft-vision.yaml * chore: fix link and incorrect points --------- Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> Co-authored-by: NanoCode012 <nano@axolotl.ai>	2026-03-20 16:13:34 +07:00
VED	7920fe74ec	fix num_labels= 1 test fail (#3493 ) [skip ci] * trl_num_lables=1 * casual num_lables=1,rwd model * lint	2026-03-20 16:12:23 +07:00
Wing Lian	1fc86d5295	Scattermoe LoRA optimizations (#3513 ) * optimize moe + lora * more scattermoe optims * selective dequant * add correctness unit tests and benchmarks for scattermoe + lora * handle base+lora split kernel for older moe models * chore: lint * fix casting for H200 and B200 * register pressure estimation and pruning for h200/b200 * use soft limit for pruning * qkv patch for qwen3.5moe * support text_model for qwen3.5 moe * nesting of qwen3 * use udpated cce with zero3 support * Fix decomposed backward for QKV and O projections eliminates B @ A materialization in LoRA attention backward, replacing full [out, in] matmuls with two small [T, R] matmuls.	2026-03-19 23:07:42 -04:00
Wing Lian	bb483ad4c4	make the CI fail GitHub Actions on test failures (#3517 ) * make the CI fail GitHub Actions on test failures * use model bundle * install zstd for compressed model artifact	2026-03-19 08:29:24 -04:00
Wing Lian	163bd4dd5a	use custom triton kernels for entropy from logits and selective softmax (#3510 ) * use custom triton kernels for entropy from logits and selective softmax * PR comments fixes * fix out of bounds, include tests, include benchmarks * chore: lint	2026-03-19 02:02:43 -04:00
Wing Lian	f291ac029c	fix for flaky tests in lora ops kernels w autotune (#3511 ) [skip ci] * fix for flaky tests in lora ops kernels w autotune * attempt 2 to fix	2026-03-19 01:18:47 -04:00
Wing Lian	5ef3f28340	Support for Async GRPO (#3486 ) * async grpo support * implement data producer * use fast async * handle call to create data producer * fix liger kernel setup * fix replay buffer * chore: lint * make gpus go brrr * chore: lint * inplace div_, unwrap model for logits in bf16 * fuse selective softmax and empty cuda cache on each scoring step * remove waiting for synch time and fix race * make fp8 work and allow lora kernels w rl * grpo with lora vllm sync and fixes for sharded distributed * update docs * more patches so it works against trl main * address PR feedback for corerabbit	2026-03-17 11:42:47 -04:00
Aarush	999b3fec2e	fix: replace shell=True subprocess with argument list in modal CLI (#3487 ) * fix: replace shell=True subprocess with argument list in modal CLI Using shell=True with a formatted string containing docker_image (a user-controlled value) is a command injection risk (Bandit B602). Replace with an argument list, which passes args directly to the process without shell interpretation, removing the nosec annotation. * fix: add nosec annotation to suppress bandit B603/B607 warnings Removing shell=True (B602) surfaces B603 (subprocess without shell) and B607 (partial executable path for 'docker'). Use bare # nosec to suppress both, consistent with other nosec usages in the codebase.	2026-03-17 08:53:13 -04:00
Wing Lian	8f3fb517b3	consolidate behavioud of routing in scattermoe kernels (#3475 ) * consolidate behavioud of routing in scattermoe kernels * collect telemetry on best chosen autotuned kernel * properly collect data * Fix property name and get smem too * handle issues raised by coderabbit * add tests for parity before refactoring	2026-03-16 23:47:40 -04:00
Wing Lian	830e9f7eaf	automatically enable tf32 if supported (#3473 ) [skip ci] * automatically enable tf32 if supported * update fixtures * handle only when True * Address CR comments * address readability from pr comment * simplify	2026-03-16 23:47:00 -04:00
NanoCode012	d230cbbde3	chore(doc): update readme (#3503 ) [skip ci]	2026-03-17 09:43:24 +07:00
NanoCode012	a098df527b	feat: add Mistral Small 4 (#3502 ) * feat: add mistral small 4 * fix: update mistral common * fix: deepcopy when passing in tokenizer * feat: add doc on reasoning and thinking section * fix: don't use custom tokenizer and quantize experts * chore: update docs and configs * chore: update doc to follow official name * feat: update cce to include mistral4 * chore: move * fix: naming * fix: test mock breaking get_text_config check * fix: enable CCE and add expert block targetting to configs * chore: docs * fix: use act checkpointing * chore: doc * chore: docs * chore: docs	2026-03-17 09:39:05 +07:00
NanoCode012	7da5f94379	feat: add FA4 (#3481 ) * feat: add FA4 * chore: update docs * fix: recommend FA4 for those with compatible devices * fix: adjust import check and add head_dim check * chore: add limitation to doc * fix: log warning and quit if cannot import validator * chore: simplify * fix: add caveat with FA2 shadow dir	2026-03-16 00:13:18 -04:00
NanoCode012	4a5876df7a	fix: explicit set workflow permission and move secrets to necessary (#3484 ) [skip ci] * fix: explicit set workflow permission and move secrets to necessary steps only * fix: comment * fix: more permission restrict * chore: add read for pypi	2026-03-16 00:13:05 -04:00
Aarush	defee62d99	fix: fix CONTRIBUTING.md placeholders, bare except clauses, and add convert.py tests (#3485 ) [skip ci] * docs: fix codestyle placeholders in CONTRIBUTING.md Replace unresolved {codestyle} and {URLofCodestyle} template variables with Ruff, the project's actual linter/formatter as configured in .pre-commit-config.yaml. * fix: replace bare except clauses with specific exception types - quantization.py: use except ImportError for optional torchao imports (consistent with line 48 which already uses ImportError correctly) - cli/config.py: use except (RuntimeError, AssertionError) for CUDA device property query Prevents masking unrelated errors like KeyboardInterrupt or SystemExit. * test: add unit tests for convert.py JSON/JSONL utilities Cover FileReader, FileWriter, StdoutWriter, JsonParser, JsonlSerializer, and JsonToJsonlConverter with 8 test cases including roundtrip and edge case (empty list) scenarios. Previously this module had zero test coverage. * fix: address CodeRabbit review feedback - quantization.py: catch (ImportError, RuntimeError) for optional torchao imports; CUDA wheel/GPU mismatches raise RuntimeError, not ImportError - convert.py: remove unused output_file_path parameter from JsonToJsonlConverter.convert() — FileWriter already holds the output path from construction - tests/test_convert.py: update call site to match new signature	2026-03-16 00:12:40 -04:00
VED	f56efdb4ab	fix: high eval loss w/ sample packing (#3478 ) [skip ci] * check if eval_sp * radable condition	2026-03-15 22:11:23 -04:00
NanoCode012	d8a646c80d	chore: logging cleanup (#3482 ) [skip ci]	2026-03-15 22:10:57 -04:00
VED	a806704e94	moe quant patch for merge miss match (#3483 ) * moe quant patch for merge miss match * lint * revert test + fix moe patch * comment fixxes * e2e tests * mismatch fixx tested * mis match fix wwith vllm compatablity + test * comment lint * fix: missing os import, duplicate no op * chore: simplify comments --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2026-03-15 22:10:30 -04:00
Wing Lian	d8a05744d7	Reverts commits `79908b3c6`, `083c5a042`, `e1ff75624`, `ff77fa248`. (#3496 ) The non-root user approach had multiple issues with RunPod compatibility, sudo PATH handling, and tmux in exec sessions. Restoring root as the default user for now.	2026-03-13 11:54:09 -04:00
Wing Lian	ff77fa2488	preserve env for root -> ubuntu user (#3495 )	2026-03-13 10:19:34 -04:00
Wing Lian	e1ff756245	become the ubuntu user when root logs in (#3494 )	2026-03-13 09:06:54 -04:00
Wing Lian	083c5a0421	check ubuntu user and set uv python dir (#3492 )	2026-03-12 23:20:54 -04:00
Wing Lian	79908b3c6e	use ubuntu user instead of root for uv docker images (#3491 )	2026-03-12 20:41:13 -04:00
Wing Lian	819b157c7b	swap around what we're building for docker (#3490 ) * remove cloud configuration we don't base image for * but we do want it for uv	2026-03-11 21:45:13 -04:00
Wing Lian	fccc712dae	builds for py312-cu128-torch2.9.1 (#3489 )	2026-03-11 20:09:03 -04:00
NanoCode012	23ad40bdd5	fix: disable async load when loading quantized bnb	2026-03-11 13:18:27 +07:00
NanoCode012	cf4d550c88	fix: reduce permissions for preview docs CI (#3480 ) [skip ci]	2026-03-09 08:04:31 -04:00
Wing Lian	43b1c80aa6	load weights synchronously so they can be converted and not OOM: (#3477 )	2026-03-07 07:09:24 -05:00
Wing Lian	a36aaa70ce	add gpu tests for scattermoe (#3474 ) [skip ci]	2026-03-07 00:00:48 -05:00
Wing Lian	80f7088ad1	update setuptools so trl can be installed from main for nightlies (#3471 ) * update setuptools so trl can be installed from main for nightlies * run the nightly in the PR CI on change * use range request, don't use cu129 in CI since it's not supported with AO * run multigpu ci if CCE install script changes	2026-03-06 14:59:25 -05:00
Wing Lian	46b9f40f2a	bump dev version to 0.16.0.dev0 (#3472 ) [skip ci]	2026-03-06 14:59:00 -05:00
Wing Lian	8f19169eb0	tag for v0.15.0 release (#3470 ) Some checks failed ci-cd / build-axolotl (<nil>, 128, 12.8.1, linux/amd64, 3.11, 2.8.0) (push) Has been cancelled Details ci-cd / build-axolotl (<nil>, 128, 12.8.1, linux/amd64,linux/arm64, 3.11, 2.9.0) (push) Has been cancelled Details ci-cd / build-axolotl (<nil>, 128, 12.8.1, linux/amd64,linux/arm64, 3.12, 2.10.0) (push) Has been cancelled Details ci-cd / build-axolotl (<nil>, 128, 12.8.1, true, linux/amd64,linux/arm64, 3.11, 2.9.1) (push) Has been cancelled Details ci-cd / build-axolotl (<nil>, 130, 13.0.0, linux/amd64,linux/arm64, 3.11, 2.9.1) (push) Has been cancelled Details ci-cd / build-axolotl (<nil>, 130, 13.0.0, linux/amd64,linux/arm64, 3.12, 2.10.0) (push) Has been cancelled Details ci-cd / build-axolotl-uv (<nil>, 128, 12.8.1, linux/amd64,linux/arm64, 3.12, 2.10.0) (push) Has been cancelled Details ci-cd / build-axolotl-uv (<nil>, 128, 12.8.1, true, linux/amd64,linux/arm64, 3.11, 2.9.1) (push) Has been cancelled Details ci-cd / build-axolotl-uv (<nil>, 130, 13.0.0, linux/amd64,linux/arm64, 3.11, 2.9.1) (push) Has been cancelled Details ci-cd / build-axolotl-uv (<nil>, 130, 13.0.0, linux/amd64,linux/arm64, 3.12, 2.10.0) (push) Has been cancelled Details publish pypi / Create Release (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 128, 12.8.1, linux/amd64, 3.11, 2.8.0) (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 128, 12.8.1, linux/amd64,linux/arm64, 3.11, 2.9.0) (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 128, 12.8.1, linux/amd64,linux/arm64, 3.12, 2.10.0) (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 128, 12.8.1, true, linux/amd64,linux/arm64, 3.11, 2.9.1) (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 130, 13.0.0, linux/amd64,linux/arm64, 3.11, 2.9.1) (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 130, 13.0.0, linux/amd64,linux/arm64, 3.12, 2.10.0) (push) Has been cancelled Details ci-cd / build-axolotl-cloud-uv (<nil>, 128, 12.8.1, linux/amd64,linux/arm64, 3.12, 2.10.0) (push) Has been cancelled Details ci-cd / build-axolotl-cloud-uv (<nil>, 128, 12.8.1, true, linux/amd64,linux/arm64, 3.11, 2.9.1) (push) Has been cancelled Details ci-cd / build-axolotl-cloud-uv (<nil>, 130, 13.0.0, linux/amd64,linux/arm64, 3.11, 2.9.1) (push) Has been cancelled Details ci-cd / build-axolotl-cloud-uv (<nil>, 130, 13.0.0, linux/amd64,linux/arm64, 3.12, 2.10.0) (push) Has been cancelled Details ci-cd / build-axolotl-cloud-no-tmux (<nil>, 128, 12.8.1, true, 3.11, 2.9.1) (push) Has been cancelled Details ci-cd / build-axolotl-cloud-no-tmux (<nil>, 130, 13.0.0, <nil>, 3.11, 2.9.1) (push) Has been cancelled Details publish pypi / Upload release to PyPI (push) Has been cancelled Details v0.15.0	2026-03-06 12:55:11 -05:00

1 2 3 4 5 ...

2659 Commits