NanoCode012
4e55871112
feat: Add opt-out Telemetry ( #3237 )
...
* initial telemetry manager impl
* adding todo
* updates
* updates
* progress on telemetry: config load, process, model load, train start / end, error tracking
* update error file path sanitization function; adding more error tracking
* updated sanitization logic, tests
* adding runtime metrics (cpu + gpu memory, steps/s, etc.)
* tests for runtime metrics telemetry and assoc. callback
* small update / fix
* simplifying path redaction
* sleep on all ranks in distributed setting
* adding back in base_model redaction w/ whitelist
* fix
* doc update
* improved redaction, send system info during model config load telemetry, etc.
* adding runtime metrics / system info additional accelerator support, etc.
* adding runtime metrics / system info additional accelerator support, etc.
* remove duplicate info
* fixes
* fix issue with tests in ci
* distributed fix
* opt-in version of telemetry
* enable / disable logic update
* docs fix
* doc update
* minor fixes
* simplifying
* slight changes
* fix
* lint
* update posthog dep
* coderabbit comments
* fix: opt-in model
* fix: increase time since last
* fix: increase whitelist orgs
* fix: posthog init and shutdown
* fix: imports
* fix: also check grad norm
* fix: duplicate plugin_manager calls
* fix: bad merge
* chore: update docs
* fix: cache process per comment
* fix: error handling
* fix: tests
* Revert "fix: error handling"
This reverts commit 22d1ea5755 .
* fix: test telemetry error_handled bool
* fix: revert test
* chore: final doc fixes
---------
Co-authored-by: Dan Saunders <danjsaund@gmail.com >
Co-authored-by: Dan Saunders <dan@axolotl.ai >
2025-11-18 11:35:25 +07:00
VED
dcf24fd24e
feat: save checkpoint after training started ( #3233 )
...
* add:config parameters for checkpoint
* callback main
* test file_type fix
* lint
* unit
* simplify dict/obj handeling
* Update src/axolotl/utils/schemas/dynamic_checkpoint.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
* Delete tests/e2e/integrations/__init__.py
* remove hard code path in test
* device check
* lint
* Update src/axolotl/utils/callbacks/dynamic_checkpoint.py
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com >
* Update src/axolotl/utils/callbacks/dynamic_checkpoint.py
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com >
* Update src/axolotl/utils/schemas/dynamic_checkpoint.py
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com >
* lint-2
* remove: singal based checkpoints
* lint
* remove signal tests
* add:is_main_process
* lint
* addis_d:istributed() for tests
* remove nested is_main_process
* Update src/axolotl/utils/schemas/dynamic_checkpoint.py
Co-authored-by: Wing Lian <wing.lian@gmail.com >
* Update src/axolotl/utils/schemas/dynamic_checkpoint.py
Co-authored-by: Wing Lian <wing.lian@gmail.com >
* add user_defined_filename
---------
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com >
Co-authored-by: Wing Lian <wing.lian@gmail.com >
2025-11-13 10:21:05 -05:00
NanoCode012
9901ee5602
fix: voxtralprocessor broken ( #3255 ) [skip ci]
...
* fix: voxtralprocessor broken
* chore: add todo
* chore: wording
2025-11-13 10:18:42 -05:00
xzuyn
dd78f2e0cc
Fix: warmup_steps: 0 & warmup_ratio: 0 not disabling warmup ( #3254 )
...
* fix unintentional falsy checks
* chore: lint
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai >
2025-11-11 10:32:06 +07:00
Eduard Zl
b54f9c942b
_get_tools in ChatTemplateStrategy : function "parameters" can be dict or string ( #3238 )
...
* When training of function calls, "tools" elements of a dataset can contain same parameter name but with different types. Datasets fails to load such training set. This fix allows "parameters" element of function call to be string( by running "json.dumps" in preparation of training data set). The _get_tools function will iterate over tool definitions, if "parameters" element is dict, it will keep that way, if it is a string, it will be converted to dict by invoking "json.loads" on string value.
* feat: add doc on tool parameters json loading
* feat: add tests for parameters json string
---------
Co-authored-by: ezlotnik <eduard_zlotnik@intuit.com >
Co-authored-by: NanoCode012 <nano@axolotl.ai >
2025-11-11 09:04:28 +07:00
NanoCode012
11eb36585a
feat: add arg to enable dft in liger ( #3125 )
...
* feat: add arg to enable dft in liger
* feat: add tests use_token_scaling
* fix: test
* fix: move check to args
2025-11-10 21:37:47 +07:00
NanoCode012
d0c846fc5e
feat: add granitemoeshared and granitemoehybrid ( #3158 )
2025-11-10 21:35:45 +07:00
Wing Lian
b5fcc2f14b
log cumulative total trained tokens ( #3252 )
...
* log cumulative total trained tokens
* use is_distributed helper
2025-11-07 16:04:00 -05:00
VED
ed2e8cacd6
feat:openenv rollout_func ( #3239 ) [skip ci]
...
* feat:openenv rollout_func
* chore lint
* docs
* add:docs processing_class
* tests
* lint
2025-11-07 08:51:40 -05:00
Lê Nam Khánh
80270a92fa
Fix typos in some files ( #3250 ) [skip ci]
2025-11-07 08:21:20 -05:00
Wing Lian
98333e639a
upgrade trl to 0.24.0 and liger to 0.6.3 ( #3230 )
...
* upgrade trl to 0.24.0
* fix reward collator init
* use newer DataCollatorForPreference instead
* DataCollatorForPreference doesn't use padding kwarg
* fix input id labels
* fix fbgemm-gpu version for pytorch versions
* tweak pinned deps
* transformers doesn't support hub 1.0 yet
* upgrade liger dep to 0.6.3
* set TORCH_CUDA_ARCH_LIST correctly
2025-10-29 18:02:16 -04:00
Dan Saunders
9d4d39e939
Diffusion trainer fix: shift logits to align with input tokens ( #3191 )
...
* shift logits for diffusion generate
* delete unused
* diffusion trainer: token shift
2025-10-27 14:42:01 +07:00
VED
4dc018992d
Feat/opentelemetry ( #3215 )
2025-10-22 19:16:55 -07:00
NanoCode012
243620394a
fix: force train split for json,csv,txt for test_datasets and misc doc changes ( #3226 )
...
* fix: force train split for json,csv,txt for test_datasets
* feat(doc): add info on mixing datasets for VLM
* feat(doc): max memory
* fix(doc): clarify lr groups
* fix: add info on vision not being dropped
* feat: add qwen3-vl to multimodal docs
* fix: add moe blocks to arch list
* feat(doc): improve mistral docs
* chore: add helpful link [skip-e2e]
* fix: add vram usage for mistral small
* Update link in docs/faq.qmd
Co-authored-by: salman <salman.mohammadi@outlook.com >
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
Co-authored-by: salman <salman.mohammadi@outlook.com >
2025-10-22 15:23:20 -07:00
Qingyang Wu
3750fdcf79
Fix trainer dataloader slow loading issue ( #3219 )
...
* Fix trainer dataloader handling in src/axolotl/core/trainers/base.py
* update comment to reflect torch version
---------
Co-authored-by: Wing Lian <wing.lian@gmail.com >
2025-10-22 21:22:14 +07:00
Matthew Hambrecht
613bcf90e5
fix: enable_sleep_mode -> vllm_enable_sleep_mode ( #3225 )
...
Co-authored-by: Matthew Hambrecht <matthew.hambrecht@patapsco.ai >
2025-10-22 06:55:26 -07:00
NanoCode012
8bb871b5cf
fix: deepspeed with context parallel ( #3220 )
2025-10-20 14:06:58 +07:00
Leonard
87565ecc05
Add chat_template.argilla_chat support for DPO datasets ( #3202 )
...
* Add chat_template.argilla_chat support for DPO datasets
Creates a new chat_template.argilla_chat prompt strategy for handling
DPO datasets where chosen/rejected fields contain full conversations
(messages + final response), following the pattern of chatml.argilla_chat
and llama3.argilla_chat.
- Add argilla_chat() function to chat_template.py
- Add chat_template.argilla_chat to RLHF documentation
- Add test coverage for argilla_chat with multiple tokenizers
Dataset format:
{
"chosen": [
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
],
"rejected": [
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
]
}
* Fix chat_template.argilla_chat return value contract and add docstring
- Return (transform_fn, dataset_kwargs) tuple instead of bare transform_fn
- Add remove_columns specification for field_chosen and field_rejected
- Add comprehensive docstring with Args/Returns sections
- Update tests to unpack tuple return value
Addresses PR feedback to maintain consistency with chat_template.default()
and properly specify columns to remove after dataset transformation.
* Update tests/prompt_strategies/test_dpo_chat_templates.py
Co-authored-by: Wing Lian <wing.lian@gmail.com >
---------
Co-authored-by: Wing Lian <wing.lian@gmail.com >
2025-10-17 17:00:26 +07:00
NanoCode012
93ba57396f
fix: qwen3_vl attention config ( #3216 )
2025-10-17 10:35:03 +07:00
NanoCode012
aa1240acd8
fix: transformers deprecate load_in_Xbit in model_kwargs ( #3205 )
...
* fix: transformers deprecate load_in_Xbit in model_kwargs
* fix: test to read from quantization_config kwarg
* fix: test
* fix: access
* fix: test weirdly entering incorrect config
2025-10-16 16:07:27 +07:00
NanoCode012
8c7f63cf97
fix: unpack cce imported incorrectly ( #3212 ) [skip ci]
2025-10-13 17:19:15 +07:00
VED
cd856b45b1
feat:add support dataset_num_processes ( #3129 ) [skip ci]
...
* feat:add support dataset_num_processes
* chore
* required changes
* requested chnages
* required chnages
* required changes
* required changes
* elif get_default_process_count()
* add:del data
* Update cicd/Dockerfile.jinja
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com >
* Update cicd/single_gpu.py
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com >
---------
Co-authored-by: salman <salman.mohammadi@outlook.com >
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com >
2025-10-13 17:18:12 +07:00
salman
143dea4753
FSDPConfig (#3170 )
2025-10-10 14:44:25 +01:00
Hitesh Sagtani
bc2ffb8204
fix: Enable KD plugin support for PEFT/LoRA adapters ( #3207 )
...
- Fix _loss_function attribute not found on base model with PEFT
- Fix mismatched attribute name (loss_function vs _loss_function)
- Set _loss_function on unwrapped base model for PEFT
- Enable previously skipped test_llama_lora_kd test
- Add test config fixes for LoRA kernel compatibility
Fixes https://github.com/axolotl-ai-cloud/axolotl/issues/3206
2025-10-10 08:57:00 -04:00
Wing Lian
08b8fa62cc
only calculate packed ds length once if using a large world size ( #3210 )
2025-10-09 14:18:46 -04:00
Wing Lian
3a5c97e6e5
use can_device_access_peer for P2P checks ( #3209 ) [skip ci]
...
* use can_device_access_peer for P2P checks
* also log warn when automatically setting NCCL_P2P_DISABLE=1
2025-10-09 14:17:31 -04:00
VED
37f78c8592
add chat_template_jinja to wandb ( #3192 ) [skip ci]
...
* add chat_template_jinja to wandb
* temp_ct_file.flush()
* Update src/axolotl/utils/callbacks/__init__.py
Co-authored-by: Wing Lian <wing.lian@gmail.com >
* Update src/axolotl/utils/callbacks/__init__.py
Co-authored-by: Wing Lian <wing.lian@gmail.com >
* Apply suggestion from @winglian
---------
Co-authored-by: Wing Lian <wing.lian@gmail.com >
2025-10-09 12:05:54 -04:00
NanoCode012
ab63b92c38
feat: add lfm2 family and latest moe model ( #3208 )
...
* feat: add lfm2 family and latest moe model
* fix: use ml-cross-entropy for lfm2 examples
2025-10-09 10:47:41 -04:00
Manh Nguyen
6f8ce024d1
Remove check_torch_compile_deepspeed ( #3195 ) [skip ci]
...
Signed-off-by: nguyen599 <pnvmanh2123@gmail.com >
2025-10-08 11:27:01 -04:00
Wing Lian
d0e9c3c1c5
When using Ray use prepare for dataloader fixes ( #3198 )
...
* make sure to use ray prepare for dataloader fixes
* ray tests use 2.7.0+
* don't call init_distributed w ray and deepspeed
* handle dict deepspeed config
* better handling of dict deepspeed config
* use json.dumps
* guard to_dict
* wrap import for optional ray
2025-10-08 10:43:41 -04:00
Wing Lian
130637a3fa
upgrade transformers to 4.57.0 ( #3201 )
...
* upgrade transformers to 4.57.0
* remove deprecated autoawq and use latest peft
* remove autoawq from setuptools script
* fix imports
* make sure torchvision is installed
* remove support for BetterTransformer
* skip fsdp_qlora_prequant test
* more robust error reporting
2025-10-08 08:43:46 -04:00
VED
377c510e95
sleep model support ( #3135 )
...
Co-authored-by: salman <salman.mohammadi@outlook.com >
2025-10-08 12:39:21 +01:00
VED
a6bfbe3400
torch_dtype -> dtype ( #3177 )
...
* torch_dtype -> dtype
* torch_dtype -> dtype
2025-10-01 15:02:51 +07:00
Dan Saunders
f4376748f3
debug log: multiprocess race condition fix ( #3188 )
2025-09-26 15:07:39 -04:00
Grant Holmes (Ren)
850c1a5f8d
Add FSDP v2 swap memory support + QLoRA compatibility fixes ( #3167 )
...
Co-authored-by: salman <salman.mohammadi@outlook.com >
2025-09-26 10:23:59 +01:00
NanoCode012
7fa8ac40cd
Feat(cce): add qwen3_vl, qwen3_vl_moe, granitemoeshared, granitemoehybrid, and upgraded all cce patches ( #3178 )
...
* feat: upgrade cce with patches for transformers 4.56
* feat: add missing models to cce readme
2025-09-26 12:11:29 +07:00
Dan Saunders
f9748c4dc5
Cp fix ( #3182 )
...
* patch transformers to allow CP + FA2
* nits
* only patch in CP > 1 case
2025-09-25 12:03:50 -04:00
陈华杰
e8b962d47f
feat: support training with JSON string tool arguments ( #3136 )
...
* feat: support training with JSON string tool arguments; fix PyArrow data type inconsistent error
* feat: raise error for tool call arguments decode
* Add test_chat_templates_tool_call_string_arguments.py
Add test for string arguments
* fix: change to correct qwen3 tokenizer
* fix: update docs to clarify arguments json
* chore: lint
* fix: duplicate
* chore: revert
* feat: add error to faq
* fix: remove duplicate fixture
---------
Co-authored-by: caoqinping <caoqinping@lixiang.com >
Co-authored-by: gamersover-blog <1611885128@qq.com >
Co-authored-by: NanoCode012 <nano@axolotl.ai >
2025-09-25 12:06:21 +07:00
NanoCode012
55d1be2ae6
fix: unify default for conversations_field [skip-e2e] ( #3070 )
...
* fix: unify default for conversations_field
* fix: suggestion to remove defaults
2025-09-23 21:22:15 +07:00
NanoCode012
08d831c3d5
Feat: add qwen3-next (w packing+cce) ( #3150 )
...
* feat: upgrade cce for qwen3-next
* feat: add sample qwen3 config
* feat: add packing patch for chunk_gated_delta_rule
* feat: add qwen3 link
* fix: tuple name
* feat: add tested qwen3 config
* fix: improve log
* feat: add patch for fla without packing
* fix: remove fla patch for standard mode
* feat: enable packing
* feat: add qwen3-next tests
* chore: move tests
2025-09-23 11:31:15 +07:00
AlexHT Hung
7be8740c5c
fix(rl): pass max_prompt_len to training args as max_prompt_length ( #3113 )
...
* pass max_prompt_len to training args as max_prompt_length
* Update rl.py
* refactor
* format
* fix: default for max_prompt_length
* fix: defaults for trainer
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai >
2025-09-19 17:34:28 +07:00
NanoCode012
c51d6b06c3
feat: add apertus model and cce ( #3144 ) [skip ci]
...
* feat: add apertus, glm4v, glm4v_moe cce
* fix: arcee docs
* feat: add apertus
* feat: added vram usage
* fix: add apertus note
* feat: update doc on apertus xielu
* fix: add monkeypatch for xielu activation issue
* fix: simplify env
* feat: pin commit
* feat: add packing
* chore: move patch calling
* Update examples/apertus/README.md
Co-authored-by: salman <salman.mohammadi@outlook.com >
* Update examples/apertus/README.md
Co-authored-by: salman <salman.mohammadi@outlook.com >
* Update examples/apertus/README.md
Co-authored-by: salman <salman.mohammadi@outlook.com >
---------
Co-authored-by: salman <salman.mohammadi@outlook.com >
2025-09-19 17:34:04 +07:00
NanoCode012
09959fac70
Feat: add Magistral Small 2509 and native mistral3 tokenizer support ( #3165 )
...
* feat: update mistral common
* feat: add mistral3processor
* fix: loading
* fix: cast pixel_values to fp32
* fix: image tensor conversion
* feat: add FA2 support for pixtral based models
* fix: update mistral small 3.1 to use native tokenizer
* fix: install tips
* fix: improve info on sample dataset files
* chore: move mistral configs into subfolders
* fix: remove unneeded patch
* fix: indent
* feat: add integration tests
* chore: move
* feat: add magistral 2509 docs and example
* fix: convert tensor to bool
* feat: expand tests
* chore: move tests
2025-09-18 15:42:20 +07:00
Dan Saunders
4065bc14c6
Debug log, logging improvements ( #3159 )
...
* simplify logging
* remove comment
* progress on debug.log
* add debug-level logger for file log
* simplify
* case insensitivity; 3rd party logging improvements
* simplify
* fix
* tests
* lint
* nits
* nit
* Update tests/test_utils_tee.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
* cleanup / comments
* fix
* oops
---------
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-09-17 13:27:03 -04:00
Wing Lian
86d6ee7c05
upgrade trl and accelerate ( #3161 )
...
* upgrade trl==0.23.0
* upgrade accelerate patch fix
* add hints when using gradient_checkpointing with DPO
* set gradient-checpointing properly
2025-09-16 14:53:01 -04:00
Wing Lian
d4cff1b7bb
improve setting of NCCL_P2P_DISABLE on runpod ( #3132 ) [skip ci]
...
* improve setting of NCCL_P2P_DISABLE on runpod
* use recs from review
2025-09-16 14:52:45 -04:00
Wing Lian
1ef6c196f7
setup env vars for ray train for FSDP ( #3130 ) [skip ci]
2025-09-16 14:52:29 -04:00
salman
58d67bf98d
Migrate QAT API; fix axolotl quantize for QAT-ed models; add NVFP4 ( #3107 )
2025-09-12 10:55:50 +01:00
salman
9406c0c488
log before eval step ( #3148 ) [skip-ci]
2025-09-11 11:19:30 +01:00
Dan Saunders
1b53c49e1a
text diffusion training plugin ( #3067 )
...
* diffusion training plugin
* cleanup
* nits
* fixes + improvements
* add back in reinit_weights (clobbered?); masking / pretrain fixes
* nits
* cleanup; tests draft
* sample generation, tests fixes
* fixes
* nits
* add inference support; add auto-mask token support
* nits
* nits
* progress
* simplify logging
* lint
* prefix args with diffusion_
* coderabbito
* tests fix
* nit
* nits
* cleanup + nits
* nits
* fix SFT sample gen
* fixes
* fix
* comments
* comments
* lint
* reward model lora fix
* cleanup; fix pretraining_dataset case
* gradio inference
* update cfgs
* update cfgs
* train, generation parity, cleanup
* fix
* simplify
* test
* test fix
2025-09-10 20:27:00 -04:00