Commit Graph

1366 Commits

Author SHA1 Message Date
NanoCode012
4e55871112 feat: Add opt-out Telemetry (#3237)
* initial telemetry manager impl

* adding todo

* updates

* updates

* progress on telemetry: config load, process, model load, train start / end, error tracking

* update error file path sanitization function; adding more error tracking

* updated sanitization logic, tests

* adding runtime metrics (cpu + gpu memory, steps/s, etc.)

* tests for runtime metrics telemetry and assoc. callback

* small update / fix

* simplifying path redaction

* sleep on all ranks in distributed setting

* adding back in base_model redaction w/ whitelist

* fix

* doc update

* improved redaction, send system info during model config load telemetry, etc.

* adding runtime metrics / system info additional accelerator support, etc.

* adding runtime metrics / system info additional accelerator support, etc.

* remove duplicate info

* fixes

* fix issue with tests in ci

* distributed fix

* opt-in version of telemetry

* enable / disable logic update

* docs fix

* doc update

* minor fixes

* simplifying

* slight changes

* fix

* lint

* update posthog dep

* coderabbit comments

* fix: opt-in model

* fix: increase time since last

* fix: increase whitelist orgs

* fix: posthog init and shutdown

* fix: imports

* fix: also check grad norm

* fix: duplicate plugin_manager calls

* fix: bad merge

* chore: update docs

* fix: cache process per comment

* fix: error handling

* fix: tests

* Revert "fix: error handling"

This reverts commit 22d1ea5755.

* fix: test telemetry error_handled bool

* fix: revert test

* chore: final doc fixes

---------

Co-authored-by: Dan Saunders <danjsaund@gmail.com>
Co-authored-by: Dan Saunders <dan@axolotl.ai>
2025-11-18 11:35:25 +07:00
VED
dcf24fd24e feat: save checkpoint after training started (#3233)
* add:config parameters for checkpoint

* callback main

* test file_type fix

* lint

* unit

* simplify dict/obj handeling

* Update src/axolotl/utils/schemas/dynamic_checkpoint.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Delete tests/e2e/integrations/__init__.py

* remove hard code path in test

* device check

* lint

* Update src/axolotl/utils/callbacks/dynamic_checkpoint.py

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

* Update src/axolotl/utils/callbacks/dynamic_checkpoint.py

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

* Update src/axolotl/utils/schemas/dynamic_checkpoint.py

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

* lint-2

* remove: singal based checkpoints

* lint

* remove signal tests

* add:is_main_process

* lint

* addis_d:istributed() for tests

* remove nested is_main_process

* Update src/axolotl/utils/schemas/dynamic_checkpoint.py

Co-authored-by: Wing Lian <wing.lian@gmail.com>

* Update src/axolotl/utils/schemas/dynamic_checkpoint.py

Co-authored-by: Wing Lian <wing.lian@gmail.com>

* add user_defined_filename

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
Co-authored-by: Wing Lian <wing.lian@gmail.com>
2025-11-13 10:21:05 -05:00
NanoCode012
9901ee5602 fix: voxtralprocessor broken (#3255) [skip ci]
* fix: voxtralprocessor broken

* chore: add todo

* chore: wording
2025-11-13 10:18:42 -05:00
xzuyn
dd78f2e0cc Fix: warmup_steps: 0 & warmup_ratio: 0 not disabling warmup (#3254)
* fix unintentional falsy checks

* chore: lint

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-11-11 10:32:06 +07:00
Eduard Zl
b54f9c942b _get_tools in ChatTemplateStrategy : function "parameters" can be dict or string (#3238)
* When training of function calls, "tools" elements of a dataset can contain same parameter name but with different types. Datasets fails to load such training set. This fix allows "parameters" element of function call to be string( by running "json.dumps" in preparation of training data set). The _get_tools function will iterate over tool definitions, if "parameters" element is dict, it will keep that way, if it is a string, it will be converted to dict by invoking "json.loads" on string value.

* feat: add doc on tool parameters json loading

* feat: add tests for parameters json string

---------

Co-authored-by: ezlotnik <eduard_zlotnik@intuit.com>
Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-11-11 09:04:28 +07:00
NanoCode012
11eb36585a feat: add arg to enable dft in liger (#3125)
* feat: add arg to enable dft in liger

* feat: add tests use_token_scaling

* fix: test

* fix: move check to args
2025-11-10 21:37:47 +07:00
NanoCode012
d0c846fc5e feat: add granitemoeshared and granitemoehybrid (#3158) 2025-11-10 21:35:45 +07:00
Wing Lian
b5fcc2f14b log cumulative total trained tokens (#3252)
* log cumulative total trained tokens

* use is_distributed helper
2025-11-07 16:04:00 -05:00
VED
ed2e8cacd6 feat:openenv rollout_func (#3239) [skip ci]
* feat:openenv rollout_func

* chore lint

* docs

* add:docs processing_class

* tests

* lint
2025-11-07 08:51:40 -05:00
Lê Nam Khánh
80270a92fa Fix typos in some files (#3250) [skip ci] 2025-11-07 08:21:20 -05:00
Wing Lian
98333e639a upgrade trl to 0.24.0 and liger to 0.6.3 (#3230)
* upgrade trl to 0.24.0

* fix reward collator init

* use newer DataCollatorForPreference instead

* DataCollatorForPreference doesn't use padding kwarg

* fix input id labels

* fix fbgemm-gpu version for pytorch versions

* tweak pinned deps

* transformers doesn't support hub 1.0 yet

* upgrade liger dep to 0.6.3

* set TORCH_CUDA_ARCH_LIST correctly
2025-10-29 18:02:16 -04:00
Dan Saunders
9d4d39e939 Diffusion trainer fix: shift logits to align with input tokens (#3191)
* shift logits for diffusion generate

* delete unused

* diffusion trainer: token shift
2025-10-27 14:42:01 +07:00
VED
4dc018992d Feat/opentelemetry (#3215) 2025-10-22 19:16:55 -07:00
NanoCode012
243620394a fix: force train split for json,csv,txt for test_datasets and misc doc changes (#3226)
* fix: force train split for json,csv,txt for test_datasets

* feat(doc): add info on mixing datasets for VLM

* feat(doc): max memory

* fix(doc): clarify lr groups

* fix: add info on vision not being dropped

* feat: add qwen3-vl to multimodal docs

* fix: add moe blocks to arch list

* feat(doc): improve mistral docs

* chore: add helpful link [skip-e2e]

* fix: add vram usage for mistral small

* Update link in docs/faq.qmd

Co-authored-by: salman <salman.mohammadi@outlook.com>

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: salman <salman.mohammadi@outlook.com>
2025-10-22 15:23:20 -07:00
Qingyang Wu
3750fdcf79 Fix trainer dataloader slow loading issue (#3219)
* Fix trainer dataloader handling in src/axolotl/core/trainers/base.py

* update comment to reflect torch version

---------

Co-authored-by: Wing Lian <wing.lian@gmail.com>
2025-10-22 21:22:14 +07:00
Matthew Hambrecht
613bcf90e5 fix: enable_sleep_mode -> vllm_enable_sleep_mode (#3225)
Co-authored-by: Matthew Hambrecht <matthew.hambrecht@patapsco.ai>
2025-10-22 06:55:26 -07:00
NanoCode012
8bb871b5cf fix: deepspeed with context parallel (#3220) 2025-10-20 14:06:58 +07:00
Leonard
87565ecc05 Add chat_template.argilla_chat support for DPO datasets (#3202)
* Add chat_template.argilla_chat support for DPO datasets

  Creates a new chat_template.argilla_chat prompt strategy for handling
  DPO datasets where chosen/rejected fields contain full conversations
  (messages + final response), following the pattern of chatml.argilla_chat
  and llama3.argilla_chat.

  - Add argilla_chat() function to chat_template.py
  - Add chat_template.argilla_chat to RLHF documentation
  - Add test coverage for argilla_chat with multiple tokenizers

  Dataset format:
  {
    "chosen": [
      {"role": "user", "content": "..."},
      {"role": "assistant", "content": "..."}
    ],
    "rejected": [
      {"role": "user", "content": "..."},
      {"role": "assistant", "content": "..."}
    ]
  }

* Fix chat_template.argilla_chat return value contract and add docstring

- Return (transform_fn, dataset_kwargs) tuple instead of bare transform_fn
- Add remove_columns specification for field_chosen and field_rejected
- Add comprehensive docstring with Args/Returns sections
- Update tests to unpack tuple return value

Addresses PR feedback to maintain consistency with chat_template.default()
and properly specify columns to remove after dataset transformation.

* Update tests/prompt_strategies/test_dpo_chat_templates.py

Co-authored-by: Wing Lian <wing.lian@gmail.com>

---------

Co-authored-by: Wing Lian <wing.lian@gmail.com>
2025-10-17 17:00:26 +07:00
NanoCode012
93ba57396f fix: qwen3_vl attention config (#3216) 2025-10-17 10:35:03 +07:00
NanoCode012
aa1240acd8 fix: transformers deprecate load_in_Xbit in model_kwargs (#3205)
* fix: transformers deprecate load_in_Xbit in model_kwargs

* fix: test to read from quantization_config kwarg

* fix: test

* fix: access

* fix: test weirdly entering incorrect config
2025-10-16 16:07:27 +07:00
NanoCode012
8c7f63cf97 fix: unpack cce imported incorrectly (#3212) [skip ci] 2025-10-13 17:19:15 +07:00
VED
cd856b45b1 feat:add support dataset_num_processes (#3129) [skip ci]
* feat:add support dataset_num_processes

* chore

* required changes

* requested chnages

* required chnages

* required changes

* required changes

* elif get_default_process_count()

* add:del data

* Update cicd/Dockerfile.jinja

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

* Update cicd/single_gpu.py

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

---------

Co-authored-by: salman <salman.mohammadi@outlook.com>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
2025-10-13 17:18:12 +07:00
salman
143dea4753 FSDPConfig (#3170) 2025-10-10 14:44:25 +01:00
Hitesh Sagtani
bc2ffb8204 fix: Enable KD plugin support for PEFT/LoRA adapters (#3207)
- Fix _loss_function attribute not found on base model with PEFT
- Fix mismatched attribute name (loss_function vs _loss_function)
- Set _loss_function on unwrapped base model for PEFT
- Enable previously skipped test_llama_lora_kd test
- Add test config fixes for LoRA kernel compatibility

Fixes https://github.com/axolotl-ai-cloud/axolotl/issues/3206
2025-10-10 08:57:00 -04:00
Wing Lian
08b8fa62cc only calculate packed ds length once if using a large world size (#3210) 2025-10-09 14:18:46 -04:00
Wing Lian
3a5c97e6e5 use can_device_access_peer for P2P checks (#3209) [skip ci]
* use can_device_access_peer for P2P checks

* also log warn when automatically setting NCCL_P2P_DISABLE=1
2025-10-09 14:17:31 -04:00
VED
37f78c8592 add chat_template_jinja to wandb (#3192) [skip ci]
* add chat_template_jinja to wandb

* temp_ct_file.flush()

* Update src/axolotl/utils/callbacks/__init__.py

Co-authored-by: Wing Lian <wing.lian@gmail.com>

* Update src/axolotl/utils/callbacks/__init__.py

Co-authored-by: Wing Lian <wing.lian@gmail.com>

* Apply suggestion from @winglian

---------

Co-authored-by: Wing Lian <wing.lian@gmail.com>
2025-10-09 12:05:54 -04:00
NanoCode012
ab63b92c38 feat: add lfm2 family and latest moe model (#3208)
* feat: add lfm2 family and latest moe model

* fix: use ml-cross-entropy for lfm2 examples
2025-10-09 10:47:41 -04:00
Manh Nguyen
6f8ce024d1 Remove check_torch_compile_deepspeed (#3195) [skip ci]
Signed-off-by: nguyen599 <pnvmanh2123@gmail.com>
2025-10-08 11:27:01 -04:00
Wing Lian
d0e9c3c1c5 When using Ray use prepare for dataloader fixes (#3198)
* make sure to use ray prepare for dataloader fixes

* ray tests use 2.7.0+

* don't call init_distributed w ray and deepspeed

* handle dict deepspeed config

* better handling of dict deepspeed config

* use json.dumps

* guard to_dict

* wrap import for optional ray
2025-10-08 10:43:41 -04:00
Wing Lian
130637a3fa upgrade transformers to 4.57.0 (#3201)
* upgrade transformers to 4.57.0

* remove deprecated autoawq and use latest peft

* remove autoawq from setuptools script

* fix imports

* make sure torchvision is installed

* remove support for BetterTransformer

* skip fsdp_qlora_prequant test

* more robust error reporting
2025-10-08 08:43:46 -04:00
VED
377c510e95 sleep model support (#3135)
Co-authored-by: salman <salman.mohammadi@outlook.com>
2025-10-08 12:39:21 +01:00
VED
a6bfbe3400 torch_dtype -> dtype (#3177)
* torch_dtype -> dtype

* torch_dtype -> dtype
2025-10-01 15:02:51 +07:00
Dan Saunders
f4376748f3 debug log: multiprocess race condition fix (#3188) 2025-09-26 15:07:39 -04:00
Grant Holmes (Ren)
850c1a5f8d Add FSDP v2 swap memory support + QLoRA compatibility fixes (#3167)
Co-authored-by: salman <salman.mohammadi@outlook.com>
2025-09-26 10:23:59 +01:00
NanoCode012
7fa8ac40cd Feat(cce): add qwen3_vl, qwen3_vl_moe, granitemoeshared, granitemoehybrid, and upgraded all cce patches (#3178)
* feat: upgrade cce with patches for transformers 4.56

* feat: add missing models to cce readme
2025-09-26 12:11:29 +07:00
Dan Saunders
f9748c4dc5 Cp fix (#3182)
* patch transformers to allow CP + FA2

* nits

* only patch in CP > 1 case
2025-09-25 12:03:50 -04:00
陈华杰
e8b962d47f feat: support training with JSON string tool arguments (#3136)
* feat: support training with JSON string tool arguments; fix PyArrow data type inconsistent error

* feat: raise error for tool call arguments decode

* Add test_chat_templates_tool_call_string_arguments.py

Add test for string arguments

* fix: change to correct qwen3 tokenizer

* fix: update docs to clarify arguments json

* chore: lint

* fix: duplicate

* chore: revert

* feat: add error to faq

* fix: remove duplicate fixture

---------

Co-authored-by: caoqinping <caoqinping@lixiang.com>
Co-authored-by: gamersover-blog <1611885128@qq.com>
Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-09-25 12:06:21 +07:00
NanoCode012
55d1be2ae6 fix: unify default for conversations_field [skip-e2e] (#3070)
* fix: unify default for conversations_field

* fix: suggestion to remove defaults
2025-09-23 21:22:15 +07:00
NanoCode012
08d831c3d5 Feat: add qwen3-next (w packing+cce) (#3150)
* feat: upgrade cce for qwen3-next

* feat: add sample qwen3 config

* feat: add packing patch for chunk_gated_delta_rule

* feat: add qwen3 link

* fix: tuple name

* feat: add tested qwen3 config

* fix: improve log

* feat: add patch for fla without packing

* fix: remove fla patch for standard mode

* feat: enable packing

* feat: add qwen3-next tests

* chore: move tests
2025-09-23 11:31:15 +07:00
AlexHT Hung
7be8740c5c fix(rl): pass max_prompt_len to training args as max_prompt_length (#3113)
* pass max_prompt_len to training args as max_prompt_length

* Update rl.py

* refactor

* format

* fix: default for max_prompt_length

* fix: defaults for trainer

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-09-19 17:34:28 +07:00
NanoCode012
c51d6b06c3 feat: add apertus model and cce (#3144) [skip ci]
* feat: add apertus, glm4v, glm4v_moe cce

* fix: arcee docs

* feat: add apertus

* feat: added vram usage

* fix: add apertus note

* feat: update doc on apertus xielu

* fix: add monkeypatch for xielu activation issue

* fix: simplify env

* feat: pin commit

* feat: add packing

* chore: move patch calling

* Update examples/apertus/README.md

Co-authored-by: salman <salman.mohammadi@outlook.com>

* Update examples/apertus/README.md

Co-authored-by: salman <salman.mohammadi@outlook.com>

* Update examples/apertus/README.md

Co-authored-by: salman <salman.mohammadi@outlook.com>

---------

Co-authored-by: salman <salman.mohammadi@outlook.com>
2025-09-19 17:34:04 +07:00
NanoCode012
09959fac70 Feat: add Magistral Small 2509 and native mistral3 tokenizer support (#3165)
* feat: update mistral common

* feat: add mistral3processor

* fix: loading

* fix: cast pixel_values to fp32

* fix: image tensor conversion

* feat: add FA2 support for pixtral based models

* fix: update mistral small 3.1 to use native tokenizer

* fix: install tips

* fix: improve info on sample dataset files

* chore: move mistral configs into subfolders

* fix: remove unneeded patch

* fix: indent

* feat: add integration tests

* chore: move

* feat: add magistral 2509 docs and example

* fix: convert tensor to bool

* feat: expand tests

* chore: move tests
2025-09-18 15:42:20 +07:00
Dan Saunders
4065bc14c6 Debug log, logging improvements (#3159)
* simplify logging

* remove comment

* progress on debug.log

* add debug-level logger for file log

* simplify

* case insensitivity; 3rd party logging improvements

* simplify

* fix

* tests

* lint

* nits

* nit

* Update tests/test_utils_tee.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* cleanup / comments

* fix

* oops

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-09-17 13:27:03 -04:00
Wing Lian
86d6ee7c05 upgrade trl and accelerate (#3161)
* upgrade trl==0.23.0

* upgrade accelerate patch fix

* add hints when using gradient_checkpointing with DPO

* set gradient-checpointing properly
2025-09-16 14:53:01 -04:00
Wing Lian
d4cff1b7bb improve setting of NCCL_P2P_DISABLE on runpod (#3132) [skip ci]
* improve setting of NCCL_P2P_DISABLE on runpod

* use recs from review
2025-09-16 14:52:45 -04:00
Wing Lian
1ef6c196f7 setup env vars for ray train for FSDP (#3130) [skip ci] 2025-09-16 14:52:29 -04:00
salman
58d67bf98d Migrate QAT API; fix axolotl quantize for QAT-ed models; add NVFP4 (#3107) 2025-09-12 10:55:50 +01:00
salman
9406c0c488 log before eval step (#3148) [skip-ci] 2025-09-11 11:19:30 +01:00
Dan Saunders
1b53c49e1a text diffusion training plugin (#3067)
* diffusion training plugin

* cleanup

* nits

* fixes + improvements

* add back in reinit_weights (clobbered?); masking / pretrain fixes

* nits

* cleanup; tests draft

* sample generation, tests fixes

* fixes

* nits

* add inference support; add auto-mask token support

* nits

* nits

* progress

* simplify logging

* lint

* prefix args with diffusion_

* coderabbito

* tests fix

* nit

* nits

* cleanup + nits

* nits

* fix SFT sample gen

* fixes

* fix

* comments

* comments

* lint

* reward model lora fix

* cleanup; fix pretraining_dataset case

* gradio inference

* update cfgs

* update cfgs

* train, generation parity, cleanup

* fix

* simplify

* test

* test fix
2025-09-10 20:27:00 -04:00