Commit Graph

25 Commits

Author SHA1 Message Date
tgoab
530a0c0bf0 Changes from dataset_processes to dataset_num_proc (#3352) [skip ci]
* changes from dataset_processes to dataset_num_proc

* deprecation message improved

---------

Co-authored-by: Juliana Nieto Cárdenas <jnietoca@purdue.edu>
2026-02-10 17:44:17 +07:00
Wing Lian
fc4e37920b transformers v5 upgrade (#3272)
* Prepare for transformers v5 upgrade

* fix hf cli

* update for hf hub changes

* fix tokenizer apply_chat_template args

* remap include_tokens_per_second

* fix tps

* handle migration for warmup

* use latest hf hub

* Fix scan -> ls

* fix import

* fix for renaming of mistral common tokenizer -> backend

* update for fixed tokenziation for llama

* Skip phi35 tests for now

* remove mistral patch fixed upstream in huggingface/transformers#41439

* use namespacing for patch

* don't rely on sdist for e2e tests for now

* run modal ci without waiting too

* Fix dep for ci

* fix imports

* Fix fp8 check

* fsdp2 fixes

* fix version handling

* update fsdp version tests for new v5 behavior

* Fail multigpu tests after 3 failures

* skip known v5 broken tests for now and cleanup

* bump deps

* unmark skipped test

* re-enable test_fsdp_qlora_prequant_packed test

* increase multigpu ci timeout

* skip broken gemma3 test

* reduce timout back to original 120min now that the hanging test is skipped

* fix for un-necessary collator for pretraining with bsz=1

* fix: safe_serialization deprecated in transformers v5 rc01 (#3318)

* torch_dtype deprecated

* load model in float32 for consistency with tests

* revert some test fixtures back

* use hf cache ls instead of scan

* don't strip fsdp_version

more fdsp_Version fixes for v5
fix version in fsdp_config
fix aliasing
fix fsdp_version check
check fsdp_version is 2 in both places

* Transformers v5 rc2 (#3347)

* bump dep

* use latest fbgemm, grab model config as part of fixture, un-skip test

* import AutoConfig

* don't need more problematic autoconfig when specifying config.json manually

* add fixtures for argilla ultrafeedback datasets

* download phi4-reasoning

* fix arg

* update tests for phi fast tokenizer changes

* use explicit model types for gemma3

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>

* fix: AutoModelForVision2Seq -> AutoModelForImageTextToText

* chore: remove duplicate

* fix: attempt fix gemma3 text mode

* chore: lint

* ga release of v5

* need property setter for name_or_path for mistral tokenizer

* vllm not compatible with transformers v5

* setter for chat_template w mistral too

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
Co-authored-by: salman <salman.mohammadi@outlook.com>
2026-01-27 17:08:24 -05:00
VED
d0d26d5064 feat: Add GDPO Support (#3353)
* gdpo support - test left

* lint

* fixxes for vllm serv

* test advantages

* docss

* lint

* lint =

* gdpo simple + lint

* lint nit

* example

* lint

* trl 0.27.0

* blocklist

* test assert rmv

* add validation check for GDPO + sum_then_normalize

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
2026-01-21 17:22:45 -05:00
salman
bbd3486f57 Distributed Muon Optimizer (#3264)
* init

* working

* updating configs

* removing unneeded files

* lint

* comments

* lint

* fix regex match

* bump contribs version

* comments

* fixing tests and imports

* muon imports in test v2

* test cleanup

* bump contribs version

---------

Co-authored-by: Salman Mohammadi <“salman.mohammadi@outlook.com”>
2025-12-19 10:43:47 -05:00
Wing Lian
bfdc9a8249 upgrade trl and other hf deps (#3249)
* upgrade trl and other hf deps

* skip simpo for now
2025-11-06 16:06:03 -05:00
VED
cd856b45b1 feat:add support dataset_num_processes (#3129) [skip ci]
* feat:add support dataset_num_processes

* chore

* required changes

* requested chnages

* required chnages

* required changes

* required changes

* elif get_default_process_count()

* add:del data

* Update cicd/Dockerfile.jinja

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

* Update cicd/single_gpu.py

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

---------

Co-authored-by: salman <salman.mohammadi@outlook.com>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
2025-10-13 17:18:12 +07:00
Dan Saunders
79ddaebe9a Add ruff, remove black, isort, flake8, pylint (#3092)
* black, isort, flake8 -> ruff

* remove unused

* add back needed import

* fix
2025-08-23 23:37:33 -04:00
Wing Lian
d6b81b3683 update training args check for new defaults (#3051) [skip ci]
* update training args check for new defaults

* skip check for now
2025-08-10 11:26:22 -04:00
salman
294c7fe7a6 Distributed/ND-Parallel (#2977) 2025-07-31 15:25:02 -04:00
Wing Lian
db5f6f4693 limit num_proc when saving datasets to disk (#2948) [skip ci]
* limit num_proc when saving datasets to disk

* enforce at least 1 in case it rounds down to 0, and sane divisor is at least 8 rows per worker to save

* update fixtures with dataset processes since that should never be NoneType

* improve reusability for tests
2025-07-21 11:39:38 -04:00
Wing Lian
cd079b5536 Tensor parallel w DeepSpeed AutoTP (#2574)
* support for deepspeed autotup

* bump to latest deepspeed that supports deepcompile too

* add deepcompile support too

* fix total steps calculation for TP

* setup fixture for tp

* update ds config to ensure weights are gathered for checkpoint

* fix duplicate validation names

* chore: lint
2025-07-14 21:33:48 -04:00
Dan Saunders
00cda8cc70 Data loader refactor (#2707)
* data loading refactor (wip)

* updates

* progress

* pytest

* pytest fix

* lint

* zero_first -> filelock, more simplifications

* small simplification

* import change

* nit

* lint

* simplify dedup

* couldnt resist

* review comments WIP

* continued wip

* minor changes

* fix; remove contrived test

* further refactor

* set default seed in pydantic config

* lint

* continued simplication

* lint

* renaming and nits

* filelock tests

* fix

* fix

* lint

* remove nullable arg

* remove unnecessary code

* moving dataset save fn to shared module

* remove debug print

* matching var naming

* fn name change

* coderabbit comments

* naming nit

* fix test
2025-06-10 19:53:07 -04:00
NanoCode012
6778856804 Fix: RL base feature parity (#2133)
* feat: add num_proc and load from cache for rl mapping

* fix: refactor sft and rl trainer to set same base args

* feat: add report_to to set run name

* fix: consolidate handling of fp16, bf16, tf32 kwarg

* chore: consolidate eval_strat, loraplus, lr sched, max_length

* fix: deprecate old types

* fix: adding missing Any

* fix: max_steps incorrectly set

* fix: remove unnecessary datacollator kwarg insert and pop

* fix: update default max_steps

* fix: add missing weight_decay handling

* fix: ignore max_length for grpo

* feat: update CI on trainer_builder

* fix: comments

* improve handling of warmup/logging steps

* use transformers default for logging steps, not None

* fix: remove redundant override

* fix: lint

* feat: allow custom optim for rl methods

* fix: duplicate optim setting

* fix(test): set sequence_parallel_degree default in base cfg

* feat: add handling for seed and SP/ring-attn config

* chore: add back return typing from rebase

* fix(test): use RLType directly to skip needing to validate

* feat: split training builder into sub modules

* fix: remove deprecated clause

* chore: add missing config to doc

* fix: update quarto autodoc

* fix: import path for trainer builder and submodules

* fix: remove redundant configs from rebase mistake

* chore: simplify dynamo check

* fix: optimizer_cls_and_kwargs to be passed into trainer_kwargs

* fix: add missing rex from rebase

* fix: move pop optimizer_cls_and_kwargs

* fix: pop optimizer cls in rl too

* fix: leftover bug from rebase

* fix: update handling of trainer_cls in RL

* fix: address pr feedback

* feat: call hook_pre_create_trainer for rl

* chore: lint

* fix: return notimplemented for ppo

* feat: moved torch compile to base and refactor collator setting

* chore: remove unused importlib.util import

* fix: optimizer cls not being popped

* feat: move epoch setting to base

* fix: catch unhandled custom optimizer

* fix: remove duplicate lora plus setting

* chore: refactor if condition

* chore: refactor set_base_training_args into smaller modules

* fix: address TrainerBuilderBase class variables to instance var

* fix: add handling for beta3 and episilon2

* fix: change to pass dict via arg instead of updating dict

* chore: simplify if condition

* fix: force access to lr & weight decay in case not provided to early error

* fix: remove log sweep

* chore: refactor if condition

* fix: address renamed cfg

* fix: improve handling of cosine hyp

* fix: remove unused params

* chore: refactor

* chore: clarify doc safetensors

* fix: update import path to be unified following comments

* fix: duplicate kwargs passed

* feat: return separate trainer_kwargs

* chore: refactor

* chore: refactor based on comments

* chore: refactor based on comments

* fix: move gpustats callback to base

* chore: create trainer_cls_args first based on comments

* fix: ipo label smoothing passed incorrectly

* feat: add optimizer parity for RL methods with test

* feat: add parity for optimizer in RM/PRM and add test

* fix: remove redundant function override for orpo/cpo batch metrics

* fix: improve handling of dpo_label_smoothing and merge issue

* fix: test fixture returning wrong field

* fix: address avoid direct modify fixture

* chore: minor refactor

* Revert "chore: refactor"

This reverts commit 99c8859eb0.

* feat: rename trainer_builder to builders

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-05-30 11:21:47 +07:00
Dan Saunders
b5f1e53a0f models.py -> loaders/ module refactor (#2680)
* models.py -> loaders/ module refactor

* refactor ModelLoader class

* plugin manager changes

* circular import fix

* pytest

* pytest

* minor improvements

* fix

* minor changes

* fix test

* remove dead code

* coderabbit comments

* lint

* fix

* coderabbit suggestion I liked

* more coderabbit

* review comments, yak shaving

* lint

* updating in light of SP ctx manager changes

* review comment

* review comment 2
2025-05-23 15:51:11 -04:00
NanoCode012
798b5f5cfd fix(RL): address plugin rl overwriting trainer_cls (#2697) [skip ci]
* fix: plugin rl overwrite trainer_cls

* feat(test): add test to catch trainer_cls is not None
2025-05-22 19:19:12 +07:00
NanoCode012
cf0c79d52e fix: minor patches for multimodal (#2441)
* fix: update chat_template

* fix: handle gemma3 showing a lot of no content for turn 0

* fix: remove unknown config from examples

* fix: test

* fix: temporary disable gemma2 test

* fix: stop overwriting config.text_config unnecessarily

* fix: handling of set cache to the text_config section

* feat: add liger gemma support and bump liger to 0.5.5

* fix: add double use_cache setting

* fix: add support for final_logit_softcap in CCE for gemma2/3

* fix: set use_cache before model load

* feat: add missing layernorm override

* fix: handle gemma3 rmsnorm

* fix: use wrapper to pass dim as hidden_size

* fix: change dim to positional

* fix: patch with wrong mlp

* chore: refactor use_cache handling

* fix import issues

* fix tests.e2e.utils import

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-03-31 13:40:12 +07:00
Wing Lian
05f03b541a hf offline decorator for tests to workaround rate limits (#2452) [skip ci]
* hf offline decorator for tests to workaround rate limits

* fail quicker so we can see logs

* try new cache name

* limit files downloaded

* phi mini predownload

* offline decorator for phi tokenizer

* handle meta llama 8b offline too

* make sure to return fixtures if they are wrapped too

* more fixes

* more things offline

* more offline things

* fix the env var

* fix the model name

* handle gemma also

* force reload of modules to recheck offline status

* prefetch mistral too

* use reset_sessions so hub picks up offline mode

* more fixes

* rename so it doesn't seem like a context manager

* fix backoff

* switch out tinyshakespeare dataset since it runs a py script to fetch data and doesn't work offline

* include additional dataset

* more fixes

* more fixes

* replace tiny shakespeaere dataset

* skip some tests for now

* use more robust check using snapshot download to determine if a dataset name is on the hub

* typo for skip reason

* use local_files_only

* more fixtures

* remove local only

* use tiny shakespeare as pretrain dataset and streaming can't be offline even if precached

* make sure fixtures aren't offline

improve the offline reset
try bumping version of datasets
reorder reloading and setting
prime a new cache
run the tests now with fresh cache
try with a static cache

* now run all the ci again with hopefully a correct cache

* skip wonky tests for now

* skip wonky tests for now

* handle offline mode for model card creation
2025-03-28 19:20:46 -04:00
Dan Saunders
c907ac173e adding pre-commit auto-update GH action and bumping plugin versions (#2428)
* adding pre-commit auto-update GH action and bumping plugin versions

* running updated pre-commit plugins

* sorry to revert, but pylint complained

* Update .pre-commit-config.yaml

Co-authored-by: Wing Lian <wing.lian@gmail.com>

---------

Co-authored-by: Dan Saunders <dan@axolotl.ai>
Co-authored-by: Wing Lian <wing.lian@gmail.com>
2025-03-21 11:02:43 -04:00
salman
ac471a697a updating to fused (#2293) 2025-01-30 11:45:56 -05:00
Wing Lian
ce5bcff750 various tests fixes for flakey tests (#2110)
* add mhenrichsen/alpaca_2k_test with revision dataset download fixture for flaky tests

* log slowest tests

* pin pynvml==11.5.3

* fix load local hub path

* optimize for speed w smaller models and val_set_size

* replace pynvml

* make the resume from checkpoint e2e faster

* make tests smaller
2024-12-02 17:28:58 -05:00
Wing Lian
cd2d89f467 wip add new proposed message structure (#1904)
* wip add new proposed message structure

* tokenization

* wip

* wip transform builder

* wip make the chat dataset loadable

* wip chatml + llama 3 new chat objects

* chore: lint

* chore: lint

* fix tokenization

* remove dacite dependency since we're using pydantic now

* fix handling when already correctly split in messages

* make sure to remove chat features from tokenized ds

* move chat to be a input transform for messages

* make sure llama3 has the bos token

* remove non-working special token code

* fix messages strat loader
2024-10-13 12:15:18 -04:00
Wing Lian
7d1d22f72f ORPO Trainer replacement (#1551)
* WIP use trl ORPOTrainer

* fixes to make orpo work with trl

* fix the chat template laoding

* make sure to handle the special tokens and add_generation for assistant turn too
2024-04-19 17:25:36 -04:00
NanoCode012
ff939d8a64 fix(dataset): normalize tokenizer config and change hash from tokenizer class to tokenizer path (#1298)
* fix(dataset): normalize tokenizer config and change hash from tokenizer class to tokenizer path

* fix: normalize config
2024-03-25 15:34:54 +09:00
Wing Lian
78c5b1979e add gptneox embeddings, fix phi2 inputs, also fix the casting (#1083) 2024-01-10 22:32:43 -05:00
Wing Lian
f243c2186d RL/DPO (#935)
* ipo-dpo trainer

* fix missing abstract method

* chatml template, grad checkpointing kwargs support

* fix steps calc for RL and add dataloader kwargs

* wip to fix dpo and start ppo

* more fixes

* refactor to generalize map fn

* fix dataset loop and handle argilla pref dataset

* set training args

* load reference model on seperate gpu if more than one device

* no auto upload to hub for dpo, don't add lora adapters to ref model for dpo

* fixes for rl training

* support for ipo from yaml

* set dpo training args from the config, add tests

* chore: lint

* set sequence_len for model in test

* add RLHF docs
2024-01-04 18:22:55 -05:00