axolotl

Author	SHA1	Message	Date
NanoCode012	a620d481e2	fix: drop long seq even if not sample packing (#2211 ) * fix: drop long seq even if not sample packing * fix: logging import * fix: cfg passed being none * fix: try to fix logging * fix: refactor call to not use accelerate log * fix: try to fix circular import issue * fix: don't drop when skip prepare * chore: remove duplicate line * fix: update warning to mention that sequences will be trimmed * fix: do not drop seq if input_ids don't exist * fix: increase RM unittest sequence length to reduce trim warnings * fix: solve conflicts * fix: default min_seq_len in case of None	2025-02-04 09:43:35 -05:00
Wing Lian	78ce268848	KD Trainer w logprobs (#2303 ) * refactor trainer to prevent circular dependencies later fix loader default KD dataset loading and KD with logprobs filter bad rows make batch smaller handle padding/collation for KD datasets make it work flipped the slice cross entropy loss coefficient during KD make sure to multiply against the correct loss chore: lint triton wip no where support v2 trial no torch.exp inside triton kernel no log etc no torch.tensor v3 fix kwarg don't use triton for now better rescaling for temperatures hash for temperature too use kd_alpha in the correct loss method fix kd loss so it's causal (fixes repeating tokens) var naming and add todo chore: lint refactor so we can easily add new loss functions add license block remove references to triton kd for now handle token/logprob shifting support for custom trainer classes from plugins refactor kd chat template loader move more things to kd plugin remove moved class from import make plugin setup concise increase logging around loading plugins add copyrights remove duplicate code more info on preprocess for kd and fix import be a bit pickier about loading dynamic prompt strategies kd sample packing make loss torch script compat support streaming for processing sft datasts? improve iterable support ensure that batch vs single is done properly tweak check for batched prompt data reward can use same batch check fix reward trainer calls for tokenization improve check for batched reward model doesn't work well with batched add kd trainer e2e test linting rename test files so it gets picked up make the kd e2e fit in vram for ci and add lora version set lora_dropout explicitly lower lr make sure to set tokenizer from l3 70b and save safetensors make sure to use the correct tokenizer fix adapter model check make sure to use tensorboard to capture loss for checks chore: lint chore: lint improve logprob masking and shift in trainer more fixes try tests for kd on l40s don't shift student logits for kd no batching for kd chat templates make sure to truncate logprobs if there are more than top_k change up logic so we always truncate to top_k use iter instead of tuple fix finding the top-k rather than assuming first position has the correct val apply z-score scaling to kd kd loss needs to be calculated in full precision Always re-normalize teacher distribution various fixes * support for configurable top-k/softmax ordering * add attribute check for filter rows and lint * fix logic * handle none case for conversion to int * fix student logit off by one * set kd_temp to 1.0 for test loss * address PR feedback	2025-01-31 20:18:52 -05:00
Wing Lian	cf17649ef3	Misc fixes 20250130 (#2301 ) * misc fixes for garbage collection and L40S w NCCL P2P * patch bnb fix for triton check * chore: lint * change up import * try patching differently * remove patch for bnb fix for now * more verbose checks and tweak train loss threshold	2025-01-31 08:58:04 -05:00
salman	ac471a697a	updating to fused (#2293 )	2025-01-30 11:45:56 -05:00
Eric Tang	268543a3be	Ray Train Axolotl Integration (#2251 ) * current not clean working version move torch trainer to do_cli update code with config changes and clean up edit config cleanup add run name to trainer * address comments * use axolotl train in multigpu tests and add ray tests for multi-gpu * accelerate uses underscores for main_process_port arg * chore: lint * fix order of accelerate args * include ray train in docker images * current not clean working version move torch trainer to do_cli update code with config changes and clean up edit config cleanup add run name to trainer * address comments * use axolotl train in multigpu tests and add ray tests for multi-gpu * accelerate uses underscores for main_process_port arg * chore: lint * fix order of accelerate args * include ray train in docker images * fix bf16 resolution behavior * move dtype logic * x Signed-off-by: SumanthRH <sumanthrh@anyscale.com> * rename Signed-off-by: SumanthRH <sumanthrh@anyscale.com> * add to sidebar Signed-off-by: SumanthRH <sumanthrh@anyscale.com> * Apply suggestions from code review Co-authored-by: Eric Tang <46737979+erictang000@users.noreply.github.com> * Update docs/ray-integration.qmd Co-authored-by: Eric Tang <46737979+erictang000@users.noreply.github.com> * pre-commit fixes Signed-off-by: SumanthRH <sumanthrh@anyscale.com> * use output_dir instead of hardcoded saves path Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * bugfix storage dir * change type\ for resources_per_worker --------- Signed-off-by: SumanthRH <sumanthrh@anyscale.com> Co-authored-by: Wing Lian <wing@axolotl.ai> Co-authored-by: SumanthRH <sumanthrh@anyscale.com> Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com> Co-authored-by: Wing Lian <wing.lian@gmail.com> Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>	2025-01-29 00:10:19 -05:00
salman	54dd7abfc1	Process reward models (#2241 ) * adding model_cfg to set num_labels * using a num_labels field instead * linting * WIP stepwise prompt tokenizer * this should work? * trainer working? * pushing to runpod * fixing saving * updating conf * updating config, adding docs * adding stepwise supervision docpage * updating tests * adding test for dataset * fixing tests * linting * addressing some comments * adding additional cfg fields support * updating tests, fixing cfg * fixing tests * updating loss * Update test_process_reward_model_smollm2.py * updating loss values and seed * dumb pre-commit	2025-01-29 00:08:33 -05:00
salman	c071a530f7	removing 2.3.1 (#2294 )	2025-01-28 23:23:44 -05:00
Wing Lian	20620771f1	Pretrain multipack (#2278 ) * fix for pretrain with packing * fix model name and loss expected * make sure to check with micro batch size for pretraining * change loss threshholds based on parametrization * make tests smaller for CI * fix pretrain packing * fix pretrain packing test * address pr feedback	2025-01-24 12:55:20 -05:00
Wing Lian	8a7a0b07dc	support for latest transformers release 4.48.1 (#2256 )	2025-01-23 21:17:57 -05:00
NanoCode012	cba5a457d9	fix: use text_column even when not packing for pretraining (#2254 ) * fix: use text_column even when not packing for pretraining * feat: update test to check when not packing * chore: lint * Update src/axolotl/utils/data/pretraining.py Co-authored-by: Wing Lian <wing.lian@gmail.com> --------- Co-authored-by: Wing Lian <wing@axolotl.ai> Co-authored-by: Wing Lian <wing.lian@gmail.com>	2025-01-14 22:08:56 -05:00
Dan Saunders	1ed4de73b6	CLI cleanup and documentation (#2244 ) * CLI init refactor * fix * cleanup and (partial) docs * Adding documentation and continuing cleanup (in progress) * remove finetune.py script * continued cleanup and documentation * pytest fixes * review comments * fix * Fix * typing fixes * make sure the batch dataset patcher for multipack is always loaded when handling datasets * review comments * fix --------- Co-authored-by: Dan Saunders <dan@axolotl.ai> Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-01-13 17:55:29 +00:00
Wing Lian	dd26cc3c0f	add helper to verify the correct model output file exists (#2245 ) * add helper to verify the correct model output file exists * more checks using helper * chore: lint * fix import and relora model check * workaround for trl trainer saves * remove stray print	2025-01-13 10:43:29 -05:00
Wing Lian	fb3352e21c	rename liger test so it properly runs in ci (#2246 )	2025-01-09 17:31:43 -05:00
Wing Lian	7669a03fb4	update upstream HF deps (#2239 ) * bump axolotl contribs for upstream main conflicts: * bump datasets, tokenizer, trl * remove log workarounds in trl * bump lm-eval * remove unsloth_ import from critical path * remove llama fa2 from conftest * unsloth breaks with latest upstream	2025-01-09 21:01:59 +00:00
Wing Lian	bd2a594b89	use DataCollatorWithFlattening when not sample packing (#2167 )	2024-12-17 17:46:44 -05:00
Wing Lian	e246ceffa4	use axolotl contribs for fix_untrained_tokens (#2194 ) [skip ci] * use axolotl contribs for fix_untrained_tokens * remove the module we're replacing * Add check for using fix_untrained_tokens	2024-12-17 13:57:16 -05:00
Wing Lian	ab4b32187d	need to update deepspeed version in extras too (#2161 ) [skip ci] * need to update deepspeed version in extras too * fix patch import * fix monkeypatch reloading in tests and deepspeed patch * remove duplicated functionality fixture * reset LlamaForCausalLM too in fixtures for cce patch * reset llama attn too * disable xformers patch for cce * skip problematic test on low usage functionality	2024-12-09 14:01:44 -05:00
Wing Lian	393853751e	add additional fft deepspeed variants (#2153 ) [skip ci]	2024-12-08 16:38:47 -05:00
Wing Lian	743ba62bd5	Transformers 4.47.0 (#2138 ) * bump transformers and trl * fix: update trainer.log signature * fix trl trainer.log interfaces * broken 🦥 with latest transformers * skip parent, call grandparent - yeah, super janky * update HF HUB env var and fix reward trainer log since it doesn't directly override log * also bump accelerate * patches for llama ga * detab the code to check * fix whitespace for patch check * play nicely with CI tests since we patch everytime * fix pop default in case it doesn't exist * more tweaks to make patches nicer in CI * fix detab for when there are possibly multiple patches --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2024-12-07 05:03:01 -05:00
Wing Lian	5e9fa33f3d	reduce test concurrency to avoid HF rate limiting, test suite parity (#2128 ) * reduce test concurrency to avoid HF rate limiting, test suite parity * make val_set_size smaller to speed up e2e tests * more retries for pytest fixture downloads * val_set_size was too small * move retry_on_request_exceptions to data utils and add retry strategy * pre-download ultrafeedback as a test fixture * refactor download retry into it's own fn * don't import from data utils * use retry mechanism now for fixtures	2024-12-06 10:20:20 -05:00
Wing Lian	6b3058b2dc	upgrade bnb 0.45.0 and peft 0.14.0 (#2126 ) * upgrade bnb to lastest release * update peft to working supporting commit * bump to latest release of peft==0.14.0	2024-12-06 09:08:55 -05:00
Wing Lian	a1790f2652	replace tensorboard checks with helper function (#2120 ) [skip ci] * replace tensorboard checks with helper function * move helper function * use relative	2024-12-03 21:06:20 -05:00
Wing Lian	d87df2c776	prepare plugins needs to happen so registration can occur to build the plugin args (#2119 ) * prepare plugins needs to happen so registration can occur to build the plugin args use yaml.dump include dataset and more assertions * attempt to manually register plugins rather than use fn * fix fixture * remove fixture * move cli test to patched dir * fix cce validation	2024-12-03 15:06:09 -05:00
Wing Lian	1ef70312ba	fix optimizer reset for relora sft (#1414 ) * fix optimizer reset * set states to reset for 8bit optimizers and handle quantile runtime error for embeddings * fix relora test to check grad_norm * use flash attn for relora and tweak hyperparams for test * fix messages field for test dataset	2024-12-03 08:58:23 -05:00
NanoCode012	bd8436bc6e	feat: add cut_cross_entropy (#2091 ) * feat: add cut_cross_entropy * fix: add to input * fix: remove from setup.py * feat: refactor into an integration * chore: ignore lint * feat: add test for cce * fix: set max_steps for liger test * chore: Update base model following suggestion Co-authored-by: Wing Lian <wing.lian@gmail.com> * chore: update special_tokens following suggestion Co-authored-by: Wing Lian <wing.lian@gmail.com> * chore: remove with_temp_dir following comments * fix: plugins aren't loaded * chore: update quotes in error message * chore: lint * chore: lint * feat: enable FA on test * chore: refactor get_pytorch_version * fix: lock cce commit version * fix: remove subclassing UT * fix: downcast even if not using FA and config check * feat: add test to check different attentions * feat: add install to CI * chore: refactor to use parametrize for attention * fix: pytest not detecting test * feat: handle torch lower than 2.4 * fix args/kwargs to match docs * use release version cut-cross-entropy==24.11.4 * fix quotes * fix: use named params for clarity for modal builder * fix: handle install from pip * fix: test check only top level module install * fix: re-add import check * uninstall existing version if no transformers submodule in cce * more dataset fixtures into the cache --------- Co-authored-by: Wing Lian <wing.lian@gmail.com> Co-authored-by: Wing Lian <wing@axolotl.ai>	2024-12-03 08:22:22 -05:00
Wing Lian	fc6188cd76	fix merge conflict of duplicate max_steps in config for relora (#2116 )	2024-12-03 07:42:41 -05:00
NanoCode012	822c904092	fix(vlm): handle legacy conversation data format and check image in data (#2018 ) [skip ci] * fix: handle legacy conversation data format and check image in data * feat: add test for llama vision * feat: add max_steps to test * fix: incorrect indent and return preprocess * feat: use smaller model and dataset * chore: add extra config for sharegpt dataset	2024-12-03 00:01:31 -05:00
Sunny Liu	d5f58b6509	Check torch version for ADOPT optimizer + integrating new ADOPT updates (#2104 ) * added torch check for adopt, wip * lint * gonna put torch version checking somewhere else * added ENVcapabilities class for torch version checking * lint + pydantic * ENVCapabilities -> EnvCapabilities * forgot to git add v0_4_1/__init__.py * removed redundancy * add check if env_capabilities not specified * make env_capabilities compulsory [skip e2e] * fixup env_capabilities * modified test_validation.py to accomodate env_capabilities * adopt torch version test [skip e2e] * raise error * test correct torch version * test torch version above requirement * Update src/axolotl/utils/config/models/input/v0_4_1/__init__.py Co-authored-by: Wing Lian <wing.lian@gmail.com> * removed unused is_totch_min --------- Co-authored-by: Wing Lian <wing@axolotl.ai> Co-authored-by: Wing Lian <wing.lian@gmail.com>	2024-12-02 20:15:39 -05:00
Wing Lian	53963c792c	make the eval size smaller for the resume test (#2111 ) [skip ci]	2024-12-02 18:32:29 -05:00
Wing Lian	ce5bcff750	various tests fixes for flakey tests (#2110 ) * add mhenrichsen/alpaca_2k_test with revision dataset download fixture for flaky tests * log slowest tests * pin pynvml==11.5.3 * fix load local hub path * optimize for speed w smaller models and val_set_size * replace pynvml * make the resume from checkpoint e2e faster * make tests smaller	2024-12-02 17:28:58 -05:00
Wing Lian	5f1d98e8fc	add e2e tests for Unsloth qlora and test the builds (#2093 ) * see if unsloth installs cleanly in ci * check unsloth install on regular tests, not sdist * fix ampere check exception for ci * use cached_property instead * add an e2e test for unsloth qlora * reduce seq len and mbsz to prevent oom in ci * add checks for fp16 and sdp_attention * pin unsloth to a specific release * add unsloth to docker image too * fix flash attn xentropy patch * fix loss, add check for loss when using fa_xentropy * fix special tokens for test * typo * test fa xentropy with and without gradient accum * pr feedback changes	2024-11-29 20:38:49 -05:00
Wing Lian	1cf7075d18	support seperate lr for embeddings, similar to loraplus (#1910 ) [skip ci] * support seperate lr for embeddings, similar to loraplus * add test case for train w lr embedding scale * use kwarg for optimizer * make sure to handle the optimizer creation * make sure to handle for embedding_lr too * use smollm for e2e, check for embeddings lr first before wdecay	2024-11-29 20:38:20 -05:00
Wing Lian	6e0fb4a6b2	add finetome dataset to fixtures, check eval_loss in test (#2106 ) [skip ci] * add finetome dataset to fixtures, check eval_loss in test * add qwen 0.5b to pytest session fixture	2024-11-29 20:37:32 -05:00
Wing Lian	724b660d56	move shared pytest conftest to top level tests (#2099 ) [skip ci] * move shared pytest conftest to top level tests * add __init__ so mypy doesn't choke on multiple conftests	2024-11-22 15:05:42 -05:00
Wing Lian	d9b71edf84	bump transformers for fsdp-grad-accum fix, remove patch (#2079 )	2024-11-19 02:23:09 -05:00
Wing Lian	9871fa060b	optim e2e tests to run a bit faster (#2069 ) [skip ci] * optim e2e tests to run a bit faster * run prequant w/o lora_modules_to_save * use smollm2	2024-11-18 12:35:31 -05:00
Chirag Jain	0c8b1d824a	Update `get_unpad_data` patching for multipack (#2013 ) * Update `get_unpad_data` patching for multipack * Update src/axolotl/utils/models.py * Update src/axolotl/utils/models.py * Add test case --------- Co-authored-by: Wing Lian <wing.lian@gmail.com> Co-authored-by: Wing Lian <wing@axolotl.ai>	2024-11-15 20:35:50 -05:00
Wing Lian	d42f202046	Fsdp grad accum monkeypatch (#2064 )	2024-11-15 19:11:04 -05:00
Wing Lian	0dabde1962	support for schedule free and e2e ci smoke test (#2066 ) [skip ci] * support for schedule free and e2e ci smoke test * set default lr scheduler to constant in test * ignore duplicate code * fix quotes for config/dict	2024-11-15 19:10:14 -05:00
Wing Lian	521e62daf1	remove the bos token from dpo outputs (#1733 ) [skip ci] * remove the bos token from dpo outputs * don't forget to fix prompt_input_ids too * use processing_class instead of tokenizer * fix for processing class	2024-11-15 19:09:20 -05:00
Wing Lian	71d4030b79	gradient accumulation tests, embeddings w pad_token fix, smaller models (#2059 ) * add more test cases for gradient accumulation and fix zero3 * swap out for smaller model * fix missing return * fix missing pad_token in config * support concurrency for multigpu testing * cast empty deepspeed to empty string for zero3 check * fix temp_dir as fixture so parametrize works properly * fix test file for multigpu evals * don't use default * don't use default for fsdp_state_dict_type * don't use llama tokenizer w smollm * also automatically cancel multigpu for concurrency	2024-11-14 12:59:00 -05:00
Sunny Liu	1d7aee0ad2	ADOPT optimizer integration (#2032 ) [skip ci] * adopt integration * stuff * doc and test for ADOPT * rearrangement * fixed formatting * hacking pre-commit * chore: lint * update module doc for adopt optimizer * remove un-necessary example yaml for adopt optimizer * skip test adopt if torch<2.5.1 * formatting * use version.parse * specifies required torch version for adopt_adamw --------- Co-authored-by: sunny <sunnyliu19981005@gmail.com> Co-authored-by: Wing Lian <wing@axolotl.ai>	2024-11-13 17:10:17 -05:00
Sunny Liu	3265b7095e	Add weighted optimisation support for trl DPO trainer integration (#2016 ) * trlv0.12.0 integration * update trl version requirements * linting * commenting out * trl version requirement	2024-11-08 11:29:11 -05:00
Wing Lian	02ce520b7e	upgrade liger to 0.4.0 (#1973 ) * upgrade liger to 0.3.1 * update docs and example * skip duplicate code check * Update src/axolotl/integrations/liger/args.py Co-authored-by: NanoCode012 <nano@axolotl.ai> * Update README.md Co-authored-by: NanoCode012 <nano@axolotl.ai> * add logging * chore: lint * add test case * upgrade liger and transformers * also upgrade accelerate * use kwargs to support patch release * make sure prepared path is empty for test * use transfromers 4.46.1 since 4.46.2 breaks fsdp --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2024-11-07 12:53:34 -05:00
NanoCode012	5c7e89105d	Fix: modelloader handling of model_kwargs load_inbit (#1999 ) fix: load_in_bit not properly read fix: load_bit check fix: typo * refactor: load * bit handling * feat: add test dpo lora multi-gpu * fix: turn off sample packing for dpo * fix: missing warmup_steps * fix: test to load in 8bit for lora * skip 8bit lora on h100, add 4bit lora on h100 to multi gpu tests * chore: reduce max_steps --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>	2024-10-30 14:41:34 -04:00
Wing Lian	32c60765ef	remove skipped test (#2002 ) * remove skipped test * use mean_resizing_embeddings with qlora and added tokens * use </s> as pad_token to prevent resize of embeddings * make sure local hub test saves to a tmp dir * use Path so concatenation works * make sure to use tmp_ds_path for data files	2024-10-30 12:27:04 -04:00
NanoCode012	2501c1a6a3	Fix: Gradient Accumulation issue (#1980 ) * feat: support new arg num_items_in_batch * use kwargs to manage extra unknown kwargs for now * upgrade against upstream transformers main * make sure trl is on latest too * fix for upgraded trl * fix: handle trl and transformer signature change * feat: update trl to handle transformer signature * RewardDataCollatorWithPadding no longer has max_length * handle updated signature for tokenizer vs processor class * invert logic for tokenizer vs processor class * processing_class, not processor class * also handle processing class in dpo * handle model name w model card creation * upgrade transformers and add a loss check test * fix install of tbparse requirements * make sure to add tbparse to req * feat: revert kwarg to positional kwarg to be explicit --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>	2024-10-25 11:28:23 -04:00
Mengqing Cao	1d6a5e2bd6	Refactor func load_model to class ModelLoader (#1909 )	2024-10-25 09:06:56 -04:00
Sunny Liu	f62e23737b	memoize dataset length for eval sample packing (#1974 ) * wip on multimodal sample packing support * wip on multimodal packing support * llama-1b-yml * setup logging for test * yml * yml * yml * fix for __len__ for eval sample packing * reverted irrelavant changes * reformatted, reverted log message * reverted unnecessary changes * added e2e multigpu testing for eval sample packing * formatting * fixed e2e test_eval params * fix test_eval e2e multigpu * fix test_eval e2e multigpu * Update tests/e2e/multigpu/test_eval.py Co-authored-by: Wing Lian <wing.lian@gmail.com> * Update tests/e2e/multigpu/test_eval.py Co-authored-by: Wing Lian <wing.lian@gmail.com> --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>	2024-10-17 15:15:29 -04:00
Wing Lian	ec4272c3a0	add ds zero3 to multigpu biweekly tests (#1900 ) * add ds zero3 to multigpu biweekly tests * fix for upstream api change * use updated accelerate and fix deepspeed tests * stringify the Path, and run multigpu tests if the multigpu tests change for a PR * use correct json rather than yaml * revert accelerate for deepspeed	2024-10-13 17:34:37 -04:00

1 2 3

101 Commits