* fix: drop long seq even if not sample packing
* fix: logging import
* fix: cfg passed being none
* fix: try to fix logging
* fix: refactor call to not use accelerate log
* fix: try to fix circular import issue
* fix: don't drop when skip prepare
* chore: remove duplicate line
* fix: update warning to mention that sequences will be trimmed
* fix: do not drop seq if input_ids don't exist
* fix: increase RM unittest sequence length to reduce trim warnings
* fix: solve conflicts
* fix: default min_seq_len in case of None
* refactor trainer to prevent circular dependencies later
fix loader default
KD dataset loading and KD with logprobs
filter bad rows
make batch smaller
handle padding/collation for KD datasets
make it work
flipped the slice
cross entropy loss coefficient during KD
make sure to multiply against the correct loss
chore: lint
triton wip
no where support
v2 trial
no torch.exp inside triton kernel
no log etc
no torch.tensor
v3
fix kwarg
don't use triton for now
better rescaling for temperatures
hash for temperature too
use kd_alpha in the correct loss method
fix kd loss so it's causal (fixes repeating tokens)
var naming and add todo
chore: lint
refactor so we can easily add new loss functions
add license block
remove references to triton kd for now
handle token/logprob shifting
support for custom trainer classes from plugins
refactor kd chat template loader
move more things to kd plugin
remove moved class from import
make plugin setup concise
increase logging around loading plugins
add copyrights
remove duplicate code
more info on preprocess for kd and fix import
be a bit pickier about loading dynamic prompt strategies
kd sample packing
make loss torch script compat
support streaming for processing sft datasts?
improve iterable support
ensure that batch vs single is done properly
tweak check for batched prompt data
reward can use same batch check
fix reward trainer calls for tokenization
improve check for batched
reward model doesn't work well with batched
add kd trainer e2e test
linting
rename test files so it gets picked up
make the kd e2e fit in vram for ci and add lora version
set lora_dropout explicitly
lower lr
make sure to set tokenizer from l3 70b and save safetensors
make sure to use the correct tokenizer
fix adapter model check
make sure to use tensorboard to capture loss for checks
chore: lint
chore: lint
improve logprob masking and shift in trainer
more fixes
try tests for kd on l40s
don't shift student logits for kd
no batching for kd chat templates
make sure to truncate logprobs if there are more than top_k
change up logic so we always truncate to top_k
use iter instead of tuple
fix finding the top-k rather than assuming first position has the correct val
apply z-score scaling to kd
kd loss needs to be calculated in full precision
Always re-normalize teacher distribution
various fixes
* support for configurable top-k/softmax ordering
* add attribute check for filter rows and lint
* fix logic
* handle none case for conversion to int
* fix student logit off by one
* set kd_temp to 1.0 for test loss
* address PR feedback
* misc fixes for garbage collection and L40S w NCCL P2P
* patch bnb fix for triton check
* chore: lint
* change up import
* try patching differently
* remove patch for bnb fix for now
* more verbose checks and tweak train loss threshold
* current
not clean working version
move torch trainer to do_cli
update code with config changes and clean up
edit config
cleanup
add run name to trainer
* address comments
* use axolotl train in multigpu tests and add ray tests for multi-gpu
* accelerate uses underscores for main_process_port arg
* chore: lint
* fix order of accelerate args
* include ray train in docker images
* current
not clean working version
move torch trainer to do_cli
update code with config changes and clean up
edit config
cleanup
add run name to trainer
* address comments
* use axolotl train in multigpu tests and add ray tests for multi-gpu
* accelerate uses underscores for main_process_port arg
* chore: lint
* fix order of accelerate args
* include ray train in docker images
* fix bf16 resolution behavior
* move dtype logic
* x
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
* rename
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
* add to sidebar
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
* Apply suggestions from code review
Co-authored-by: Eric Tang <46737979+erictang000@users.noreply.github.com>
* Update docs/ray-integration.qmd
Co-authored-by: Eric Tang <46737979+erictang000@users.noreply.github.com>
* pre-commit fixes
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
* use output_dir instead of hardcoded saves path
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
* bugfix storage dir
* change type\ for resources_per_worker
---------
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: SumanthRH <sumanthrh@anyscale.com>
Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>
Co-authored-by: Wing Lian <wing.lian@gmail.com>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
* fix for pretrain with packing
* fix model name and loss expected
* make sure to check with micro batch size for pretraining
* change loss threshholds based on parametrization
* make tests smaller for CI
* fix pretrain packing
* fix pretrain packing test
* address pr feedback
* fix: use text_column even when not packing for pretraining
* feat: update test to check when not packing
* chore: lint
* Update src/axolotl/utils/data/pretraining.py
Co-authored-by: Wing Lian <wing.lian@gmail.com>
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* add helper to verify the correct model output file exists
* more checks using helper
* chore: lint
* fix import and relora model check
* workaround for trl trainer saves
* remove stray print
* need to update deepspeed version in extras too
* fix patch import
* fix monkeypatch reloading in tests and deepspeed patch
* remove duplicated functionality fixture
* reset LlamaForCausalLM too in fixtures for cce patch
* reset llama attn too
* disable xformers patch for cce
* skip problematic test on low usage functionality
* bump transformers and trl
* fix: update trainer.log signature
* fix trl trainer.log interfaces
* broken 🦥 with latest transformers
* skip parent, call grandparent - yeah, super janky
* update HF HUB env var and fix reward trainer log since it doesn't directly override log
* also bump accelerate
* patches for llama ga
* detab the code to check
* fix whitespace for patch check
* play nicely with CI tests since we patch everytime
* fix pop default in case it doesn't exist
* more tweaks to make patches nicer in CI
* fix detab for when there are possibly multiple patches
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai>
* reduce test concurrency to avoid HF rate limiting, test suite parity
* make val_set_size smaller to speed up e2e tests
* more retries for pytest fixture downloads
* val_set_size was too small
* move retry_on_request_exceptions to data utils and add retry strategy
* pre-download ultrafeedback as a test fixture
* refactor download retry into it's own fn
* don't import from data utils
* use retry mechanism now for fixtures
* prepare plugins needs to happen so registration can occur to build the plugin args
use yaml.dump
include dataset and more assertions
* attempt to manually register plugins rather than use fn
* fix fixture
* remove fixture
* move cli test to patched dir
* fix cce validation
* fix optimizer reset
* set states to reset for 8bit optimizers and handle quantile runtime error for embeddings
* fix relora test to check grad_norm
* use flash attn for relora and tweak hyperparams for test
* fix messages field for test dataset
* feat: add cut_cross_entropy
* fix: add to input
* fix: remove from setup.py
* feat: refactor into an integration
* chore: ignore lint
* feat: add test for cce
* fix: set max_steps for liger test
* chore: Update base model following suggestion
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* chore: update special_tokens following suggestion
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* chore: remove with_temp_dir following comments
* fix: plugins aren't loaded
* chore: update quotes in error message
* chore: lint
* chore: lint
* feat: enable FA on test
* chore: refactor get_pytorch_version
* fix: lock cce commit version
* fix: remove subclassing UT
* fix: downcast even if not using FA and config check
* feat: add test to check different attentions
* feat: add install to CI
* chore: refactor to use parametrize for attention
* fix: pytest not detecting test
* feat: handle torch lower than 2.4
* fix args/kwargs to match docs
* use release version cut-cross-entropy==24.11.4
* fix quotes
* fix: use named params for clarity for modal builder
* fix: handle install from pip
* fix: test check only top level module install
* fix: re-add import check
* uninstall existing version if no transformers submodule in cce
* more dataset fixtures into the cache
---------
Co-authored-by: Wing Lian <wing.lian@gmail.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>
* fix: handle legacy conversation data format and check image in data
* feat: add test for llama vision
* feat: add max_steps to test
* fix: incorrect indent and return preprocess
* feat: use smaller model and dataset
* chore: add extra config for sharegpt dataset
* add mhenrichsen/alpaca_2k_test with revision dataset download fixture for flaky tests
* log slowest tests
* pin pynvml==11.5.3
* fix load local hub path
* optimize for speed w smaller models and val_set_size
* replace pynvml
* make the resume from checkpoint e2e faster
* make tests smaller
* see if unsloth installs cleanly in ci
* check unsloth install on regular tests, not sdist
* fix ampere check exception for ci
* use cached_property instead
* add an e2e test for unsloth qlora
* reduce seq len and mbsz to prevent oom in ci
* add checks for fp16 and sdp_attention
* pin unsloth to a specific release
* add unsloth to docker image too
* fix flash attn xentropy patch
* fix loss, add check for loss when using fa_xentropy
* fix special tokens for test
* typo
* test fa xentropy with and without gradient accum
* pr feedback changes
* support seperate lr for embeddings, similar to loraplus
* add test case for train w lr embedding scale
* use kwarg for optimizer
* make sure to handle the optimizer creation
* make sure to handle for embedding_lr too
* use smollm for e2e, check for embeddings lr first before wdecay
* Update `get_unpad_data` patching for multipack
* Update src/axolotl/utils/models.py
* Update src/axolotl/utils/models.py
* Add test case
---------
Co-authored-by: Wing Lian <wing.lian@gmail.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>
* remove the bos token from dpo outputs
* don't forget to fix prompt_input_ids too
* use processing_class instead of tokenizer
* fix for processing class
* add more test cases for gradient accumulation and fix zero3
* swap out for smaller model
* fix missing return
* fix missing pad_token in config
* support concurrency for multigpu testing
* cast empty deepspeed to empty string for zero3 check
* fix temp_dir as fixture so parametrize works properly
* fix test file for multigpu evals
* don't use default
* don't use default for fsdp_state_dict_type
* don't use llama tokenizer w smollm
* also automatically cancel multigpu for concurrency
* upgrade liger to 0.3.1
* update docs and example
* skip duplicate code check
* Update src/axolotl/integrations/liger/args.py
Co-authored-by: NanoCode012 <nano@axolotl.ai>
* Update README.md
Co-authored-by: NanoCode012 <nano@axolotl.ai>
* add logging
* chore: lint
* add test case
* upgrade liger and transformers
* also upgrade accelerate
* use kwargs to support patch release
* make sure prepared path is empty for test
* use transfromers 4.46.1 since 4.46.2 breaks fsdp
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai>
* remove skipped test
* use mean_resizing_embeddings with qlora and added tokens
* use </s> as pad_token to prevent resize of embeddings
* make sure local hub test saves to a tmp dir
* use Path so concatenation works
* make sure to use tmp_ds_path for data files
* feat: support new arg num_items_in_batch
* use kwargs to manage extra unknown kwargs for now
* upgrade against upstream transformers main
* make sure trl is on latest too
* fix for upgraded trl
* fix: handle trl and transformer signature change
* feat: update trl to handle transformer signature
* RewardDataCollatorWithPadding no longer has max_length
* handle updated signature for tokenizer vs processor class
* invert logic for tokenizer vs processor class
* processing_class, not processor class
* also handle processing class in dpo
* handle model name w model card creation
* upgrade transformers and add a loss check test
* fix install of tbparse requirements
* make sure to add tbparse to req
* feat: revert kwarg to positional kwarg to be explicit
---------
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* add ds zero3 to multigpu biweekly tests
* fix for upstream api change
* use updated accelerate and fix deepspeed tests
* stringify the Path, and run multigpu tests if the multigpu tests change for a PR
* use correct json rather than yaml
* revert accelerate for deepspeed