* keep gate in fp32 for loras
* add e2e check for lora w/o flash attention for mixtral to check gate
* add checks for gate in fp32 for mixtral, add typehints to train outputs
* mixtral doesn't support basic lora 🤦
add lora tests @ 16bit and fix gate layer check
fix the parameter name, was using the old disco name
don't lora over the gate so we can check that is in fp32
fix dtype check
* ensure we're using fp16/bf16 for 16bit and qlora is always going to be in uint8
* fix: `train_on_inputs: true` ignored for sharegpt
* enable unit test for train_on_inputs for sharegpt
---------
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* attempt to also run e2e tests that needs gpus
* fix stray quote
* checkout specific github ref
* dockerfile for tests with proper checkout
ensure wandb is dissabled for docker pytests
clear wandb env after testing
clear wandb env after testing
make sure to provide a default val for pop
tryin skipping wandb validation tests
explicitly disable wandb in the e2e tests
explicitly report_to None to see if that fixes the docker e2e tests
split gpu from non-gpu unit tests
skip bf16 check in test for now
build docker w/o cache since it uses branch name ref
revert some changes now that caching is fixed
skip bf16 check if on gpu w support
* pytest skip for auto-gptq requirements
* skip mamba tests for now, split multipack and non packed lora llama tests
* split tests that use monkeypatches
* fix relative import for prev commit
* move other tests using monkeypatches to the correct run
* fix double eos token for chatml
* isolate fix to chatml conversation
* fix add special tokens to include rstrip
* add test for train_on_inputs for sharegpt
* don't use rstrip for chatml
* restore to current phi modeling code from phi-2
* enable gradient checkpointing
* don't cast everything to float32 all the time
* gradient checkpointing for phi2 ParallelBlock module too
* fix enabling flash attn for phi2
* add comment about import
* fix phi2 example
* fix model type check for tokenizer
* revert float32 -> bf16 casting changes
* support fused dense flash attn
* fix the repo for flash-attn
* add package name for subdir pkg
* fix the data collator when not using sample packing
* install packaging for pytests in ci
* also fix setup to not install flash attn fused dense subdir if not extras
* split out the fused-dense-lib in extra requires
* don't train w group_by_length for phi
* update integration test to use phi2
* set max steps and save steps for phi e2e tests
* try to workaround ssave issue in ci
* skip phi2 e2e test for now
* [Feat] streaming multipack
* WIP make continued pretraining work w multipack
* fix up hadrcoding, lint
* fix dict check
* update test for updated pretraining multipack code
* fix hardcoded data collator fix for multipack pretraining
* fix the collator to be the max length for multipack pretraining
* don't bother with latest tag for test
* cleanup docker build/test
---------
Co-authored-by: jinwonkim93@github.com <jinwonkim>
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* ipo-dpo trainer
* fix missing abstract method
* chatml template, grad checkpointing kwargs support
* fix steps calc for RL and add dataloader kwargs
* wip to fix dpo and start ppo
* more fixes
* refactor to generalize map fn
* fix dataset loop and handle argilla pref dataset
* set training args
* load reference model on seperate gpu if more than one device
* no auto upload to hub for dpo, don't add lora adapters to ref model for dpo
* fixes for rl training
* support for ipo from yaml
* set dpo training args from the config, add tests
* chore: lint
* set sequence_len for model in test
* add RLHF docs
* bump transformers and update attention class map name
* also run the tests in docker
* add mixtral e2e smoke test
* fix base name for docker image in test
* mixtral lora doesn't seem to work, at least check qlora
* add testcase for mixtral w sample packing
* check monkeypatch for flash attn multipack
* also run the e2e tests in docker
* use all gpus to run tests in docker ci
* use privileged mode too for docker w gpus
* rename the docker e2e actions for gh ci
* set privileged mode for docker and update mixtral model self attn check
* use fp16/bf16 for mixtral w fa2
* skip e2e tests on docker w gpus for now
* tests to validate mistral and mixtral patches
* fix rel import
* Feat: Auto add to modules_to_save when adding tokens
* fix: swap to error instead of warning
* feat: add check when special_tokens differ and add test
* start at index 0
* add test to check for missing turns
* apply black
* Update test_prompt_tokenizers.py
* Update src/axolotl/monkeypatch/fastchat_conversation_turns.py
Co-authored-by: Motoki Wu <tokestermw@gmail.com>
* fix linting
* apply black
* add more tests for llama/sharegpt
* make logic clearer
---------
Co-authored-by: Motoki Wu <tokestermw@gmail.com>
* Respect sequence_len in config for `type: llama2_chat`
It was hardcoded to `4096` I am not sure why? This updates it to pull from the config.
cc: @winglian
* Update llama2_chat.py
* apply black formatting
* fix tokenizer
* update test data
* lint fixtures
* support for mamba
* more mamba fixes
* use fork for mamba kwargs fix
* grad checkpointing doesn't work
* fix extras for mamaba
* mamba loss fix
* use fp32 and remove verbose logging
* mamba fixes
* fix collator for mamba
* set model_type on training_args
* don't save safetensors for mamba
* update mamba config to disable safetensor checkpooints, install for tests
* no evals for mamba tests
* handle save_pretrained
* handle unused safetensors arg
* Feat: Update to handle wandb env better
* chore: rename wandb_run_id to wandb_name
* feat: add new recommendation and update config
* fix: indent and pop disabled env if project passed
* feat: test env set for wandb and recommendation
* feat: update to use wandb_name and allow id
* chore: add info to readme
* add phi modeling from hf
* update for packing and use new modeling class for phi
* update e2e tests for phi to use new model name
* update example phi to also use new phi model name
* use AutoModelForCausalLM for phi lora since sample packing isn't supported
* use tensorboard to see if resume from checkpoint works
* make sure e2e test is either fp16 or bf16
* set max_steps and save limit so we have the checkpoint when testing resuming
* fix test parameters
* support for sharegpt with assistant talking first, better masking of assistant token, allow remap of roles from dataset
* invalid role is actually not possible
* update tokenized fixture for corrected labels
* Allow usage of native Mistral FA when no sample_packing
* fix: do not apply custom patch when sample_pack off
* chore: lint
* chore: pin transformer to v4.35.0.dev0
* fix: split sample_packing to separate test
* Fix(cfg): Check save_strategy cfg conflict with save_steps
* Fix(cfg): Check evaluation_strategy cfg conflict with eval_steps
* chore: add extra check for steps only
* use fastchat conversations template
* require fastchat (fschat) pip install
* handle roles dynamically from conversation
* tweak fastchat conversation with a monkeypatch to get individual turns
* fix up so it works with multiple conversation styles, and don't strip the turns
* fix sharegpt fixture now that we're using a more correct tokenization
* use a new prompter and support fastchat conversation type
* use sharegpt from prompt strategies now
* update docs, add chatml template
* add a newline after im_end token
* ensure we correctly set system message
* update per PR feedback to handle deprecated sharegpt types
* don't add duplicate wandb req
* make sharegpt fields configurable from yml
* llama2 fixes
* don't fail fatally when turns are improper
* phi sequence packing
* sample packing fixes
* fix linting
* fix inference and phi e2e tests
* update phi example now that sample packing works
* wandb import keeps getting moved around
* return without packing prep/len
* fix remove columns
* fix encode arguments
* add error when max steps not set
* fix test
---------
Co-authored-by: Jan Philipp Harries <jphme@users.noreply.github.com>
* fix attetion mask with packing
* set position ids and use block diagonal attn mask
* fix expand mask for multiple batch items, make sure we pad position_ids
* don't move masks to cpu
* use multi pack dataloader w random sampler
* add position_ids back
* more fixes for dataloader integration
* est total tokens, fix field loop
* more fixes, position_ids seems broken
* more fixes for sample packing
* use distributed sampler, avoid accelerate prepare
* use accelerator prepare for dataloader
* fix for position_ids w packing
* Update src/axolotl/utils/dataloader.py
* validation for sample packing and doc
* more fixes for 4k and optimizations
* optimized expand mask fn
* better handling of variance in multipack dataloader length and trainer hanging when it runs out of data
* fix rounding of len of batches to int
* better handling so that all devices have the same dataloader len
* fix step calc for packing
* pass sample packing efficiency to training args
* add a test for the mask expansion for sequence packing
* only process eval dataset for packing if not None
* don't split batches when packing
* weighted CE losses
* weighted CEL fixes
* limit packing to sequences of max seq len
* seq_len_multiple for packing
* make sure the chunk size is an int
* sample_packing_seq_len_multiplier config
* use cumulative seq len with var len flash attn v2 w packing
* properly calculate max len
* fix flash-attn, xformers, packing, support chatml
* fix chatml system prompt for openorca, legacy tokenizer opts
* add chatml
* add unit tests for cum seq lens, add ability to build cu_seq_lens from positional ids, fix prompt test
* fix test and pylint checks
* more packing and dataset optimizations and fixes
* filter w multiple cpus
* more fixes and optimizations
* fixes and go back to distributed sampler since batch sampler won't work
* fix counts by accounting for num devices
* fix steps calculation
* previous accelerate is still most performant
* add numba to requirements.
* use custom distributed checks
* fix sampler to prevent overfit w new epochs
* let's not cleanup the cached datasets
* calculate cum seq lens with pos_ids instead of mask, simplify packing params, fix distributed barrier
* speed optimizations and set accelerate fsdp env vars
* optimize dataset concatenation?
* more optimizations for dataset handling
* fix import for annotation
* manual pre-commit fixes
* another sum optimization and bug fix for calc steps
* fix packing estimations
* fix formatting
* pylint problems
* add back flash attention branch for handling unpacked sequences seperately
* Address PR feedback
* add optional sample packing config params to readme