Commit Graph

186 Commits

Author SHA1 Message Date
Dan Saunders
e4fd7aad0b quartodoc integration 2025-03-21 11:12:09 -04:00
Dan Saunders
c907ac173e adding pre-commit auto-update GH action and bumping plugin versions (#2428)
* adding pre-commit auto-update GH action and bumping plugin versions

* running updated pre-commit plugins

* sorry to revert, but pylint complained

* Update .pre-commit-config.yaml

Co-authored-by: Wing Lian <wing.lian@gmail.com>

---------

Co-authored-by: Dan Saunders <dan@axolotl.ai>
Co-authored-by: Wing Lian <wing.lian@gmail.com>
2025-03-21 11:02:43 -04:00
Wing Lian
aae4337f40 add 12.8.1 cuda to the base matrix (#2426)
* add 12.8.1 cuda to the base matrix

* use nightly

* bump deepspeed and set no binary

* deepspeed binary fixes hopefully

* install deepspeed by itself

* multiline fix

* make sure ninja is installed

* try with reversion of packaging/setuptools/wheel install

* use license instead of license-file

* try rolling back packaging and setuptools versions

* comment out license for validation for now

* make sure packaging version is consistent

* more parity across tests and docker images for packaging/setuptools
2025-03-21 10:17:25 -04:00
Wing Lian
04f6324833 build cloud images with torch 2.6.0 (#2413)
* build cloud images with torch 2.6.0

* nightlies too
2025-03-13 23:28:51 -04:00
Wing Lian
a4170030ab don't install extraneous old version of pydantic in ci and make sre to run multigpu ci (#2355) 2025-02-21 22:06:29 -05:00
Wing Lian
ffae8d6a95 GRPO (#2307) 2025-02-13 16:01:01 -05:00
NanoCode012
fd8cb32547 chore: remove redundant py310 from tests (#2316) 2025-02-07 21:34:16 -05:00
NanoCode012
5bbad5ef93 feat: add torch2.6 to ci (#2311) 2025-02-07 07:28:54 -05:00
Wing Lian
a971eb4ce6 Torch 2.6 support for base docker image (#2312) 2025-02-05 09:24:02 -05:00
Wing Lian
a20f17689b set MODAL_IMAGE_BUILDER_VERSION=2024.10 to 2024.10 to test latest builder (#2302)
* set MODAL_IMAGE_BUILDER_VERSION=2024.10 to 2024.10 to test latest builder

* chore: lint

* remove fastapi and pydantic extras
2025-01-31 20:19:20 -05:00
Wing Lian
78ce268848 KD Trainer w logprobs (#2303)
* refactor trainer to prevent circular dependencies later

fix loader default
KD dataset loading and KD with logprobs
filter bad rows
make batch smaller
handle padding/collation for KD datasets
make it work
flipped the slice
cross entropy loss coefficient during KD
make sure to multiply against the correct loss
chore: lint
triton wip
no where support
v2 trial
no torch.exp inside triton kernel
no log etc
no torch.tensor
v3
fix kwarg
don't use triton for now
better rescaling for temperatures
hash for temperature too
use kd_alpha in the correct loss method
fix kd loss so it's causal (fixes repeating tokens)
var naming and add todo
chore: lint
refactor so we can easily add new loss functions
add license block
remove references to triton kd for now
handle token/logprob shifting
support for custom trainer classes from plugins
refactor kd chat template loader
move more things to kd plugin
remove moved class from import
make plugin setup concise
increase logging around loading plugins
add copyrights
remove duplicate code
more info on preprocess for kd and fix import
be a bit pickier about loading dynamic prompt strategies
kd sample packing
make loss torch script compat
support streaming for processing sft datasts?
improve iterable support
ensure that batch vs single is done properly
tweak check for batched prompt data
reward can use same batch check
fix reward trainer calls for tokenization
improve check for batched
reward model doesn't work well with batched
add kd trainer e2e test
linting
rename test files so it gets picked up
make the kd e2e fit in vram for ci and add lora version
set lora_dropout explicitly
lower lr
make sure to set tokenizer from l3 70b and save safetensors
make sure to use the correct tokenizer
fix adapter model check
make sure to use tensorboard to capture loss for checks
chore: lint
chore: lint
improve logprob masking and shift in trainer
more fixes
try tests for kd on l40s
don't shift student logits for kd
no batching for kd chat templates
make sure to truncate logprobs if there are more than top_k
change up logic so we always truncate to top_k
use iter instead of tuple
fix finding the top-k rather than assuming first position has the correct val
apply z-score scaling to kd
kd loss needs to be calculated in full precision
Always re-normalize teacher distribution
various fixes

* support for configurable top-k/softmax ordering

* add attribute check for filter rows and lint

* fix logic

* handle none case for conversion to int

* fix student logit off by one

* set kd_temp to 1.0 for test loss

* address PR feedback
2025-01-31 20:18:52 -05:00
Wing Lian
1063d82b51 match the cuda version for 2.4.1 build w/o tmux (#2299) 2025-01-30 11:46:09 -05:00
salman
c071a530f7 removing 2.3.1 (#2294) 2025-01-28 23:23:44 -05:00
Wing Lian
d8b4027200 use 2.5.1 docker images as latest tag as it seems stable (#2198) 2025-01-10 08:35:25 -05:00
Wing Lian
5e0124e2ab update modal version for ci (#2242) 2025-01-09 21:01:02 +00:00
Wing Lian
3c1921e400 add hf cache caching for GHA (#2247)
* add hf cache caching for GHA

* use modal volume to cache hf data

* make sure to update the cache as we add new fixtures in conftest
2025-01-09 20:59:54 +00:00
Wing Lian
7faf2b6e8e Merge group queue (#2248)
* add support for merge groups

* also lint merge groups
2025-01-09 15:49:00 -05:00
Wing Lian
02629c7cdf parity for nightly ci - make sure to install setuptools (#2176) [skip ci] 2024-12-11 20:14:55 -05:00
Wing Lian
d009ead101 fix build w pyproject to respect insalled torch version (#2168)
* fix build w pyproject to respect insalled torch version

* include in manifest

* disable duplicate code check for now

* move parser so it can be found

* add checks for correct pytorch version so this doesn't slip by again
2024-12-10 16:25:25 -05:00
Wing Lian
6aa31b44c6 make sure to checkout tag before creating release (#2164)
Some checks failed
ci-cd / build-axolotl (<nil>, 124, 12.4.1, 3.11, 2.4.1) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 124, 12.4.1, 3.11, 2.5.1) (push) Has been cancelled
ci-cd / build-axolotl (mamba-ssm, 121, 12.1.1, 3.10, 2.3.1) (push) Has been cancelled
ci-cd / build-axolotl (mamba-ssm, 121, 12.1.1, true, 3.11, 2.3.1) (push) Has been cancelled
publish pypi / Create Release (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 121, 12.1.1, 3.10, 2.3.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 121, 12.1.1, true, 3.11, 2.3.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 124, 12.4.1, 3.11, 2.4.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 124, 12.4.1, 3.11, 2.5.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud-no-tmux (<nil>, 121, 12.1.1, 3.11, 2.3.1) (push) Has been cancelled
publish pypi / Upload release to PyPI (push) Has been cancelled
2024-12-09 14:20:16 -05:00
Wing Lian
9001859b0b fix release command (#2163) [skip ci] 2024-12-09 14:12:45 -05:00
Wing Lian
0c25bc07a2 use manual version for now (#2156) 2024-12-08 21:09:12 -05:00
Wing Lian
5e9fa33f3d reduce test concurrency to avoid HF rate limiting, test suite parity (#2128)
* reduce test concurrency to avoid HF rate limiting, test suite parity

* make val_set_size smaller to speed up e2e tests

* more retries for pytest fixture downloads

* val_set_size was too small

* move retry_on_request_exceptions to data utils and add retry strategy

* pre-download ultrafeedback as a test fixture

* refactor download retry into it's own fn

* don't import from data utils

* use retry mechanism now for fixtures
2024-12-06 10:20:20 -05:00
Dan Saunders
08fa133177 Fix broken CLI; remove duplicate metadata from setup.py (#2136)
* Fix broken CLI; remove duplicate metadata from setup.py

* Adding tests.yml CLI check

* updating

* remove test with requests to github due to rate limiting

---------

Co-authored-by: Dan Saunders <dan@axolotl.ai>
2024-12-06 10:19:54 -05:00
Dan Saunders
fc973f4322 CLI Implementation with Click (#2107)
* Initial CLI implementation with click package

* Adding fetch command for pulling examples and deepspeed configs

* Automating default options for CliArgs classes

* Mimicking existing no config behavior

* bugfix in choose_config

* Updating fetch to sync instead of re-download

* bugfix

* isort fix

* fixing yaml isort order

* pre-commit fixes

* simplifying argument parsing -- pass through kwargs to do_cli

* make accelerate launch default for non-preprocess commands

* fixing arg handling

* testing None placeholder approach

* removing hacky --use-gpu argument to preprocess command

* Adding brief README documentation for CLI

* remove (New)

* Initial CLI pytest tests

* progress on CLI pytest

* adding inference CLI tests; cleanup

* Refactor train CLI tests to remove various mocking

* Major CLI test refator; adding remaining CLI codepath test coverage

* pytest fixes

* remove integration markers

* parallelizing examples, deepspeed config downloads; rename test to match other CLI test naming

* moving cli pytest due to isolation issues; cleanup

* testing fixes; various minor improvements

* fix

* tests fix

* Update tests/cli/conftest.py

Co-authored-by: Wing Lian <wing.lian@gmail.com>

---------

Co-authored-by: Dan Saunders <dan@axolotl.ai>
Co-authored-by: Wing Lian <wing.lian@gmail.com>
2024-12-05 22:11:48 -05:00
Wing Lian
e2882dd749 drop unnecessary BNB_CUDA_VERSION env var from docker as it just results in warnings (#2121) [skip ci]
* drop unnecessary BNB_CUDA_VERSION env var from docker as it just results in warnings

* make sure to run tests when cicd Dockerfile changes
2024-12-04 12:25:47 -05:00
NanoCode012
bd8436bc6e feat: add cut_cross_entropy (#2091)
* feat: add cut_cross_entropy

* fix: add to input

* fix: remove from setup.py

* feat: refactor into an integration

* chore: ignore lint

* feat: add test for cce

* fix: set max_steps for liger test

* chore: Update base model following suggestion

Co-authored-by: Wing Lian <wing.lian@gmail.com>

* chore: update special_tokens following suggestion

Co-authored-by: Wing Lian <wing.lian@gmail.com>

* chore: remove with_temp_dir following comments

* fix: plugins aren't loaded

* chore: update quotes in error message

* chore: lint

* chore: lint

* feat: enable FA on test

* chore: refactor get_pytorch_version

* fix: lock cce commit version

* fix: remove subclassing UT

* fix: downcast even if not using FA and config check

* feat: add test to check different attentions

* feat: add install to CI

* chore: refactor to use parametrize for attention

* fix: pytest not detecting test

* feat: handle torch lower than 2.4

* fix args/kwargs to match docs

* use release version cut-cross-entropy==24.11.4

* fix quotes

* fix: use named params for clarity for modal builder

* fix: handle install from pip

* fix: test check only top level module install

* fix: re-add import check

* uninstall existing version if no transformers submodule in cce

* more dataset fixtures into the cache

---------

Co-authored-by: Wing Lian <wing.lian@gmail.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>
2024-12-03 08:22:22 -05:00
Wing Lian
9f6d0b5587 use pytest sugar and verbose for more info during ci (#2112) [skip ci]
* use pytest sugar and verbose for more info during ci

* also run test suite when test requirements or cicd.sh changes

* also on PR too
2024-12-02 20:14:40 -05:00
Wing Lian
a4f4a56d77 build causal_conv1d and mamba-ssm into the base image (#2113)
* build causal_conv1d and mamba-ssm into the base image

* also build base images on changes to Dockerfile-base and base workflow yaml
2024-12-02 18:27:46 -05:00
Wing Lian
5f1d98e8fc add e2e tests for Unsloth qlora and test the builds (#2093)
* see if unsloth installs cleanly in ci

* check unsloth install on regular tests, not sdist

* fix ampere check exception for ci

* use cached_property instead

* add an e2e test for unsloth qlora

* reduce seq len and mbsz to prevent oom in ci

* add checks for fp16 and sdp_attention

* pin unsloth to a specific release

* add unsloth to docker image too

* fix flash attn xentropy patch

* fix loss, add check for loss when using fa_xentropy

* fix special tokens for test

* typo

* test fa xentropy with and without gradient accum

* pr feedback changes
2024-11-29 20:38:49 -05:00
Wing Lian
94fc223f6c actions/create-release is unmaintained, and doesn't create proper release notes (#2098) [skip ci] 2024-11-21 14:32:41 -05:00
Wing Lian
db51a9e4cb use pep440 instead of semver (#2088) [skip ci] 2024-11-19 15:02:10 -05:00
Wing Lian
e9c3a2aec0 add missing dunder-init for monkeypatches and add tests for install from sdist (#2085)
Some checks failed
ci-cd / build-axolotl (<nil>, 124, 12.4.1, 3.11, 2.4.1) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 124, 12.4.1, 3.11, 2.5.1) (push) Has been cancelled
ci-cd / build-axolotl (mamba-ssm, 121, 12.1.1, 3.10, 2.3.1) (push) Has been cancelled
ci-cd / build-axolotl (mamba-ssm, 121, 12.1.1, true, 3.11, 2.3.1) (push) Has been cancelled
publish pypi / Create Release (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 121, 12.1.1, 3.10, 2.3.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 121, 12.1.1, true, 3.11, 2.3.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 124, 12.4.1, 3.11, 2.4.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 124, 12.4.1, 3.11, 2.5.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud-no-tmux (<nil>, 121, 12.1.1, 3.11, 2.3.1) (push) Has been cancelled
publish pypi / Upload release to PyPI (push) Has been cancelled
* add missing dunder-init for monkeypatches and add tests for install from sdist

* fix gha name

* reduce matrix for sdist test
2024-11-19 12:43:30 -05:00
Wing Lian
5f6f9186e4 make sure action has permission to create release (#2083) [skip ci] 2024-11-19 10:43:02 -05:00
Wing Lian
a77c8a71cf fix brackets on docker ci builds, add option to skip e2e builds [skip e2e] (#2080) [skip ci] 2024-11-19 10:29:31 -05:00
Wing Lian
8403c67156 don't build bdist (#2076) [skip ci] 2024-11-18 12:36:03 -05:00
Wing Lian
c06b8f0243 increase worker count to 8 for basic pytests (#2075) [skip ci] 2024-11-18 11:52:35 -05:00
Wing Lian
2f20cb7ebf upgrade datasets==3.1.0 and add upstream check (#2067) [skip ci] 2024-11-15 19:08:38 -05:00
Wing Lian
71d4030b79 gradient accumulation tests, embeddings w pad_token fix, smaller models (#2059)
* add more test cases for gradient accumulation and fix zero3

* swap out for smaller model

* fix missing return

* fix missing pad_token in config

* support concurrency for multigpu testing

* cast empty deepspeed to empty string for zero3 check

* fix temp_dir as fixture so parametrize works properly

* fix test file for multigpu evals

* don't use default

* don't use default for fsdp_state_dict_type

* don't use llama tokenizer w smollm

* also automatically cancel multigpu for concurrency
2024-11-14 12:59:00 -05:00
Wing Lian
ba219b51a5 fix duplicate base build (#2061) [skip ci] 2024-11-14 10:31:19 -05:00
Wing Lian
5be8e13d35 make sure to add tags for versioned tag on cloud docker images (#2060) 2024-11-14 10:24:49 -05:00
Wing Lian
659ee5d723 don't cancel the tests on main automatically for concurrency (#2055) [skip ci] 2024-11-13 17:07:41 -05:00
Wing Lian
c5eb9ea2c2 fix push to main and tag semver build for docker ci (#2054) 2024-11-13 14:04:28 -05:00
Wing Lian
01881c3113 make sure to tag images in docker for tagged releases (#2051) [skip ci]
* make sure to tag images in docker for tagged releases

* fix tag event
2024-11-13 13:15:49 -05:00
Wing Lian
0e8eb96e07 run pypi release action on tag create w version (#2047) 2024-11-13 10:21:48 -05:00
NanoCode012
28924fc791 feat: cancel ongoing tests if new CI is triggered (#2046) [skip ci] 2024-11-13 10:06:59 -05:00
Wing Lian
f68fb71005 update actions version for node16 deprecation (#2037) [skip ci]
* update actions version for node16 deprecation

* update pre-commit/action to use 3.0.1 for actions/cache@v4 dep

* update docker/setup-buildx-action too to v3
2024-11-11 15:09:11 -05:00
Wing Lian
9bc3ee6c75 add axolotlai docker hub org to publish list (#2031)
* add axolotlai docker hub org to publish list

* fix to use latest actions docker metadata version

* fix list in yaml for expected format for action

* missed a change
2024-11-11 09:48:19 -05:00
Wing Lian
e20b15bee3 make publish to pypi manually dispatchable as a workflow (#2026) [skip ci] 2024-11-08 14:18:16 -05:00
Wing Lian
3cb2d75de1 upgrade pytorch to 2.5.1 (#2024) 2024-11-08 10:46:24 -05:00