Commit Graph

1922 Commits

Author SHA1 Message Date
Sung Ching Liu
0ef1f011fe Merge branch 'main' into flx_attn_support 2025-02-11 23:31:56 -05:00
Sung Ching Liu
44f64ab627 Update faq.qmd (#2319)
* Update faq.qmd

Added Q&A for being stuck on saving preprocessed datasets

* Update faq.qmd

added details on preprocessing on cpu

* Update faq.qmd

* Update faq.qmd
2025-02-11 13:18:31 -05:00
NanoCode012
826f1b1494 feat(doc): Add multi-node torchrun info (#2304) 2025-02-08 06:02:02 -05:00
NanoCode012
526e5ee8b8 fix(config): missing config not being documented and fix model_ override (#2317)
* fix(config): missing config not being documented and fix model_ space override

* fix: delete redundant field
2025-02-08 06:01:48 -05:00
NanoCode012
fd8cb32547 chore: remove redundant py310 from tests (#2316) 2025-02-07 21:34:16 -05:00
NanoCode012
e48e2df4dd feat: update FA to 2.7.4.post1 which includes torch2.6 binary (#2315) 2025-02-07 21:34:01 -05:00
Wing Lian
b7616022ab bump transformers to 4.48.3 (#2318) 2025-02-07 21:33:44 -05:00
Wing Lian
1faf1a5c5a batch add of spectrum snr results (#2320) 2025-02-07 21:33:14 -05:00
Sunny Liu
c0a1d205c7 packed doc mask starts at 1, 0 means masked out 2025-02-07 14:44:52 -05:00
NanoCode012
5bbad5ef93 feat: add torch2.6 to ci (#2311) 2025-02-07 07:28:54 -05:00
Wing Lian
a971eb4ce6 Torch 2.6 support for base docker image (#2312) 2025-02-05 09:24:02 -05:00
Sunny Liu
d0e739da24 attempt at getting around bf16 error 2025-02-04 21:57:21 -05:00
Sunny Liu
3f6be519d5 stack 2025-02-04 21:25:13 -05:00
Sunny Liu
adcbc7459b misc 2025-02-04 21:17:50 -05:00
Sunny Liu
470ba65c44 make doc mask instead of the whole block mask in collator 2025-02-04 20:27:39 -05:00
NanoCode012
a620d481e2 fix: drop long seq even if not sample packing (#2211)
* fix: drop long seq even if not sample packing

* fix: logging import

* fix: cfg passed being none

* fix: try to fix logging

* fix: refactor call to not use accelerate log

* fix: try to fix circular import issue

* fix: don't drop when skip prepare

* chore: remove duplicate line

* fix: update warning to mention that sequences will be trimmed

* fix: do not drop seq if input_ids don't exist

* fix: increase RM unittest sequence length to reduce trim warnings

* fix: solve conflicts

* fix: default min_seq_len in case of None
2025-02-04 09:43:35 -05:00
Sunny Liu
8e1adc154d stuff 2025-02-02 20:36:14 -05:00
Sunny Liu
e5b36900e4 misc 2025-02-02 20:32:03 -05:00
Sunny Liu
9f6c89b12b undo my stupidity 2025-02-02 20:25:53 -05:00
Sunny Liu
b0871c8d3b attempt - mask padding 2025-02-02 20:18:49 -05:00
bursteratom
d3ea379a23 figure out slight diff from flash result 2025-02-02 01:45:54 -05:00
bursteratom
0ebab63309 test 2025-02-02 01:27:15 -05:00
bursteratom
e98581f6f5 BLOCK SIZE 2025-02-02 01:22:23 -05:00
bursteratom
b832b11c8f stuff 2025-02-02 00:51:43 -05:00
bursteratom
b692d394b1 more test 2025-02-02 00:48:57 -05:00
bursteratom
2319e5276d more test 2025-02-02 00:48:15 -05:00
bursteratom
9a43a0925d more test 2025-02-02 00:45:30 -05:00
bursteratom
10de67e8ea more test 2025-02-02 00:43:41 -05:00
bursteratom
fa7355404c test 2025-02-02 00:38:35 -05:00
bursteratom
907424a2e8 stuff 2025-02-02 00:29:09 -05:00
Sunny Liu
3f4fd3c1eb remove padding self attention 2025-02-01 22:47:10 -05:00
Wing Lian
158330ab60 [feature] sweeps (#2171) 2025-02-01 21:11:18 -05:00
Wing Lian
80e1468b8d better handling of multipack dataset length (#2296) 2025-02-01 21:10:34 -05:00
Sunny Liu
48c3c47071 vanills mask 2025-02-01 14:23:37 -05:00
Sunny Liu
3ed9c117fb try vanilla mask 2025-02-01 14:09:13 -05:00
Wing Lian
a20f17689b set MODAL_IMAGE_BUILDER_VERSION=2024.10 to 2024.10 to test latest builder (#2302)
* set MODAL_IMAGE_BUILDER_VERSION=2024.10 to 2024.10 to test latest builder

* chore: lint

* remove fastapi and pydantic extras
2025-01-31 20:19:20 -05:00
Wing Lian
78ce268848 KD Trainer w logprobs (#2303)
* refactor trainer to prevent circular dependencies later

fix loader default
KD dataset loading and KD with logprobs
filter bad rows
make batch smaller
handle padding/collation for KD datasets
make it work
flipped the slice
cross entropy loss coefficient during KD
make sure to multiply against the correct loss
chore: lint
triton wip
no where support
v2 trial
no torch.exp inside triton kernel
no log etc
no torch.tensor
v3
fix kwarg
don't use triton for now
better rescaling for temperatures
hash for temperature too
use kd_alpha in the correct loss method
fix kd loss so it's causal (fixes repeating tokens)
var naming and add todo
chore: lint
refactor so we can easily add new loss functions
add license block
remove references to triton kd for now
handle token/logprob shifting
support for custom trainer classes from plugins
refactor kd chat template loader
move more things to kd plugin
remove moved class from import
make plugin setup concise
increase logging around loading plugins
add copyrights
remove duplicate code
more info on preprocess for kd and fix import
be a bit pickier about loading dynamic prompt strategies
kd sample packing
make loss torch script compat
support streaming for processing sft datasts?
improve iterable support
ensure that batch vs single is done properly
tweak check for batched prompt data
reward can use same batch check
fix reward trainer calls for tokenization
improve check for batched
reward model doesn't work well with batched
add kd trainer e2e test
linting
rename test files so it gets picked up
make the kd e2e fit in vram for ci and add lora version
set lora_dropout explicitly
lower lr
make sure to set tokenizer from l3 70b and save safetensors
make sure to use the correct tokenizer
fix adapter model check
make sure to use tensorboard to capture loss for checks
chore: lint
chore: lint
improve logprob masking and shift in trainer
more fixes
try tests for kd on l40s
don't shift student logits for kd
no batching for kd chat templates
make sure to truncate logprobs if there are more than top_k
change up logic so we always truncate to top_k
use iter instead of tuple
fix finding the top-k rather than assuming first position has the correct val
apply z-score scaling to kd
kd loss needs to be calculated in full precision
Always re-normalize teacher distribution
various fixes

* support for configurable top-k/softmax ordering

* add attribute check for filter rows and lint

* fix logic

* handle none case for conversion to int

* fix student logit off by one

* set kd_temp to 1.0 for test loss

* address PR feedback
2025-01-31 20:18:52 -05:00
NanoCode012
d425d5d3c3 fix: add warning for invalid eval_steps or save_steps (#2298) 2025-01-31 08:58:25 -05:00
Wing Lian
cf17649ef3 Misc fixes 20250130 (#2301)
* misc fixes for garbage collection and L40S w NCCL P2P

* patch bnb fix for triton check

* chore: lint

* change up import

* try patching differently

* remove patch for bnb fix for now

* more verbose checks and tweak train loss threshold
2025-01-31 08:58:04 -05:00
Sunny Liu
84960003ed reset llama_patch_multipack.py 2025-01-30 14:40:18 -05:00
Sunny Liu
93a268e43d --no-verify
fixes silly mistake
2025-01-30 14:08:26 -05:00
Sunny Liu
065f6d477e flex batching WIP 2025-01-30 14:04:59 -05:00
Dan Saunders
6f294c3d8d refactor README; hardcode links to quarto docs; add additional quarto doc pages (#2295)
* refactor README; hardcode links to quarto docs; add additional quarto doc pages

* updates

* review comments

* update

---------

Co-authored-by: Dan Saunders <dan@axolotl.ai>
2025-01-30 12:49:21 -05:00
Sunny Liu
96ad741cd5 flex batching WIP 2025-01-30 12:35:25 -05:00
Wing Lian
6f713226dd make save_safetensors: true the default (#2292)
* make save_safetensors: true the default

* revert change to model output check
2025-01-30 11:48:48 -05:00
Wing Lian
1063d82b51 match the cuda version for 2.4.1 build w/o tmux (#2299) 2025-01-30 11:46:09 -05:00
salman
ac471a697a updating to fused (#2293) 2025-01-30 11:45:56 -05:00
Wing Lian
8779997ba5 native support for modal cloud from CLI (#2237)
* native support for modal cloud from CLI

* do lm_eval in cloud too

* Fix the sub call to lm-eval

* lm_eval option to not post eval, and append not extend

* cache bust when using branch, grab sha of latest image tag, update lm-eval dep

* allow minimal yaml for lm eval

* include modal in requirements

* update link in README to include utm

* pr feedback

* use chat template

* revision support

* apply chat template as arg

* add wandb name support, allow explicit a100-40gb

* cloud is optional

* handle accidental setting of tasks with a single task str

* document the modal cloud yaml for clarity [skip ci]

* cli docs

* support spawn vs remote for lm-eval

* Add support for additional docker commands in modal image build

* cloud config shouldn't be a dir

* Update README.md

Co-authored-by: Charles Frye <cfrye59@gmail.com>

* fix annotation args

---------

Co-authored-by: Charles Frye <cfrye59@gmail.com>
2025-01-30 11:34:02 -05:00
bursteratom
ba88bc7840 wip flex block mask creation 2025-01-29 00:25:25 -05:00
Eric Tang
268543a3be Ray Train Axolotl Integration (#2251)
* current

not clean working version
move torch trainer to do_cli
update code with config changes and clean up
edit config
cleanup
add run name to trainer

* address comments

* use axolotl train in multigpu tests and add ray tests for multi-gpu

* accelerate uses underscores for main_process_port arg

* chore: lint

* fix order of accelerate args

* include ray train in docker images

* current

not clean working version
move torch trainer to do_cli
update code with config changes and clean up
edit config
cleanup
add run name to trainer

* address comments

* use axolotl train in multigpu tests and add ray tests for multi-gpu

* accelerate uses underscores for main_process_port arg

* chore: lint

* fix order of accelerate args

* include ray train in docker images

* fix bf16 resolution behavior

* move dtype logic

* x

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

* rename

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

* add to sidebar

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

* Apply suggestions from code review

Co-authored-by: Eric Tang <46737979+erictang000@users.noreply.github.com>

* Update docs/ray-integration.qmd

Co-authored-by: Eric Tang <46737979+erictang000@users.noreply.github.com>

* pre-commit fixes

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

* use output_dir instead of hardcoded saves path

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

* bugfix storage dir

* change type\ for resources_per_worker

---------

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: SumanthRH <sumanthrh@anyscale.com>
Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>
Co-authored-by: Wing Lian <wing.lian@gmail.com>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
2025-01-29 00:10:19 -05:00