Commit Graph

47 Commits

Author SHA1 Message Date
NanoCode012
40a88e8c4a Feat: Add sharegpt multirole (#1137)
* feat(prompt): support multiple roles for sharegpt

* fix: add handling of empty role back

* feat: rebased and allowed more dynamic roles via config

* fix: variable

* chore: update message

* feat: add vicuna format

* fix: JSON serializable error

* fix: typing

* fix: don't remap for unknown keys

* fix: add roles to pydantic

* feat: add test

* chore: remove leftover print

* chore: remove leftover comment

* chore: remove print

* fix: update test to use chatml
2024-03-19 20:51:49 +09:00
Brian Fitzgerald
b7d8a7dc4d Add Glaive conversation format support (#1365)
* Add Glaive conversation format support

* fix black formatting errors

* Fix black and pylint formatting errors

* only set role_key_tool if provided in the dataset constructor

* Update src/axolotl/prompt_strategies/sharegpt.py

Co-authored-by: Wing Lian <wing.lian@gmail.com>

* sharegpt test

* tokenizer test

* fix formatting

---------

Co-authored-by: Wing Lian <wing.lian@gmail.com>
2024-03-10 20:50:25 -04:00
Oleh Kuznetsov
af0243021c Standardize system prompt format for AlpacaPrompter (#1190) [skip ci] 2024-01-24 14:27:01 -05:00
kallewoof
450e04d3c4 fix: remove excessive newlines in system prompt(s) for alpaca (#936) 2023-12-13 16:40:02 +09:00
Wing Lian
14706504e3 various bugfixes (#856)
* various bugfixes

use latest tinyllama release
check if val_set_size is empty first
update sdp and xformers llama patches for updated upstream transformers
fix system prompt when no input
calculate total and total supervised tokens even when not sample packing

* add fix for when eval size is estimated to be too small

* should be len 1 for dataset length

* add catchall kwargs
2023-11-15 12:23:18 -05:00
Wing Lian
1a6309c8a6 cleanup the old multipack dataloader (#841) 2023-11-12 05:39:09 -05:00
Casper
e50ab072e2 Create preprocess CLI (#785)
* Create preprocess CLI

* Print prompt template if debugging

* Add print for unsupported prompters

* Formatting

* Formatting

* Refactor variables

* Formatting

* Formatting

* Formatting

* Formatting
2023-10-26 09:35:42 -04:00
Wing Lian
f30afe4544 misc sharegpt fixes (#723)
* support for sharegpt with assistant talking first, better masking of assistant token, allow remap of roles from dataset

* invalid role is actually not possible

* update tokenized fixture for corrected labels
2023-10-13 11:04:39 -04:00
Wing Lian
e7d3e2dbb6 use fastchat conversations template (#578)
* use fastchat conversations template

* require fastchat (fschat) pip install

* handle roles dynamically from conversation

* tweak fastchat conversation with a monkeypatch to get individual turns

* fix up so it works with multiple conversation styles, and don't strip the turns

* fix sharegpt fixture now that we're using a more correct tokenization

* use a new prompter and support fastchat conversation type

* use sharegpt from prompt strategies now

* update docs, add chatml template

* add a newline after im_end token

* ensure we correctly set system message

* update per PR feedback to handle deprecated sharegpt types

* don't add duplicate wandb req

* make sharegpt fields configurable from yml

* llama2 fixes

* don't fail fatally when turns are improper
2023-09-27 12:10:45 -04:00
Wing Lian
97d3776ce6 split completion text to sequence_len (#616) 2023-09-21 21:51:25 -04:00
kingbri
995557bdf3 Prompters: ShareGPT: Allow for custom system prompts
If a system prompt is present in a conversation, add it instead of
using the default.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-09-01 13:53:05 -04:00
Wing Lian
d2e7f27240 support user defined prompters, pretokenized datasets in config, local parquet, local arrow files (#348)
* support user defined prompters, pretokenized datasets in config, local parquet, local arrow files

* fix user defined dataset types

* fix for system prompts

* fix tests

* fix checks for parquet and arrow

* aha moment that d.data_files isn't used

* add documentation for ds_type to add support for parquet and arrow
2023-08-20 09:17:49 -04:00
florian peyron
63fdb5a7fb Error msg for sharegpt if conv has less than 2 msg (#379) 2023-08-14 17:40:40 +09:00
Wing Lian
2bb0b78975 Attention mask and position id fixes for packing (#285)
* fix attetion mask with packing

* set position ids and use block diagonal attn mask

* fix expand mask for multiple batch items, make sure we pad position_ids

* don't move masks to cpu

* use multi pack dataloader w random sampler

* add position_ids back

* more fixes for dataloader integration

* est total tokens, fix field loop

* more fixes, position_ids seems broken

* more fixes for sample packing

* use distributed sampler, avoid accelerate prepare

* use accelerator prepare for dataloader

* fix for position_ids w packing

* Update src/axolotl/utils/dataloader.py

* validation for sample packing and doc

* more fixes for 4k and optimizations

* optimized expand mask fn

* better handling of variance in multipack dataloader length and trainer hanging when it runs out of data

* fix rounding of len of batches to int

* better handling so that all devices have the same dataloader len

* fix step calc for packing

* pass sample packing efficiency to training args

* add a test for the mask expansion for sequence packing

* only process eval dataset for packing if not None

* don't split batches when packing

* weighted CE losses

* weighted CEL fixes

* limit packing to sequences of max seq len

* seq_len_multiple for packing

* make sure the chunk size is an int

* sample_packing_seq_len_multiplier config

* use cumulative seq len with var len flash attn v2 w packing

* properly calculate max len

* fix flash-attn, xformers, packing, support chatml

* fix chatml system prompt for openorca, legacy tokenizer opts

* add chatml

* add unit tests for cum seq lens, add ability to build cu_seq_lens from positional ids, fix prompt test

* fix test and pylint checks

* more packing and dataset optimizations and fixes

* filter w multiple cpus

* more fixes and optimizations

* fixes and go back to distributed sampler since batch sampler won't work

* fix counts by accounting for num devices

* fix steps calculation

* previous accelerate is still most performant

* add numba to requirements.

* use custom distributed checks

* fix sampler to prevent overfit w new epochs

* let's not cleanup the cached datasets

* calculate cum seq lens with pos_ids instead of mask, simplify packing params, fix distributed barrier

* speed optimizations and set accelerate fsdp env vars

* optimize dataset concatenation?

* more optimizations for dataset handling

* fix import for annotation

* manual pre-commit fixes

* another sum optimization and bug fix for calc steps

* fix packing estimations

* fix formatting

* pylint problems

* add back flash attention branch for handling unpacked sequences seperately

* Address PR feedback

* add optional sample packing config params to readme
2023-08-12 15:14:56 -04:00
NanoCode012
e37d9358e6 Fix(message): Improve error message for bad format (#365) 2023-08-13 01:16:18 +09:00
theobjectivedad
b1f4f7a34d Fixed pre-commit problems, fixed small bug in logging_config to handle LOG_LEVEL env var 2023-07-15 12:29:35 +00:00
theobjectivedad
553a86b52c Adding logging enhancement 2023-07-14 07:26:19 -05:00
Wing Lian
3a38271276 add tests and supoort for loader for sys prompt data 2023-06-25 22:28:07 -04:00
Wing Lian
8d20e0a3d3 initial wip to get sys prompt from dataset 2023-06-25 22:28:07 -04:00
Wing Lian
c7dee56b87 add typehints 2023-06-11 19:52:34 -04:00
Wing Lian
aac4b7691e add new sharegpt, refactor prompt so it can be customized later, add exception if no data is processed 2023-06-11 19:42:25 -04:00
NanoCode012
25eeeeba0b Fix sharegpt prompt 2023-05-31 02:55:21 +09:00
NanoCode012
37293dce07 Apply isort then black 2023-05-31 02:53:53 +09:00
NanoCode012
e9650d3ae4 Fix mypy typing 2023-05-31 02:53:53 +09:00
NanoCode012
69722aeef4 Remove fixme disable 2023-05-31 02:53:23 +09:00
NanoCode012
5658717dbd Remove disable too many arg 2023-05-31 02:53:23 +09:00
NanoCode012
cb4f0e9342 Lint prompters.py 2023-05-31 02:53:23 +09:00
NanoCode012
9ac1884323 Fix: Remove base class inherit for CompletionPrompter 2023-05-28 00:51:35 +09:00
Wing Lian
004820209d Update src/axolotl/prompters.py
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
2023-05-25 12:21:02 -04:00
Wing Lian
a5d739b66b fixes w/ example for super basic lora starter 2023-05-25 11:59:08 -04:00
Wing Lian
ce34d64e8a apply black formatting 2023-05-24 22:59:33 -04:00
Wing Lian
e8aacfbd7c more qlora support 2023-05-24 14:33:18 -04:00
Wing Lian
1d5ab84486 optionally be able to specify alpaca or chat style prompts 2023-05-20 18:16:22 -04:00
Wing Lian
13650732f8 concise multiple choice and tldr summarize 2023-05-17 11:29:17 -04:00
Wing Lian
b46bc02f0a add alpaca multiple choice instruct dataset support 2023-05-16 21:45:34 -04:00
Wing Lian
5e37144754 fix prompters, especially the sharegpt prompter 2023-05-15 22:15:36 -04:00
Wing Lian
2bc1a5bde1 black formatting 2023-05-10 16:01:08 -04:00
NanoCode012
174b74ddc9 Rename variable to use same convention 2023-05-09 02:49:44 +09:00
NanoCode012
cf681537ec Add CompletionPrompt type 2023-05-09 02:49:44 +09:00
Wing Lian
a12fb0a8da Jeopardy bot! (#17)
* support for jeopardy dataset

* commit the final config for jeopardy bot
2023-05-08 03:21:40 -04:00
Wing Lian
5159d00a86 fix sharegpt tokenization, refactor tokenization debugging 2023-04-30 00:23:53 -04:00
Wing Lian
8d437853c8 fix sharegpt handling from hf, don't worry about loading llama if using earlier transformers release 2023-04-24 09:41:35 -04:00
Wing Lian
6045345d6b WIP large refactor to make finetune script a little more manageable (#3) 2023-04-18 14:01:38 -04:00
Wing Lian
81de0efc18 add support for alpaca reflect training (#2) 2023-04-18 08:34:05 -04:00
Wing Lian
a6028d302e black formatting 2023-04-14 07:25:52 -04:00
Wing Lian
8d959a7e26 make it work with pythia in the cloud 2023-04-14 07:24:55 -04:00
Wing Lian
ce24f5e246 WIP for axolotl trainer 2023-04-14 00:20:05 -04:00