Commit Graph

15 Commits

Author SHA1 Message Date
NanoCode012
21326e4ef3 chore: lint 2024-10-11 11:40:42 +07:00
NanoCode012
e3efa29cf5 fix: test 2024-10-11 11:11:19 +07:00
NanoCode012
2038255052 Merge branch 'main' into cj_tokenizer_default_prompt_template 2024-10-10 20:25:37 +07:00
NanoCode012
203ae28704 fix: refactor artifact left from main merge 2024-10-10 17:16:41 +07:00
NanoCode012
f61e2fc7dc chore: remove redundant function 2024-10-10 16:15:15 +07:00
NanoCode012
b8056d04d9 Merge branch 'main' into cj_tokenizer_default_prompt_template 2024-10-10 16:11:07 +07:00
Wing Lian
e1915f5625 Multimodal Vision Llama - rudimentary support (#1940)
---------

Co-authored-by: Sunny <sunny@Sunnys-MacBook-Air.local>
Co-authored-by: sunny <sunnyliu19981005@gmail.com>
2024-10-02 21:02:48 -04:00
Keith Stevens
7b9f669a3a Trigger the original tokenization behavior when no advanced turn settings are provided (#1915) 2024-09-14 08:22:54 -04:00
Chirag Jain
eb188acbd4 Add option chat_template_jinja to provide a jinja template 2024-07-31 01:43:40 +05:30
Chirag Jain
34ea51dcf3 Fix lint and bug post merge from main 2024-07-30 23:59:38 +05:30
Chirag Jain
fd7538dca7 Merge branch 'main' into cj_tokenizer_default_prompt_template 2024-07-30 23:48:43 +05:30
Adam Brusselback
55cc214c76 Add flexible configuration options for chat_template dataset training (#1756)
* Add flexible configuration options for chat dataset training

- Introduce roles_to_train parameter to set training labels by role
- Add train_on_eos option to configure training on end-of-sequence tokens
- Implement per-message training configuration in dataset
- Allow fine-grained control over training specific portions of messages
- Add message_field_training and message_field_training_detail settings
- Implement mapping between dataset character offsets and tokenized prompt
- Enhance test suite to cover new functionality

* Fix missing field inits, things weren't working from yaml.

* Add flexible configuration options for chat dataset training

- Introduce roles_to_train parameter to set training labels by role
- Add train_on_eos option to configure training on end-of-sequence tokens
- Implement per-message training configuration in dataset
- Allow fine-grained control over training specific portions of messages
- Add message_field_training and message_field_training_detail settings
- Implement mapping between dataset character offsets and tokenized prompt
- Enhance test suite to cover new functionality

* Fix missing field inits, things weren't working from yaml.

* chore: lint

* Revert test repo back to NousResearch after opening PR to fix the tokenizer_config.json.

---------

Co-authored-by: Wing Lian <wing.lian@gmail.com>
2024-07-28 21:48:57 -04:00
Chirag Jain
4e38cea6b8 Add tests 2024-07-12 09:04:59 +05:30
Keith Stevens
cc11c6bce2 Generalizing the chat_template prompt strategy (#1660) [skip ci]
The strategy now supports configuring several fields: * The data field holding message arrays * the role and
content fields for each message * role mapping from source to target types

additionally this adds a sample llama3-8b instruct template using the chat template
2024-05-28 11:24:13 -04:00
Leonard
7c2bf3091f Fix llama3 chat_template (extra <|eot_id|> on last turn) (#1635)
* Fix llama3 chat_template (the {{eos_token}} leads to an extra <|eot_id|> being added in the last turn). Output now matches official Llama 3 Instruct model

* add tests

* chore: lint

---------

Co-authored-by: Wing Lian <wing.lian@gmail.com>
2024-05-21 09:08:53 -04:00