* fix: use apply_chat_template to find turn boundaries and allow tool_calling field
* fix: keys to include in turn
* feat(doc): explicitly recommend setting train_on_eos and roles_to_train
* fix: eos not being masked for tool due to template padding
* chore: clear up docs
* fix: default messages format, train_on_eos: turn, and train on all assistant msg
* fix: properly warn if empty content
* feat: parametrize chat_template tests to test different tokenizers
* fix: set proper default for message key
* fix: update defaults to match load function
* fix: change defaults to use new
* feat: add tool_calling dataset
* feat: add tool_calling test
* fix: add handling of edge case of mistral tokenizer with only system prompt
* feat: refactor all test to follow source code
* fix: remove unnecessary eos_token from phi35
* fix test for phi3.5 since eos was dropped from chat_template
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
* Allow using tokenizer's default chat template with fallbacks
Summary of changes:
1. Adds `tokenizer_default` as option for `chat_template` in
`chat_template` prompt strategy that allows using the chat template
from tokenizer's config.json
2. Allows falling back to chat templates available in axolotl if
tokenizer does not have a chat template
3. Adds a mistral chat template which supports system message - taken
from https://github.com/chujiezheng/chat_templates/blob/main/chat_templates/mistral-instruct.jinja
---
Why?
Many popular models are not trained with chatml format. As a result for
the model to correctly learn chatml we have to turn on train_on_inputs
which requires more compute and time. If we can use the model's already
learned chat template we can just learn the output tokens
---
Todo:
- Write tests
* Add tests
* Fix lint and bug post merge from main
* Add option `chat_template_jinja` to provide a jinja template
* remove custom mistral template
* Address review comments and add docs
* Update docs/dataset-formats/conversation.qmd
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
* fix: set default to tokenizer template
* Merge branch 'main' into cj_tokenizer_default_prompt_template
* chore: remove redundant function
* fix: re-arrange enum declaration position
* fix: refactor artifact left from main merge
* feat(doc): updated config with chat template options and clarified examples
* chore: clarify doc
* chore: added example for non-default template
* chore: refactor
* fix: test
* fix: config being dropped and unittest to catch that
* chore: lint
* chore: skip duplicate
* fix: rename var after merge
* feat: add test for levy's dpo case
* fix: remove default setting on edge case where chat template overriden in dataset section
* feat: handle sharegpt deprecation better in docs
* feat: add example using fallback
* feat: handles chat_template requiring specific user/assistant order
* fix: update test based on new defaults
* fix: imported name incorrectly updated on merge
* chore: lint
* fix: update dummy message to prevent potential overlap with real content
* fix(doc): formatting
* fix: update bradleyterry to use new chat_template
---------
Co-authored-by: Chirag Jain <jain.chirag925@gmail.com>
* wip add new proposed message structure
* tokenization
* wip
* wip transform builder
* wip make the chat dataset loadable
* wip chatml + llama 3 new chat objects
* chore: lint
* chore: lint
* fix tokenization
* remove dacite dependency since we're using pydantic now
* fix handling when already correctly split in messages
* make sure to remove chat features from tokenized ds
* move chat to be a input transform for messages
* make sure llama3 has the bos token
* remove non-working special token code
* fix messages strat loader
* Add flexible configuration options for chat dataset training
- Introduce roles_to_train parameter to set training labels by role
- Add train_on_eos option to configure training on end-of-sequence tokens
- Implement per-message training configuration in dataset
- Allow fine-grained control over training specific portions of messages
- Add message_field_training and message_field_training_detail settings
- Implement mapping between dataset character offsets and tokenized prompt
- Enhance test suite to cover new functionality
* Fix missing field inits, things weren't working from yaml.
* Add flexible configuration options for chat dataset training
- Introduce roles_to_train parameter to set training labels by role
- Add train_on_eos option to configure training on end-of-sequence tokens
- Implement per-message training configuration in dataset
- Allow fine-grained control over training specific portions of messages
- Add message_field_training and message_field_training_detail settings
- Implement mapping between dataset character offsets and tokenized prompt
- Enhance test suite to cover new functionality
* Fix missing field inits, things weren't working from yaml.
* chore: lint
* Revert test repo back to NousResearch after opening PR to fix the tokenizer_config.json.
---------
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* Implementing a basic chat_template strategy for DPO datasets
This mimics the sft chat_template strategy such that users can:
* Specify the messages field
* Specify the per message role and content fields
* speicfy the chosen and rejected fields
* Let the tokenizer construct the raw prompt
* Ensure the chosen and rejected fields don't have any prefix tokens
* Adding additional dpo chat template unittests
* Rename test class
Allow in message objects the additional key `weight`, which can be set
to 0 (or 1) to cause that message to be masked out (or left unmasked)
for training (similar to [1]). This is helpful for training the model to be robust and
capable of error recovery upon a bad assistant message.
A missing `weight` key defaults to weight 1, to guarantee downward compatibility.
[1]: https://github.com/mistralai/mistral-finetune
The strategy now supports configuring several fields: * The data field holding message arrays * the role and
content fields for each message * role mapping from source to target types
additionally this adds a sample llama3-8b instruct template using the chat template
* Fix llama3 chat_template (the {{eos_token}} leads to an extra <|eot_id|> being added in the last turn). Output now matches official Llama 3 Instruct model
* add tests
* chore: lint
---------
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* Add Glaive conversation format support
* fix black formatting errors
* Fix black and pylint formatting errors
* only set role_key_tool if provided in the dataset constructor
* Update src/axolotl/prompt_strategies/sharegpt.py
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* sharegpt test
* tokenizer test
* fix formatting
---------
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* plain input/output prompt strategy w/o chat templates
* disable duplicate code check
* make sure to add an eos/eot token to the end of the output so it will stop
* multi turn segement support and test
* add system message to template
* readme update
* added code to register new system message
* register chatml template for test
---------
Co-authored-by: Mads Henrichsen <mads@BrbartiendeMads.lan>
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* fix: `train_on_inputs: true` ignored for sharegpt
* enable unit test for train_on_inputs for sharegpt
---------
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* fix double eos token for chatml
* isolate fix to chatml conversation
* fix add special tokens to include rstrip
* add test for train_on_inputs for sharegpt
* don't use rstrip for chatml