fix: use apply_chat_template to find turn boundaries and allow tool_calling field (#2179) [skip ci]

* fix: use apply_chat_template to find turn boundaries and allow tool_calling field

* fix: keys to include in turn

* feat(doc): explicitly recommend setting train_on_eos and roles_to_train

* fix: eos not being masked for tool due to template padding

* chore: clear up docs

* fix: default messages format, train_on_eos: turn, and train on all assistant msg

* fix: properly warn if empty content

* feat: parametrize chat_template tests to test different tokenizers

* fix: set proper default for message key

* fix: update defaults to match load function

* fix: change defaults to use new

* feat: add tool_calling dataset

* feat: add tool_calling test

* fix: add handling of edge case of mistral tokenizer with only system prompt

* feat: refactor all test to follow source code

* fix: remove unnecessary eos_token from phi35

* fix test for phi3.5 since eos was dropped from chat_template

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
This commit is contained in:
NanoCode012
2024-12-18 04:42:21 +07:00
committed by GitHub
parent 339f3c67e2
commit 10cfecf02e
7 changed files with 924 additions and 352 deletions

View File

@@ -68,6 +68,8 @@ We recommend checking the below examples for other usecases.
datasets:
- path: ...
type: chat_template
roles_to_train:
train_on_eos:
```
2. Using the `gemma` chat template to override the tokenizer_config.json's chat template on OpenAI messages format, training on all assistant messages.
@@ -77,7 +79,7 @@ chat_template: gemma # this overwrites the tokenizer's chat_template
datasets:
- path: ...
type: chat_template
roles_to_train: ["assistant"]
roles_to_train: ["assistant"] # default value
```
3. Using the tokenizer_config.json's chat template or `chatml` as fallback if the former's chat template does not exist, on OpenAI messages format, training on all assistant messages.
@@ -87,7 +89,6 @@ chat_template: tokenizer_default_fallback_chatml # this overwrites the tokenizer
datasets:
- path: ...
type: chat_template
roles_to_train: ["assistant"]
```
4. Using a custom jinja template on OpenAI messages format, training on all assistant messages.
@@ -99,7 +100,6 @@ chat_template_jinja: "{{ bos_token }}{% for message in messages %}{% if (message
datasets:
- path: ...
type: chat_template
roles_to_train: ["assistant"]
```
5. (Advanced) Using fine-grained control over tokens and turns to train in a conversation