* chore(doc): add faq when having no default chat_template * Update docs/dataset-formats/conversation.qmd Co-authored-by: salman <salman.mohammadi@outlook.com> * Update docs/faq.qmd Co-authored-by: salman <salman.mohammadi@outlook.com> --------- Co-authored-by: salman <salman.mohammadi@outlook.com>
165 lines
4.7 KiB
Plaintext
165 lines
4.7 KiB
Plaintext
---
|
|
title: Conversation
|
|
description: Conversation format for supervised fine-tuning.
|
|
order: 3
|
|
---
|
|
|
|
## sharegpt
|
|
|
|
::: {.callout-important}
|
|
ShareGPT is deprecated!. Please see [chat_template](#chat_template) section below.
|
|
:::
|
|
|
|
## pygmalion
|
|
|
|
```{.json filename="data.jsonl"}
|
|
{"conversations": [{"role": "...", "value": "..."}]}
|
|
```
|
|
|
|
## chat_template
|
|
|
|
Chat Template strategy uses a jinja2 template that converts a list of messages into a prompt. Support using tokenizer's template, a supported template, or custom jinja2.
|
|
|
|
```{.json filename="data.jsonl"}
|
|
{"conversations": [{"role": "...", "content": "..."}]}
|
|
```
|
|
|
|
See [configs](../config.qmd) for full configs and supported templates.
|
|
|
|
### Migrating from sharegpt
|
|
|
|
Most configs can be adapted as follows:
|
|
|
|
```yaml
|
|
# old
|
|
chat_template: chatml
|
|
datasets:
|
|
- path: ...
|
|
type: sharegpt
|
|
conversation: chatml
|
|
|
|
# new (if using tokenizer's chat_template)
|
|
datasets:
|
|
- path: ...
|
|
type: chat_template
|
|
|
|
field_messages: conversations
|
|
message_property_mappings:
|
|
role: from
|
|
content: value
|
|
|
|
# new (if setting a new chat_template like chatml, gemma, etc)
|
|
chat_template: chatml
|
|
datasets:
|
|
- path: ...
|
|
type: chat_template
|
|
|
|
field_messages: conversations
|
|
message_property_mappings:
|
|
role: from
|
|
content: value
|
|
```
|
|
|
|
We recommend checking the below examples for other usecases.
|
|
|
|
### Examples
|
|
|
|
1. Using the default chat template in the tokenizer_config.json on OpenAI messages format, training on only last message.
|
|
|
|
```yaml
|
|
datasets:
|
|
- path: ...
|
|
type: chat_template
|
|
roles_to_train:
|
|
train_on_eos:
|
|
```
|
|
|
|
::: {.callout-tip}
|
|
If you receive an error like "`chat_template` choice is `tokenizer_default` but tokenizer's `chat_template` is null.", it means the tokenizer does not have a default `chat_template`. Follow the examples below instead to set a custom `chat_template`.
|
|
:::
|
|
|
|
2. Using the `gemma` chat template to override the tokenizer_config.json's chat template on OpenAI messages format, training on all assistant messages.
|
|
|
|
```yaml
|
|
chat_template: gemma # this overwrites the tokenizer's chat_template
|
|
datasets:
|
|
- path: ...
|
|
type: chat_template
|
|
roles_to_train: ["assistant"] # default value
|
|
```
|
|
|
|
3. Using the tokenizer_config.json's chat template or `chatml` as fallback if the former's chat template does not exist, on OpenAI messages format, training on all assistant messages.
|
|
|
|
```yaml
|
|
chat_template: tokenizer_default_fallback_chatml # this overwrites the tokenizer's chat_template
|
|
datasets:
|
|
- path: ...
|
|
type: chat_template
|
|
```
|
|
|
|
4. Using a custom jinja template on OpenAI messages format, training on all assistant messages.
|
|
|
|
```yaml
|
|
# chat_template: jinja # `jinja` will be implied if the `chat_template_jinja` is set and this field is empty
|
|
chat_template_jinja: "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'system') %}{{'<|system|>' + '\n' + message['content'] + '<|end|>' + '\n'}}{% elif (message['role'] == 'user') %}{{'<|user|>' + '\n' + message['content'] + '<|end|>' + '\n' + '<|assistant|>' + '\n'}}{% elif message['role'] == 'assistant' %}{{message['content'] + '<|end|>' + '\n'}}{% endif %}{% endfor %}"
|
|
|
|
datasets:
|
|
- path: ...
|
|
type: chat_template
|
|
```
|
|
|
|
::: {.callout-important}
|
|
Please make sure that your `tokenizer.eos_token` is same as EOS/EOT token in template. Otherwise, set `eos_token` under `special_tokens`.
|
|
:::
|
|
|
|
5. (Advanced) Using fine-grained control over tokens and turns to train in a conversation
|
|
|
|
For a data sample that looks like:
|
|
|
|
```{.json filename="data.jsonl"}
|
|
{
|
|
"conversations": [
|
|
{"from": "system", "value": "You are an AI assistant.", "train": false},
|
|
{"from": "human", "value": "Hello", "train": false},
|
|
{"from": "assistant", "value": "Hello", "train": true},
|
|
{"from": "human", "value": "How are you?", "train": true},
|
|
{
|
|
"from": "assistant",
|
|
"value": "I'm doing very well, thank you!",
|
|
"train_detail": [
|
|
{"begin_offset": 0, "end_offset": 8, "train": false},
|
|
{"begin_offset": 9, "end_offset": 18, "train": true},
|
|
{"begin_offset": 19, "end_offset": 30, "train": false},
|
|
],
|
|
},
|
|
{
|
|
"from": "human",
|
|
"value": "I'm doing very well, thank you!",
|
|
"train": true,
|
|
},
|
|
{"from": "assistant", "value": "Hi there!", "train": true}
|
|
]
|
|
}
|
|
```
|
|
|
|
The configuration would look like:
|
|
|
|
```yaml
|
|
datasets:
|
|
- path: ...
|
|
type: chat_template
|
|
chat_template: tokenizer_default
|
|
field_messages: conversations
|
|
message_property_mappings:
|
|
role: from
|
|
content: value
|
|
roles_to_train: []
|
|
train_on_eos: turn
|
|
message_field_training: train
|
|
message_field_training_detail: train_detail
|
|
```
|
|
|
|
::: {.callout-tip}
|
|
It is not necessary to set both `message_field_training` and `message_field_training_detail` at once.
|
|
:::
|