221 lines
8.0 KiB
Plaintext
221 lines
8.0 KiB
Plaintext
---
|
|
title: Conversation
|
|
description: Conversation format for supervised fine-tuning.
|
|
order: 3
|
|
---
|
|
|
|
## sharegpt
|
|
|
|
conversations where `from` is `human`/`gpt`. (optional: first row with role `system` to override default system prompt)
|
|
|
|
```{.json filename="data.jsonl"}
|
|
{"conversations": [{"from": "...", "value": "..."}]}
|
|
```
|
|
|
|
Note: `type: sharegpt` opens special configs:
|
|
- `conversation`: enables conversions to many Conversation types. Refer to the 'name' [here](https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py) for options.
|
|
- `roles`: allows you to specify the roles for input and output. This is useful for datasets with custom roles such as `tool` etc to support masking.
|
|
- `field_human`: specify the key to use instead of `human` in the conversation.
|
|
- `field_model`: specify the key to use instead of `gpt` in the conversation.
|
|
|
|
```yaml
|
|
datasets:
|
|
path: ...
|
|
type: sharegpt
|
|
|
|
conversation: # Options (see Conversation 'name'): https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
|
|
field_human: # Optional[str]. Human key to use for conversation.
|
|
field_model: # Optional[str]. Assistant key to use for conversation.
|
|
# Add additional keys from your dataset as input or output roles
|
|
roles:
|
|
input: # Optional[List[str]]. These will be masked based on train_on_input
|
|
output: # Optional[List[str]].
|
|
```
|
|
|
|
## pygmalion
|
|
|
|
```{.json filename="data.jsonl"}
|
|
{"conversations": [{"role": "...", "value": "..."}]}
|
|
```
|
|
|
|
## sharegpt.load_role
|
|
|
|
conversations where `role` is used instead of `from`
|
|
|
|
```{.json filename="data.jsonl"}
|
|
{"conversations": [{"role": "...", "value": "..."}]}
|
|
```
|
|
|
|
## sharegpt.load_guanaco
|
|
|
|
conversations where `from` is `prompter` `assistant` instead of default sharegpt
|
|
|
|
```{.json filename="data.jsonl"}
|
|
{"conversations": [{"from": "...", "value": "..."}]}
|
|
```
|
|
|
|
## sharegpt.load_ultrachat
|
|
|
|
conversations where the turns field is 'messages', human is 'user' and gpt is 'assistant'.
|
|
|
|
```{.json filename="data.jsonl"}
|
|
{"messages": [{"user": "...", "assistant": "..."}]}
|
|
```
|
|
|
|
## sharegpt_jokes
|
|
|
|
creates a chat where bot is asked to tell a joke, then explain why the joke is funny
|
|
|
|
```{.json filename="data.jsonl"}
|
|
{"conversations": [{"title": "...", "text": "...", "explanation": "..."}]}
|
|
```
|
|
|
|
|
|
## chat_template
|
|
|
|
Chat Template strategy uses a jinja2 template that converts a list of messages into a prompt. Usually this chat template is stored in tokenizer_config.json under the key `chat_template`.
|
|
|
|
Conversational data would normally look like follows:
|
|
|
|
```{.json filename="data.jsonl"}
|
|
{"messages": [{"role": "...", "content": "..."}]}
|
|
```
|
|
|
|
with roles usually being system, user, assistant, etc.
|
|
However, all fields can be customized using the following configuration:
|
|
|
|
```yaml
|
|
datasets:
|
|
- path: ...
|
|
# Set type to `chat_template` to use this strategy
|
|
type: chat_template
|
|
# Specify the name of the chat template to use
|
|
# The name of the chat template to use for training, following values are supported:
|
|
# - tokenizer_default: Uses the chat template that is available in the tokenizer_config.json. If the chat template is not available in the tokenizer, it will raise an error. This is the default value.
|
|
# - alpaca/inst/chatml/gemma/cohere/llama3/phi_3/deepseek_v2/jamba: These chat templates are available in the axolotl codebase at src/axolotl/utils/chat_templates.py
|
|
# - tokenizer_default_fallback_*: where * is the name of the chat template to fallback to. E.g. tokenizer_default_fallback_chatml. This is useful when the chat template is not available in the tokenizer.
|
|
# - jinja: Uses a custom jinja template for the chat template. The custom jinja template should be provided in the chat_template_jinja field.
|
|
chat_template: tokenizer_default
|
|
# custom jinja template for chat template. This will be only used if chat_template is set to `jinja` or `null` (in which case chat_template is automatically set to `jinja`). Default is null.
|
|
chat_template_jinja: null
|
|
# The key in the data example that contains the messages. Default is "conversations".
|
|
field_messages: conversations
|
|
# The key in the message turn that contains the role. Default is "from".
|
|
message_field_role: from
|
|
# The key in the message turn that contains the content. Default is "value".
|
|
message_field_content: value
|
|
# Role mapping for the messages. This can be useful if you are combining data from multiple sources and the roles are different.
|
|
roles:
|
|
human: user
|
|
user: user
|
|
assistant: assistant
|
|
gpt: assistant
|
|
system: system
|
|
# Roles to train on. The tokens from these roles will be considered for the loss. Default is ["gpt", "assistant"]
|
|
roles_to_train: ["gpt", "assistant"]
|
|
# Which EOS tokens to train on in the conversation. Possible values are:
|
|
# - all: train on all EOS tokens
|
|
# - turn: train on the EOS token at the end of each trainable turn
|
|
# - last: train on the last EOS token in the conversation
|
|
# - none: do not train on EOS tokens
|
|
# Default is "turn".
|
|
train_on_eos: turn
|
|
# The key in the message turn that indicates if tokens of a turn should be considered for training. This is an advanced option useful to selectively train on certain turns besides the `roles_to_train`. Default is "training".
|
|
message_field_training: training
|
|
# The key in the message turn that contains the training details. This is an advanced option useful to selectively train on certain tokens in a turn. Default is "train_detail".
|
|
message_field_training_detail: train_detail
|
|
```
|
|
|
|
### Examples
|
|
|
|
1. Using the default chat template in the tokenizer_config.json on OpenAI messages format
|
|
|
|
```yaml
|
|
datasets:
|
|
- path: ...
|
|
type: chat_template
|
|
chat_template: tokenizer_default
|
|
field_messages: messages
|
|
message_field_role: role
|
|
message_field_content: content
|
|
roles:
|
|
user: user
|
|
assistant: assistant
|
|
human: user
|
|
gpt: assistant
|
|
system: system
|
|
roles_to_train: ["assistant"]
|
|
```
|
|
|
|
2. Using a custom jinja template on OpenAI messages format
|
|
|
|
```yaml
|
|
datasets:
|
|
- path: ...
|
|
type: chat_template
|
|
chat_template: jinja
|
|
chat_template_jinja: "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'system') %}{{'<|system|>' + '\n' + message['content'] + '<|end|>' + '\n'}}{% elif (message['role'] == 'user') %}{{'<|user|>' + '\n' + message['content'] + '<|end|>' + '\n' + '<|assistant|>' + '\n'}}{% elif message['role'] == 'assistant' %}{{message['content'] + '<|end|>' + '\n'}}{% endif %}{% endfor %}"
|
|
field_messages: messages
|
|
message_field_role: role
|
|
message_field_content: content
|
|
roles:
|
|
user: user
|
|
assistant: assistant
|
|
human: user
|
|
gpt: assistant
|
|
system: system
|
|
roles_to_train: ["assistant"]
|
|
```
|
|
|
|
3. Using fine-grained control over tokens and turns to train in a conversation
|
|
|
|
|
|
For a data sample that looks like:
|
|
|
|
```{.json filename="data.jsonl"}
|
|
{
|
|
"conversations": [
|
|
{"from": "system", "value": "You are an AI assistant.", "train": false},
|
|
{"from": "human", "value": "Hello", "train": false},
|
|
{"from": "assistant", "value": "Hello", "train": true},
|
|
{"from": "human", "value": "How are you?", "train": true},
|
|
{
|
|
"from": "assistant",
|
|
"value": "I'm doing very well, thank you!",
|
|
"train_detail": [
|
|
{"begin_offset": 0, "end_offset": 8, "train": false},
|
|
{"begin_offset": 9, "end_offset": 18, "train": true},
|
|
{"begin_offset": 19, "end_offset": 30, "train": false},
|
|
],
|
|
},
|
|
{
|
|
"from": "human",
|
|
"value": "I'm doing very well, thank you!",
|
|
"train": true,
|
|
},
|
|
{"from": "assistant", "value": "Hi there!", "train": true}
|
|
]
|
|
}
|
|
```
|
|
|
|
The configuration would look like:
|
|
|
|
```yaml
|
|
datasets:
|
|
- path: ...
|
|
chat_template: tokenizer_default
|
|
field_messages: conversations
|
|
message_field_role: from
|
|
message_field_content: value
|
|
roles:
|
|
human: human
|
|
user: human
|
|
assistant: assistant
|
|
gpt: assistant
|
|
system: system
|
|
roles_to_train: []
|
|
train_on_eos: turn
|
|
message_field_training: train
|
|
message_field_training_detail: train_detail
|
|
```
|