* When training of function calls, "tools" elements of a dataset can contain same parameter name but with different types. Datasets fails to load such training set. This fix allows "parameters" element of function call to be string( by running "json.dumps" in preparation of training data set). The _get_tools function will iterate over tool definitions, if "parameters" element is dict, it will keep that way, if it is a string, it will be converted to dict by invoking "json.loads" on string value. * feat: add doc on tool parameters json loading * feat: add tests for parameters json string --------- Co-authored-by: ezlotnik <eduard_zlotnik@intuit.com> Co-authored-by: NanoCode012 <nano@axolotl.ai>
338 lines
9.3 KiB
Plaintext
338 lines
9.3 KiB
Plaintext
---
|
|
title: Conversation
|
|
description: Conversation format for supervised fine-tuning.
|
|
order: 3
|
|
---
|
|
|
|
## chat_template
|
|
|
|
Chat Template strategy uses a jinja2 template that converts a list of messages into a prompt. Support using tokenizer's template, a supported template, or custom jinja2.
|
|
|
|
```{.json filename="data.jsonl"}
|
|
{"messages": [{"role": "...", "content": "..."}, {"role": "...", "content": "..."}, ...]}
|
|
```
|
|
|
|
See [configs](../config-reference.qmd) for full configs and supported templates.
|
|
|
|
### Migrating from sharegpt
|
|
|
|
Most configs can be adapted as follows:
|
|
|
|
```yaml
|
|
# old
|
|
chat_template: chatml
|
|
datasets:
|
|
- path: ...
|
|
type: sharegpt
|
|
conversation: chatml
|
|
|
|
# new (if using tokenizer's chat_template)
|
|
datasets:
|
|
- path: ...
|
|
type: chat_template
|
|
|
|
field_messages: conversations
|
|
message_property_mappings:
|
|
role: from
|
|
content: value
|
|
|
|
# new (if setting a new chat_template like chatml, gemma, etc)
|
|
chat_template: chatml
|
|
datasets:
|
|
- path: ...
|
|
type: chat_template
|
|
|
|
field_messages: conversations
|
|
message_property_mappings:
|
|
role: from
|
|
content: value
|
|
```
|
|
|
|
We recommend checking the below examples for other usecases.
|
|
|
|
### Examples
|
|
|
|
#### Training on last message
|
|
|
|
(Legacy) Using the default chat template in the tokenizer_config.json on OpenAI messages format, training on only last message.
|
|
|
|
```yaml
|
|
datasets:
|
|
- path: ...
|
|
type: chat_template
|
|
roles_to_train:
|
|
train_on_eos:
|
|
```
|
|
|
|
::: {.callout-tip}
|
|
If you receive an error like "`chat_template` choice is `tokenizer_default` but tokenizer's `chat_template` is null.", it means the tokenizer does not have a default `chat_template`. Follow the examples below instead to set a custom `chat_template`.
|
|
:::
|
|
|
|
#### Overriding default chat template
|
|
|
|
Using the `gemma` chat template to override the tokenizer_config.json's chat template on OpenAI messages format, training on all assistant messages.
|
|
|
|
```yaml
|
|
chat_template: gemma # this overwrites the tokenizer's chat_template
|
|
datasets:
|
|
- path: ...
|
|
type: chat_template
|
|
roles_to_train: ["assistant"] # default value
|
|
```
|
|
|
|
::: {.callout-note}
|
|
If you want to use built-in chat_template, use `chat_template: tokenizer_default` (this is set by default).
|
|
:::
|
|
|
|
#### Using default chat template with fallback
|
|
|
|
Using the tokenizer_config.json's chat template or `chatml` as fallback if the former's chat template does not exist, on OpenAI messages format, training on all assistant messages.
|
|
|
|
```yaml
|
|
chat_template: tokenizer_default_fallback_chatml # this overwrites the tokenizer's chat_template
|
|
datasets:
|
|
- path: ...
|
|
type: chat_template
|
|
```
|
|
|
|
#### Custom Jinja template
|
|
|
|
Using a custom jinja template on OpenAI messages format, training on all assistant messages.
|
|
|
|
```yaml
|
|
# chat_template: jinja # `jinja` will be implied if the `chat_template_jinja` is set and this field is empty
|
|
chat_template_jinja: "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'system') %}{{'<|system|>' + '\n' + message['content'] + '<|end|>' + '\n'}}{% elif (message['role'] == 'user') %}{{'<|user|>' + '\n' + message['content'] + '<|end|>' + '\n' + '<|assistant|>' + '\n'}}{% elif message['role'] == 'assistant' %}{{message['content'] + '<|end|>' + '\n'}}{% endif %}{% endfor %}"
|
|
|
|
datasets:
|
|
- path: ...
|
|
type: chat_template
|
|
```
|
|
|
|
::: {.callout-important}
|
|
Please make sure that your `tokenizer.eos_token` is same as EOS (End-of-Sequence) token in template. Otherwise, set `eos_token` under `special_tokens: `.
|
|
:::
|
|
|
|
#### Using template with different token for EOT and EOS
|
|
|
|
- If you are using a template that has a different EOT (End-of-Turn) token from EOS token or multiple EOT tokens (like Mistral V7 Tekken), set the `eot_tokens: ` config. The handling of EOT tokens follows `train_on_eos: ` which defaults to turn.
|
|
|
|
```yaml
|
|
eot_tokens:
|
|
- "[/INST]"
|
|
# - "[/SYSTEM_PROMPT]"
|
|
|
|
datasets:
|
|
- path: ...
|
|
type: chat_template
|
|
|
|
# optional
|
|
train_on_eot: turn # defaults read from train_on_eos (which defaults to turn)
|
|
```
|
|
|
|
::: {.callout-tip}
|
|
See [config documentation](../config-reference.qmd) for detailed explanations of "turn", "last", and "all" options for training on tokens.
|
|
:::
|
|
|
|
::: {.callout-note}
|
|
Using `eot_tokens` requires each token that exists in `chat_template` to be a single token in the tokenizer. Otherwise, the tokenizer will split the token and cause unexpected behavior.
|
|
|
|
You can add those tokens as new tokens under `tokens: ` or (recommended) override unused added_tokens via `added_tokens_overrides: `. See [config](../config-reference.qmd) for more details.
|
|
:::
|
|
|
|
- Continuing from the previous example, if you want to train on all EOT token trainable turns but only last EOS token, set `train_on_eos: last`.
|
|
|
|
```yaml
|
|
eot_tokens:
|
|
- "[/INST]"
|
|
# ...
|
|
|
|
datasets:
|
|
- path: ...
|
|
type: chat_template
|
|
|
|
train_on_eos: last
|
|
train_on_eot: turn
|
|
```
|
|
|
|
::: {.callout-tip}
|
|
If EOS token only appears at the end of a prompt, `train_on_eos: last` is equivalent to `train_on_eos: turn`. Therefore, generally, you can leave them to their defaults and omit them.
|
|
:::
|
|
|
|
|
|
#### Using tool use
|
|
|
|
Instead of passing `tools` via the system prompt, an alternative method would be to have the `tools` in a separate column and loaded via `chat_template` to let the template dynamically build it.
|
|
|
|
```json
|
|
{
|
|
"tools": [
|
|
{
|
|
"type": "...",
|
|
"function": {
|
|
"name": "...",
|
|
"description": "...",
|
|
"parameters": {
|
|
"type": "...",
|
|
"properties": {
|
|
// ...
|
|
},
|
|
"required": ["..."],
|
|
},
|
|
},
|
|
},
|
|
],
|
|
"messages": [
|
|
// ...
|
|
{
|
|
"role": "assistant", // call the function via assistant
|
|
"tool_calls": [
|
|
{
|
|
"id": "...", // required only for mistral
|
|
"type": "function",
|
|
"function": {
|
|
"name": "...",
|
|
"arguments": {
|
|
"...": "...",
|
|
}
|
|
}
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"role": "tool",
|
|
"tool_call_id": "...", // required only for mistral
|
|
"name": "...",
|
|
"content": "..."
|
|
},
|
|
],
|
|
}
|
|
```
|
|
|
|
::: {.callout-note}
|
|
Tools need to follow [JSON schema](https://json-schema.org/learn/getting-started-step-by-step).
|
|
:::
|
|
|
|
::: {.callout-warning}
|
|
If you have tool arguments with same name but different dtypes (like `"time": string` and `"time": number`), please save `arguments: ` as JSON string to prevent `datasets` from having casting issues.
|
|
|
|
```
|
|
"arguments": "{\"...\": \"...\"}"
|
|
```
|
|
|
|
The same is applicable for tool parameters.
|
|
|
|
```
|
|
"parameters": "{\"...\": \"...\"}"
|
|
```
|
|
|
|
:::
|
|
|
|
Example config for Llama4:
|
|
```yaml
|
|
chat_template: llama4
|
|
datasets:
|
|
- path: Nanobit/text-tools-2k-test
|
|
type: chat_template
|
|
# field_tools: tools # default is `tools`
|
|
```
|
|
|
|
::: {.callout-tip}
|
|
Look into the `chat_template` you are using to see if it supports `tools` and what the expected role is for the tool answer. In the example above, the tool answer is expected to be in the `tool` or `ipython` role for `llama4` template.
|
|
:::
|
|
|
|
|
|
#### Using fine-grained control over token masking
|
|
|
|
(Advanced) Using fine-grained control over tokens and turns to train in a conversation
|
|
|
|
For a data sample that looks like:
|
|
|
|
```{.json filename="data.jsonl"}
|
|
{
|
|
"conversations": [
|
|
{"from": "system", "value": "You are an AI assistant.", "train": false},
|
|
{"from": "human", "value": "Hello", "train": false},
|
|
{"from": "assistant", "value": "Hello", "train": true},
|
|
{"from": "human", "value": "How are you?", "train": true},
|
|
{
|
|
"from": "assistant",
|
|
"value": "I'm doing very well, thank you!",
|
|
"train_detail": [
|
|
{"begin_offset": 0, "end_offset": 8, "train": false},
|
|
{"begin_offset": 9, "end_offset": 18, "train": true},
|
|
{"begin_offset": 19, "end_offset": 30, "train": false},
|
|
],
|
|
},
|
|
{
|
|
"from": "human",
|
|
"value": "I'm doing very well, thank you!",
|
|
"train": true,
|
|
},
|
|
{"from": "assistant", "value": "Hi there!", "train": true}
|
|
]
|
|
}
|
|
```
|
|
|
|
The configuration would look like:
|
|
|
|
```yaml
|
|
datasets:
|
|
- path: ...
|
|
type: chat_template
|
|
chat_template: tokenizer_default
|
|
field_messages: conversations
|
|
message_property_mappings:
|
|
role: from
|
|
content: value
|
|
roles_to_train: []
|
|
train_on_eos: turn
|
|
message_field_training: train
|
|
message_field_training_detail: train_detail
|
|
```
|
|
|
|
::: {.callout-tip}
|
|
It is not necessary to set both `message_field_training` and `message_field_training_detail` at once.
|
|
:::
|
|
|
|
#### Reasoning split
|
|
|
|
(For Qwen3 template only) Enable reasoning split, where the reasoning is split from the content and passed as a separate field into the template.
|
|
|
|
```yaml
|
|
datasets:
|
|
- path: ...
|
|
type: chat_template
|
|
chat_template: qwen3
|
|
split_thinking: true
|
|
```
|
|
|
|
For example, a content can look like:
|
|
|
|
```json
|
|
{
|
|
"content": "<think>Some thinking outputs</think>Output after thinking."
|
|
}
|
|
```
|
|
|
|
After split, it will look like:
|
|
|
|
```json
|
|
{
|
|
"reasoning_content": "Some thinking outputs",
|
|
"content": "Output after thinking..."
|
|
}
|
|
```
|
|
|
|
|
|
## sharegpt
|
|
|
|
::: {.callout-important}
|
|
ShareGPT is deprecated!. Please see [chat_template](#chat_template) section.
|
|
:::
|
|
|
|
## pygmalion
|
|
|
|
```{.json filename="data.jsonl"}
|
|
{"conversations": [{"role": "...", "value": "..."}]}
|
|
```
|