--- title: Conversation description: Conversation format for supervised fine-tuning. order: 3 --- ## chat_template Chat Template strategy uses a jinja2 template that converts a list of messages into a prompt. Support using tokenizer's template, a supported template, or custom jinja2. ```{.json filename="data.jsonl"} {"messages": [{"role": "...", "content": "..."}, {"role": "...", "content": "..."}, ...]} ``` See [configs](../config-reference.qmd) for full configs and supported templates. ### Migrating from sharegpt Most configs can be adapted as follows: ```yaml # old chat_template: chatml datasets: - path: ... type: sharegpt conversation: chatml # new (if using tokenizer's chat_template) datasets: - path: ... type: chat_template field_messages: conversations message_property_mappings: role: from content: value # new (if setting a new chat_template like chatml, gemma, etc) chat_template: chatml datasets: - path: ... type: chat_template field_messages: conversations message_property_mappings: role: from content: value ``` We recommend checking the below examples for other usecases. ### Examples #### Training on last message (Legacy) Using the default chat template in the tokenizer_config.json on OpenAI messages format, training on only last message. ```yaml datasets: - path: ... type: chat_template roles_to_train: train_on_eos: ``` ::: {.callout-tip} If you receive an error like "`chat_template` choice is `tokenizer_default` but tokenizer's `chat_template` is null.", it means the tokenizer does not have a default `chat_template`. Follow the examples below instead to set a custom `chat_template`. ::: #### Overriding default chat template Using the `gemma` chat template to override the tokenizer_config.json's chat template on OpenAI messages format, training on all assistant messages. ```yaml chat_template: gemma # this overwrites the tokenizer's chat_template datasets: - path: ... type: chat_template roles_to_train: ["assistant"] # default value ``` ::: {.callout-note} If you want to use built-in chat_template, use `chat_template: tokenizer_default` (this is set by default). ::: #### Using default chat template with fallback Using the tokenizer_config.json's chat template or `chatml` as fallback if the former's chat template does not exist, on OpenAI messages format, training on all assistant messages. ```yaml chat_template: tokenizer_default_fallback_chatml # this overwrites the tokenizer's chat_template datasets: - path: ... type: chat_template ``` #### Custom Jinja template Using a custom jinja template on OpenAI messages format, training on all assistant messages. ```yaml # chat_template: jinja # `jinja` will be implied if the `chat_template_jinja` is set and this field is empty chat_template_jinja: "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'system') %}{{'<|system|>' + '\n' + message['content'] + '<|end|>' + '\n'}}{% elif (message['role'] == 'user') %}{{'<|user|>' + '\n' + message['content'] + '<|end|>' + '\n' + '<|assistant|>' + '\n'}}{% elif message['role'] == 'assistant' %}{{message['content'] + '<|end|>' + '\n'}}{% endif %}{% endfor %}" datasets: - path: ... type: chat_template ``` ::: {.callout-important} Please make sure that your `tokenizer.eos_token` is same as EOS (End-of-Sequence) token in template. Otherwise, set `eos_token` under `special_tokens: `. ::: #### Using template with different token for EOT and EOS - If you are using a template that has a different EOT (End-of-Turn) token from EOS token or multiple EOT tokens (like Mistral V7 Tekken), set the `eot_tokens: ` config. The handling of EOT tokens follows `train_on_eos: ` which defaults to turn. ```yaml eot_tokens: - "[/INST]" # - "[/SYSTEM_PROMPT]" datasets: - path: ... type: chat_template # optional train_on_eot: turn # defaults read from train_on_eos (which defaults to turn) ``` ::: {.callout-tip} See [config documentation](../config-reference.qmd) for detailed explanations of "turn", "last", and "all" options for training on tokens. ::: ::: {.callout-note} Using `eot_tokens` requires each token that exists in `chat_template` to be a single token in the tokenizer. Otherwise, the tokenizer will split the token and cause unexpected behavior. You can add those tokens as new tokens under `tokens: ` or (recommended) override unused added_tokens via `added_tokens_overrides: `. See [config](../config-reference.qmd) for more details. ::: - Continuing from the previous example, if you want to train on all EOT token trainable turns but only last EOS token, set `train_on_eos: last`. ```yaml eot_tokens: - "[/INST]" # ... datasets: - path: ... type: chat_template train_on_eos: last train_on_eot: turn ``` ::: {.callout-tip} If EOS token only appears at the end of a prompt, `train_on_eos: last` is equivalent to `train_on_eos: turn`. Therefore, generally, you can leave them to their defaults and omit them. ::: #### Using tool use Instead of passing `tools` via the system prompt, an alternative method would be to have the `tools` in a separate column and loaded via `chat_template` to let the template dynamically build it. ```json { "tools": [ { "type": "...", "function": { "name": "...", "description": "...", "parameters": { "type": "...", "properties": { // ... }, "required": ["..."], }, }, }, ], "messages": [ // ... { "role": "assistant", // call the function via assistant "tool_calls": [ { "type": "function", "function": { "name": "...", "arguments": { "...": "...", } } } ] }, { "role": "tool", "name": "...", "content": "..." }, ], } ``` ::: {.callout-note} Tools need to follow [JSON schema](https://json-schema.org/learn/getting-started-step-by-step). ::: ```yaml chat_template: llama4 datasets: - path: ... type: chat_template # field_tools: tools # default is `tools` ``` ::: {.callout-tip} Look into the `chat_template` you are using to see if it supports `tools` and what the expected role is for the tool answer. In the example above, the tool answer is expected to be in the `tool` or `ipython` role for `llama4` template. ::: #### Using fine-grained control over token masking (Advanced) Using fine-grained control over tokens and turns to train in a conversation For a data sample that looks like: ```{.json filename="data.jsonl"} { "conversations": [ {"from": "system", "value": "You are an AI assistant.", "train": false}, {"from": "human", "value": "Hello", "train": false}, {"from": "assistant", "value": "Hello", "train": true}, {"from": "human", "value": "How are you?", "train": true}, { "from": "assistant", "value": "I'm doing very well, thank you!", "train_detail": [ {"begin_offset": 0, "end_offset": 8, "train": false}, {"begin_offset": 9, "end_offset": 18, "train": true}, {"begin_offset": 19, "end_offset": 30, "train": false}, ], }, { "from": "human", "value": "I'm doing very well, thank you!", "train": true, }, {"from": "assistant", "value": "Hi there!", "train": true} ] } ``` The configuration would look like: ```yaml datasets: - path: ... type: chat_template chat_template: tokenizer_default field_messages: conversations message_property_mappings: role: from content: value roles_to_train: [] train_on_eos: turn message_field_training: train message_field_training_detail: train_detail ``` ::: {.callout-tip} It is not necessary to set both `message_field_training` and `message_field_training_detail` at once. ::: #### Reasoning split (For Qwen3 template only) Enable reasoning split, where the reasoning is split from the content and passed as a separate field into the template. ```yaml datasets: - path: ... type: chat_template chat_template: qwen3 split_thinking: true ``` For example, a content can look like: ```json { "content": "Some thinking outputsOutput after thinking." } ``` After split, it will look like: ```json { "reasoning_content": "Some thinking outputs", "content": "Output after thinking..." } ``` ## sharegpt ::: {.callout-important} ShareGPT is deprecated!. Please see [chat_template](#chat_template) section. ::: ## pygmalion ```{.json filename="data.jsonl"} {"conversations": [{"role": "...", "value": "..."}]} ```