<li><ahref="#pre-training-from-hugging-face-hub-datasets"id="toc-pre-training-from-hugging-face-hub-datasets"class="nav-link"data-scroll-target="#pre-training-from-hugging-face-hub-datasets">Pre-training from Hugging Face hub datasets</a></li>
<li><ahref="#pre-training-from-local-dataset-files"id="toc-pre-training-from-local-dataset-files"class="nav-link"data-scroll-target="#pre-training-from-local-dataset-files">Pre-training from local dataset files</a></li>
<li><ahref="#pre-training-without-streaming"id="toc-pre-training-without-streaming"class="nav-link"data-scroll-target="#pre-training-without-streaming">Pre-training without streaming</a></li>
<p>When aiming to train on large corpora of text datasets, pre-training is your go-to choice. Due to the size of these datasets, downloading the entire-datasets before beginning training would be prohibitively time-consuming. Axolotl supports <ahref="https://huggingface.co/docs/datasets/en/stream">streaming</a> to only load batches into memory at a time.</p>
<p>A sample format for a pre-training dataset is as follows:</p>
<p>Pre-training trains on raw text corpora with no input masking. The dataset format is simple:</p>
<spanid="cb1-3"><ahref="#cb1-3"aria-hidden="true"tabindex="-1"></a><spanclass="er">...</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<p>It is typically recommended to save your dataset as <code>.jsonl</code> due to its flexibility and simplicity.</p>
<p>Axolotl supports loading from a Hugging Face hub repo or from local files.</p>
<spanid="cb3-6"><ahref="#cb3-6"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="kw">-</span><spanclass="at"> C.jsonl</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<p>While we recommend <code>.jsonl</code>, you can also use the other formats (<code>csv</code>, <code>parquet</code>, <code>arrow</code>, <code>SQL</code>, <code>Webdataset</code>) that are supported by <ahref="https://huggingface.co/docs/datasets/loading#local-and-remote-files"><code>Dataset.load_dataset</code></a></p>
<h3class="anchored"data-anchor-id="pre-training-without-streaming">Pre-training without streaming</h3>
<p>In the case that the dataset is small and can be loaded entirely into memory, another approach to running pre-training is to use the <code>completion</code> format. This would mean that the entire dataset is pre-tokenized instead of on-demand in streaming.</p>
<p>One benefit of this is that the tokenization can be performed separately on a CPU-only machine, and then transferred to a GPU machine for training to save costs.</p>
<spanid="cb4-3"><ahref="#cb4-3"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">type</span><spanclass="kw">:</span><spanclass="at"> completion</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<spanid="cb5-6"><ahref="#cb5-6"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">type</span><spanclass="kw">:</span><spanclass="at"> completion</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<spanid="cb1-2"><ahref="#cb1-2"aria-hidden="true"tabindex="-1"></a><spanclass="fu">{</span><spanclass="dt">"text"</span><spanclass="fu">:</span><spanclass="st">"second row"</span><spanclass="fu">}</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<p>For large corpora that don’t fit in memory, use <code>pretraining_dataset</code> with <ahref="../../docs/streaming.html">streaming</a>. Data is tokenized on-demand during training.</p>
<spanid="cb2-5"><ahref="#cb2-5"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">split</span><spanclass="kw">:</span><spanclass="at"> train</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<p>For<code>completion</code>only, Axolotl would split texts if it exceeds the context length into multiple smaller prompts. If you are interested in having this for <code>pretraining_dataset</code> too, please let us know or help make a PR!</p>
<p>Streaming requires<code>max_steps</code>in your config — Axolotl cannot infer the dataset size. One step = <code>sequence_len * micro_batch_size * gradient_accumulation_steps * num_gpus</code> tokens.</p>
</div>
</div>
<p>See <ahref="../../docs/streaming.html">Streaming Datasets</a> for full configuration details.</p>
<p>When using streaming for large datasets, Axolotl does not know in advance how large the dataset is and does not know when to stop.</p>
<p>Therefore, it is necessary to set <code>max_steps: int</code> in your config for pre-training to run, so that Axolotl knows when to stop training.</p>
<p>One step is equal to <code>sequence_len * micro_batch_size * gradient_accumulation_steps * total_num_gpus</code> tokens.</p>
<p>For datasets that fit in memory, use <code>type: completion</code> under <code>datasets:</code>. The entire dataset is pre-tokenized before training, which can be done on a CPU-only machine.</p>
<spanid="cb3-3"><ahref="#cb3-3"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">type</span><spanclass="kw">:</span><spanclass="at"> completion</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<spanid="cb6-3"><ahref="#cb6-3"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">type</span><spanclass="kw">:</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<spanid="cb4-3"><ahref="#cb4-3"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">type</span><spanclass="kw">:</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<p>We reccomend this approach when you want granular control over the prompt formatting, special tokens, and masking, whilst letting Axolotl handle the tokenization. This is very useful if your dataset has unique prompts that differ across samples and where one single general template wouldn’t suffice.</p>
<p>In the example below, you could see that there is no proper structure. At the same time, it’s very flexible as there are no constraints on how your prompt can look.</p>
<spanid="cb7-20"><ahref="#cb7-20"aria-hidden="true"tabindex="-1"></a><spanclass="fu">}</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<spanid="cb5-20"><ahref="#cb5-20"aria-hidden="true"tabindex="-1"></a><spanclass="fu">}</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<p>Each prompt must be have a key called <code>segments</code> which is a list of <code>{ text, label }</code>.</p>
<spanid="cb8-3"><ahref="#cb8-3"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">type</span><spanclass="kw">:</span><spanclass="at"> input_output</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<spanid="cb6-3"><ahref="#cb6-3"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">type</span><spanclass="kw">:</span><spanclass="at"> input_output</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<p>Here’s a quick rundown on <code>chat_template</code>: A <code>chat_template</code> is a Jinja2 template which formats a list of messages into a prompt.</p>
<p>An example of a prompt formatted into a popular template called ChatML can be seen below:</p>
<spanid="cb9-9"><ahref="#cb9-9"aria-hidden="true"tabindex="-1"></a><spanclass="dt">"content"</span><spanclass="fu">:</span><spanclass="st">"How can I help you?"</span></span>
<spanid="cb9-13"><ahref="#cb9-13"aria-hidden="true"tabindex="-1"></a><spanclass="dt">"content"</span><spanclass="fu">:</span><spanclass="st">"Can you add 3+5?"</span></span>
<spanid="cb9-17"><ahref="#cb9-17"aria-hidden="true"tabindex="-1"></a><spanclass="dt">"content"</span><spanclass="fu">:</span><spanclass="st">"The answer is 8."</span></span>
<spanid="cb9-20"><ahref="#cb9-20"aria-hidden="true"tabindex="-1"></a><spanclass="fu">}</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<spanid="cb7-9"><ahref="#cb7-9"aria-hidden="true"tabindex="-1"></a><spanclass="dt">"content"</span><spanclass="fu">:</span><spanclass="st">"How can I help you?"</span></span>
<spanid="cb7-13"><ahref="#cb7-13"aria-hidden="true"tabindex="-1"></a><spanclass="dt">"content"</span><spanclass="fu">:</span><spanclass="st">"Can you add 3+5?"</span></span>
<spanid="cb7-17"><ahref="#cb7-17"aria-hidden="true"tabindex="-1"></a><spanclass="dt">"content"</span><spanclass="fu">:</span><spanclass="st">"The answer is 8."</span></span>
<spanid="cb7-20"><ahref="#cb7-20"aria-hidden="true"tabindex="-1"></a><spanclass="fu">}</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<p>The ChatML template is as follows:</p>
<preclass="jinja2"><code>{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}</code></pre>
<p>The above prompt formatted into this template will result in:</p>
@@ -1066,9 +1067,9 @@ The answer is 8.<|im_end|></code></pre>
<p>Older conversation datasets with the following format are colloquially called <code>sharegpt</code> datasets.</p>
<divclass="code-copy-outer-scaffold"><divclass="sourceCode"id="cb12"><preclass="sourceCode json code-with-copy"><codeclass="sourceCode json"><spanid="cb12-1"><ahref="#cb12-1"aria-hidden="true"tabindex="-1"></a><spanclass="fu">{</span><spanclass="dt">"conversations"</span><spanclass="fu">:</span><spanclass="ot">[</span><spanclass="fu">{</span><spanclass="dt">"from"</span><spanclass="fu">:</span><spanclass="st">"..."</span><spanclass="fu">,</span><spanclass="dt">"value"</span><spanclass="fu">:</span><spanclass="st">"..."</span><spanclass="fu">}</span><spanclass="ot">]</span><spanclass="fu">}</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<divclass="code-copy-outer-scaffold"><divclass="sourceCode"id="cb10"><preclass="sourceCode json code-with-copy"><codeclass="sourceCode json"><spanid="cb10-1"><ahref="#cb10-1"aria-hidden="true"tabindex="-1"></a><spanclass="fu">{</span><spanclass="dt">"conversations"</span><spanclass="fu">:</span><spanclass="ot">[</span><spanclass="fu">{</span><spanclass="dt">"from"</span><spanclass="fu">:</span><spanclass="st">"..."</span><spanclass="fu">,</span><spanclass="dt">"value"</span><spanclass="fu">:</span><spanclass="st">"..."</span><spanclass="fu">}</span><spanclass="ot">]</span><spanclass="fu">}</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<p>Newer conversation datasets usually follow the OpenAI format.</p>
<divclass="code-copy-outer-scaffold"><divclass="sourceCode"id="cb13"><preclass="sourceCode json code-with-copy"><codeclass="sourceCode json"><spanid="cb13-1"><ahref="#cb13-1"aria-hidden="true"tabindex="-1"></a><spanclass="fu">{</span><spanclass="dt">"messages"</span><spanclass="fu">:</span><spanclass="ot">[</span><spanclass="fu">{</span><spanclass="dt">"role"</span><spanclass="fu">:</span><spanclass="st">"..."</span><spanclass="fu">,</span><spanclass="dt">"content"</span><spanclass="fu">:</span><spanclass="st">"..."</span><spanclass="fu">}</span><spanclass="ot">]</span><spanclass="fu">}</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<divclass="code-copy-outer-scaffold"><divclass="sourceCode"id="cb11"><preclass="sourceCode json code-with-copy"><codeclass="sourceCode json"><spanid="cb11-1"><ahref="#cb11-1"aria-hidden="true"tabindex="-1"></a><spanclass="fu">{</span><spanclass="dt">"messages"</span><spanclass="fu">:</span><spanclass="ot">[</span><spanclass="fu">{</span><spanclass="dt">"role"</span><spanclass="fu">:</span><spanclass="st">"..."</span><spanclass="fu">,</span><spanclass="dt">"content"</span><spanclass="fu">:</span><spanclass="st">"..."</span><spanclass="fu">}</span><spanclass="ot">]</span><spanclass="fu">}</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<p>Axolotl supports both as well as allowing customization of any kind of key.</p>
</section>
<sectionid="chat-template-usage"class="level4">
@@ -1084,49 +1085,49 @@ The answer is 8.<|im_end|></code></pre>
<p>There are a lot of <code>chat_templates</code> out there. Axolotl supports the common ones: <ahref="https://github.com/axolotl-ai-cloud/axolotl/blob/860609392184cf62a7e0ca676658b170e059ce6c/src/axolotl/utils/chat_templates.py#L17">supported chat templates</a>. For example, to use ChatML, it would be <code>chat_template: chatml</code>.</p>
<p>However, it is also possible to use the already configured template within the tokenizer by specifying <code>chat_template: tokenizer_default</code>. If you want a fallback (in case some tokenizer does not have it pre-configured), you can do <code>chat_template: tokenizer_default_fallback_chatml</code> to fallback to the ChatML template if a tokenizer template was not found.</p>
<p>One last but powerful approach is to bring your own template. This can be set via:</p>
<divclass="code-copy-outer-scaffold"><divclass="sourceCode"id="cb14"><preclass="sourceCode yaml code-with-copy"><codeclass="sourceCode yaml"><spanid="cb14-1"><ahref="#cb14-1"aria-hidden="true"tabindex="-1"></a><spanclass="fu">chat_template_jinja</span><spanclass="kw">:</span><spanclass="co"> # your template</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<divclass="code-copy-outer-scaffold"><divclass="sourceCode"id="cb12"><preclass="sourceCode yaml code-with-copy"><codeclass="sourceCode yaml"><spanid="cb12-1"><ahref="#cb12-1"aria-hidden="true"tabindex="-1"></a><spanclass="fu">chat_template_jinja</span><spanclass="kw">:</span><spanclass="co"> # your template</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<spanid="cb15-3"><ahref="#cb15-3"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">field_messages</span><spanclass="kw">:</span><spanclass="at"> messages</span><spanclass="co"> # this should point to the key containing the list of conversations</span></span>
<spanid="cb15-4"><ahref="#cb15-4"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">message_property_mappings</span><spanclass="kw">:</span><spanclass="co"> # this is a mapping from keys in your dataset to keys in chat_template</span></span>
<spanid="cb15-6"><ahref="#cb15-6"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">content</span><spanclass="kw">:</span><spanclass="at"> content</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<spanid="cb13-3"><ahref="#cb13-3"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">field_messages</span><spanclass="kw">:</span><spanclass="at"> messages</span><spanclass="co"> # this should point to the key containing the list of conversations</span></span>
<spanid="cb13-4"><ahref="#cb13-4"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">message_property_mappings</span><spanclass="kw">:</span><spanclass="co"> # this is a mapping from keys in your dataset to keys in chat_template</span></span>
<spanid="cb13-6"><ahref="#cb13-6"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">content</span><spanclass="kw">:</span><spanclass="at"> content</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<p>In some <code>chat_templates</code> (e.g. <ahref="https://huggingface.co/google/gemma-2b-it/blob/main/tokenizer_config.json#L1507">Gemma</a>), the roles are hardcoded to <code>user</code> and <code>assistant</code>. Consequently, you may find it necessary to map the roles in your dataset to these above. We currently have some defaults that should work for common datasets, but if you get a <code>KeyError</code>, it would be necessary to add mapping for your roles. Here is an example of how it would look like:</p>
<spanid="cb16-8"><ahref="#cb16-8"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="kw">-</span><spanclass="at"> human</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<spanid="cb14-8"><ahref="#cb14-8"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="kw">-</span><spanclass="at"> human</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<p>In the example above, all <code>gpt</code> and <code>model</code> values are converted to <code>assistant</code>. All <code>human</code> values are converted to <code>user.</code></p>
<p>The common use case for <code>chat_template</code> is for chat messages, therefore, it is common to mask all non-assistant messages. Assistant messages refer to the bot messages that you want the model to learn on.</p>
<p>To train on all <code>assistant</code> messages, you would set the following configs.</p>
<spanid="cb17-4"><ahref="#cb17-4"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">train_on_eos</span><spanclass="kw">:</span><spanclass="at"></span><spanclass="st">"turn"</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<spanid="cb15-4"><ahref="#cb15-4"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">train_on_eos</span><spanclass="kw">:</span><spanclass="at"></span><spanclass="st">"turn"</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<p>The <code>train_on_eos</code> config means that it would mask all EOS tokens for turns that aren’t assistant-turns. The other options are: <code>all</code> and <code>last</code> to choose which EOS to train on.</p>
<p>Perhaps, you want to train on <code>assistant</code> and <code>narrator</code> roles, you can simply add <code>narrator</code> to the list of <code>roles_to_train</code>. You would also need to add it to the mapping of <code>roles</code> above.</p>
<spanid="cb18-10"><ahref="#cb18-10"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">narrator</span><spanclass="kw">:</span><spanclass="at"></span><spanclass="kw">[</span><spanclass="st">"narrator"</span><spanclass="kw">]</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<spanid="cb16-10"><ahref="#cb16-10"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">narrator</span><spanclass="kw">:</span><spanclass="at"></span><spanclass="kw">[</span><spanclass="st">"narrator"</span><spanclass="kw">]</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<p>As chat_templates may use hardcoded EOS/EOT tokens that are different from the tokenizer’s EOS, it is highly recommended to set them. For example, <code>ChatML</code> uses <code><|im_end|></code> to end turns.</p>
<spanid="cb19-2"><ahref="#cb19-2"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">eos_token</span><spanclass="kw">:</span><spanclass="at"><|im_end|></span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<spanid="cb17-2"><ahref="#cb17-2"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">eos_token</span><spanclass="kw">:</span><spanclass="at"><|im_end|></span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<spanid="cb20-28"><ahref="#cb20-28"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">eos_token</span><spanclass="kw">:</span><spanclass="at"><|im_end|></span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<spanid="cb18-28"><ahref="#cb18-28"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">eos_token</span><spanclass="kw">:</span><spanclass="at"><|im_end|></span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<p>If this config were to be applied to the sample dataset above, the output would look as such (which can be retrieved via <code>axolotl preprocess config.yaml --debug</code>):</p>
<p>Instruction datasets are used to train instruction-following models and comprise a prompt, containing an instruction, and a single response. In contrast to chat datasets which may be multi-turn, instruct datasets are typically single-turn.</p>
<p>An example is of a common format called Alpaca:</p>
<divclass="code-copy-outer-scaffold"><divclass="sourceCode"id="cb22"><preclass="sourceCode json code-with-copy"><codeclass="sourceCode json"><spanid="cb22-1"><ahref="#cb22-1"aria-hidden="true"tabindex="-1"></a><spanclass="fu">{</span><spanclass="dt">"instruction"</span><spanclass="fu">:</span><spanclass="st">"..."</span><spanclass="fu">,</span><spanclass="dt">"input"</span><spanclass="fu">:</span><spanclass="st">"..."</span><spanclass="fu">,</span><spanclass="dt">"output"</span><spanclass="fu">:</span><spanclass="st">"..."</span><spanclass="fu">}</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<divclass="code-copy-outer-scaffold"><divclass="sourceCode"id="cb20"><preclass="sourceCode json code-with-copy"><codeclass="sourceCode json"><spanid="cb20-1"><ahref="#cb20-1"aria-hidden="true"tabindex="-1"></a><spanclass="fu">{</span><spanclass="dt">"instruction"</span><spanclass="fu">:</span><spanclass="st">"..."</span><spanclass="fu">,</span><spanclass="dt">"input"</span><spanclass="fu">:</span><spanclass="st">"..."</span><spanclass="fu">,</span><spanclass="dt">"output"</span><spanclass="fu">:</span><spanclass="st">"..."</span><spanclass="fu">}</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<p>Using those keys, a prompt can be built based on it.</p>
<pre><code>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
<spanid="cb24-3"><ahref="#cb24-3"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">type</span><spanclass="kw">:</span><spanclass="at"> alpaca</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<spanid="cb22-3"><ahref="#cb22-3"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">type</span><spanclass="kw">:</span><spanclass="at"> alpaca</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<p>Axolotl supports many kinds of instruction dataset. All of them can be found in the <ahref="../../docs/dataset-formats/inst_tune.html">Instruction Dataset Documentation</a> with their respective type and sample row format.</p>
<p>Due to the myriad possibilities of instruction formats, Axolotl allows customizing your own instruction format without having to dive into the code directly.</p>
<p>In the example below, a sample row is used to output in <code>mistral_v1</code> format.</p>
<divclass="code-copy-outer-scaffold"><divclass="sourceCode"id="cb25"><preclass="sourceCode json code-with-copy"><codeclass="sourceCode json"><spanid="cb25-1"><ahref="#cb25-1"aria-hidden="true"tabindex="-1"></a><spanclass="fu">{</span><spanclass="dt">"input"</span><spanclass="fu">:</span><spanclass="st">"..."</span><spanclass="fu">,</span><spanclass="dt">"output"</span><spanclass="fu">:</span><spanclass="st">"..."</span><spanclass="fu">}</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<spanid="cb26-15"><ahref="#cb26-15"aria-hidden="true"tabindex="-1"></a><spanclass="co"> # single-line example without input</span></span>
<spanid="cb26-16"><ahref="#cb26-16"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">no_input_format</span><spanclass="kw">:</span><spanclass="at"></span><spanclass="st">"[INST] {instruction} [/INST]"</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<divclass="code-copy-outer-scaffold"><divclass="sourceCode"id="cb23"><preclass="sourceCode json code-with-copy"><codeclass="sourceCode json"><spanid="cb23-1"><ahref="#cb23-1"aria-hidden="true"tabindex="-1"></a><spanclass="fu">{</span><spanclass="dt">"input"</span><spanclass="fu">:</span><spanclass="st">"..."</span><spanclass="fu">,</span><spanclass="dt">"output"</span><spanclass="fu">:</span><spanclass="st">"..."</span><spanclass="fu">}</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<spanid="cb24-15"><ahref="#cb24-15"aria-hidden="true"tabindex="-1"></a><spanclass="co"> # single-line example without input</span></span>
<spanid="cb24-16"><ahref="#cb24-16"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">no_input_format</span><spanclass="kw">:</span><spanclass="at"></span><spanclass="st">"[INST] {instruction} [/INST]"</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<p>The config sets that the <code>field_instruction</code> is actually named <code>input</code>, and the <code>field_input</code> is empty as we don’t have an <code>input</code> in this sample. Generally, <code>instruction</code> can be thought as the question to the model, and <code>input</code> as the additional information with <code>output</code> being the response. It is not necessary to have an <code>input</code> nor <code>system</code>. In the end, the most important part is to understand what format you want it to look like and how you can customize this to your use case.</p>
<p>Reference: <ahref="../../docs/dataset-formats/inst_tune.html#how-to-add-custom-prompt-format">Custom Instruct Prompt Format Documentation</a>.</p>
<spanid="cb1-3"><ahref="#cb1-3"aria-hidden="true"tabindex="-1"></a><spanclass="er">...</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
<spanclass="screen-reader-only">Note</span>Streaming is recommended for large datasets
Note
</div>
</div>
<divclass="callout-body-container callout-body">
<p>Axolotl usually loads the entire dataset into memory. This will be challenging for large datasets. Use the following config to enable streaming:</p>
<spanid="cb2-5"><ahref="#cb2-5"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">text_column</span><spanclass="kw">:</span><spanclass="co"> # column in dataset with the data, usually `text`</span></span>
<spanid="cb2-8"><ahref="#cb2-8"aria-hidden="true"tabindex="-1"></a><spanclass="at"></span><spanclass="fu">skip</span><spanclass="kw">:</span><spanclass="co"> # number of rows of data to skip over from the beginning</span></span></code></pre></div><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></div>
</div>
<p>Pre-training documentation has been consolidated:</p>
<ul>
<li><strong>Streaming pretraining</strong> (large datasets): See <ahref="../../docs/streaming.html#pretraining-with-streaming">Streaming Datasets</a></li>
<li><strong>Non-streaming pretraining</strong> (<code>type: completion</code>): See <ahref="../../docs/dataset-formats/index.html#pre-training">Dataset Formats</a></li>
<spanclass="menu-text">Training Stability & Debugging</span></a>
</div>
</li>
<liclass="sidebar-item">
<divclass="sidebar-item-container">
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.