|
|
|
|
@@ -103,6 +103,15 @@ pre > code.sourceCode > span > a:first-child::before { text-decoration: underlin
|
|
|
|
|
"search-label": "Search"
|
|
|
|
|
}
|
|
|
|
|
}</script>
|
|
|
|
|
<script async="" src="https://www.googletagmanager.com/gtag/js?id=G-9KYCVJBNMQ"></script>
|
|
|
|
|
|
|
|
|
|
<script type="text/javascript">
|
|
|
|
|
|
|
|
|
|
window.dataLayer = window.dataLayer || [];
|
|
|
|
|
function gtag(){dataLayer.push(arguments);}
|
|
|
|
|
gtag('js', new Date());
|
|
|
|
|
gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});
|
|
|
|
|
</script>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<link rel="stylesheet" href="../../styles.css">
|
|
|
|
|
@@ -538,19 +547,6 @@ Tip
|
|
|
|
|
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="er">...</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
|
|
|
|
|
<p>It is typically recommended to save your dataset as <code>.jsonl</code> due to its flexibility and simplicity.</p>
|
|
|
|
|
<p>Axolotl supports loading from a Hugging Face hub repo or from local files.</p>
|
|
|
|
|
<div class="callout callout-style-default callout-important callout-titled">
|
|
|
|
|
<div class="callout-header d-flex align-content-center">
|
|
|
|
|
<div class="callout-icon-container">
|
|
|
|
|
<i class="callout-icon"></i>
|
|
|
|
|
</div>
|
|
|
|
|
<div class="callout-title-container flex-fill">
|
|
|
|
|
Important
|
|
|
|
|
</div>
|
|
|
|
|
</div>
|
|
|
|
|
<div class="callout-body-container callout-body">
|
|
|
|
|
<p>For pre-training only, Axolotl would split texts if it exceeds the context length into multiple smaller prompts.</p>
|
|
|
|
|
</div>
|
|
|
|
|
</div>
|
|
|
|
|
<section id="pre-training-from-hugging-face-hub-datasets" class="level3">
|
|
|
|
|
<h3 class="anchored" data-anchor-id="pre-training-from-hugging-face-hub-datasets">Pre-training from Hugging Face hub datasets</h3>
|
|
|
|
|
<p>As an example, to train using a Hugging Face dataset <code>hf_org/name</code>, you can pass the following config:</p>
|
|
|
|
|
@@ -575,14 +571,26 @@ Important
|
|
|
|
|
<div class="sourceCode" id="cb4"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="fu">datasets</span><span class="kw">:</span></span>
|
|
|
|
|
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a><span class="at"> </span><span class="kw">-</span><span class="at"> </span><span class="fu">path</span><span class="kw">:</span><span class="at"> hf_org/name</span></span>
|
|
|
|
|
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="at"> </span><span class="fu">type</span><span class="kw">:</span><span class="at"> completion</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
|
|
|
|
|
<p>From local files (either example works):</p>
|
|
|
|
|
<p>From local files:</p>
|
|
|
|
|
<div class="sourceCode" id="cb5"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="fu">datasets</span><span class="kw">:</span></span>
|
|
|
|
|
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a><span class="at"> </span><span class="kw">-</span><span class="at"> </span><span class="fu">path</span><span class="kw">:</span><span class="at"> A.jsonl</span></span>
|
|
|
|
|
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="at"> </span><span class="fu">type</span><span class="kw">:</span><span class="at"> completion</span></span>
|
|
|
|
|
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a></span>
|
|
|
|
|
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a><span class="at"> </span><span class="kw">-</span><span class="at"> </span><span class="fu">path</span><span class="kw">:</span><span class="at"> json</span></span>
|
|
|
|
|
<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a><span class="at"> </span><span class="fu">data_files</span><span class="kw">:</span><span class="at"> </span><span class="kw">[</span><span class="st">"A.jsonl"</span><span class="kw">,</span><span class="at"> </span><span class="st">"B.jsonl"</span><span class="kw">,</span><span class="at"> </span><span class="st">"C.jsonl"</span><span class="kw">]</span></span>
|
|
|
|
|
<span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a><span class="at"> </span><span class="fu">type</span><span class="kw">:</span><span class="at"> completion</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
|
|
|
|
|
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a><span class="at"> </span><span class="kw">-</span><span class="at"> </span><span class="fu">path</span><span class="kw">:</span><span class="at"> B.jsonl</span></span>
|
|
|
|
|
<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a><span class="at"> </span><span class="fu">type</span><span class="kw">:</span><span class="at"> completion</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
|
|
|
|
|
<div class="callout callout-style-default callout-important callout-titled">
|
|
|
|
|
<div class="callout-header d-flex align-content-center">
|
|
|
|
|
<div class="callout-icon-container">
|
|
|
|
|
<i class="callout-icon"></i>
|
|
|
|
|
</div>
|
|
|
|
|
<div class="callout-title-container flex-fill">
|
|
|
|
|
Important
|
|
|
|
|
</div>
|
|
|
|
|
</div>
|
|
|
|
|
<div class="callout-body-container callout-body">
|
|
|
|
|
<p>For <code>completion</code> only, Axolotl would split texts if it exceeds the context length into multiple smaller prompts. If you are interested in having this for <code>pretraining_dataset</code> too, please let us know or help make a PR!</p>
|
|
|
|
|
</div>
|
|
|
|
|
</div>
|
|
|
|
|
</section>
|
|
|
|
|
<section id="pre-training-dataset-configuration-tips" class="level3">
|
|
|
|
|
<h3 class="anchored" data-anchor-id="pre-training-dataset-configuration-tips">Pre-training dataset configuration tips</h3>
|
|
|
|
|
|