Built site for gh-pages

This commit is contained in:
Quarto GHA Workflow Runner
2025-07-30 19:52:35 +00:00
parent b1bf58e8e6
commit ca0e437362
17 changed files with 7424 additions and 1811 deletions

View File

@@ -485,7 +485,11 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});
<ul>
<li><a href="#accelerate" id="toc-accelerate" class="nav-link active" data-scroll-target="#accelerate">Accelerate</a></li>
<li><a href="#raytrain" id="toc-raytrain" class="nav-link" data-scroll-target="#raytrain">Raytrain</a></li>
<li><a href="#torchrun" id="toc-torchrun" class="nav-link" data-scroll-target="#torchrun">Torchrun</a></li>
<li><a href="#torchrun" id="toc-torchrun" class="nav-link" data-scroll-target="#torchrun">Torchrun</a>
<ul class="collapse">
<li><a href="#option-1-new-axolotl-cli-with-launcher-args-recommended" id="toc-option-1-new-axolotl-cli-with-launcher-args-recommended" class="nav-link" data-scroll-target="#option-1-new-axolotl-cli-with-launcher-args-recommended">Option 1: New Axolotl CLI with launcher args (Recommended)</a></li>
<li><a href="#option-2-direct-torchrun-legacy" id="toc-option-2-direct-torchrun-legacy" class="nav-link" data-scroll-target="#option-2-direct-torchrun-legacy">Option 2: Direct torchrun (Legacy)</a></li>
</ul></li>
</ul>
</nav>
</div>
@@ -575,8 +579,14 @@ Important
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="bu">export</span> <span class="va">NCCL_SOCKET_IFNAME</span><span class="op">=</span><span class="st">"eth0,en,eth,em,bond"</span></span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="bu">export</span> <span class="va">NCCL_BUFFSIZE</span><span class="op">=</span>2097152</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<p>Run the following on each node:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="ex">torchrun</span> <span class="at">--nnodes</span> <span class="va">$num_nodes</span> <span class="at">--nproc_per_node</span> <span class="va">$gpu_per_node</span> <span class="at">--rdzv_id</span> <span class="va">$rdzv_id</span> <span class="at">--rdzv_backend</span> c10d <span class="at">--rdzv_endpoint</span> <span class="st">"</span><span class="va">$head_node_ip</span><span class="st">:</span><span class="va">$head_node_port</span><span class="st">"</span> <span class="at">-m</span> axolotl.cli.train config.yaml</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<p>Please make sure to substitute the placeholder variables.</p>
<section id="option-1-new-axolotl-cli-with-launcher-args-recommended" class="level3">
<h3 class="anchored" data-anchor-id="option-1-new-axolotl-cli-with-launcher-args-recommended">Option 1: New Axolotl CLI with launcher args (Recommended)</h3>
<div class="sourceCode" id="cb4"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="ex">axolotl</span> train config.yaml <span class="at">--launcher</span> torchrun <span class="at">--</span> <span class="at">--nnodes</span> <span class="va">$num_nodes</span> <span class="at">--nproc_per_node</span> <span class="va">$gpu_per_node</span> <span class="at">--rdzv_id</span> <span class="va">$rdzv_id</span> <span class="at">--rdzv_backend</span> c10d <span class="at">--rdzv_endpoint</span> <span class="st">"</span><span class="va">$head_node_ip</span><span class="st">:</span><span class="va">$head_node_port</span><span class="st">"</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</section>
<section id="option-2-direct-torchrun-legacy" class="level3">
<h3 class="anchored" data-anchor-id="option-2-direct-torchrun-legacy">Option 2: Direct torchrun (Legacy)</h3>
<div class="sourceCode" id="cb5"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="ex">torchrun</span> <span class="at">--nnodes</span> <span class="va">$num_nodes</span> <span class="at">--nproc_per_node</span> <span class="va">$gpu_per_node</span> <span class="at">--rdzv_id</span> <span class="va">$rdzv_id</span> <span class="at">--rdzv_backend</span> c10d <span class="at">--rdzv_endpoint</span> <span class="st">"</span><span class="va">$head_node_ip</span><span class="st">:</span><span class="va">$head_node_port</span><span class="st">"</span> <span class="at">-m</span> axolotl.cli.train config.yaml</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<p>Please make sure to substitute the placeholder variables:</p>
<ul>
<li><code>num_nodes</code>: Number of nodes (containing GPUs)</li>
<li><code>gpu_per_node</code>: Number of gpus per node</li>
@@ -584,22 +594,11 @@ Important
<li><code>head_node_port</code>: Port of the head node (make sure other machines can connect to this. Default 29400)</li>
<li><code>rdzv_id</code>: A unique job ID that is used by the job across nodes.</li>
</ul>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>You need to call <code>axolotl.cli.train</code> instead of <code>axolotl train</code> as the latter calls accelerate under the hood</p>
</div>
</div>
<p>The new CLI approach (Option 1) is recommended as it provides consistent argument handling and works seamlessly with other Axolotl CLI features.</p>
<p>More info on the available configs can be found on the Pytorch docs <a href="https://pytorch.org/docs/stable/elastic/run.html">here</a></p>
</section>
</section>
</main> <!-- /main -->