Built site for gh-pages

2025-02-08 11:02:53 +00:00
parent ca4cd4192e
commit 7ef6b7ee2d
6 changed files with 594 additions and 550 deletions
--- a/docs/multi-node.html
+++ b/docs/multi-node.html
@@ -329,7 +329,9 @@ pre > code.sourceCode > span > a:first-child::before { text-decoration: underlin
    <h2 id="toc-title">On this page</h2>
   
  <ul>
-  <li><a href="#machine-configuration" id="toc-machine-configuration" class="nav-link active" data-scroll-target="#machine-configuration">Machine configuration</a></li>
+  <li><a href="#accelerate" id="toc-accelerate" class="nav-link active" data-scroll-target="#accelerate">Accelerate</a></li>
+  <li><a href="#raytrain" id="toc-raytrain" class="nav-link" data-scroll-target="#raytrain">Raytrain</a></li>
+  <li><a href="#torchrun" id="toc-torchrun" class="nav-link" data-scroll-target="#torchrun">Torchrun</a></li>
  </ul>
 </nav>
    </div>
@@ -360,6 +362,24 @@ pre > code.sourceCode > span > a:first-child::before { text-decoration: underlin
 </header>


+<p>The below are three ways to train multi-node in Axolotl.</p>
+<div class="callout callout-style-default callout-important callout-titled">
+<div class="callout-header d-flex align-content-center">
+<div class="callout-icon-container">
+<i class="callout-icon"></i>
+</div>
+<div class="callout-title-container flex-fill">
+Important
+</div>
+</div>
+<div class="callout-body-container callout-body">
+<p>Each machine needs a copy of Axolotl, we suggest using the same commit to ensure compatibility.</p>
+<p>You will also need to have the same configuration file for your model on each machine.</p>
+<p>Make sure the main machine is reachable by other machines.</p>
+</div>
+</div>
+<section id="accelerate" class="level1">
+<h1>Accelerate</h1>
 <p>You will need to create a configuration for accelerate, either by using <code>accelerate config</code> and follow the instructions or you can use one of the preset below:</p>
 <p>~/.cache/huggingface/accelerate/default_config.yaml</p>
 <div class="sourceCode" id="cb1"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="fu">compute_environment</span><span class="kw">:</span><span class="at"> LOCAL_MACHINE</span></span>
@@ -379,7 +399,7 @@ pre > code.sourceCode > span > a:first-child::before { text-decoration: underlin
 <span id="cb1-15"><a href="#cb1-15" aria-hidden="true" tabindex="-1"></a><span class="fu">tpu_use_cluster</span><span class="kw">:</span><span class="at"> </span><span class="ch">false</span></span>
 <span id="cb1-16"><a href="#cb1-16" aria-hidden="true" tabindex="-1"></a><span class="fu">tpu_use_sudo</span><span class="kw">:</span><span class="at"> </span><span class="ch">false</span></span>
 <span id="cb1-17"><a href="#cb1-17" aria-hidden="true" tabindex="-1"></a><span class="fu">use_cpu</span><span class="kw">:</span><span class="at"> </span><span class="ch">false</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
-<p>Configure your model to use FSDP with for example:</p>
+<p>Configure your model to use FSDP in the Axolotl yaml. For example:</p>
 <div class="sourceCode" id="cb2"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="fu">fsdp</span><span class="kw">:</span></span>
 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> full_shard</span></span>
 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> auto_wrap</span></span>
@@ -387,12 +407,43 @@ pre > code.sourceCode > span > a:first-child::before { text-decoration: underlin
 <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">fsdp_offload_params</span><span class="kw">:</span><span class="at"> </span><span class="ch">true</span></span>
 <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">fsdp_state_dict_type</span><span class="kw">:</span><span class="at"> FULL_STATE_DICT</span></span>
 <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">fsdp_transformer_layer_cls_to_wrap</span><span class="kw">:</span><span class="at"> LlamaDecoderLayer</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
-<section id="machine-configuration" class="level2">
-<h2 class="anchored" data-anchor-id="machine-configuration">Machine configuration</h2>
-<p>On each machine you need a copy of Axolotl, we suggest using the same commit to ensure compatibility.</p>
-<p>You will also need to have the same configuration file for your model on each machine.</p>
-<p>On the main machine only, make sure the port you set as <code>main_process_port</code> is open in TCP and reachable by other machines.</p>
 <p>All you have to do now is launch using accelerate as you would usually do on each machine and voila, the processes will start once you have launched accelerate on every machine.</p>
+</section>
+<section id="raytrain" class="level1">
+<h1>Raytrain</h1>
+<p>Please see ray train doc <a href="../docs/ray-integration.html">here</a>.</p>
+</section>
+<section id="torchrun" class="level1">
+<h1>Torchrun</h1>
+<p>If you are using Infiniband, we recommend torchrun to utilize the full bandwidth.</p>
+<p>Set the following env (change buffersize/socketname depending on your system):</p>
+<div class="sourceCode" id="cb3"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="at">export NCCL_IB_DISABLE=0</span></span>
+<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="at">export NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond"</span></span>
+<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="at">export NCCL_BUFFSIZE=2097152</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<p>Run the following on each node:</p>
+<div class="sourceCode" id="cb4"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="ex">torchrun</span> <span class="at">--nnodes</span> <span class="va">$num_nodes</span> <span class="at">--nproc_per_node</span> <span class="va">$gpu_per_node</span> <span class="at">--rdzv_id</span> <span class="va">$rdzv_id</span> <span class="at">--rdzv_backend</span> c10d <span class="at">--rdzv_endpoint</span> <span class="st">"</span><span class="va">$head_node_ip</span><span class="st">:</span><span class="va">$head_node_port</span><span class="st">"</span> <span class="at">-m</span> axolotl.cli.train config.yaml</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<p>Please make sure to substitute the placeholder variables.</p>
+<ul>
+<li><code>num_nodes</code>: Number of nodes (containing GPUs)</li>
+<li><code>gpu_per_node</code>: Number of gpus per node</li>
+<li><code>head_node_ip</code>: IP of the head node (make sure other machines can connect to this)</li>
+<li><code>head_node_port</code>: Port of the head node (make sure other machines can connect to this. Default 29400)</li>
+<li><code>rdzv_id</code>: A unique job ID that is used by the job across nodes.</li>
+</ul>
+<div class="callout callout-style-default callout-note callout-titled">
+<div class="callout-header d-flex align-content-center">
+<div class="callout-icon-container">
+<i class="callout-icon"></i>
+</div>
+<div class="callout-title-container flex-fill">
+Note
+</div>
+</div>
+<div class="callout-body-container callout-body">
+<p>You need to call <code>axolotl.cli.train</code> instead of <code>axolotl train</code> as the latter calls accelerate under the hood</p>
+</div>
+</div>
+<p>More info on the available configs can be found on the Pytorch docs <a href="https://pytorch.org/docs/stable/elastic/run.html">here</a></p>


 </section>