Built site for gh-pages

This commit is contained in:
Quarto GHA Workflow Runner
2025-07-15 00:16:43 +00:00
parent 1659bb9f82
commit 9564d8f7c6
193 changed files with 2897 additions and 823 deletions

View File

@@ -425,6 +425,12 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});
<a href="../../docs/sequence_parallelism.html" class="sidebar-item-text sidebar-link">
<span class="menu-text">Sequence Parallelism</span></a>
</div>
</li>
<li class="sidebar-item">
<div class="sidebar-item-container">
<a href="../../docs/gradient_checkpointing.html" class="sidebar-item-text sidebar-link">
<span class="menu-text">Gradient Checkpointing and Activation Offloading</span></a>
</div>
</li>
</ul>
</li>
@@ -472,7 +478,6 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});
<li><a href="#classes" id="toc-classes" class="nav-link" data-scroll-target="#classes">Classes</a>
<ul class="collapse">
<li><a href="#axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CPU_Offloaded_Gradient_Checkpointer" id="toc-axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CPU_Offloaded_Gradient_Checkpointer" class="nav-link" data-scroll-target="#axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CPU_Offloaded_Gradient_Checkpointer">CPU_Offloaded_Gradient_Checkpointer</a></li>
<li><a href="#axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CheckpointFunctionWithCPUOffload" id="toc-axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CheckpointFunctionWithCPUOffload" class="nav-link" data-scroll-target="#axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CheckpointFunctionWithCPUOffload">CheckpointFunctionWithCPUOffload</a></li>
</ul></li>
</ul></li>
</ul>
@@ -502,10 +507,6 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});
<td><a href="#axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CPU_Offloaded_Gradient_Checkpointer">CPU_Offloaded_Gradient_Checkpointer</a></td>
<td>Saves VRAM by smartly offloading to RAM.</td>
</tr>
<tr class="even">
<td><a href="#axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CheckpointFunctionWithCPUOffload">CheckpointFunctionWithCPUOffload</a></td>
<td>This is a torch/utils/checkpoint.py CheckpointFunction monkey patch that offloads the first tensor to cpu during forward and back to cuda during backward. This allows significant memory savings when using a very long seqlen. e.g.&nbsp;for llama 8b at 100k its 24GB saved per gpu: <code>((100_000*4096)*2*32/2**30)</code></td>
</tr>
</tbody>
</table>
<section id="axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CPU_Offloaded_Gradient_Checkpointer" class="level3">
@@ -514,13 +515,6 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<p>Saves VRAM by smartly offloading to RAM.
Tiny hit to performance, since we mask the movement via non blocking calls.</p>
</section>
<section id="axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CheckpointFunctionWithCPUOffload" class="level3">
<h3 class="anchored" data-anchor-id="axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CheckpointFunctionWithCPUOffload">CheckpointFunctionWithCPUOffload</h3>
<div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>monkeypatch.gradient_checkpointing.offload_cpu.CheckpointFunctionWithCPUOffload(</span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<p>This is a torch/utils/checkpoint.py CheckpointFunction monkey patch that offloads the first tensor to cpu during forward and back to cuda during backward. This allows significant memory savings when using a very long seqlen. e.g.&nbsp;for llama 8b at 100k its 24GB saved per gpu: <code>((100_000*4096)*2*32/2**30)</code>
In the case of a very long seqlen 100k+ the copying to/from cpu overhead is not big, because dense quadratic attention compute will dominate.</p>
</section>