Built site for gh-pages

2025-07-15 00:16:43 +00:00
parent 1659bb9f82
commit 9564d8f7c6
193 changed files with 2897 additions and 823 deletions
--- a/docs/api/monkeypatch.gradient_checkpointing.offload_cpu.html
+++ b/docs/api/monkeypatch.gradient_checkpointing.offload_cpu.html
@@ -425,6 +425,12 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});
  <a href="../../docs/sequence_parallelism.html" class="sidebar-item-text sidebar-link">
 <span class="menu-text">Sequence Parallelism</span></a>
  </div>
+</li>
+          <li class="sidebar-item">
+  <div class="sidebar-item-container"> 
+  <a href="../../docs/gradient_checkpointing.html" class="sidebar-item-text sidebar-link">
+ <span class="menu-text">Gradient Checkpointing and Activation Offloading</span></a>
+  </div>
 </li>
      </ul>
  </li>
@@ -472,7 +478,6 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});
  <li><a href="#classes" id="toc-classes" class="nav-link" data-scroll-target="#classes">Classes</a>
  <ul class="collapse">
  <li><a href="#axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CPU_Offloaded_Gradient_Checkpointer" id="toc-axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CPU_Offloaded_Gradient_Checkpointer" class="nav-link" data-scroll-target="#axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CPU_Offloaded_Gradient_Checkpointer">CPU_Offloaded_Gradient_Checkpointer</a></li>
-  <li><a href="#axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CheckpointFunctionWithCPUOffload" id="toc-axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CheckpointFunctionWithCPUOffload" class="nav-link" data-scroll-target="#axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CheckpointFunctionWithCPUOffload">CheckpointFunctionWithCPUOffload</a></li>
  </ul></li>
  </ul></li>
  </ul>
@@ -502,10 +507,6 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});
 <td><a href="#axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CPU_Offloaded_Gradient_Checkpointer">CPU_Offloaded_Gradient_Checkpointer</a></td>
 <td>Saves VRAM by smartly offloading to RAM.</td>
 </tr>
-<tr class="even">
-<td><a href="#axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CheckpointFunctionWithCPUOffload">CheckpointFunctionWithCPUOffload</a></td>
-<td>This is a torch/utils/checkpoint.py CheckpointFunction monkey patch that offloads the first tensor to cpu during forward and back to cuda during backward. This allows significant memory savings when using a very long seqlen. e.g.&nbsp;for llama 8b at 100k it’s 24GB saved per gpu: <code>((100_000*4096)*2*32/2**30)</code></td>
-</tr>
 </tbody>
 </table>
 <section id="axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CPU_Offloaded_Gradient_Checkpointer" class="level3">
@@ -514,13 +515,6 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});
 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <p>Saves VRAM by smartly offloading to RAM.
 Tiny hit to performance, since we mask the movement via non blocking calls.</p>
-</section>
-<section id="axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CheckpointFunctionWithCPUOffload" class="level3">
-<h3 class="anchored" data-anchor-id="axolotl.monkeypatch.gradient_checkpointing.offload_cpu.CheckpointFunctionWithCPUOffload">CheckpointFunctionWithCPUOffload</h3>
-<div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>monkeypatch.gradient_checkpointing.offload_cpu.CheckpointFunctionWithCPUOffload(</span>
-<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
-<p>This is a torch/utils/checkpoint.py CheckpointFunction monkey patch that offloads the first tensor to cpu during forward and back to cuda during backward. This allows significant memory savings when using a very long seqlen. e.g.&nbsp;for llama 8b at 100k it’s 24GB saved per gpu: <code>((100_000*4096)*2*32/2**30)</code>
-In the case of a very long seqlen 100k+ the copying to/from cpu overhead is not big, because dense quadratic attention compute will dominate.</p>


 </section>