support offloading layers to CPU (#3512) [skip ci]

* support offloading layers to CPU * chore: lint * revert change * update docs
2026-03-21 22:47:02 -04:00
parent 0ee98a0309
commit c9df6efdc2
8 changed files with 360 additions and 1 deletions
--- a/docs/gradient_checkpointing.qmd
+++ b/docs/gradient_checkpointing.qmd
@@ -1,5 +1,5 @@
 ---
-title: Gradient Checkpointing and Activation Offloading
+title: Gradient Checkpointing, Activation Offloading, and Layer Offloading
 ---

 Gradient checkpointing and activation offloading are techniques used to optimize the performance of deep learning
@@ -27,3 +27,33 @@ The `activation_offloading: legacy` naively offloads activations to CPU and with

 For resource constrained environments with limited CPU memory, `activation_offloading: disk` offloads
 activations to disk instead of CPU RAM so that much larger context lengths can be trained with minimal memory.
+
+### Enabling Layer Offloading
+
+```yaml
+layer_offloading: true
+```
+
+Layer offloading reduces GPU memory usage by moving frozen (non-trainable) decoder layer parameters to CPU
+and streaming them back to GPU one layer at a time during the forward and backward passes. This is
+particularly useful for LoRA/QLoRA training where most of the model's parameters are frozen — only the
+trainable adapter weights stay on GPU permanently.
+
+During training, forward and backward hooks on each decoder layer handle the transfer automatically:
+
+- **Forward pass:** Before a layer executes, its frozen params are loaded to GPU. The next layer is
+  prefetched asynchronously on a separate CUDA stream for overlap.
+- **Backward pass:** Same pattern in reverse — the current layer's frozen params are loaded and the
+  previous layer is prefetched.
+
+After each layer finishes, its frozen params are offloaded back to CPU pinned memory.
+
+This approach trades some CPU-GPU transfer overhead for significant GPU memory savings — the freed memory
+is roughly equal to the size of all frozen parameters across all decoder layers, minus one layer's worth
+that is kept on GPU at any given time.
+
+**Requirements:**
+
+- CUDA GPU (CPU-only training is not supported for this feature)
+- Works with any HuggingFace model architecture that uses decoder layers (Llama, Mistral, Qwen, etc.)
+- Best combined with LoRA/QLoRA where most parameters are frozen
--- a/docs/optimizations.qmd
+++ b/docs/optimizations.qmd
@@ -54,6 +54,13 @@ These techniques save VRAM by changing how activations are handled.
 - Activation Offloading: moves activations to CPU RAM or disk, trading I/O overhead for VRAM.
 - Learn more: [Gradient Checkpointing and Offloading Docs](gradient_checkpointing.qmd)

+### Layer Offloading
+
+Offloads frozen (non-trainable) decoder layer parameters to CPU and streams them back to GPU one layer at a time during forward/backward passes using CUDA stream prefetching. Especially effective for LoRA/QLoRA where most parameters are frozen.
+
+- **Config:** `layer_offloading: true`
+- **Learn more:** [Layer Offloading Docs](gradient_checkpointing.qmd#enabling-layer-offloading)
+
 ### Cut Cross Entropy (CCE)

 Reduces VRAM usage by using an optimized cross-entropy loss calculation.