* use cuda streams for activation offloading * use torch native ops * update cfg schema for streams * fix literal constructor for set * use context for training step so it doesn't affect evals * disable streams * auto gc on eval steps * use activation_offloading config arg * add docs for gradient checkpointing * handle validation for gc/ao * use cuda streams for act offloading * add more validation for AC w/o GC * fix docs * move activation_offloading lower in definition so it doesn't break args/kwargs * fix kd due to import order
30 lines
997 B
Plaintext
30 lines
997 B
Plaintext
---
|
|
title: Gradient Checkpointing and Activation Offloading
|
|
---
|
|
|
|
Gradient checkpointing and activation offloading are techniques used to optimize the performance of deep learning
|
|
models by reducing the memory footprint and improving computational efficiency.
|
|
|
|
### Enabling Gradient Checkpointing
|
|
|
|
```yaml
|
|
gradient_checkpointing: true
|
|
```
|
|
|
|
### Enabling Activation Offloading
|
|
|
|
```yaml
|
|
gradient_checkpointing: true # required for activation offloading
|
|
activation_offloading: true
|
|
```
|
|
|
|
Activation offloading variants:
|
|
|
|
The default `activation_offloading: true` offloads activations to CPU and uses CUDA streams
|
|
to overlap the communications and computations when offloading.
|
|
|
|
The `activation_offloading: legacy` naively offloads activations to CPU and without additional optimizations.
|
|
|
|
For resource constrained environments with limited CPU memory, `activation_offloading: disk` offloads
|
|
activations to disk instead of CPU RAM so that much larger context lengths can be trained with minimal memory.
|