Gradient Checkpointing and Activation Offloading
+Gradient checkpointing and activation offloading are techniques used to optimize the performance of deep learning +models by reducing the memory footprint and improving computational efficiency.
+Enabling Gradient Checkpointing
+gradient_checkpointing: trueEnabling Activation Offloading
+gradient_checkpointing: true # required for activation offloading
+activation_offloading: trueActivation offloading variants:
+The default activation_offloading: true offloads activations to CPU and uses CUDA streams
+to overlap the communications and computations when offloading.
The activation_offloading: legacy naively offloads activations to CPU and without additional optimizations.
For resource constrained environments with limited CPU memory, activation_offloading: disk offloads
+activations to disk instead of CPU RAM so that much larger context lengths can be trained with minimal memory.