Activation Offloading w CUDA Streams (#2900) [skip ci]

* use cuda streams for activation offloading * use torch native ops * update cfg schema for streams * fix literal constructor for set * use context for training step so it doesn't affect evals * disable streams * auto gc on eval steps * use activation_offloading config arg * add docs for gradient checkpointing * handle validation for gc/ao * use cuda streams for act offloading * add more validation for AC w/o GC * fix docs * move activation_offloading lower in definition so it doesn't break args/kwargs * fix kd due to import order
2025-07-14 20:10:20 -04:00
parent aa684122f1
commit 99187cd208
14 changed files with 154 additions and 186 deletions
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -276,6 +276,7 @@ website:
            - docs/torchao.qmd
            - docs/custom_integrations.qmd
            - docs/sequence_parallelism.qmd
+            - docs/gradient_checkpointing.qmd

        - section: "Troubleshooting"
          contents: