diff --git a/.nojekyll b/.nojekyll index fa762a488..c40bb00a7 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -610a007e \ No newline at end of file +ddc3e283 \ No newline at end of file diff --git a/docs/api/monkeypatch.gradient_checkpointing.offload_cpu.html b/docs/api/monkeypatch.gradient_checkpointing.offload_cpu.html index 56bee7618..705b36559 100644 --- a/docs/api/monkeypatch.gradient_checkpointing.offload_cpu.html +++ b/docs/api/monkeypatch.gradient_checkpointing.offload_cpu.html @@ -472,6 +472,7 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});
((100_000*4096)*2*32/2**30)Saves VRAM by smartly offloading to RAM. Tiny hit to performance, since we mask the movement via non blocking calls.
+monkeypatch.gradient_checkpointing.offload_cpu.CheckpointFunctionWithCPUOffload(
+)This is a torch/utils/checkpoint.py CheckpointFunction monkey patch that offloads the first tensor to cpu during forward and back to cuda during backward. This allows significant memory savings when using a very long seqlen. e.g. for llama 8b at 100k it’s 24GB saved per gpu: ((100_000*4096)*2*32/2**30)
+In the case of a very long seqlen 100k+ the copying to/from cpu overhead is not big, because dense quadratic attention compute will dominate.
Start from Stage 1 -> Stage 2 -> Stage 3.
+Using ZeRO Stage 3 with Single-GPU training
+ZeRO Stage 3 can be used for training on a single GPU by manually setting the environment variables:
+WORLD_SIZE=1 LOCAL_RANK=0 MASTER_ADDR=0.0.0.0 MASTER_PORT=29500