From 3e13ea033ee7e99f7b4d24ed7604da041058ff03 Mon Sep 17 00:00:00 2001
From: Quarto GHA Workflow Runner Saves VRAM by smartly offloading to RAM.
Tiny hit to performance, since we mask the movement via non blocking calls. This is a torch/utils/checkpoint.py CheckpointFunction monkey patch that offloads the first tensor to cpu during forward and back to cuda during backward. This allows significant memory savings when using a very long seqlen. e.g. for llama 8b at 100k it’s 24GB saved per gpu: Start from Stage 1 -> Stage 2 -> Stage 3. Using ZeRO Stage 3 with Single-GPU training ZeRO Stage 3 can be used for training on a single GPU by manually setting the environment variables:
+CPU_Offloaded_Gradient_Checkpointer
Saves VRAM by smartly offloading to RAM.
+
+
CheckpointFunctionWithCPUOffload
+This is a torch/utils/checkpoint.py CheckpointFunction monkey patch that offloads the first tensor to cpu during forward and back to cuda during backward. This allows significant memory savings when using a very long seqlen. e.g. for llama 8b at 100k it’s 24GB saved per gpu:
+((100_000*4096)*2*32/2**30)CheckpointFunctionWithCPUOffload
+monkeypatch.gradient_checkpointing.offload_cpu.CheckpointFunctionWithCPUOffload(
+)((100_000*4096)*2*32/2**30)
+In the case of a very long seqlen 100k+ the copying to/from cpu overhead is not big, because dense quadratic attention compute will dominate.WORLD_SIZE=1 LOCAL_RANK=0 MASTER_ADDR=0.0.0.0 MASTER_PORT=29500