utils.gradient_checkpointing.offload_disk
+utils.gradient_checkpointing.offload_disk
DISCO - DIsk-based Storage and Checkpointing with Optimized prefetching
+Classes
+| Name | +Description | +
|---|---|
| Disco | +Disco: DIsk-based Storage and Checkpointing with Optimized prefetching | +
| DiskOffloadManager | +Manages offloaded tensors and handles prefetching in a separate thread. | +
Disco
+utils.gradient_checkpointing.offload_disk.Disco()Disco: DIsk-based Storage and Checkpointing with Optimized prefetching +Advanced disk-based gradient checkpointer with prefetching.
+Methods
+| Name | +Description | +
|---|---|
| backward | +Backward pass that loads activations from disk with prefetching | +
| forward | +Forward pass that offloads activations to disk asynchronously | +
| get_instance | +Get or create the offload manager | +
backward
+utils.gradient_checkpointing.offload_disk.Disco.backward(ctx, *grad_outputs)Backward pass that loads activations from disk with prefetching
+forward
+utils.gradient_checkpointing.offload_disk.Disco.forward(
+ ctx,
+ forward_function,
+ hidden_states,
+ *args,
+ prefetch_size=1,
+ prefetch_to_gpu=True,
+ save_workers=4,
+)Forward pass that offloads activations to disk asynchronously
+get_instance
+utils.gradient_checkpointing.offload_disk.Disco.get_instance(
+ prefetch_size=1,
+ prefetch_to_gpu=True,
+ save_workers=4,
+)Get or create the offload manager
+DiskOffloadManager
+utils.gradient_checkpointing.offload_disk.DiskOffloadManager(
+ self,
+ prefetch_size=3,
+ prefetch_to_gpu=True,
+ save_workers=4,
+)Manages offloaded tensors and handles prefetching in a separate thread. +Includes synchronization to prevent race conditions.
+Methods
+| Name | +Description | +
|---|---|
| cleanup | +Clean up all temp files and stop prefetch thread with proper synchronization | +
| cleanup_tensor | +Clean up a specific tensor file after it’s been used | +
| load_tensor | +Load tensor from disk or prefetch cache with proper synchronization | +
| save_tensor | +Save tensor to disk asynchronously and return file path with thread-safe operations | +
| trigger_prefetch | +Trigger prefetching of the next N tensors with proper synchronization | +
| wait_for_save | +Wait for a tensor to be saved to disk | +
cleanup
+utils.gradient_checkpointing.offload_disk.DiskOffloadManager.cleanup()Clean up all temp files and stop prefetch thread with proper synchronization
+cleanup_tensor
+utils.gradient_checkpointing.offload_disk.DiskOffloadManager.cleanup_tensor(
+ file_path,
+)Clean up a specific tensor file after it’s been used
+load_tensor
+utils.gradient_checkpointing.offload_disk.DiskOffloadManager.load_tensor(
+ file_path,
+ target_device='cuda',
+)Load tensor from disk or prefetch cache with proper synchronization
+save_tensor
+utils.gradient_checkpointing.offload_disk.DiskOffloadManager.save_tensor(tensor)Save tensor to disk asynchronously and return file path with thread-safe operations
+trigger_prefetch
+utils.gradient_checkpointing.offload_disk.DiskOffloadManager.trigger_prefetch(
+ n=None,
+)Trigger prefetching of the next N tensors with proper synchronization
+wait_for_save
+utils.gradient_checkpointing.offload_disk.DiskOffloadManager.wait_for_save(
+ file_path,
+ timeout=None,
+)Wait for a tensor to be saved to disk
+ + +