Checkpoint Saving
+1 Overview
+Axolotl supports on-demand checkpoint saving during training. You can trigger checkpoints via file-based triggers (for programmatic control) or Control+C (for interactive use).
+2 File-Based Checkpoint Trigger
+2.1 Configuration
+Enable in your config:
+dynamic_checkpoint:
+ enabled: true
+ check_interval: 100 # Optional: check every N steps (default: 100)
+ trigger_file_path: "axolotl_checkpoint.save" # Optional: custom filenameOptions:
+- enabled: true to enable (required)
+- check_interval: Steps between file checks. Default: 100. Lower = faster response, higher I/O overhead.
+- trigger_file_path: Custom trigger filename. Default: axolotl_checkpoint.save
2.2 How It Works
+-
+
- Rank 0 checks for trigger file every
check_intervalsteps inoutput_dir
+ - When detected, file is deleted and checkpoint is saved +
- In distributed training, rank 0 broadcasts to synchronize all ranks +
2.3 Usage
+Command line:
+touch /path/to/output_dir/axolotl_checkpoint.saveProgrammatic:
+from pathlib import Path
+Path("/path/to/output_dir/axolotl_checkpoint.save").touch()Checkpoint saves within the next check_interval steps. The trigger file is auto-deleted after detection, so you can create it multiple times.
Custom filename:
+dynamic_checkpoint:
+ enabled: true
+ trigger_file_path: "my_trigger.save"touch /path/to/output_dir/my_trigger.save3 Control+C (SIGINT) Checkpoint
+Pressing Ctrl+C during training saves the model state and exits gracefully. Note: This saves only the model weights, not optimizer state. For resumable checkpoints, use the file-based trigger.
4 Best Practices
+-
+
- Check interval: Lower values (10-50) for fast training, default 100 for slower training +
- Distributed training: Create trigger file once; rank 0 handles synchronization +
- Resume: Dynamic checkpoints can be resumed like regular checkpoints via
resume_from_checkpoint
+
5 Example
+output_dir: ./outputs/lora-out
+save_steps: 500 # Scheduled checkpoints
+
+dynamic_checkpoint:
+ enabled: true
+ check_interval: 50This enables scheduled checkpoints every 500 steps plus on-demand saves via file trigger (checked every 50 steps).
+ + +