87 lines
2.3 KiB
Plaintext
87 lines
2.3 KiB
Plaintext
---
|
|
title: "Checkpoint Saving"
|
|
format:
|
|
html:
|
|
toc: true
|
|
toc-depth: 2
|
|
number-sections: true
|
|
execute:
|
|
enabled: false
|
|
---
|
|
|
|
## Overview
|
|
|
|
Axolotl supports on-demand checkpoint saving during training. You can trigger checkpoints via file-based triggers (for programmatic control) or Control+C (for interactive use).
|
|
|
|
## File-Based Checkpoint Trigger
|
|
|
|
### Configuration
|
|
|
|
Enable in your config:
|
|
|
|
```yaml
|
|
dynamic_checkpoint:
|
|
enabled: true
|
|
check_interval: 100 # Optional: check every N steps (default: 100)
|
|
trigger_file_path: "axolotl_checkpoint.save" # Optional: custom filename
|
|
```
|
|
|
|
**Options:**
|
|
- `enabled`: `true` to enable (required)
|
|
- `check_interval`: Steps between file checks. Default: 100. Lower = faster response, higher I/O overhead.
|
|
- `trigger_file_path`: Custom trigger filename. Default: `axolotl_checkpoint.save`
|
|
|
|
### How It Works
|
|
|
|
1. Rank 0 checks for trigger file every `check_interval` steps in `output_dir`
|
|
2. When detected, file is deleted and checkpoint is saved
|
|
3. In distributed training, rank 0 broadcasts to synchronize all ranks
|
|
|
|
### Usage
|
|
|
|
**Command line:**
|
|
```bash
|
|
touch /path/to/output_dir/axolotl_checkpoint.save
|
|
```
|
|
|
|
**Programmatic:**
|
|
```python
|
|
from pathlib import Path
|
|
Path("/path/to/output_dir/axolotl_checkpoint.save").touch()
|
|
```
|
|
|
|
Checkpoint saves within the next `check_interval` steps. The trigger file is auto-deleted after detection, so you can create it multiple times.
|
|
|
|
**Custom filename:**
|
|
```yaml
|
|
dynamic_checkpoint:
|
|
enabled: true
|
|
trigger_file_path: "my_trigger.save"
|
|
```
|
|
```bash
|
|
touch /path/to/output_dir/my_trigger.save
|
|
```
|
|
|
|
## Control+C (SIGINT) Checkpoint
|
|
|
|
Pressing `Ctrl+C` during training saves the model state and exits gracefully. **Note:** This saves only the model weights, not optimizer state. For resumable checkpoints, use the file-based trigger.
|
|
|
|
## Best Practices
|
|
|
|
- **Check interval**: Lower values (10-50) for fast training, default 100 for slower training
|
|
- **Distributed training**: Create trigger file once; rank 0 handles synchronization
|
|
- **Resume**: Dynamic checkpoints can be resumed like regular checkpoints via `resume_from_checkpoint`
|
|
|
|
## Example
|
|
|
|
```yaml
|
|
output_dir: ./outputs/lora-out
|
|
save_steps: 500 # Scheduled checkpoints
|
|
|
|
dynamic_checkpoint:
|
|
enabled: true
|
|
check_interval: 50
|
|
```
|
|
|
|
This enables scheduled checkpoints every 500 steps plus on-demand saves via file trigger (checked every 50 steps).
|