Files

PraMamba 8aab807e67 feat: Add SwanLab integration for experiment tracking (#3334 )

* feat(swanlab): add SwanLab integration for experiment tracking

SwanLab integration provides comprehensive experiment tracking and monitoring for Axolotl training.

Features:
- Hyperparameter logging
- Training metrics tracking
- RLHF completion logging
- Performance profiling
- Configuration validation and conflict detection

Includes:
- Plugin in src/axolotl/integrations/swanlab/
- Callback in src/axolotl/utils/callbacks/swanlab.py
- Tests in tests/integrations/test_swanlab.py
- Examples in examples/swanlab/

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix(swanlab): address PR #3334 review feedback from winglian and CodeRabbit

- Change use_swanlab default to True (winglian)
- Clear buffer after periodic logging to prevent duplicates (CodeRabbit Major)
- Add safe exception handling in config fallback (CodeRabbit)
- Use context managers for file operations (CodeRabbit)
- Replace LOG.error with LOG.exception for better debugging (CodeRabbit)
- Sort __all__ alphabetically (CodeRabbit)
- Add language specifiers to README code blocks (CodeRabbit)
- Fix end-of-file newline in README (pre-commit)

Resolves actionable comments and nitpicks from CodeRabbit review.
Addresses reviewer feedback from @winglian.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* only run swanlab integration tests if package is available

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>

2026-01-06 09:19:18 -05:00

custom_trainer_profiling.py

feat: Add SwanLab integration for experiment tracking (#3334 )

2026-01-06 09:19:18 -05:00

dpo-swanlab-completions.yml

feat: Add SwanLab integration for experiment tracking (#3334 )

2026-01-06 09:19:18 -05:00

dpo-swanlab-full-featured.yml

feat: Add SwanLab integration for experiment tracking (#3334 )

2026-01-06 09:19:18 -05:00

lora-swanlab-profiling.yml

feat: Add SwanLab integration for experiment tracking (#3334 )

2026-01-06 09:19:18 -05:00

README.md

feat: Add SwanLab integration for experiment tracking (#3334 )

2026-01-06 09:19:18 -05:00

README.md

SwanLab Integration Examples

This directory contains example configurations demonstrating SwanLab integration with Axolotl.

Examples Overview

1. DPO with Completion Logging

File: dpo-swanlab-completions.yml

Demonstrates DPO (Direct Preference Optimization) training with RLHF completion table logging.

Features:

Basic SwanLab experiment tracking
Completion table logging (prompts, chosen/rejected responses, rewards)
Memory-bounded buffer for long training runs
Cloud sync configuration

Best for: RLHF practitioners who want to analyze model outputs qualitatively

Quick start:

export SWANLAB_API_KEY=your-api-key
accelerate launch -m axolotl.cli.train examples/swanlab/dpo-swanlab-completions.yml

2. LoRA with Performance Profiling

File: lora-swanlab-profiling.yml

Demonstrates standard LoRA fine-tuning with performance profiling enabled.

Features:

SwanLab experiment tracking
Automatic profiling of trainer methods
Profiling metrics visualization
Performance optimization guidance

Best for: Engineers optimizing training performance and comparing different configurations

Quick start:

export SWANLAB_API_KEY=your-api-key
accelerate launch -m axolotl.cli.train examples/swanlab/lora-swanlab-profiling.yml

3. Full-Featured DPO Production Setup

File: dpo-swanlab-full-featured.yml

Comprehensive production-ready configuration with ALL SwanLab features enabled.

Features:

Experiment tracking with team workspace
RLHF completion logging
Performance profiling
Lark (Feishu) team notifications
Private deployment support
Production checklist and troubleshooting

Best for: Production RLHF training with team collaboration

Quick start:

export SWANLAB_API_KEY=your-api-key
export SWANLAB_LARK_WEBHOOK_URL=https://open.feishu.cn/...
export SWANLAB_LARK_SECRET=your-webhook-secret
accelerate launch -m axolotl.cli.train examples/swanlab/dpo-swanlab-full-featured.yml

4. Custom Trainer Profiling (Python)

File: custom_trainer_profiling.py

Python code examples showing how to add SwanLab profiling to custom trainers.

Features:

@swanlab_profile decorator examples
Context manager profiling for fine-grained timing
ProfilingConfig for advanced filtering and throttling
Multiple profiling patterns and best practices

Best for: Advanced users creating custom trainers

Usage:

from custom_trainer_profiling import CustomTrainerWithProfiling
# See file for detailed examples and patterns

Feature Matrix

Example	Tracking	Completion Logging	Profiling	Lark Notifications	Team Workspace
dpo-swanlab-completions.yml	✅	✅	✅ (auto)	➖ (commented)	➖ (commented)
lora-swanlab-profiling.yml	✅	➖ (disabled)	✅ (auto)	➖ (commented)	➖ (commented)
dpo-swanlab-full-featured.yml	✅	✅	✅ (auto)	✅	✅
custom_trainer_profiling.py	N/A	N/A	✅ (manual)	N/A	N/A

Configuration Quick Reference

Basic SwanLab Setup

plugins:
  - axolotl.integrations.swanlab.SwanLabPlugin

use_swanlab: true
swanlab_project: my-project
swanlab_experiment_name: my-experiment
swanlab_mode: cloud  # cloud, local, offline, disabled

RLHF Completion Logging

swanlab_log_completions: true
swanlab_completion_log_interval: 100  # Log every 100 steps
swanlab_completion_max_buffer: 128    # Memory-bounded buffer

Lark Team Notifications

swanlab_lark_webhook_url: https://open.feishu.cn/...
swanlab_lark_secret: your-webhook-secret  # Required for production

Team Workspace

swanlab_workspace: my-research-team

Private Deployment

swanlab_web_host: https://swanlab.yourcompany.com
swanlab_api_host: https://api.swanlab.yourcompany.com

Authentication

Recommended: Environment Variable

export SWANLAB_API_KEY=your-api-key
export SWANLAB_LARK_WEBHOOK_URL=https://open.feishu.cn/...
export SWANLAB_LARK_SECRET=your-webhook-secret

Alternative: Config File (less secure)

swanlab_api_key: your-api-key
swanlab_lark_webhook_url: https://open.feishu.cn/...
swanlab_lark_secret: your-webhook-secret

Common Use Cases

Use Case 1: Migrate from WandB to SwanLab

Start with lora-swanlab-profiling.yml, add your model/dataset config, disable WandB:

use_swanlab: true
use_wandb: false

Use Case 2: Analyze DPO Model Outputs

Use dpo-swanlab-completions.yml, adjust completion logging interval based on your training length:

swanlab_completion_log_interval: 50   # More frequent for short training
swanlab_completion_log_interval: 200  # Less frequent for long training

Use Case 3: Optimize Training Performance

Use lora-swanlab-profiling.yml, run multiple experiments with different optimizations:

Baseline: flash_attention: false, gradient_checkpointing: false
Flash Attention: flash_attention: true
Gradient Checkpointing: gradient_checkpointing: true
Both: flash_attention: true, gradient_checkpointing: true

Compare profiling metrics in SwanLab dashboard.

Use Case 4: Production RLHF with Team Collaboration

Use dpo-swanlab-full-featured.yml, set up team workspace and Lark notifications:

swanlab_workspace: ml-team
swanlab_lark_webhook_url: ...
swanlab_lark_secret: ...

Viewing Your Experiments

Cloud Mode

Visit https://swanlab.cn and navigate to your project.

Dashboard sections:

Metrics: Training loss, learning rate, profiling metrics
Tables: RLHF completions (for DPO/KTO/ORPO/GRPO)
Config: Hyperparameters and configuration
System: Resource usage (GPU, memory, CPU)
Files: Logged artifacts

Local Mode

swanlab watch ./swanlog
# Open browser to http://localhost:5092

Troubleshooting

SwanLab not initializing

# Check API key
echo $SWANLAB_API_KEY

# Verify SwanLab is installed
pip show swanlab

# Check config
grep -A 5 "use_swanlab" your-config.yml

Completions not appearing

Verify you're using an RLHF trainer (DPO/KTO/ORPO/GRPO)
Check swanlab_log_completions: true
Wait for swanlab_completion_log_interval steps
Look for "Registered SwanLab RLHF completion logging" in logs

Lark notifications not working

Test webhook manually: curl -X POST "$SWANLAB_LARK_WEBHOOK_URL" ...
Verify SWANLAB_LARK_SECRET is set correctly
Check bot is added to Lark group chat
Look for "Registered Lark notification callback" in logs

Profiling metrics not appearing

Verify use_swanlab: true
Check SwanLab is initialized (look for init log message)
Profiling metrics are under "profiling/" namespace
Profiling auto-enabled when SwanLab is enabled

Performance Notes

Overhead Comparison

Feature	Overhead per Step	Memory Usage
Basic tracking	< 0.1%	~10 MB
Completion logging	< 0.5%	~64 KB (buffer=128)
Profiling	< 0.1%	~1 KB
Total	< 0.7%	~10 MB

Best Practices

Use ONE logging tool in production (disable WandB/MLflow when using SwanLab)
Adjust completion log interval based on training length (100-200 steps)
Keep completion buffer size reasonable (128-512)
Profile critical path methods first (training_step, compute_loss)
Use ProfilingConfig to throttle high-frequency operations

Contributing

Found an issue or have an improvement? Please submit a PR or open an issue:

README.md Unescape Escape

SwanLab Integration Examples

Examples Overview

1. DPO with Completion Logging

2. LoRA with Performance Profiling

3. Full-Featured DPO Production Setup

4. Custom Trainer Profiling (Python)

Feature Matrix

Configuration Quick Reference

Basic SwanLab Setup

RLHF Completion Logging

Lark Team Notifications

Team Workspace

Private Deployment

Authentication

Recommended: Environment Variable

Alternative: Config File (less secure)

Common Use Cases

Use Case 1: Migrate from WandB to SwanLab

Use Case 2: Analyze DPO Model Outputs

Use Case 3: Optimize Training Performance

Use Case 4: Production RLHF with Team Collaboration

Viewing Your Experiments

Cloud Mode

Local Mode

Troubleshooting

SwanLab not initializing

Completions not appearing

Lark notifications not working

Profiling metrics not appearing

Performance Notes

Overhead Comparison

Best Practices

Further Reading

Contributing

README.md