Files

PraMamba 8aab807e67 feat: Add SwanLab integration for experiment tracking (#3334 )

* feat(swanlab): add SwanLab integration for experiment tracking

SwanLab integration provides comprehensive experiment tracking and monitoring for Axolotl training.

Features:
- Hyperparameter logging
- Training metrics tracking
- RLHF completion logging
- Performance profiling
- Configuration validation and conflict detection

Includes:
- Plugin in src/axolotl/integrations/swanlab/
- Callback in src/axolotl/utils/callbacks/swanlab.py
- Tests in tests/integrations/test_swanlab.py
- Examples in examples/swanlab/

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix(swanlab): address PR #3334 review feedback from winglian and CodeRabbit

- Change use_swanlab default to True (winglian)
- Clear buffer after periodic logging to prevent duplicates (CodeRabbit Major)
- Add safe exception handling in config fallback (CodeRabbit)
- Use context managers for file operations (CodeRabbit)
- Replace LOG.error with LOG.exception for better debugging (CodeRabbit)
- Sort __all__ alphabetically (CodeRabbit)
- Add language specifiers to README code blocks (CodeRabbit)
- Fix end-of-file newline in README (pre-commit)

Resolves actionable comments and nitpicks from CodeRabbit review.
Addresses reviewer feedback from @winglian.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* only run swanlab integration tests if package is available

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>

2026-01-06 09:19:18 -05:00

37 KiB

Raw Blame History

SwanLab Integration for Axolotl

SwanLab is an open-source, lightweight AI experiment tracking and visualization tool that provides a platform for tracking, recording, comparing, and collaborating on experiments.

This integration enables seamless experiment tracking and visualization of Axolotl training runs using SwanLab.

Features

📊 Automatic Metrics Logging: Training loss, learning rate, and other metrics are automatically logged
🎯 Hyperparameter Tracking: Model configuration and training parameters are tracked
📈 Real-time Visualization: Monitor training progress in real-time through SwanLab dashboard
☁️ Cloud & Local Support: Works in both cloud-synced and offline modes
🔄 Experiment Comparison: Compare multiple training runs easily
🤝 Team Collaboration: Share experiments with team members
🎭 RLHF Completion Logging: Automatically log model outputs during DPO/KTO/ORPO/GRPO training for qualitative analysis
⚡ Performance Profiling: Built-in profiling decorators to measure and optimize training performance
🔔 Lark Notifications: Send real-time training updates to team chat (Feishu/Lark integration)

Installation

pip install swanlab

Quick Start

1. Register for SwanLab (Optional for cloud mode)

If you want to use cloud sync features, register at https://swanlab.cn to get your API key.

2. Configure Axolotl Config File

Add SwanLab configuration to your Axolotl YAML config:

# Enable SwanLab plugin
plugins:
  - axolotl.integrations.swanlab.SwanLabPlugin

# SwanLab configuration
use_swanlab: true
swanlab_project: my-llm-project
swanlab_experiment_name: qwen-finetune-v1
swanlab_mode: cloud  # Options: cloud, local, offline, disabled
swanlab_workspace: my-team  # Optional: organization name
swanlab_api_key: YOUR_API_KEY  # Optional: can also use env var SWANLAB_API_KEY

3. Run Training

# Set API key via environment variable (recommended)
export SWANLAB_API_KEY=your-api-key-here

# Or login once
swanlab login

# Run training as usual
accelerate launch -m axolotl.cli.train your-config.yaml

Configuration Options

Basic Configuration

Parameter	Type	Default	Description
`use_swanlab`	bool	`false`	Enable SwanLab tracking
`swanlab_project`	str	`None`	Project name (required)
`swanlab_experiment_name`	str	`None`	Experiment name
`swanlab_description`	str	`None`	Experiment description
`swanlab_mode`	str	`cloud`	Sync mode: `cloud`, `local`, `offline`, `disabled`

Advanced Configuration

Parameter	Type	Default	Description
`swanlab_workspace`	str	`None`	Workspace/organization name
`swanlab_api_key`	str	`None`	API key (prefer env var)
`swanlab_web_host`	str	`None`	Private deployment web host
`swanlab_api_host`	str	`None`	Private deployment API host
`swanlab_log_model`	bool	`false`	Log model checkpoints (coming soon)
`swanlab_lark_webhook_url`	str	`None`	Lark (Feishu) webhook URL for team notifications
`swanlab_lark_secret`	str	`None`	Lark webhook HMAC secret for authentication
`swanlab_log_completions`	bool	`true`	Enable RLHF completion table logging (DPO/KTO/ORPO/GRPO)
`swanlab_completion_log_interval`	int	`100`	Steps between completion logging
`swanlab_completion_max_buffer`	int	`128`	Max completions to buffer (memory bound)

Configuration Examples

Example 1: Basic Cloud Sync

plugins:
  - axolotl.integrations.swanlab.SwanLabPlugin

use_swanlab: true
swanlab_project: llama-finetune
swanlab_experiment_name: llama-3-8b-instruct-v1
swanlab_mode: cloud

Example 2: Offline/Local Mode

plugins:
  - axolotl.integrations.swanlab.SwanLabPlugin

use_swanlab: true
swanlab_project: local-experiments
swanlab_experiment_name: test-run-1
swanlab_mode: local  # or 'offline'

Example 3: Team Workspace

plugins:
  - axolotl.integrations.swanlab.SwanLabPlugin

use_swanlab: true
swanlab_project: research-project
swanlab_experiment_name: experiment-42
swanlab_workspace: my-research-team
swanlab_mode: cloud

Example 4: Private Deployment

plugins:
  - axolotl.integrations.swanlab.SwanLabPlugin

use_swanlab: true
swanlab_project: internal-project
swanlab_experiment_name: secure-training
swanlab_mode: cloud
swanlab_web_host: https://swanlab.yourcompany.com
swanlab_api_host: https://api.swanlab.yourcompany.com

Team Notifications with Lark (Feishu)

SwanLab supports sending real-time training notifications to your team chat via Lark (Feishu), ByteDance's enterprise collaboration platform. This is especially useful for:

Production training monitoring: Get alerts when training starts, completes, or encounters errors
Team collaboration: Keep your ML team informed about long-running experiments
Multi-timezone teams: Team members can check training progress without being online

Prerequisites

Lark Bot Setup: Create a custom bot in your Lark group chat
Webhook URL: Get the webhook URL from your Lark bot settings
HMAC Secret (recommended): Enable signature verification in your Lark bot for security

For detailed Lark bot setup instructions, see Lark Custom Bot Documentation.

Example 5: Basic Lark Notifications

Send training notifications to a Lark group chat:

plugins:
  - axolotl.integrations.swanlab.SwanLabPlugin

use_swanlab: true
swanlab_project: production-training
swanlab_experiment_name: llama-3-finetune-v2
swanlab_mode: cloud

# Lark notification (basic, no HMAC verification)
swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxx

Note: This configuration will work, but you'll see a security warning recommending HMAC secret configuration.

Example 6: Lark Notifications with HMAC Security (Recommended)

For production use, enable HMAC signature verification:

plugins:
  - axolotl.integrations.swanlab.SwanLabPlugin

use_swanlab: true
swanlab_project: production-training
swanlab_experiment_name: llama-3-finetune-v2
swanlab_mode: cloud

# Lark notification with HMAC authentication
swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxx
swanlab_lark_secret: your-webhook-secret-key

Why HMAC secret matters:

Prevents unauthorized parties from sending fake notifications to your Lark group
Ensures notifications genuinely come from your training jobs
Required for production deployments with sensitive training data

Example 7: Team Workspace + Lark Notifications

Combine team workspace collaboration with Lark notifications:

plugins:
  - axolotl.integrations.swanlab.SwanLabPlugin

use_swanlab: true
swanlab_project: research-project
swanlab_experiment_name: multimodal-experiment-42
swanlab_workspace: ml-research-team
swanlab_mode: cloud

# Notify team via Lark when training starts/completes
swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxx
swanlab_lark_secret: your-webhook-secret-key

What Notifications Are Sent?

SwanLab's Lark integration sends notifications for key training events:

Training Start: When your experiment begins
Training Complete: When training finishes successfully
Training Errors: If training crashes or encounters critical errors
Metric Milestones: Configurable alerts for metric thresholds (if configured in SwanLab)

Each notification includes:

Experiment name and project
Training status
Key metrics (loss, learning rate)
Direct link to SwanLab dashboard

Lark Configuration Validation

The plugin validates your Lark configuration at startup:

✅ Valid Configurations

# Option 1: No Lark (default)
use_swanlab: true
swanlab_project: my-project
# No swanlab_lark_webhook_url → Lark disabled, no warnings

# Option 2: Lark with HMAC secret (recommended)
use_swanlab: true
swanlab_project: my-project
swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxx
swanlab_lark_secret: your-secret
# ✅ Logs: "Registered Lark notification callback with HMAC authentication"

# Option 3: Lark without secret (works but not recommended)
use_swanlab: true
swanlab_project: my-project
swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxx
# ⚠️ Logs: "Registered Lark notification callback (no HMAC secret)"
# ⚠️ Warning: "Lark webhook has no secret configured. For production use, set 'swanlab_lark_secret'..."

Security Best Practices

Always use HMAC secret in production:

swanlab_lark_webhook_url: https://open.feishu.cn/...
swanlab_lark_secret: your-secret-key  # ✅ Add this!

Store secrets in environment variables (even better):

# In your training script/environment
export SWANLAB_LARK_WEBHOOK_URL="https://open.feishu.cn/..."
export SWANLAB_LARK_SECRET="your-secret-key"

Then in config:

# SwanLab plugin will auto-detect environment variables
use_swanlab: true
swanlab_project: my-project
# Lark URL and secret read from env vars

Rotate webhook secrets periodically: Update your Lark bot's secret every 90 days
Use separate webhooks for dev/prod: Don't mix development and production notifications

Distributed Training

Lark notifications are automatically deduplicated in distributed training:

Only rank 0 sends notifications
Other GPU ranks skip Lark registration
Prevents duplicate messages in multi-GPU training

# Running on 4 GPUs
torchrun --nproc_per_node=4 -m axolotl.cli.train config.yml

# Expected logs:
# [Rank 0] Registered Lark notification callback with HMAC authentication
# [Rank 1-3] (no Lark registration messages)

RLHF Completion Table Logging

For RLHF (Reinforcement Learning from Human Feedback) training methods like DPO, KTO, ORPO, and GRPO, SwanLab can log model completions (prompts, chosen/rejected responses, rewards) to a visual table for qualitative analysis. This helps you:

Inspect model behavior: See actual model outputs during training
Debug preference learning: Compare chosen vs rejected responses
Track reward patterns: Monitor how rewards evolve over training
Share examples with team: Visual tables in SwanLab dashboard

Features

✅ Automatic detection: Works with DPO, KTO, ORPO, GRPO trainers
✅ Memory-safe buffering: Bounded buffer prevents memory leaks in long training runs
✅ Periodic logging: Configurable logging interval to reduce overhead
✅ Rich visualization: SwanLab tables show prompts, responses, and metrics side-by-side

Configuration

Parameter	Type	Default	Description
`swanlab_log_completions`	bool	`true`	Enable completion logging for RLHF trainers
`swanlab_completion_log_interval`	int	`100`	Log completions to SwanLab every N training steps
`swanlab_completion_max_buffer`	int	`128`	Maximum completions to buffer (memory bound)

Example: DPO Training with Completion Logging

plugins:
  - axolotl.integrations.swanlab.SwanLabPlugin

use_swanlab: true
swanlab_project: dpo-training
swanlab_experiment_name: llama-3-dpo-v1
swanlab_mode: cloud

# RLHF completion logging (enabled by default)
swanlab_log_completions: true
swanlab_completion_log_interval: 100  # Log every 100 steps
swanlab_completion_max_buffer: 128    # Keep last 128 completions

# DPO-specific config
rl: dpo
datasets:
  - path: /path/to/preference_dataset
    type: chatml.intel

Example: Disable Completion Logging

If you're doing a quick test run or don't need completion tables:

plugins:
  - axolotl.integrations.swanlab.SwanLabPlugin

use_swanlab: true
swanlab_project: dpo-training

# Disable completion logging
swanlab_log_completions: false

Supported RLHF Trainers

The completion logging callback automatically activates for these trainer types:

DPO (Direct Preference Optimization): Logs prompts, chosen, rejected, reward_diff
KTO (Kahneman-Tversky Optimization): Logs prompts, completions, labels, rewards
ORPO (Odds Ratio Preference Optimization): Logs prompts, chosen, rejected, log_odds_ratio
GRPO (Group Relative Policy Optimization): Logs prompts, completions, rewards, advantages
CPO (Constrained Policy Optimization): Logs prompts, chosen, rejected

For non-RLHF trainers (standard supervised fine-tuning), the completion callback is automatically skipped.

How It Works

Auto-detection: Plugin detects trainer type at initialization
Buffering: Completions are buffered in memory (up to swanlab_completion_max_buffer)
Periodic logging: Every swanlab_completion_log_interval steps, buffer is logged to SwanLab
Memory safety: Old completions are automatically dropped when buffer is full (uses collections.deque)
Final flush: Remaining completions are logged when training completes

Viewing Completion Tables

After training starts, you can view completion tables in your SwanLab dashboard:

Navigate to your experiment in SwanLab
Look for the "rlhf_completions" table in the metrics panel
The table shows:
- step: Training step when completion was generated
- prompt: Input prompt
- chosen: Preferred response (DPO/ORPO)
- rejected: Non-preferred response (DPO/ORPO)
- completion: Model output (KTO/GRPO)
- reward_diff/reward: Reward metrics
- Trainer-specific metrics (e.g., log_odds_ratio for ORPO)

Memory Management

The completion buffer is memory-bounded to prevent memory leaks:

# Internal implementation uses deque with maxlen
from collections import deque

buffer = deque(maxlen=128)  # Old completions automatically dropped

Memory usage estimate:

Average completion: ~500 characters (prompt + responses)
Buffer size 128: ~64 KB (negligible)
Buffer size 1024: ~512 KB (still small)

Recommendation: Default buffer size (128) works well for most cases. Increase to 512-1024 only if you need to review more historical completions.

Performance Impact

Completion logging has minimal overhead:

Buffering: O(1) append operation, negligible CPU/memory
Logging: Only happens every N steps (default: 100)
Network: SwanLab batches table uploads efficiently

Expected overhead: < 0.5% per training step

Troubleshooting

Completions not appearing in SwanLab

Cause: Trainer may not be logging completion data in the expected format.

Diagnostic steps:

Check trainer type detection in logs:

INFO: SwanLab RLHF completion logging enabled for DPOTrainer (type: dpo)

Verify your trainer is an RLHF trainer (DPO/KTO/ORPO/GRPO)
Check if trainer logs completion data (this depends on TRL version)

Note: The current implementation expects trainers to log completion data in the logs dict during on_log() callback. Some TRL trainers may not expose this data by default. You may need to patch the trainer to expose completions.

Buffer fills up too quickly

Cause: High logging frequency with small buffer size.

Solution: Increase buffer size or logging interval:

swanlab_completion_log_interval: 200  # Log less frequently
swanlab_completion_max_buffer: 512    # Larger buffer

Memory usage growing over time

Cause: Buffer should be bounded, so this indicates a bug.

Solution:

Verify swanlab_completion_max_buffer is set
Check SwanLab version is up to date
Report issue with memory profiling data

Performance Profiling

SwanLab integration includes profiling utilities to measure and log execution time of trainer methods. This helps you:

Identify bottlenecks: Find slow operations in your training loop
Optimize performance: Track improvements after optimization changes
Monitor distributed training: See per-rank timing differences
Debug hangs: Detect methods that take unexpectedly long

Features

✅ Zero-config profiling: Automatic timing of key trainer methods
✅ Decorator-based: Easy to add profiling to custom methods with @swanlab_profile
✅ Context manager: Fine-grained profiling with swanlab_profiling_context()
✅ Advanced filtering: ProfilingConfig for throttling and minimum duration thresholds
✅ Exception-safe: Logs duration even if function raises an exception

Basic Usage: Decorator

Add profiling to any trainer method with the @swanlab_profile decorator:

from axolotl.integrations.swanlab.profiling import swanlab_profile

class MyCustomTrainer(AxolotlTrainer):
    @swanlab_profile
    def training_step(self, model, inputs):
        # Your training step logic
        return super().training_step(model, inputs)

    @swanlab_profile
    def prediction_step(self, model, inputs, prediction_loss_only):
        # Your prediction logic
        return super().prediction_step(model, inputs, prediction_loss_only)

The decorator automatically:

Measures execution time with high-precision timer
Logs to SwanLab as profiling/Time taken: ClassName.method_name
Only logs if SwanLab is enabled (use_swanlab: true)
Gracefully handles exceptions (logs duration, then re-raises)

Advanced Usage: Context Manager

For fine-grained profiling within a method:

from axolotl.integrations.swanlab.profiling import swanlab_profiling_context

class MyTrainer(AxolotlTrainer):
    def complex_training_step(self, model, inputs):
        # Profile just the forward pass
        with swanlab_profiling_context(self, "forward_pass"):
            outputs = model(**inputs)

        # Profile just the backward pass
        with swanlab_profiling_context(self, "backward_pass"):
            loss = outputs.loss
            loss.backward()

        return outputs

Advanced Usage: ProfilingConfig

Filter and throttle profiling logs with ProfilingConfig:

from axolotl.integrations.swanlab.profiling import (
    swanlab_profiling_context_advanced,
    ProfilingConfig,
)

# Create custom profiling config
profiling_config = ProfilingConfig(
    enabled=True,
    min_duration_ms=1.0,    # Only log if duration > 1ms
    log_interval=10,        # Log every 10th call
)

class MyTrainer(AxolotlTrainer):
    def frequently_called_method(self, data):
        with swanlab_profiling_context_advanced(
            self,
            "frequent_op",
            config=profiling_config
        ):
            # This only logs every 10th call, and only if it takes > 1ms
            result = expensive_computation(data)
        return result

ProfilingConfig Parameters:

enabled: Enable/disable profiling globally (default: True)
min_duration_ms: Minimum duration to log in milliseconds (default: 0.1)
log_interval: Log every Nth function call (default: 1 = log all)

Use cases:

High-frequency methods: Use log_interval=100 to reduce logging overhead
Filter noise: Use min_duration_ms=1.0 to skip very fast operations
Debugging: Use log_interval=1, min_duration_ms=0.0 to log everything

Viewing Profiling Metrics

In your SwanLab dashboard, profiling metrics appear under the "profiling" namespace:

profiling/Time taken: AxolotlTrainer.training_step
profiling/Time taken: AxolotlTrainer.prediction_step
profiling/Time taken: MyTrainer.forward_pass
profiling/Time taken: MyTrainer.backward_pass

You can:

Track over time: See if methods get faster/slower during training
Compare runs: Compare profiling metrics across experiments
Identify regressions: Detect if a code change slowed down training

Configuration in Axolotl Config

Profiling is automatically enabled when SwanLab is enabled. No additional config needed:

plugins:
  - axolotl.integrations.swanlab.SwanLabPlugin

use_swanlab: true
swanlab_project: my-project

# Profiling is automatically enabled
# Add @swanlab_profile decorators to your custom trainer methods

To disable profiling while keeping SwanLab enabled:

# In your custom trainer code
from axolotl.integrations.swanlab.profiling import DEFAULT_PROFILING_CONFIG

# Disable profiling globally
DEFAULT_PROFILING_CONFIG.enabled = False

Performance Impact

Decorator overhead: ~2-5 microseconds per call (negligible)
Context manager overhead: ~1-3 microseconds (negligible)
Logging overhead: Only when SwanLab is enabled and method duration exceeds threshold
Network overhead: SwanLab batches metrics efficiently

Expected overhead: < 0.1% per training step (effectively zero)

Best Practices

Profile bottlenecks first: Start by profiling suspected slow operations
Use min_duration_ms: Filter out fast operations (< 1ms) to reduce noise
Throttle high-frequency calls: Use log_interval for methods called > 100 times/step
Profile across runs: Compare profiling metrics before/after optimization
Monitor distributed training: Check for rank-specific slowdowns

Example: Complete Profiling Setup

from axolotl.integrations.swanlab.profiling import (
    swanlab_profile,
    swanlab_profiling_context,
    ProfilingConfig,
)

class OptimizedTrainer(AxolotlTrainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        # Custom profiling config for high-frequency operations
        self.fast_op_config = ProfilingConfig(
            enabled=True,
            min_duration_ms=0.5,
            log_interval=50,
        )

    @swanlab_profile
    def training_step(self, model, inputs):
        """Main training step - always profile."""
        return super().training_step(model, inputs)

    @swanlab_profile
    def compute_loss(self, model, inputs, return_outputs=False):
        """Loss computation - always profile."""
        return super().compute_loss(model, inputs, return_outputs)

    def _prepare_inputs(self, inputs):
        """High-frequency operation - throttled profiling."""
        with swanlab_profiling_context_advanced(
            self,
            "prepare_inputs",
            config=self.fast_op_config,
        ):
            return super()._prepare_inputs(inputs)

Troubleshooting

Profiling metrics not appearing in SwanLab

Cause: SwanLab is not enabled or not initialized.

Solution:

# Ensure SwanLab is enabled
use_swanlab: true
swanlab_project: my-project

Check logs for:

INFO: SwanLab initialized for project: my-project

Too many profiling metrics cluttering dashboard

Cause: Profiling every function call for high-frequency operations.

Solution: Use ProfilingConfig with throttling:

config = ProfilingConfig(
    min_duration_ms=1.0,    # Skip fast ops
    log_interval=100,       # Log every 100th call
)

Profiling overhead impacting training speed

Cause: Profiling itself should have negligible overhead (< 0.1%). If you see > 1% slowdown, this indicates a bug.

Solution:

Disable profiling temporarily to confirm:

DEFAULT_PROFILING_CONFIG.enabled = False

Report issue with profiling data and trainer details

Profiling shows inconsistent timing

Cause: Normal variation due to GPU warmup, data loading, or system load.

Solution:

Ignore first few steps (warmup period)
Look at average/median timing over many steps
Use log_interval to reduce noise from individual outliers

Complete Config Example

Here's a complete example integrating SwanLab with your RVQ-Alpha training:

base_model: /path/to/your/model
model_type: Qwen2ForCausalLM

# SwanLab Integration
plugins:
  - axolotl.integrations.swanlab.SwanLabPlugin
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin

use_swanlab: true
swanlab_project: RVQ-Alpha-Training
swanlab_experiment_name: Qwen2.5-7B-MetaQA-Perturb-P020
swanlab_description: "Training on MetaQA and Perturbation datasets with NEW-RVQ encoding"
swanlab_mode: cloud
swanlab_workspace: single-cell-genomics

# Training configuration
sequence_len: 32768
micro_batch_size: 1
gradient_accumulation_steps: 1
num_epochs: 2
learning_rate: 2e-5
optimizer: adamw_torch_fused

# Datasets
datasets:
  - path: /path/to/dataset
    type: chat_template

# Output
output_dir: ./outputs

Modes Explained

`cloud` Mode (Default)

Syncs experiments to SwanLab cloud in real-time
Requires API key and internet connection
Best for: Team collaboration, remote monitoring

`local` Mode

Saves experiments locally only
No cloud sync
Best for: Local development, air-gapped environments

`offline` Mode

Saves metadata locally
Can sync to cloud later using swanlab sync
Best for: Unstable internet, sync later

`disabled` Mode

Turns off SwanLab completely
No logging or tracking
Best for: Debugging, testing

Configuration Validation & Conflict Detection

SwanLab integration includes comprehensive validation and conflict detection to help you catch configuration errors early and avoid performance issues.

Required Fields Validation

The plugin validates your configuration at startup and provides clear error messages with solutions:

Missing Project Name

# ❌ INVALID: use_swanlab enabled but no project
use_swanlab: true
# Error: SwanLab enabled but 'swanlab_project' is not set.

Solution:

# ✅ VALID: Provide project name
use_swanlab: true
swanlab_project: my-project

Invalid Mode

# ❌ INVALID: Unknown mode
use_swanlab: true
swanlab_project: my-project
swanlab_mode: invalid-mode
# Error: Invalid swanlab_mode: 'invalid-mode'. Valid options: cloud, local, offline, disabled

Solution:

# ✅ VALID: Use one of the valid modes
use_swanlab: true
swanlab_project: my-project
swanlab_mode: cloud  # or: local, offline, disabled

Empty Project Name

# ❌ INVALID: Empty string project name
use_swanlab: true
swanlab_project: ""
# Error: swanlab_project cannot be an empty string.

Solution:

# ✅ VALID: Provide non-empty project name
use_swanlab: true
swanlab_project: my-project

Cloud Mode API Key Warning

When using cloud mode without an API key, you'll receive a warning with multiple solutions:

use_swanlab: true
swanlab_project: my-project
swanlab_mode: cloud
# No API key set
# Warning: SwanLab cloud mode enabled but no API key found.

Solutions:

Set environment variable: export SWANLAB_API_KEY=your-api-key
Add to config (less secure): swanlab_api_key: your-api-key
Run swanlab login before training
Use swanlab_mode: local for offline tracking

Multi-Logger Performance Warnings

Using multiple logging tools simultaneously (SwanLab + WandB + MLflow + Comet) can impact training performance:

Two Loggers - Warning

use_swanlab: true
swanlab_project: my-project

use_wandb: true
wandb_project: my-project

# Warning: Multiple logging tools enabled: SwanLab, WandB
# Expected overhead: ~3.0% per training step.

Impact:

Performance overhead: ~1-2% per logger (cumulative)
Increased memory usage
Longer training time per step
Potential config/callback conflicts

Recommendations:

Choose ONE primary logging tool for production training
Use multiple loggers only for:
- Migration period (transitioning between tools)
- Short comparison runs
- Debugging specific tool issues
Monitor system resources (CPU, memory) during training

Three+ Loggers - Error-Level Warning

use_swanlab: true
swanlab_project: my-project

use_wandb: true
wandb_project: my-project

use_mlflow: true
mlflow_tracking_uri: http://localhost:5000

# ERROR: 3 logging tools enabled simultaneously!
# Expected overhead: ~4.5% per training step.
# STRONGLY RECOMMEND: Disable all but ONE logging tool

Why This Matters:

With 3 loggers: ~4-5% overhead per step → significant slowdown over long training
Example: 10,000 steps at 2s/step → ~400-500 seconds extra (6-8 minutes)
Memory overhead scales with number of loggers
Rare edge cases with callback ordering conflicts

Auto-Enable Logic

For convenience, SwanLab will auto-enable if you specify a project without setting use_swanlab:

# This configuration:
swanlab_project: my-project

# Automatically becomes:
use_swanlab: true
swanlab_project: my-project

Distributed Training Detection

In distributed training scenarios (multi-GPU), the plugin automatically detects and reports:

use_swanlab: true
swanlab_project: my-project
swanlab_mode: cloud

# When running with torchrun --nproc_per_node=4:
# Info: Distributed training detected (world_size=4)
# Info: SwanLab mode: cloud
# Info: Only rank 0 will initialize SwanLab
# Info: Other ranks will skip SwanLab to avoid conflicts

Why Only Rank 0:

Avoids duplicate experiment runs
Reduces network/cloud API overhead on worker ranks
Prevents race conditions in metric logging

Authentication

Method 1: Environment Variable (Recommended)

export SWANLAB_API_KEY=your-api-key-here

swanlab login
# Enter your API key when prompted

Method 3: Config File

swanlab_api_key: your-api-key-here

What Gets Logged?

Automatically Logged Metrics

Training loss
Learning rate
Gradient norm
Training steps
Epoch progress

Automatically Logged Config

Model configuration (base_model, model_type)
Training hyperparameters (learning_rate, batch_size, etc.)
Optimizer settings
Parallelization settings (FSDP, DeepSpeed, Context Parallel)
Axolotl configuration file
DeepSpeed configuration (if used)

Viewing Your Experiments

Cloud Mode

Visit https://swanlab.cn and navigate to your project to view:

Real-time training metrics
Hyperparameter comparison
System resource usage
Configuration files

Local Mode

# Start local dashboard
swanlab watch ./swanlog

# Open browser to http://localhost:5092

Integration with Existing Tools

SwanLab can work alongside other tracking tools:

plugins:
  - axolotl.integrations.swanlab.SwanLabPlugin

# Use both SwanLab and Wandb
use_swanlab: true
swanlab_project: my-project

use_wandb: true
wandb_project: my-project

Troubleshooting

Configuration Errors

Error: "SwanLab enabled but 'swanlab_project' is not set"

Cause: You enabled SwanLab (use_swanlab: true) but forgot to specify a project name.

Solution:

use_swanlab: true
swanlab_project: my-project  # Add this line

Error: "Invalid swanlab_mode: 'xxx'"

Cause: You provided an invalid mode value.

Solution: Use one of the valid modes:

swanlab_mode: cloud     # or: local, offline, disabled

Error: "swanlab_project cannot be an empty string"

Cause: You set swanlab_project: "" (empty string).

Solution: Either provide a valid name or remove the field:

# Option 1: Provide valid name
swanlab_project: my-project

# Option 2: Remove the field entirely
# swanlab_project: ""  <- Remove this line

Import Errors

Error: "SwanLab is not installed"

Cause: SwanLab package is not installed in your environment.

Solution:

pip install swanlab
# or
pip install swanlab>=0.3.0

Performance Issues

Warning: "Multiple logging tools enabled"

Cause: You have multiple experiment tracking tools enabled (e.g., SwanLab + WandB + MLflow).

Impact: ~1-2% performance overhead per logger, cumulative.

Solution: For production training, disable all but one logger:

# Option 1: Keep only SwanLab
use_swanlab: true
swanlab_project: my-project
use_wandb: false      # Disable others
use_mlflow: false

# Option 2: Keep only WandB
use_swanlab: false
use_wandb: true
wandb_project: my-project

Exception: Multiple loggers are acceptable for:

Short comparison runs (< 100 steps)
Migration testing between logging tools
Debugging logger-specific issues

Distributed Training Issues

SwanLab creates duplicate runs in multi-GPU training

Cause: All ranks are initializing SwanLab instead of just rank 0.

Expected Behavior: The plugin automatically ensures only rank 0 initializes SwanLab. You should see:

Info: Distributed training detected (world_size=4)
Info: Only rank 0 will initialize SwanLab
Info: Other ranks will skip SwanLab to avoid conflicts

If you see duplicates:

Check your plugin is loaded correctly
Verify you're using the latest SwanLab integration code
Check logs for initialization messages on all ranks

SwanLab not logging metrics

Solution: Ensure SwanLab is initialized before training starts. The plugin automatically handles this in pre_model_load.

API Key errors

Solution:

# Verify API key
echo $SWANLAB_API_KEY

# Re-login
swanlab login

Cloud sync issues

Solution: Use offline mode and sync later:

swanlab_mode: offline

Then sync when ready:

swanlab sync ./swanlog

Plugin not loaded

Solution: Verify plugin path in config:

plugins:
  - axolotl.integrations.swanlab.SwanLabPlugin  # Correct path

Lark Notification Issues

Error: "Failed to import SwanLab Lark plugin"

Cause: Your SwanLab version doesn't include the Lark plugin (requires SwanLab >= 0.3.0).

Solution:

# Upgrade SwanLab to latest version
pip install --upgrade swanlab

# Or install specific version
pip install 'swanlab>=0.3.0'

Warning: "Lark webhook has no secret configured"

Cause: You provided swanlab_lark_webhook_url but no swanlab_lark_secret.

Impact: Lark notifications will work, but without HMAC authentication (security risk).

Solution: Add HMAC secret for production use:

swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxx
swanlab_lark_secret: your-webhook-secret  # Add this line

When it's OK to skip secret:

Local development and testing
Internal networks with restricted access
Non-sensitive training experiments

When secret is required:

Production training jobs
Training with proprietary data
Multi-team shared Lark groups

Error: "Failed to register Lark callback"

Cause: Invalid webhook URL or network connectivity issues.

Diagnostic steps:

# 1. Test webhook URL manually
curl -X POST "YOUR_WEBHOOK_URL" \
  -H 'Content-Type: application/json' \
  -d '{"msg_type":"text","content":{"text":"Test from Axolotl"}}'

# 2. Check SwanLab version
pip show swanlab

# 3. Verify webhook URL format
# Should start with: https://open.feishu.cn/open-apis/bot/v2/hook/

Solution:

Verify webhook URL is correct (copy from Lark bot settings)
Check network connectivity to Lark API
Ensure webhook is not expired (Lark webhooks can expire)
Regenerate webhook URL in Lark bot settings if needed

Lark notifications not received

Cause: Multiple possible causes.

Diagnostic checklist:

Check training logs for Lark registration confirmation:

# Expected log message (rank 0 only):
INFO: Registered Lark notification callback with HMAC authentication

Verify webhook in Lark: Test webhook manually (see above)

Check distributed training: Only rank 0 sends notifications

# If running multi-GPU, check rank 0 logs specifically
grep "Registered Lark" logs/rank_0.log

Verify SwanLab is initialized: Lark callback needs SwanLab to be running

use_swanlab: true  # Must be enabled
swanlab_project: my-project  # Must be set

Check Lark bot permissions: Ensure bot is added to the target group chat

Duplicate Lark notifications in multi-GPU training

Expected Behavior: Should NOT happen - only rank 0 sends notifications.

If you see duplicates:

Check that all GPUs are using the same config file
Verify plugin is loaded correctly on all ranks
Check logs for unexpected Lark initialization on non-zero ranks
Ensure RANK or LOCAL_RANK environment variables are set correctly

Solution: This is a bug if it occurs. Report with:

Full training command
Logs from all ranks
Config file

Comparison: SwanLab vs WandB

Feature	SwanLab	WandB
Open Source	✅ Yes	❌ No
Self-Hosting	✅ Easy	⚠️ Complex
Free Tier	✅ Generous	⚠️ Limited
Chinese Support	✅ Native	⚠️ Limited
Offline Mode	✅ Full support	✅ Supported
Integration	🆕 New	✅ Mature

Advanced Usage

Custom Logging

You can add custom metrics in your callbacks:

import swanlab

# In your custom callback
swanlab.log({
    "custom_metric": value,
    "epoch": epoch_num
})

Experiment Comparison

# Compare multiple experiments
swanlab compare run1 run2 run3

Support

Documentation: https://docs.swanlab.cn
GitHub: https://github.com/SwanHubX/SwanLab
Issues: Report bugs at GitHub Issues

License

This integration follows the Axolotl Community License Agreement.

Acknowledgements

This integration is built on top of:

SwanLab - Experiment tracking tool
Transformers - SwanLabCallback
Axolotl - Training framework

37 KiB Raw Blame History

SwanLab Integration for Axolotl

Features

Installation

Quick Start

1. Register for SwanLab (Optional for cloud mode)

2. Configure Axolotl Config File

3. Run Training

Configuration Options

Basic Configuration

Advanced Configuration

Configuration Examples

Example 1: Basic Cloud Sync

Example 2: Offline/Local Mode

Example 3: Team Workspace

Example 4: Private Deployment

Team Notifications with Lark (Feishu)

Prerequisites

Example 5: Basic Lark Notifications

Example 6: Lark Notifications with HMAC Security (Recommended)

Example 7: Team Workspace + Lark Notifications

What Notifications Are Sent?

Lark Configuration Validation

✅ Valid Configurations

Security Best Practices

Distributed Training

RLHF Completion Table Logging

Features

Configuration

Example: DPO Training with Completion Logging

Example: Disable Completion Logging

Supported RLHF Trainers

How It Works

Viewing Completion Tables

Memory Management

Performance Impact

Troubleshooting

Completions not appearing in SwanLab

Buffer fills up too quickly

Memory usage growing over time

Performance Profiling

Features

Basic Usage: Decorator

Advanced Usage: Context Manager

Advanced Usage: ProfilingConfig

Viewing Profiling Metrics

Configuration in Axolotl Config

Performance Impact

Best Practices

Example: Complete Profiling Setup

Troubleshooting

Profiling metrics not appearing in SwanLab

Too many profiling metrics cluttering dashboard

Profiling overhead impacting training speed

Profiling shows inconsistent timing

Complete Config Example

Modes Explained

cloud Mode (Default)

local Mode

offline Mode

disabled Mode

Configuration Validation & Conflict Detection

Required Fields Validation

Missing Project Name

Invalid Mode

Empty Project Name

Cloud Mode API Key Warning

Multi-Logger Performance Warnings

Two Loggers - Warning

Three+ Loggers - Error-Level Warning

Auto-Enable Logic

Distributed Training Detection

Authentication

Method 1: Environment Variable (Recommended)

Method 2: Login Command

Method 3: Config File

What Gets Logged?

Automatically Logged Metrics

Automatically Logged Config

Viewing Your Experiments

37 KiB

Raw Blame History

`cloud` Mode (Default)

`local` Mode

`offline` Mode

`disabled` Mode