Files

Wing Lian e4032fc90f Refactor separate attention flags with attn_implementation and capability/concerns feature flags (#3602 )

* upgrade to torchao 0.17.0

* chore: lint

* refactor attention handling

* replace legacy attention boolean flags with capability properties

Replace checks with capability-based properties derived from attn_implementation

This separates three concerns that were conflated under flash_attention:
1. Backend selection -> attn_implementation enum
2. Packing capability -> attn_supports_packing property
3. Flash-attn library dependency -> attn_uses_flash_lib property

* compute attn capability flags in normalizer instead of properties

* make attn_implementation the single source of truth

* move attention-dependent validators to mode=after

* migrate remaining consumers to canonical attn_implementation

* expand attention tests + rewrite docs

* migrate example configs to canonical attn_implementation

* update doc snippets + reject gemma4-hybrid with non-FA2 backend

* remove dead gemma4 branch in _set_attention_config

* fix duplicate attn_implementation in gpt-oss yamls and flaky caplog tests

* drop "Phase 2" naming from attn-implementation tests

* regroup attn_implementation tests by feature concern

* clean up verbose comments and remove MD

Signed-off-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>

* fix(collator): pass return_dict=True at apply_chat_template top level for transformers 5.x

In transformers 5.x, ProcessorMixin.apply_chat_template gained its own
`return_dict` parameter (defaulting to False).  When return_dict=False
and tokenize=True the method returns out["input_ids"] directly — a 2-D
tensor — rather than the full BatchFeature dict.

The old code placed `return_dict=True` inside processor_kwargs.  In
transformers 5.x those kwargs are forwarded to the underlying processor
call self(...) where _merge_kwargs silently ignores any key not present
in MllamaProcessorKwargs (emitting a warning).  The outer return_dict
therefore stayed False, apply_chat_template returned the raw input_ids
tensor, and the subsequent `batch["input_ids"]` attempted to index a
2-D tensor with the 9-character string "input_ids", producing:

  IndexError: too many indices for tensor of dimension 2

The fix is to pass return_dict=True as a top-level keyword argument to
apply_chat_template (where it is actually consumed) and remove it from
processor_kwargs (where it was silently dropped).  No version guard is
needed: transformers is pinned to ==5.5.4 in pyproject.toml.

Adds a unit-level regression test (tests/test_mm_chat_collator.py) that
mocks the processor to return a raw tensor when apply_chat_template is
called without top-level return_dict=True, verifying the four invariants:
process_rows returns a dict, input_ids is 2-D, labels is 2-D, and
apply_chat_template receives return_dict=True as a top-level kwarg.

Fixes: tests/e2e/test_llama_vision.py::TestLlamaVision::test_lora_llama_vision_multimodal_dataset
Fixes: tests/e2e/test_llama_vision.py::TestLlamaVision::test_lora_llama_vision_text_only_dataset
Signed-off-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>

* fix(collator): process_rows returns dict (BatchFeature) shape

Two related changes for the multimodal chat collator under transformers 5.x:

1. Wrap apply_chat_template result in dict(...) so process_rows returns
   a plain dict rather than a BatchFeature instance. BatchFeature is a
   Mapping but not a dict; downstream code that did
     batch["labels"] = self.processing_strategy.process_labels(batch["input_ids"])
   would index on a tensor when the result wasn't dict-shaped, raising
     IndexError: too many indices for tensor of dimension 2

2. Soften the regression test's contract from `dict` to `Mapping` so it
   exercises the actual semantic guarantee (key/value access) rather
   than the implementation detail (dict vs BatchFeature). Test guards
   against the original transformers 5.x breakage where apply_chat_template's
   return_dict default went from True to False.

Includes regression test under tests/test_mm_chat_collator.py.

Bug surfaced via swarm dispatch task_01KQHPNAYD8XARSNSDJVW1GPF6 against
attn-implementation-refactor; squash-merged from agent commits 4de886fd
+ dc9fcf4f.

Signed-off-by: Wing Lian <wing@axolotl.ai>

---------

Signed-off-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>

2026-05-05 10:15:18 -04:00

custom_trainer_profiling.py

feat: Add SwanLab integration for experiment tracking (#3334 )

2026-01-06 09:19:18 -05:00

dpo-swanlab-completions.yml

Refactor separate attention flags with attn_implementation and capability/concerns feature flags (#3602 )

2026-05-05 10:15:18 -04:00

dpo-swanlab-full-featured.yml

Refactor separate attention flags with attn_implementation and capability/concerns feature flags (#3602 )

2026-05-05 10:15:18 -04:00

lora-swanlab-profiling.yml

Refactor separate attention flags with attn_implementation and capability/concerns feature flags (#3602 )

2026-05-05 10:15:18 -04:00

README.md

feat: Add SwanLab integration for experiment tracking (#3334 )

2026-01-06 09:19:18 -05:00

README.md

SwanLab Integration Examples

This directory contains example configurations demonstrating SwanLab integration with Axolotl.

Examples Overview

1. DPO with Completion Logging

File: dpo-swanlab-completions.yml

Demonstrates DPO (Direct Preference Optimization) training with RLHF completion table logging.

Features:

Basic SwanLab experiment tracking
Completion table logging (prompts, chosen/rejected responses, rewards)
Memory-bounded buffer for long training runs
Cloud sync configuration

Best for: RLHF practitioners who want to analyze model outputs qualitatively

Quick start:

export SWANLAB_API_KEY=your-api-key
accelerate launch -m axolotl.cli.train examples/swanlab/dpo-swanlab-completions.yml

2. LoRA with Performance Profiling

File: lora-swanlab-profiling.yml

Demonstrates standard LoRA fine-tuning with performance profiling enabled.

Features:

SwanLab experiment tracking
Automatic profiling of trainer methods
Profiling metrics visualization
Performance optimization guidance

Best for: Engineers optimizing training performance and comparing different configurations

Quick start:

export SWANLAB_API_KEY=your-api-key
accelerate launch -m axolotl.cli.train examples/swanlab/lora-swanlab-profiling.yml

3. Full-Featured DPO Production Setup

File: dpo-swanlab-full-featured.yml

Comprehensive production-ready configuration with ALL SwanLab features enabled.

Features:

Experiment tracking with team workspace
RLHF completion logging
Performance profiling
Lark (Feishu) team notifications
Private deployment support
Production checklist and troubleshooting

Best for: Production RLHF training with team collaboration

Quick start:

export SWANLAB_API_KEY=your-api-key
export SWANLAB_LARK_WEBHOOK_URL=https://open.feishu.cn/...
export SWANLAB_LARK_SECRET=your-webhook-secret
accelerate launch -m axolotl.cli.train examples/swanlab/dpo-swanlab-full-featured.yml

4. Custom Trainer Profiling (Python)

File: custom_trainer_profiling.py

Python code examples showing how to add SwanLab profiling to custom trainers.

Features:

@swanlab_profile decorator examples
Context manager profiling for fine-grained timing
ProfilingConfig for advanced filtering and throttling
Multiple profiling patterns and best practices

Best for: Advanced users creating custom trainers

Usage:

from custom_trainer_profiling import CustomTrainerWithProfiling
# See file for detailed examples and patterns

Feature Matrix

Example	Tracking	Completion Logging	Profiling	Lark Notifications	Team Workspace
dpo-swanlab-completions.yml	✅	✅	✅ (auto)	➖ (commented)	➖ (commented)
lora-swanlab-profiling.yml	✅	➖ (disabled)	✅ (auto)	➖ (commented)	➖ (commented)
dpo-swanlab-full-featured.yml	✅	✅	✅ (auto)	✅	✅
custom_trainer_profiling.py	N/A	N/A	✅ (manual)	N/A	N/A

Configuration Quick Reference

Basic SwanLab Setup

plugins:
  - axolotl.integrations.swanlab.SwanLabPlugin

use_swanlab: true
swanlab_project: my-project
swanlab_experiment_name: my-experiment
swanlab_mode: cloud  # cloud, local, offline, disabled

RLHF Completion Logging

swanlab_log_completions: true
swanlab_completion_log_interval: 100  # Log every 100 steps
swanlab_completion_max_buffer: 128    # Memory-bounded buffer

Lark Team Notifications

swanlab_lark_webhook_url: https://open.feishu.cn/...
swanlab_lark_secret: your-webhook-secret  # Required for production

Team Workspace

swanlab_workspace: my-research-team

Private Deployment

swanlab_web_host: https://swanlab.yourcompany.com
swanlab_api_host: https://api.swanlab.yourcompany.com

Authentication

Recommended: Environment Variable

export SWANLAB_API_KEY=your-api-key
export SWANLAB_LARK_WEBHOOK_URL=https://open.feishu.cn/...
export SWANLAB_LARK_SECRET=your-webhook-secret

Alternative: Config File (less secure)

swanlab_api_key: your-api-key
swanlab_lark_webhook_url: https://open.feishu.cn/...
swanlab_lark_secret: your-webhook-secret

Common Use Cases

Use Case 1: Migrate from WandB to SwanLab

Start with lora-swanlab-profiling.yml, add your model/dataset config, disable WandB:

use_swanlab: true
use_wandb: false

Use Case 2: Analyze DPO Model Outputs

Use dpo-swanlab-completions.yml, adjust completion logging interval based on your training length:

swanlab_completion_log_interval: 50   # More frequent for short training
swanlab_completion_log_interval: 200  # Less frequent for long training

Use Case 3: Optimize Training Performance

Use lora-swanlab-profiling.yml, run multiple experiments with different optimizations:

Baseline: flash_attention: false, gradient_checkpointing: false
Flash Attention: flash_attention: true
Gradient Checkpointing: gradient_checkpointing: true
Both: flash_attention: true, gradient_checkpointing: true

Compare profiling metrics in SwanLab dashboard.

Use Case 4: Production RLHF with Team Collaboration

Use dpo-swanlab-full-featured.yml, set up team workspace and Lark notifications:

swanlab_workspace: ml-team
swanlab_lark_webhook_url: ...
swanlab_lark_secret: ...

Viewing Your Experiments

Cloud Mode

Visit https://swanlab.cn and navigate to your project.

Dashboard sections:

Metrics: Training loss, learning rate, profiling metrics
Tables: RLHF completions (for DPO/KTO/ORPO/GRPO)
Config: Hyperparameters and configuration
System: Resource usage (GPU, memory, CPU)
Files: Logged artifacts

Local Mode

swanlab watch ./swanlog
# Open browser to http://localhost:5092

Troubleshooting

SwanLab not initializing

# Check API key
echo $SWANLAB_API_KEY

# Verify SwanLab is installed
pip show swanlab

# Check config
grep -A 5 "use_swanlab" your-config.yml

Completions not appearing

Verify you're using an RLHF trainer (DPO/KTO/ORPO/GRPO)
Check swanlab_log_completions: true
Wait for swanlab_completion_log_interval steps
Look for "Registered SwanLab RLHF completion logging" in logs

Lark notifications not working

Test webhook manually: curl -X POST "$SWANLAB_LARK_WEBHOOK_URL" ...
Verify SWANLAB_LARK_SECRET is set correctly
Check bot is added to Lark group chat
Look for "Registered Lark notification callback" in logs

Profiling metrics not appearing

Verify use_swanlab: true
Check SwanLab is initialized (look for init log message)
Profiling metrics are under "profiling/" namespace
Profiling auto-enabled when SwanLab is enabled

Performance Notes

Overhead Comparison

Feature	Overhead per Step	Memory Usage
Basic tracking	< 0.1%	~10 MB
Completion logging	< 0.5%	~64 KB (buffer=128)
Profiling	< 0.1%	~1 KB
Total	< 0.7%	~10 MB

Best Practices

Use ONE logging tool in production (disable WandB/MLflow when using SwanLab)
Adjust completion log interval based on training length (100-200 steps)
Keep completion buffer size reasonable (128-512)
Profile critical path methods first (training_step, compute_loss)
Use ProfilingConfig to throttle high-frequency operations

Contributing

Found an issue or have an improvement? Please submit a PR or open an issue:

README.md Unescape Escape

SwanLab Integration Examples

Examples Overview

1. DPO with Completion Logging

2. LoRA with Performance Profiling

3. Full-Featured DPO Production Setup

4. Custom Trainer Profiling (Python)

Feature Matrix

Configuration Quick Reference

Basic SwanLab Setup

RLHF Completion Logging

Lark Team Notifications

Team Workspace

Private Deployment

Authentication

Recommended: Environment Variable

Alternative: Config File (less secure)

Common Use Cases

Use Case 1: Migrate from WandB to SwanLab

Use Case 2: Analyze DPO Model Outputs

Use Case 3: Optimize Training Performance

Use Case 4: Production RLHF with Team Collaboration

Viewing Your Experiments

Cloud Mode

Local Mode

Troubleshooting

SwanLab not initializing

Completions not appearing

Lark notifications not working

Profiling metrics not appearing

Performance Notes

Overhead Comparison

Best Practices

Further Reading

Contributing

README.md