diff --git a/.nojekyll b/.nojekyll index 6b0c7b8ae..9c1fda326 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -eac6727e \ No newline at end of file +8763ebce \ No newline at end of file diff --git a/docs/cli.html b/docs/cli.html index 6e0983219..87f9fb3fa 100644 --- a/docs/cli.html +++ b/docs/cli.html @@ -944,13 +944,15 @@ the CLI commands, their usage, and common examples.
# Basic evaluation
axolotl lm-eval config.ymlConfiguration options:
-# List of tasks to evaluate
-lm_eval_tasks:
- - arc_challenge
- - hellaswag
-lm_eval_batch_size: # Batch size for evaluation
-output_dir: # Directory to save evaluation resultsSee LM Eval Harness for more details.
+lm_eval_model: # model to evaluate (local or hf path)
+
+# List of tasks to evaluate
+lm_eval_tasks:
+ - arc_challenge
+ - hellaswag
+lm_eval_batch_size: # Batch size for evaluation
+output_dir: # Directory to save evaluation resultsSee LM Eval Harness integration docs for full configuration details.
Please see reference here
MoE (Mixture of Experts) kernels speed up training for MoE layers and reduce VRAM costs. In transformers v5, batched_mm and grouped_mm were integrated as built-in options via the experts_implementation config kwarg:
class ExpertsInterface(GeneralInterface):
+ _global_mapping = {
+ "batched_mm": batched_mm_experts_forward,
+ "grouped_mm": grouped_mm_experts_forward,
+ }In our custom integration, we add support for ScatterMoE, which is even more efficient and faster than grouped_mm.
plugins:
- - "axolotl.integrations.kd.KDPlugin"
-
-kd_trainer: True
-kd_ce_alpha: 0.1
-kd_alpha: 0.9
-kd_temperature: 1.0
-
-torch_compile: True # torch>=2.6.0, recommended to reduce vram
-
-datasets:
- - path: ...
- type: "axolotl.integrations.kd.chat_template"
- field_messages: "messages_combined"
- logprobs_field: "llm_text_generation_vllm_logprobs" # for kd only, field of logprobsAdd the following to your axolotl YAML config:
+plugins:
+ - axolotl.integrations.kernels.KernelsPlugin
+
+use_kernels: true
+use_scattermoe: trueImportant: Setting experts_implementation is incompatible with use_scattermoe.
The KernelsPlugin runs before model loading and:
axolotl-ai-co/scattermoe Hub repo.SparseMoeBlock forward method with the optimized ScatterMoE implementation.This works for any MoE model in transformers that uses a SparseMoeBlock class (Mixtral, Qwen2-MoE, OLMoE, etc.).
ScatterMoE uses a softmax -> topk routing, so results may be different for some model arch as baseline (GPT-OSS, GLM_MOE_DSA).
+We tested MegaBlocks but were unable to ensure numerical accuracy, so we did not integrate it. It was also incompatible with many newer model architectures in transformers.
+Please see reference here
+plugins:
+ - "axolotl.integrations.kd.KDPlugin"
+
+kd_trainer: True
+kd_ce_alpha: 0.1
+kd_alpha: 0.9
+kd_temperature: 1.0
+
+torch_compile: True # torch>=2.6.0, recommended to reduce vram
+
+datasets:
+ - path: ...
+ type: "axolotl.integrations.kd.chat_template"
+ field_messages: "messages_combined"
+ logprobs_field: "llm_text_generation_vllm_logprobs" # for kd only, field of logprobsAn example dataset can be found at axolotl-ai-co/evolkit-logprobs-pipeline-75k-v2-sample
Please see reference here
Axolotl with llmcompressor extras:
pip install "axolotl[llmcompressor]"pip install "axolotl[llmcompressor]"Requires llmcompressor >= 0.5.1
This will install all necessary dependencies to fine-tune sparsified models using the integration.
To enable sparse fine-tuning with this integration, include the plugin in your Axolotl config:
-plugins:
- - axolotl.integrations.llm_compressor.LLMCompressorPlugin
-
-llmcompressor:
- recipe:
- finetuning_stage:
- finetuning_modifiers:
- ConstantPruningModifier:
- targets: [
- 're:.*q_proj.weight',
- 're:.*k_proj.weight',
- 're:.*v_proj.weight',
- 're:.*o_proj.weight',
- 're:.*gate_proj.weight',
- 're:.*up_proj.weight',
- 're:.*down_proj.weight',
- ]
- start: 0
- save_compressed: trueplugins:
+ - axolotl.integrations.llm_compressor.LLMCompressorPlugin
+
+llmcompressor:
+ recipe:
+ finetuning_stage:
+ finetuning_modifiers:
+ ConstantPruningModifier:
+ targets: [
+ 're:.*q_proj.weight',
+ 're:.*k_proj.weight',
+ 're:.*v_proj.weight',
+ 're:.*o_proj.weight',
+ 're:.*gate_proj.weight',
+ 're:.*up_proj.weight',
+ 're:.*down_proj.weight',
+ ]
+ start: 0
+ save_compressed: trueThis plugin does not apply pruning or sparsification itself — it is intended for fine-tuning models that have already been sparsified.
Pre-sparsified checkpoints can be: - Generated using LLMCompressor @@ -1287,22 +1335,22 @@ The quick brown fox jumps over the loud dog
After fine-tuning your sparse model, you can leverage vLLM for efficient inference. You can also use LLMCompressor to apply additional quantization to your fine-tuned sparse model before inference for even greater performance benefits.:
-from vllm import LLM, SamplingParams
-
-prompts = [
- "Hello, my name is",
- "The president of the United States is",
- "The capital of France is",
- "The future of AI is",
-]
-sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-llm = LLM("path/to/your/sparse/model")
-outputs = llm.generate(prompts, sampling_params)
-
-for output in outputs:
- prompt = output.prompt
- generated_text = output.outputs[0].text
- print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")from vllm import LLM, SamplingParams
+
+prompts = [
+ "Hello, my name is",
+ "The president of the United States is",
+ "The capital of France is",
+ "The future of AI is",
+]
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+llm = LLM("path/to/your/sparse/model")
+outputs = llm.generate(prompts, sampling_params)
+
+for output in outputs:
+ prompt = output.prompt
+ generated_text = output.outputs[0].text
+ print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")For more details on vLLM’s capabilities and advanced configuration options, see the official vLLM documentation.
Run evaluation on model using the popular lm-evaluation-harness library.
See https://github.com/EleutherAI/lm-evaluation-harness
-plugins:
- - axolotl.integrations.lm_eval.LMEvalPlugin
-
-lm_eval_tasks:
- - gsm8k
- - hellaswag
- - arc_easy
-
-lm_eval_batch_size: # Batch size for evaluation
-output_dir: # Directory to save evaluation resultsThere are two ways to use the LM Eval integration:
+When training with the plugin enabled, evaluation runs automatically after training completes:
+plugins:
+ - axolotl.integrations.lm_eval.LMEvalPlugin
+
+lm_eval_tasks:
+ - gsm8k
+ - hellaswag
+ - arc_easy
+
+lm_eval_batch_size: # Batch size for evaluation
+
+output_dir:Run training as usual:
+axolotl train config.ymlEvaluate any model directly without training:
+lm_eval_model: meta-llama/Llama-2-7b-hf
+
+plugins:
+ - axolotl.integrations.lm_eval.LMEvalPlugin
+
+lm_eval_tasks:
+ - gsm8k
+ - hellaswag
+ - arc_easy
+
+lm_eval_batch_size: 8
+output_dir: ./outputsRun evaluation:
+axolotl lm-eval config.ymlThe model to evaluate is selected in the following priority order:
+lm_eval_model - Explicit model path or HuggingFace repo (highest priority)hub_model_id - Trained model pushed to HuggingFace Huboutput_dir - Local checkpoint directory containing trained model weights@misc{eval-harness,
- author = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
- title = {A framework for few-shot language model evaluation},
- month = 07,
- year = 2024,
- publisher = {Zenodo},
- version = {v0.4.3},
- doi = {10.5281/zenodo.12608602},
- url = {https://zenodo.org/records/12608602}
-}@misc{eval-harness,
+ author = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
+ title = {A framework for few-shot language model evaluation},
+ month = 07,
+ year = 2024,
+ publisher = {Zenodo},
+ version = {v0.4.3},
+ doi = {10.5281/zenodo.12608602},
+ url = {https://zenodo.org/records/12608602}
+}Please see reference here
See https://github.com/linkedin/Liger-Kernel
-plugins:
- - axolotl.integrations.liger.LigerPlugin
-liger_rope: true
-liger_rms_norm: true
-liger_glu_activation: true
-liger_layer_norm: true
-liger_fused_linear_cross_entropy: true
-
-liger_use_token_scaling: trueplugins:
+ - axolotl.integrations.liger.LigerPlugin
+liger_rope: true
+liger_rms_norm: true
+liger_glu_activation: true
+liger_layer_norm: true
+liger_fused_linear_cross_entropy: true
+
+liger_use_token_scaling: true@article{hsu2024ligerkernelefficienttriton,
- title={Liger Kernel: Efficient Triton Kernels for LLM Training},
- author={Pin-Lun Hsu and Yun Dai and Vignesh Kothapalli and Qingquan Song and Shao Tang and Siyu Zhu and Steven Shimizu and Shivam Sahni and Haowen Ning and Yanning Chen},
- year={2024},
- eprint={2410.10989},
- archivePrefix={arXiv},
- primaryClass={cs.LG},
- url={https://arxiv.org/abs/2410.10989},
- journal={arXiv preprint arXiv:2410.10989},
-}@article{hsu2024ligerkernelefficienttriton,
+ title={Liger Kernel: Efficient Triton Kernels for LLM Training},
+ author={Pin-Lun Hsu and Yun Dai and Vignesh Kothapalli and Qingquan Song and Shao Tang and Siyu Zhu and Steven Shimizu and Shivam Sahni and Haowen Ning and Yanning Chen},
+ year={2024},
+ eprint={2410.10989},
+ archivePrefix={arXiv},
+ primaryClass={cs.LG},
+ url={https://arxiv.org/abs/2410.10989},
+ journal={arXiv preprint arXiv:2410.10989},
+}Please see reference here
Spectrum is a tool for scanning and evaluating the Signal-to-Noise Ratio (SNR) of layers in large language models. By identifying the top n% of layers with the highest SNR, you can optimize training efficiency.
plugins:
- - axolotl.integrations.spectrum.SpectrumPlugin
-
-spectrum_top_fraction: 0.5
-spectrum_model_name: meta-llama/Meta-Llama-3.1-8Bplugins:
+ - axolotl.integrations.spectrum.SpectrumPlugin
+
+spectrum_top_fraction: 0.5
+spectrum_model_name: meta-llama/Meta-Llama-3.1-8B@misc{hartford2024spectrumtargetedtrainingsignal,
- title={Spectrum: Targeted Training on Signal to Noise Ratio},
- author={Eric Hartford and Lucas Atkins and Fernando Fernandes Neto and David Golchinfar},
- year={2024},
- eprint={2406.06623},
- archivePrefix={arXiv},
- primaryClass={cs.LG},
- url={https://arxiv.org/abs/2406.06623},
-}@misc{hartford2024spectrumtargetedtrainingsignal,
+ title={Spectrum: Targeted Training on Signal to Noise Ratio},
+ author={Eric Hartford and Lucas Atkins and Fernando Fernandes Neto and David Golchinfar},
+ year={2024},
+ eprint={2406.06623},
+ archivePrefix={arXiv},
+ primaryClass={cs.LG},
+ url={https://arxiv.org/abs/2406.06623},
+}Please see reference here
pip install swanlabpip install swanlabAdd SwanLab configuration to your Axolotl YAML config:
-plugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: my-llm-project
-swanlab_experiment_name: qwen-finetune-v1
-swanlab_mode: cloud # Options: cloud, local, offline, disabled
-swanlab_workspace: my-team # Optional: organization name
-swanlab_api_key: YOUR_API_KEY # Optional: can also use env var SWANLAB_API_KEYplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: my-llm-project
+swanlab_experiment_name: qwen-finetune-v1
+swanlab_mode: cloud # Options: cloud, local, offline, disabled
+swanlab_workspace: my-team # Optional: organization name
+swanlab_api_key: YOUR_API_KEY # Optional: can also use env var SWANLAB_API_KEYexport SWANLAB_API_KEY=your-api-key-here
-
-swanlab login
-
-accelerate launch -m axolotl.cli.train your-config.yamlexport SWANLAB_API_KEY=your-api-key-here
+
+swanlab login
+
+accelerate launch -m axolotl.cli.train your-config.yamlplugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: llama-finetune
-swanlab_experiment_name: llama-3-8b-instruct-v1
-swanlab_mode: cloudplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: llama-finetune
+swanlab_experiment_name: llama-3-8b-instruct-v1
+swanlab_mode: cloudplugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: local-experiments
-swanlab_experiment_name: test-run-1
-swanlab_mode: local # or 'offline'plugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: local-experiments
+swanlab_experiment_name: test-run-1
+swanlab_mode: local # or 'offline'plugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: research-project
-swanlab_experiment_name: experiment-42
-swanlab_workspace: my-research-team
-swanlab_mode: cloudplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: research-project
+swanlab_experiment_name: experiment-42
+swanlab_workspace: my-research-team
+swanlab_mode: cloudplugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: internal-project
-swanlab_experiment_name: secure-training
-swanlab_mode: cloud
-swanlab_web_host: https://swanlab.yourcompany.com
-swanlab_api_host: https://api.swanlab.yourcompany.complugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: internal-project
+swanlab_experiment_name: secure-training
+swanlab_mode: cloud
+swanlab_web_host: https://swanlab.yourcompany.com
+swanlab_api_host: https://api.swanlab.yourcompany.comSend training notifications to a Lark group chat:
-plugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: production-training
-swanlab_experiment_name: llama-3-finetune-v2
-swanlab_mode: cloud
-
-swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: production-training
+swanlab_experiment_name: llama-3-finetune-v2
+swanlab_mode: cloud
+
+swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxNote: This configuration will work, but you’ll see a security warning recommending HMAC secret configuration.
For production use, enable HMAC signature verification:
-plugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: production-training
-swanlab_experiment_name: llama-3-finetune-v2
-swanlab_mode: cloud
-
-swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxx
-swanlab_lark_secret: your-webhook-secret-keyplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: production-training
+swanlab_experiment_name: llama-3-finetune-v2
+swanlab_mode: cloud
+
+swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxx
+swanlab_lark_secret: your-webhook-secret-keyWhy HMAC secret matters:
- Prevents unauthorized parties from sending fake notifications to your Lark group
- Ensures notifications genuinely come from your training jobs
@@ -1716,17 +1799,17 @@ By identifying the top n% of layers with the highest SNR, you can optimize train
Combine team workspace collaboration with Lark notifications: The plugin validates your Lark configuration at startup: Always use HMAC secret in production: Store secrets in environment variables (even better): Then in config: Rotate webhook secrets periodically: Update your Lark bot’s secret every 90 days Use separate webhooks for dev/prod: Don’t mix development and production notificationsExample 7: Team Workspace + Lark Notifications
plugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: research-project
-swanlab_experiment_name: multimodal-experiment-42
-swanlab_workspace: ml-research-team
-swanlab_mode: cloud
-
-swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxx
-swanlab_lark_secret: your-webhook-secret-keyplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: research-project
+swanlab_experiment_name: multimodal-experiment-42
+swanlab_workspace: ml-research-team
+swanlab_mode: cloud
+
+swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxx
+swanlab_lark_secret: your-webhook-secret-keyWhat Notifications Are Sent?
@@ -1746,34 +1829,34 @@ By identifying the top n% of layers with the highest SNR, you can optimize train
✅ Valid Configurations
-use_swanlab: true
-swanlab_project: my-project
-
-use_swanlab: true
-swanlab_project: my-project
-swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxx
-swanlab_lark_secret: your-secret
-
-use_swanlab: true
-swanlab_project: my-project
-swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxxuse_swanlab: true
+swanlab_project: my-project
+
+use_swanlab: true
+swanlab_project: my-project
+swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxx
+swanlab_lark_secret: your-secret
+
+use_swanlab: true
+swanlab_project: my-project
+swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxxSecurity Best Practices
@@ -1784,7 +1867,7 @@ By identifying the top n% of layers with the highest SNR, you can optimize train
- Only rank 0 sends notifications
- Other GPU ranks skip Lark registration
- Prevents duplicate messages in multi-GPU trainingswanlab_lark_webhook_url: https://open.feishu.cn/...
-swanlab_lark_secret: your-secret-key # ✅ Add this!swanlab_lark_webhook_url: https://open.feishu.cn/...
+swanlab_lark_secret: your-secret-key # ✅ Add this!# In your training script/environment
-export SWANLAB_LARK_WEBHOOK_URL="https://open.feishu.cn/..."
-export SWANLAB_LARK_SECRET="your-secret-key"# In your training script/environment
+export SWANLAB_LARK_WEBHOOK_URL="https://open.feishu.cn/..."
+export SWANLAB_LARK_SECRET="your-secret-key"# SwanLab plugin will auto-detect environment variables
-use_swanlab: true
-swanlab_project: my-project
-# Lark URL and secret read from env vars# SwanLab plugin will auto-detect environment variables
+use_swanlab: true
+swanlab_project: my-project
+# Lark URL and secret read from env vars
torchrun --nproc_per_node=4 -m axolotl.cli.train config.ymltorchrun --nproc_per_node=4 -m axolotl.cli.train config.ymlplugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: dpo-training
-swanlab_experiment_name: llama-3-dpo-v1
-swanlab_mode: cloud
-
-swanlab_log_completions: true
-swanlab_completion_log_interval: 100 # Log every 100 steps
-swanlab_completion_max_buffer: 128 # Keep last 128 completions
-
-rl: dpo
-datasets:
- - path: /path/to/preference_dataset
- type: chatml.intelplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: dpo-training
+swanlab_experiment_name: llama-3-dpo-v1
+swanlab_mode: cloud
+
+swanlab_log_completions: true
+swanlab_completion_log_interval: 100 # Log every 100 steps
+swanlab_completion_max_buffer: 128 # Keep last 128 completions
+
+rl: dpo
+datasets:
+ - path: /path/to/preference_dataset
+ type: chatml.intelIf you’re doing a quick test run or don’t need completion tables:
-plugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: dpo-training
-
-swanlab_log_completions: falseplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: dpo-training
+
+swanlab_log_completions: falseFor non-RLHF trainers (standard supervised fine-tuning), the completion callback is automatically skipped.
swanlab_completion_max_buffer)The completion buffer is memory-bounded to prevent memory leaks:
-from collections import deque
-
-buffer = deque(maxlen=128) # Old completions automatically droppedfrom collections import deque
+
+buffer = deque(maxlen=128) # Old completions automatically droppedMemory usage estimate: - Average completion: ~500 characters (prompt + responses) - Buffer size 128: ~64 KB (negligible) @@ -1952,8 +2035,8 @@ By identifying the top n% of layers with the highest SNR, you can optimize train
Cause: High logging frequency with small buffer size.
Solution: Increase buffer size or logging interval:
-swanlab_completion_log_interval: 200 # Log less frequently
-swanlab_completion_max_buffer: 512 # Larger bufferswanlab_completion_log_interval: 200 # Log less frequently
+swanlab_completion_max_buffer: 512 # Larger bufferAdd profiling to any trainer method with the @swanlab_profile decorator:
from axolotl.integrations.swanlab.profiling import swanlab_profile
-
-class MyCustomTrainer(AxolotlTrainer):
- @swanlab_profile
- def training_step(self, model, inputs):
- # Your training step logic
- return super().training_step(model, inputs)
-
- @swanlab_profile
- def prediction_step(self, model, inputs, prediction_loss_only):
- # Your prediction logic
- return super().prediction_step(model, inputs, prediction_loss_only)from axolotl.integrations.swanlab.profiling import swanlab_profile
+
+class MyCustomTrainer(AxolotlTrainer):
+ @swanlab_profile
+ def training_step(self, model, inputs):
+ # Your training step logic
+ return super().training_step(model, inputs)
+
+ @swanlab_profile
+ def prediction_step(self, model, inputs, prediction_loss_only):
+ # Your prediction logic
+ return super().prediction_step(model, inputs, prediction_loss_only)The decorator automatically:
1. Measures execution time with high-precision timer
2. Logs to SwanLab as For fine-grained profiling within a method: Filter and throttle profiling logs with ProfilingConfig Parameters:
- Profiling is automatically enabled when SwanLab is enabled. No additional config needed: To disable profiling while keeping SwanLab enabled: Cause: SwanLab is not enabled or not initialized. Solution: Check logs for: Cause: Profiling every function call for high-frequency operations. Solution: Use Here’s a complete example integrating SwanLab with your RVQ-Alpha training: The plugin validates your configuration at startup and provides clear error messages with solutions: Solution: Solution: Solution: When using Solutions:
1. Set environment variable: Using multiple logging tools simultaneously (SwanLab + WandB + MLflow + Comet) can impact training performance: Impact:
- Performance overhead: ~1-2% per logger (cumulative)
- Increased memory usage
@@ -2311,14 +2394,14 @@ profiling/Time taken: MyTrainer.backward_pass
Why This Matters:
- With 3 loggers: ~4-5% overhead per step → significant slowdown over long training
- Example: 10,000 steps at 2s/step → ~400-500 seconds extra (6-8 minutes)
@@ -2329,17 +2412,17 @@ profiling/Time taken: MyTrainer.backward_pass
For convenience, SwanLab will auto-enable if you specify a project without setting In distributed training scenarios (multi-GPU), the plugin automatically detects and reports: Why Only Rank 0:
- Avoids duplicate experiment runs
- Reduces network/cloud API overhead on worker ranks
@@ -2350,15 +2433,15 @@ profiling/Time taken: MyTrainer.backward_pass
SwanLab can work alongside other tracking tools: Cause: You enabled SwanLab ( Solution: Cause: You provided an invalid mode value. Solution: Use one of the valid modes: Cause: You set Solution: Either provide a valid name or remove the field: Cause: SwanLab package is not installed in your environment. Solution: Cause: You have multiple experiment tracking tools enabled (e.g., SwanLab + WandB + MLflow). Impact: ~1-2% performance overhead per logger, cumulative. Solution: For production training, disable all but one logger: Exception: Multiple loggers are acceptable for:
- Short comparison runs (< 100 steps)
- Migration testing between logging tools
@@ -2489,22 +2572,22 @@ Info: Other ranks will skip SwanLab to avoid conflicts
Solution: Solution: Use Then sync when ready: Solution: Verify plugin path in config: Cause: Your SwanLab version doesn’t include the Lark plugin (requires SwanLab >= 0.3.0). Solution: Cause: You provided Impact: Lark notifications will work, but without HMAC authentication (security risk). Solution: Add HMAC secret for production use: When it’s OK to skip secret:
- Local development and testing
- Internal networks with restricted access
@@ -2536,11 +2619,11 @@ Info: Other ranks will skip SwanLab to avoid conflicts
Cause: Invalid webhook URL or network connectivity issues. Diagnostic steps: Solution:
1. Verify webhook URL is correct (copy from Lark bot settings)
2. Check network connectivity to Lark API
@@ -2557,11 +2640,11 @@ Info: Other ranks will skip SwanLab to avoid conflicts
INFO: Registered Lark notification callback with HMAC authentication
Verify webhook in Lark: Test webhook manually (see above) Check distributed training: Only rank 0 sends notifications Verify SwanLab is initialized: Lark callback needs SwanLab to be running Check Lark bot permissions: Ensure bot is added to the target group chatprofiling/Time taken: ClassName.method_name
@@ -2008,45 +2091,45 @@ By identifying the top n% of layers with the highest SNR, you can optimize train
Advanced Usage: Context Manager
from axolotl.integrations.swanlab.profiling import swanlab_profiling_context
-
-class MyTrainer(AxolotlTrainer):
- def complex_training_step(self, model, inputs):
- # Profile just the forward pass
- with swanlab_profiling_context(self, "forward_pass"):
- outputs = model(**inputs)
-
- # Profile just the backward pass
- with swanlab_profiling_context(self, "backward_pass"):
- loss = outputs.loss
- loss.backward()
-
- return outputsfrom axolotl.integrations.swanlab.profiling import swanlab_profiling_context
+
+class MyTrainer(AxolotlTrainer):
+ def complex_training_step(self, model, inputs):
+ # Profile just the forward pass
+ with swanlab_profiling_context(self, "forward_pass"):
+ outputs = model(**inputs)
+
+ # Profile just the backward pass
+ with swanlab_profiling_context(self, "backward_pass"):
+ loss = outputs.loss
+ loss.backward()
+
+ return outputsAdvanced Usage: ProfilingConfig
ProfilingConfig:from axolotl.integrations.swanlab.profiling import (
- swanlab_profiling_context_advanced,
- ProfilingConfig,
-)
-
-profiling_config = ProfilingConfig(
- enabled=True,
- min_duration_ms=1.0, # Only log if duration > 1ms
- log_interval=10, # Log every 10th call
-)
-
-class MyTrainer(AxolotlTrainer):
- def frequently_called_method(self, data):
- with swanlab_profiling_context_advanced(
- self,
- "frequent_op",
- config=profiling_config
- ):
- # This only logs every 10th call, and only if it takes > 1ms
- result = expensive_computation(data)
- return resultfrom axolotl.integrations.swanlab.profiling import (
+ swanlab_profiling_context_advanced,
+ ProfilingConfig,
+)
+
+profiling_config = ProfilingConfig(
+ enabled=True,
+ min_duration_ms=1.0, # Only log if duration > 1ms
+ log_interval=10, # Log every 10th call
+)
+
+class MyTrainer(AxolotlTrainer):
+ def frequently_called_method(self, data):
+ with swanlab_profiling_context_advanced(
+ self,
+ "frequent_op",
+ config=profiling_config
+ ):
+ # This only logs every 10th call, and only if it takes > 1ms
+ result = expensive_computation(data)
+ return resultenabled: Enable/disable profiling globally (default: True)
- min_duration_ms: Minimum duration to log in milliseconds (default: 0.1)
@@ -2071,15 +2154,15 @@ profiling/Time taken: MyTrainer.backward_pass
Configuration in Axolotl Config
plugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: my-projectplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: my-projectfrom axolotl.integrations.swanlab.profiling import DEFAULT_PROFILING_CONFIG
-
-DEFAULT_PROFILING_CONFIG.enabled = Falsefrom axolotl.integrations.swanlab.profiling import DEFAULT_PROFILING_CONFIG
+
+DEFAULT_PROFILING_CONFIG.enabled = FalsePerformance Impact
@@ -2103,41 +2186,41 @@ profiling/Time taken: MyTrainer.backward_pass
Example: Complete Profiling Setup
-from axolotl.integrations.swanlab.profiling import (
- swanlab_profile,
- swanlab_profiling_context,
- ProfilingConfig,
-)
-
-class OptimizedTrainer(AxolotlTrainer):
- def __init__(self, *args, **kwargs):
- super().__init__(*args, **kwargs)
-
- # Custom profiling config for high-frequency operations
- self.fast_op_config = ProfilingConfig(
- enabled=True,
- min_duration_ms=0.5,
- log_interval=50,
- )
-
- @swanlab_profile
- def training_step(self, model, inputs):
- """Main training step - always profile."""
- return super().training_step(model, inputs)
-
- @swanlab_profile
- def compute_loss(self, model, inputs, return_outputs=False):
- """Loss computation - always profile."""
- return super().compute_loss(model, inputs, return_outputs)
-
- def _prepare_inputs(self, inputs):
- """High-frequency operation - throttled profiling."""
- with swanlab_profiling_context_advanced(
- self,
- "prepare_inputs",
- config=self.fast_op_config,
- ):
- return super()._prepare_inputs(inputs)from axolotl.integrations.swanlab.profiling import (
+ swanlab_profile,
+ swanlab_profiling_context,
+ ProfilingConfig,
+)
+
+class OptimizedTrainer(AxolotlTrainer):
+ def __init__(self, *args, **kwargs):
+ super().__init__(*args, **kwargs)
+
+ # Custom profiling config for high-frequency operations
+ self.fast_op_config = ProfilingConfig(
+ enabled=True,
+ min_duration_ms=0.5,
+ log_interval=50,
+ )
+
+ @swanlab_profile
+ def training_step(self, model, inputs):
+ """Main training step - always profile."""
+ return super().training_step(model, inputs)
+
+ @swanlab_profile
+ def compute_loss(self, model, inputs, return_outputs=False):
+ """Loss computation - always profile."""
+ return super().compute_loss(model, inputs, return_outputs)
+
+ def _prepare_inputs(self, inputs):
+ """High-frequency operation - throttled profiling."""
+ with swanlab_profiling_context_advanced(
+ self,
+ "prepare_inputs",
+ config=self.fast_op_config,
+ ):
+ return super()._prepare_inputs(inputs)Troubleshooting
@@ -2145,8 +2228,8 @@ profiling/Time taken: MyTrainer.backward_pass
Profiling metrics not appearing in SwanLab
use_swanlab: true
-swanlab_project: my-projectuse_swanlab: true
+swanlab_project: my-projectINFO: SwanLab initialized for project: my-projectToo many profiling metrics cluttering dashboard
ProfilingConfig with throttling:config = ProfilingConfig(
- min_duration_ms=1.0, # Skip fast ops
- log_interval=100, # Log every 100th call
-)config = ProfilingConfig(
+ min_duration_ms=1.0, # Skip fast ops
+ log_interval=100, # Log every 100th call
+)Profiling overhead impacting training speed
@@ -2179,32 +2262,32 @@ profiling/Time taken: MyTrainer.backward_pass
Complete Config Example
base_model: /path/to/your/model
-model_type: Qwen2ForCausalLM
-
-plugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
- - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
-
-use_swanlab: true
-swanlab_project: RVQ-Alpha-Training
-swanlab_experiment_name: Qwen2.5-7B-MetaQA-Perturb-P020
-swanlab_description: "Training on MetaQA and Perturbation datasets with NEW-RVQ encoding"
-swanlab_mode: cloud
-swanlab_workspace: single-cell-genomics
-
-sequence_len: 32768
-micro_batch_size: 1
-gradient_accumulation_steps: 1
-num_epochs: 2
-learning_rate: 2e-5
-optimizer: adamw_torch_fused
-
-datasets:
- - path: /path/to/dataset
- type: chat_template
-
-output_dir: ./outputsbase_model: /path/to/your/model
+model_type: Qwen2ForCausalLM
+
+plugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+ - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
+
+use_swanlab: true
+swanlab_project: RVQ-Alpha-Training
+swanlab_experiment_name: Qwen2.5-7B-MetaQA-Perturb-P020
+swanlab_description: "Training on MetaQA and Perturbation datasets with NEW-RVQ encoding"
+swanlab_mode: cloud
+swanlab_workspace: single-cell-genomics
+
+sequence_len: 32768
+micro_batch_size: 1
+gradient_accumulation_steps: 1
+num_epochs: 2
+learning_rate: 2e-5
+optimizer: adamw_torch_fused
+
+datasets:
+ - path: /path/to/dataset
+ type: chat_template
+
+output_dir: ./outputsModes Explained
@@ -2250,36 +2333,36 @@ profiling/Time taken: MyTrainer.backward_pass
Missing Project Name
-use_swanlab: trueuse_swanlab: trueuse_swanlab: true
-swanlab_project: my-projectuse_swanlab: true
+swanlab_project: my-projectInvalid Mode
-use_swanlab: true
-swanlab_project: my-project
-swanlab_mode: invalid-modeuse_swanlab: true
+swanlab_project: my-project
+swanlab_mode: invalid-modeuse_swanlab: true
-swanlab_project: my-project
-swanlab_mode: cloud # or: local, offline, disableduse_swanlab: true
+swanlab_project: my-project
+swanlab_mode: cloud # or: local, offline, disabledEmpty Project Name
-use_swanlab: true
-swanlab_project: ""use_swanlab: true
+swanlab_project: ""use_swanlab: true
-swanlab_project: my-projectuse_swanlab: true
+swanlab_project: my-projectCloud Mode API Key Warning
cloud mode without an API key, you’ll receive a warning with multiple solutions:use_swanlab: true
-swanlab_project: my-project
-swanlab_mode: clouduse_swanlab: true
+swanlab_project: my-project
+swanlab_mode: cloudexport SWANLAB_API_KEY=your-api-key
2. Add to config (less secure): swanlab_api_key: your-api-key
@@ -2291,11 +2374,11 @@ profiling/Time taken: MyTrainer.backward_pass
Two Loggers - Warning
-use_swanlab: true
-swanlab_project: my-project
-
-use_wandb: true
-wandb_project: my-projectuse_swanlab: true
+swanlab_project: my-project
+
+use_wandb: true
+wandb_project: my-projectThree+ Loggers - Error-Level Warning
-use_swanlab: true
-swanlab_project: my-project
-
-use_wandb: true
-wandb_project: my-project
-
-use_mlflow: true
-mlflow_tracking_uri: http://localhost:5000use_swanlab: true
+swanlab_project: my-project
+
+use_wandb: true
+wandb_project: my-project
+
+use_mlflow: true
+mlflow_tracking_uri: http://localhost:5000Auto-Enable Logic
use_swanlab:swanlab_project: my-project
-
-use_swanlab: true
-swanlab_project: my-projectswanlab_project: my-project
+
+use_swanlab: true
+swanlab_project: my-projectDistributed Training Detection
use_swanlab: true
-swanlab_project: my-project
-swanlab_mode: clouduse_swanlab: true
+swanlab_project: my-project
+swanlab_mode: cloudMethod 1: Environment Variable (Recommended)
-export SWANLAB_API_KEY=your-api-key-hereexport SWANLAB_API_KEY=your-api-key-hereMethod 2: Login Command
-swanlab loginswanlab loginMethod 3: Config File
-swanlab_api_key: your-api-key-hereswanlab_api_key: your-api-key-hereWhat Gets Logged?
@@ -2397,19 +2480,19 @@ profiling/Time taken: MyTrainer.backward_pass
Local Mode
-swanlab watch ./swanlogswanlab watch ./swanlogIntegration with Existing Tools
plugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: my-project
-
-use_wandb: true
-wandb_project: my-projectplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: my-project
+
+use_wandb: true
+wandb_project: my-projectTroubleshooting
@@ -2420,20 +2503,20 @@ profiling/Time taken: MyTrainer.backward_pass
Error: “SwanLab enabled but ‘swanlab_project’ is not set”
use_swanlab: true) but forgot to specify a project name.use_swanlab: true
-swanlab_project: my-project # Add this lineuse_swanlab: true
+swanlab_project: my-project # Add this lineError: “Invalid swanlab_mode: ‘xxx’”
swanlab_mode: cloud # or: local, offline, disabledswanlab_mode: cloud # or: local, offline, disabledError: “swanlab_project cannot be an empty string”
swanlab_project: "" (empty string).swanlab_project: my-projectswanlab_project: my-projectError: “SwanLab is not installed”
pip install swanlab
-pip install swanlab>=0.3.0pip install swanlab
+pip install swanlab>=0.3.0use_swanlab: true
-swanlab_project: my-project
-use_wandb: false # Disable others
-use_mlflow: false
-
-use_swanlab: false
-use_wandb: true
-wandb_project: my-projectuse_swanlab: true
+swanlab_project: my-project
+use_wandb: false # Disable others
+use_mlflow: false
+
+use_swanlab: false
+use_wandb: true
+wandb_project: my-projectAPI Key errors
echo $SWANLAB_API_KEY
-
-swanlab loginecho $SWANLAB_API_KEY
+
+swanlab loginCloud sync issues
offline mode and sync later:swanlab_mode: offlineswanlab_mode: offlineswanlab sync ./swanlogswanlab sync ./swanlogPlugin not loaded
plugins:
- - axolotl.integrations.swanlab.SwanLabPlugin # Correct pathplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin # Correct pathLark Notification Issues
@@ -2512,17 +2595,17 @@ Info: Other ranks will skip SwanLab to avoid conflicts
Error: “Failed to import SwanLab Lark plugin”
pip install --upgrade swanlab
-
-pip install 'swanlab>=0.3.0'pip install --upgrade swanlab
+
+pip install 'swanlab>=0.3.0'Warning: “Lark webhook has no secret configured”
swanlab_lark_webhook_url but no swanlab_lark_secret.swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxx
-swanlab_lark_secret: your-webhook-secret # Add this lineswanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxx
+swanlab_lark_secret: your-webhook-secret # Add this lineError: “Failed to register Lark callback”
curl -X POST "YOUR_WEBHOOK_URL" \
- -H 'Content-Type: application/json' \
- -d '{"msg_type":"text","content":{"text":"Test from Axolotl"}}'
-
-pip show swanlabcurl -X POST "YOUR_WEBHOOK_URL" \
+ -H 'Content-Type: application/json' \
+ -d '{"msg_type":"text","content":{"text":"Test from Axolotl"}}'
+
+pip show swanlab# If running multi-GPU, check rank 0 logs specifically
-grep "Registered Lark" logs/rank_0.log# If running multi-GPU, check rank 0 logs specifically
+grep "Registered Lark" logs/rank_0.loguse_swanlab: true # Must be enabled
-swanlab_project: my-project # Must be setuse_swanlab: true # Must be enabled
+swanlab_project: my-project # Must be set
You can add custom metrics in your callbacks:
-import swanlab
-
-swanlab.log({
- "custom_metric": value,
- "epoch": epoch_num
-})import swanlab
+
+swanlab.log({
+ "custom_metric": value,
+ "epoch": epoch_num
+})swanlab compare run1 run2 run3swanlab compare run1 run2 run3If you could not load your integration, please ensure you are pip installing in editable mode.
-pip install -e .pip install -e .and correctly spelled the integration name in the config file.
-plugins:
- - axolotl.integrations.your_integration_name.YourIntegrationPluginplugins:
+ - axolotl.integrations.your_integration_name.YourIntegrationPlugin