diff --git a/.nojekyll b/.nojekyll index c2fbc0263..899e73099 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -c438397e \ No newline at end of file +fc32723f \ No newline at end of file diff --git a/docs/api/cli.merge_lora.html b/docs/api/cli.merge_lora.html index db77bd6f5..b34fccc6d 100644 --- a/docs/api/cli.merge_lora.html +++ b/docs/api/cli.merge_lora.html @@ -793,7 +793,7 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});
transformers’ merge_and_unload on the model given in the axolotl configcli.merge_lora.do_merge_lora(cfg)Calls transformers’ merge_and_unload on the model given in the axolotl config
-along with the LoRA adapters to combine them into a single base model.
Merges LoRA adapters with base model using either memory-efficient or legacy approach.
| Path | +Speed | +Multi-turn | +Architecture | +
|---|---|---|---|
| Async GRPO + Data Producer | +Fastest (3x) | +Yes | +NemoGymDataProducer replaces vLLM generation |
+
| Standard GRPO + Data Producer | +Baseline | +Yes | +Same producer, no async prefetch | +
| Standard GRPO + /verify | +Simplest | +No | +Reward function calls /verify directly | +
| FSDP2 + /verify (2 GPU) | +Distributed | +No | +fsdp_version: 2 |
+
Multi-turn uses nemo_gym_multi_turn: true which auto-enables the async trainer’s
+data producer protocol. The plugin’s NemoGymDataProducer calls NeMo Gym agent /run
+endpoints and returns RolloutDataset with proper IS correction, env_mask, and rewards.
All paths tested end-to-end with Qwen3-0.6B + LoRA, logged to wandb project nemo-gym-rl.
git clone https://github.com/NVIDIA-NeMo/Gym.git ~/Gym
+cd ~/Gym
+uv venv --python 3.12 && source .venv/bin/activate && uv sync
+
+CFLAGS="" uv pip install pycosat --python .venv/bin/python --no-build-isolation
+
+for dir in resources_servers/reasoning_gym resources_servers/example_single_tool_call responses_api_models/vllm_model responses_api_agents/simple_agent; do
+ uv venv --seed --allow-existing --python 3.12 $dir/.venv
+ CFLAGS="" uv pip install --python $dir/.venv/bin/python pycosat --no-build-isolation 2>/dev/null
+ uv pip install --python $dir/.venv/bin/python -e . "ray[default]==2.52.1"
+done
+
+uv pip install --python resources_servers/reasoning_gym/.venv/bin/python \
+ reasoning-gym matplotlib pillow cycler contourpy kiwisolverThis is the fully validated, highest-performance path. NeMo Gym’s agent server handles +multi-turn tool execution while axolotl’s async GRPO prefetches data in background threads.
+Step 1: Create the NeMo Gym agent config
+Create ~/Gym/configs/axolotl_tool_calling.yaml:
example_single_tool_call:
+ resources_servers:
+ example_single_tool_call:
+ entrypoint: app.py
+ domain: agent
+ verified: false
+
+policy_model:
+ responses_api_models:
+ vllm_model:
+ entrypoint: app.py
+ base_url: http://localhost:8000/v1
+ api_key: dummy_key
+ model: Qwen/Qwen3-0.6B # Must match your training model
+ return_token_id_information: true
+ uses_reasoning_parser: false
+
+example_single_tool_call_simple_agent:
+ responses_api_agents:
+ simple_agent:
+ entrypoint: app.py
+ resources_server:
+ type: resources_servers
+ name: example_single_tool_call
+ model_server:
+ type: responses_api_models
+ name: policy_model
+ datasets:
+ - name: weather
+ type: example
+ jsonl_fpath: resources_servers/example_single_tool_call/data/weather_tool_calling.jsonlStep 2: Start three services
+CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
+ --model Qwen/Qwen3-0.6B --max-model-len 2048 --gpu-memory-utilization 0.85
+
+cd ~/Gym && .venv/bin/ng_run \
+ "+config_paths=[configs/axolotl_tool_calling.yaml]" "+skip_venv_if_present=true"
+
+cd experiments && CUDA_VISIBLE_DEVICES=1 CUDA_HOME=$HOME/env-claude-cu130/cuda_shim \
+ axolotl train nemo_gym_async_agent.yamlStep 3: Training config (nemo_gym_async_agent.yaml):
base_model: Qwen/Qwen3-0.6B
+adapter: lora
+lora_r: 16
+lora_alpha: 32
+lora_target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
+sequence_len: 2048
+
+rl: grpo
+chat_template: tokenizer_default
+
+trl:
+ use_vllm: true
+ vllm_mode: server
+ vllm_server_host: localhost
+ vllm_server_port: 8000
+ vllm_lora_sync: true
+ vllm_sync_interval: 5
+ # Async GRPO — 3x faster than standard
+ use_data_producer: true
+ async_prefetch: true
+ num_generations: 4
+ max_completion_length: 512
+ temperature: 0.8
+ reward_funcs:
+ - axolotl.integrations.nemo_gym.rewards.reward_env
+
+plugins:
+ - axolotl.integrations.nemo_gym.NemoGymPlugin
+
+nemo_gym_enabled: true
+nemo_gym_auto_start: false
+nemo_gym_head_port: 11000
+nemo_gym_multi_turn: true
+nemo_gym_verify_timeout: 120
+nemo_gym_datasets:
+ - path: ~/Gym/resources_servers/example_single_tool_call/data/weather_tool_calling.jsonl
+ server_name: example_single_tool_call
+
+datasets:
+ - path: ~/Gym/resources_servers/example_single_tool_call/data/weather_tool_calling.jsonl
+ type: chat_template
+ field_messages: responses_create_params.input
+ message_field_content: content
+ message_field_role: role
+
+vllm:
+ gpu_memory_utilization: 0.85
+ max_model_len: 2048
+ tensor_parallel_size: 1
+
+learning_rate: 5e-6
+micro_batch_size: 1
+gradient_accumulation_steps: 4
+max_steps: 30
+gradient_checkpointing: true
+bf16: true
+output_dir: ./outputs/nemo_gym_async
+
+use_wandb: true
+wandb_project: nemo-gym-rlFor environments that only need single-turn verify (math, coding challenges), you don’t need
+an agent server. The plugin’s reward function calls /verify directly.
base_model: Qwen/Qwen2.5-0.5B-Instruct
+rl: grpo
+chat_template: tokenizer_default
+
+trl:
+ use_vllm: true
+ vllm_mode: colocate
+ vllm_enable_sleep_mode: false
+ num_generations: 8
+ max_completion_length: 128
+ temperature: 0.9
+ reward_funcs:
+ - axolotl.integrations.nemo_gym.rewards.reward_nemo_gym_verify
+
+plugins:
+ - axolotl.integrations.nemo_gym.NemoGymPlugin
+
+nemo_gym_enabled: true
+nemo_gym_auto_start: false
+nemo_gym_head_port: 11000
+nemo_gym_datasets:
+ - path: ~/Gym/resources_servers/reasoning_gym/data/train_basic_arithmetic.jsonl
+ server_name: reasoning_gym
+
+datasets:
+ - path: ~/Gym/resources_servers/reasoning_gym/data/train_basic_arithmetic.jsonl
+ type: chat_template
+ field_messages: responses_create_params.input
+ message_field_content: content
+ message_field_role: role
+
+vllm:
+ gpu_memory_utilization: 0.3
+ max_model_len: 512
+ tensor_parallel_size: 1
+
+learning_rate: 1e-5
+micro_batch_size: 4
+gradient_accumulation_steps: 2
+max_steps: 50
+output_dir: ./outputs/nemo_gym_arithmeticOnly needs ng_run with resource servers (no agent config):
cd ~/Gym && ng_run "+config_paths=[resources_servers/reasoning_gym/configs/resources_only.yaml]" "+skip_venv_if_present=true"axolotl train → GRPO Trainer generates completions
+ → NeMo Gym plugin reward_fn calls POST /verify on resource server
+ → reward flows back to GRPO for advantage computation
+┌─────────────┐ ┌──────────────┐ ┌──────────────────┐
+│ axolotl │ │ NeMo Gym │────▶│ vLLM OpenAI │
+│ train │────▶│ Agent /run │◀────│ Server (GPU 0) │
+│ (GPU 1) │ │ │ │ /v1/completions │
+└─────────────┘ └──────┬───────┘ └──────────────────┘
+ │
+ ▼
+ ┌──────────────┐
+ │ Resource │
+ │ Server │
+ │ (tools + │
+ │ verify) │
+ └─────────────┘
+The agent server orchestrates the entire multi-turn loop: +1. Calls our vLLM server for model generation +2. Parses tool calls from model output +3. Executes tools against resource servers +4. Feeds tool results back to the model +5. Repeats until done, then calls /verify for reward +6. Returns token IDs + logprobs + reward to our rollout_func
+When nemo_gym_multi_turn: true, the plugin automatically forces use_data_producer: true
+which selects the AxolotlAsyncGRPOTrainer. The plugin then swaps the trainer’s data
+producer with NemoGymDataProducer, which:
num_generations (one agent call per rollout)aiohttp.gather)RolloutDataset)_pending_policy_logps=True for deferred scoringThe main thread then runs _compute_deferred_scores() which:
+- Computes policy logprobs on the training model (GPU forward pass)
+- Computes IS correction using agent’s sampling logprobs vs training model logprobs
+- Computes advantages with group-level normalization
+- All downstream features work: replay buffer, re-roll, streaming, zero-adv skip
With async_prefetch: true, the data producer runs in a background thread — giving ~3x
+speedup as generation and training overlap. With async_prefetch: false, it runs
+synchronously on the main thread (still uses the data producer protocol).
With vllm_lora_sync: true, the plugin (or async trainer) replaces NCCL-based weight
+sync with filesystem + HTTP:
accelerator.get_state_dict() gathers LoRA weights from all ranks/tmp/lora_sync_*/vN//set_lora_adapter/ on vLLM serverDatasets support per-row environment routing via agent_ref:
{"agent_ref": {"name": "reasoning_gym"}, "responses_create_params": {...}}
+{"agent_ref": {"name": "instruction_following"}, "responses_create_params": {...}}Or use the simpler per-dataset routing:
+nemo_gym_datasets:
+ - path: reasoning_data.jsonl
+ server_name: reasoning_gym
+ - path: tool_data.jsonl
+ server_name: example_single_tool_call| Parameter | +Type | +Default | +Description | +
|---|---|---|---|
nemo_gym_enabled |
+bool | +null |
+Enable the NeMo Gym integration | +
nemo_gym_dir |
+str | +~/Gym |
+Path to NeMo Gym repo | +
nemo_gym_auto_clone |
+bool | +true |
+Auto-clone NeMo Gym repo if missing | +
nemo_gym_auto_start |
+bool | +true |
+Auto-start resource servers | +
nemo_gym_config_paths |
+list[str] | +— | +Server config YAMLs (relative to gym_dir) | +
nemo_gym_datasets |
+list[dict] | +required | +Dataset configs with path and optional server_name |
+
nemo_gym_head_port |
+int | +11000 |
+Head server port | +
nemo_gym_server_timeout |
+int | +360 |
+Server startup timeout (seconds) | +
nemo_gym_verify_timeout |
+int | +30 |
+Per-request timeout (seconds) | +
nemo_gym_multi_turn |
+bool | +false |
+Enable multi-turn via agent /run | +
Each line must have responses_create_params with input messages:
{
+ "responses_create_params": {
+ "input": [{"role": "user", "content": "What's the weather in SF?"}],
+ "tools": [{"name": "get_weather", "type": "function", "strict": true, "parameters": {...}}]
+ }
+}For multi-turn agent routing, include agent_ref:
{"agent_ref": {"name": "my_agent"}, "responses_create_params": {...}}Note: Tool definitions MUST include "strict": true and "additionalProperties": false for NeMo Gym agent compatibility.
The plugin provides two built-in reward functions — no user code needed:
+trl:
+ reward_funcs:
+ # Multi-turn (nemo_gym_multi_turn: true):
+ # Passthrough — agent /run already computed the reward
+ - axolotl.integrations.nemo_gym.rewards.reward_env
+
+ # Single-turn (nemo_gym_multi_turn: false):
+ # Calls /verify endpoints on NeMo Gym resource servers
+ - axolotl.integrations.nemo_gym.rewards.reward_nemo_gym_verifyBoth are also importable from Python:
+from axolotl.integrations.nemo_gym import reward_env, reward_nemo_gym_verifyCFLAGS="" uv pip install pycosat --no-build-isolationray[default]==2.52.1 in all server venvsng_run creates per-server venvs via Ray. Pre-build them and use +skip_venv_if_present=truestrict field required: Agent server validates tool definitions require strict: trueStart vLLM with LoRA + tool calling + runtime loading:
+VLLM_ALLOW_RUNTIME_LORA_UPDATING=1 \
+CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
+ --model Qwen/Qwen3-4B-Instruct-2507 \
+ --max-model-len 4096 \
+ --gpu-memory-utilization 0.7 \
+ --enable-lora --max-lora-rank 64 \
+ --enable-auto-tool-choice --tool-call-parser hermesVLLM_ALLOW_RUNTIME_LORA_UPDATING=1: Required for vllm_lora_sync: true. Without it, vLLM won’t expose the /v1/load_lora_adapter endpoint and weight sync will fail silently. The plugin warns if this endpoint is missing.
--enable-lora: Enables LoRA adapter support in vLLM
--enable-auto-tool-choice --tool-call-parser hermes: Required for Qwen3 tool calling
max_model_len must be > max_completion_length: Leave room for prompt tokens (~200). If equal, the NeMo Gym model proxy gets a 400 error and returns empty completions.
CUDA_HOME required: DeepSpeed import needs it for the nvcc shim
NCCL weight sync broken with vLLM 0.17: Use vllm_lora_sync: true (filesystem + HTTP via /v1/load_lora_adapter)
/run endpoint. Without an agent, the plugin falls back to single-turn /verifyresponses_api_models server that proxies to your vLLM. See the agent config example above| Feature | +Axolotl + NeMo Gym | +Unsloth + NeMo Gym | +NeMo RL (native) | +
|---|---|---|---|
| Server management | +Automatic | +Manual (notebook) | +Built-in | +
| Multi-environment | +Per-row routing | +Manual code | +YAML config | +
| Multi-turn / tool use | +Agent /run delegation | +No | +Agent /run (Ray) | +
| Async GRPO (3x speedup) | +Yes | +No | +Yes | +
| LoRA sync | +Filesystem + HTTP | +N/A | +NCCL | +
| Multi-GPU (FSDP2) | +Yes | +No | +Yes (Ray) | +
| Config-driven | +Yes | +No (code) | +Yes | +
Please see reference here
+by Eric Hartford, Lucas Atkins, Fernando Fernandes, David Golchinfar
@@ -1547,23 +2102,23 @@ By identifying the top n% of layers with the highest SNR, you can optimize trainplugins:
- - axolotl.integrations.spectrum.SpectrumPlugin
-
-spectrum_top_fraction: 0.5
-spectrum_model_name: meta-llama/Meta-Llama-3.1-8Bplugins:
+ - axolotl.integrations.spectrum.SpectrumPlugin
+
+spectrum_top_fraction: 0.5
+spectrum_model_name: meta-llama/Meta-Llama-3.1-8B@misc{hartford2024spectrumtargetedtrainingsignal,
- title={Spectrum: Targeted Training on Signal to Noise Ratio},
- author={Eric Hartford and Lucas Atkins and Fernando Fernandes Neto and David Golchinfar},
- year={2024},
- eprint={2406.06623},
- archivePrefix={arXiv},
- primaryClass={cs.LG},
- url={https://arxiv.org/abs/2406.06623},
-}@misc{hartford2024spectrumtargetedtrainingsignal,
+ title={Spectrum: Targeted Training on Signal to Noise Ratio},
+ author={Eric Hartford and Lucas Atkins and Fernando Fernandes Neto and David Golchinfar},
+ year={2024},
+ eprint={2406.06623},
+ archivePrefix={arXiv},
+ primaryClass={cs.LG},
+ url={https://arxiv.org/abs/2406.06623},
+}Please see reference here
pip install swanlabpip install swanlabAdd SwanLab configuration to your Axolotl YAML config:
-plugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: my-llm-project
-swanlab_experiment_name: qwen-finetune-v1
-swanlab_mode: cloud # Options: cloud, local, offline, disabled
-swanlab_workspace: my-team # Optional: organization name
-swanlab_api_key: YOUR_API_KEY # Optional: can also use env var SWANLAB_API_KEYplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: my-llm-project
+swanlab_experiment_name: qwen-finetune-v1
+swanlab_mode: cloud # Options: cloud, local, offline, disabled
+swanlab_workspace: my-team # Optional: organization name
+swanlab_api_key: YOUR_API_KEY # Optional: can also use env var SWANLAB_API_KEYexport SWANLAB_API_KEY=your-api-key-here
-
-swanlab login
-
-accelerate launch -m axolotl.cli.train your-config.yamlexport SWANLAB_API_KEY=your-api-key-here
+
+swanlab login
+
+accelerate launch -m axolotl.cli.train your-config.yamlplugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: llama-finetune
-swanlab_experiment_name: llama-3-8b-instruct-v1
-swanlab_mode: cloudplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: llama-finetune
+swanlab_experiment_name: llama-3-8b-instruct-v1
+swanlab_mode: cloudplugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: local-experiments
-swanlab_experiment_name: test-run-1
-swanlab_mode: local # or 'offline'plugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: local-experiments
+swanlab_experiment_name: test-run-1
+swanlab_mode: local # or 'offline'plugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: research-project
-swanlab_experiment_name: experiment-42
-swanlab_workspace: my-research-team
-swanlab_mode: cloudplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: research-project
+swanlab_experiment_name: experiment-42
+swanlab_workspace: my-research-team
+swanlab_mode: cloudplugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: internal-project
-swanlab_experiment_name: secure-training
-swanlab_mode: cloud
-swanlab_web_host: https://swanlab.yourcompany.com
-swanlab_api_host: https://api.swanlab.yourcompany.complugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: internal-project
+swanlab_experiment_name: secure-training
+swanlab_mode: cloud
+swanlab_web_host: https://swanlab.yourcompany.com
+swanlab_api_host: https://api.swanlab.yourcompany.comSend training notifications to a Lark group chat:
-plugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: production-training
-swanlab_experiment_name: llama-3-finetune-v2
-swanlab_mode: cloud
-
-swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: production-training
+swanlab_experiment_name: llama-3-finetune-v2
+swanlab_mode: cloud
+
+swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxNote: This configuration will work, but you’ll see a security warning recommending HMAC secret configuration.
For production use, enable HMAC signature verification:
-plugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: production-training
-swanlab_experiment_name: llama-3-finetune-v2
-swanlab_mode: cloud
-
-swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxx
-swanlab_lark_secret: your-webhook-secret-keyplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: production-training
+swanlab_experiment_name: llama-3-finetune-v2
+swanlab_mode: cloud
+
+swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxx
+swanlab_lark_secret: your-webhook-secret-keyWhy HMAC secret matters:
- Prevents unauthorized parties from sending fake notifications to your Lark group
- Ensures notifications genuinely come from your training jobs
@@ -1849,17 +2404,17 @@ By identifying the top n% of layers with the highest SNR, you can optimize train
Combine team workspace collaboration with Lark notifications: The plugin validates your Lark configuration at startup: Always use HMAC secret in production:Example 7: Team Workspace + Lark Notifications
plugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: research-project
-swanlab_experiment_name: multimodal-experiment-42
-swanlab_workspace: ml-research-team
-swanlab_mode: cloud
-
-swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxx
-swanlab_lark_secret: your-webhook-secret-keyplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: research-project
+swanlab_experiment_name: multimodal-experiment-42
+swanlab_workspace: ml-research-team
+swanlab_mode: cloud
+
+swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxx
+swanlab_lark_secret: your-webhook-secret-keyWhat Notifications Are Sent?
@@ -1879,34 +2434,34 @@ By identifying the top n% of layers with the highest SNR, you can optimize train
✅ Valid Configurations
-use_swanlab: true
-swanlab_project: my-project
-
-use_swanlab: true
-swanlab_project: my-project
-swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxx
-swanlab_lark_secret: your-secret
-
-use_swanlab: true
-swanlab_project: my-project
-swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxxuse_swanlab: true
+swanlab_project: my-project
+
+use_swanlab: true
+swanlab_project: my-project
+swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxx
+swanlab_lark_secret: your-secret
+
+use_swanlab: true
+swanlab_project: my-project
+swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxxSecurity Best Practices
swanlab_lark_webhook_url: https://open.feishu.cn/...
-swanlab_lark_secret: your-secret-key # ✅ Add this!swanlab_lark_webhook_url: https://open.feishu.cn/...
+swanlab_lark_secret: your-secret-key # ✅ Add this!
Store secrets in environment variables (even better):
-# In your training script/environment
-export SWANLAB_LARK_WEBHOOK_URL="https://open.feishu.cn/..."
-export SWANLAB_LARK_SECRET="your-secret-key"# In your training script/environment
+export SWANLAB_LARK_WEBHOOK_URL="https://open.feishu.cn/..."
+export SWANLAB_LARK_SECRET="your-secret-key"Then in config:
-# SwanLab plugin will auto-detect environment variables
-use_swanlab: true
-swanlab_project: my-project
-# Lark URL and secret read from env vars# SwanLab plugin will auto-detect environment variables
+use_swanlab: true
+swanlab_project: my-project
+# Lark URL and secret read from env varsRotate webhook secrets periodically: Update your Lark bot’s secret every 90 days
Use separate webhooks for dev/prod: Don’t mix development and production notifications
torchrun --nproc_per_node=4 -m axolotl.cli.train config.ymltorchrun --nproc_per_node=4 -m axolotl.cli.train config.ymlplugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: dpo-training
-swanlab_experiment_name: llama-3-dpo-v1
-swanlab_mode: cloud
-
-swanlab_log_completions: true
-swanlab_completion_log_interval: 100 # Log every 100 steps
-swanlab_completion_max_buffer: 128 # Keep last 128 completions
-
-rl: dpo
-datasets:
- - path: /path/to/preference_dataset
- type: chatml.intelplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: dpo-training
+swanlab_experiment_name: llama-3-dpo-v1
+swanlab_mode: cloud
+
+swanlab_log_completions: true
+swanlab_completion_log_interval: 100 # Log every 100 steps
+swanlab_completion_max_buffer: 128 # Keep last 128 completions
+
+rl: dpo
+datasets:
+ - path: /path/to/preference_dataset
+ type: chatml.intelIf you’re doing a quick test run or don’t need completion tables:
-plugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: dpo-training
-
-swanlab_log_completions: falseplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: dpo-training
+
+swanlab_log_completions: falseFor non-RLHF trainers (standard supervised fine-tuning), the completion callback is automatically skipped.
swanlab_completion_max_buffer)The completion buffer is memory-bounded to prevent memory leaks:
-from collections import deque
-
-buffer = deque(maxlen=128) # Old completions automatically droppedfrom collections import deque
+
+buffer = deque(maxlen=128) # Old completions automatically droppedMemory usage estimate: - Average completion: ~500 characters (prompt + responses) - Buffer size 128: ~64 KB (negligible) @@ -2085,8 +2640,8 @@ By identifying the top n% of layers with the highest SNR, you can optimize train
Cause: High logging frequency with small buffer size.
Solution: Increase buffer size or logging interval:
-swanlab_completion_log_interval: 200 # Log less frequently
-swanlab_completion_max_buffer: 512 # Larger bufferswanlab_completion_log_interval: 200 # Log less frequently
+swanlab_completion_max_buffer: 512 # Larger bufferAdd profiling to any trainer method with the @swanlab_profile decorator:
from axolotl.integrations.swanlab.profiling import swanlab_profile
-
-class MyCustomTrainer(AxolotlTrainer):
- @swanlab_profile
- def training_step(self, model, inputs):
- # Your training step logic
- return super().training_step(model, inputs)
-
- @swanlab_profile
- def prediction_step(self, model, inputs, prediction_loss_only):
- # Your prediction logic
- return super().prediction_step(model, inputs, prediction_loss_only)from axolotl.integrations.swanlab.profiling import swanlab_profile
+
+class MyCustomTrainer(AxolotlTrainer):
+ @swanlab_profile
+ def training_step(self, model, inputs):
+ # Your training step logic
+ return super().training_step(model, inputs)
+
+ @swanlab_profile
+ def prediction_step(self, model, inputs, prediction_loss_only):
+ # Your prediction logic
+ return super().prediction_step(model, inputs, prediction_loss_only)The decorator automatically:
1. Measures execution time with high-precision timer
2. Logs to SwanLab as For fine-grained profiling within a method: Filter and throttle profiling logs with ProfilingConfig Parameters:
- Profiling is automatically enabled when SwanLab is enabled. No additional config needed: To disable profiling while keeping SwanLab enabled: Cause: SwanLab is not enabled or not initialized. Solution: Check logs for: Cause: Profiling every function call for high-frequency operations. Solution: Use Here’s a complete example integrating SwanLab with your RVQ-Alpha training: The plugin validates your configuration at startup and provides clear error messages with solutions: Solution: Solution: Solution: When using Solutions:
1. Set environment variable: Using multiple logging tools simultaneously (SwanLab + WandB + MLflow + Comet) can impact training performance: Impact:
- Performance overhead: ~1-2% per logger (cumulative)
- Increased memory usage
@@ -2444,14 +2999,14 @@ profiling/Time taken: MyTrainer.backward_pass
Why This Matters:
- With 3 loggers: ~4-5% overhead per step → significant slowdown over long training
- Example: 10,000 steps at 2s/step → ~400-500 seconds extra (6-8 minutes)
@@ -2462,17 +3017,17 @@ profiling/Time taken: MyTrainer.backward_pass
For convenience, SwanLab will auto-enable if you specify a project without setting In distributed training scenarios (multi-GPU), the plugin automatically detects and reports: Why Only Rank 0:
- Avoids duplicate experiment runs
- Reduces network/cloud API overhead on worker ranks
@@ -2483,15 +3038,15 @@ profiling/Time taken: MyTrainer.backward_pass
SwanLab can work alongside other tracking tools: Cause: You enabled SwanLab ( Solution: Cause: You provided an invalid mode value. Solution: Use one of the valid modes: Cause: You set Solution: Either provide a valid name or remove the field: Cause: SwanLab package is not installed in your environment. Solution: Cause: You have multiple experiment tracking tools enabled (e.g., SwanLab + WandB + MLflow). Impact: ~1-2% performance overhead per logger, cumulative. Solution: For production training, disable all but one logger: Exception: Multiple loggers are acceptable for:
- Short comparison runs (< 100 steps)
- Migration testing between logging tools
@@ -2622,22 +3177,22 @@ Info: Other ranks will skip SwanLab to avoid conflicts
Solution: Solution: Use Then sync when ready: Solution: Verify plugin path in config: Cause: Your SwanLab version doesn’t include the Lark plugin (requires SwanLab >= 0.3.0). Solution: Cause: You provided Impact: Lark notifications will work, but without HMAC authentication (security risk). Solution: Add HMAC secret for production use: When it’s OK to skip secret:
- Local development and testing
- Internal networks with restricted access
@@ -2669,11 +3224,11 @@ Info: Other ranks will skip SwanLab to avoid conflicts
Cause: Invalid webhook URL or network connectivity issues. Diagnostic steps: Solution:
1. Verify webhook URL is correct (copy from Lark bot settings)
2. Check network connectivity to Lark API
@@ -2690,11 +3245,11 @@ Info: Other ranks will skip SwanLab to avoid conflicts
INFO: Registered Lark notification callback with HMAC authentication
Verify webhook in Lark: Test webhook manually (see above) Check distributed training: Only rank 0 sends notifications Verify SwanLab is initialized: Lark callback needs SwanLab to be running Check Lark bot permissions: Ensure bot is added to the target group chatprofiling/Time taken: ClassName.method_name
@@ -2141,45 +2696,45 @@ By identifying the top n% of layers with the highest SNR, you can optimize train
Advanced Usage: Context Manager
from axolotl.integrations.swanlab.profiling import swanlab_profiling_context
-
-class MyTrainer(AxolotlTrainer):
- def complex_training_step(self, model, inputs):
- # Profile just the forward pass
- with swanlab_profiling_context(self, "forward_pass"):
- outputs = model(**inputs)
-
- # Profile just the backward pass
- with swanlab_profiling_context(self, "backward_pass"):
- loss = outputs.loss
- loss.backward()
-
- return outputsfrom axolotl.integrations.swanlab.profiling import swanlab_profiling_context
+
+class MyTrainer(AxolotlTrainer):
+ def complex_training_step(self, model, inputs):
+ # Profile just the forward pass
+ with swanlab_profiling_context(self, "forward_pass"):
+ outputs = model(**inputs)
+
+ # Profile just the backward pass
+ with swanlab_profiling_context(self, "backward_pass"):
+ loss = outputs.loss
+ loss.backward()
+
+ return outputsAdvanced Usage: ProfilingConfig
ProfilingConfig:from axolotl.integrations.swanlab.profiling import (
- swanlab_profiling_context_advanced,
- ProfilingConfig,
-)
-
-profiling_config = ProfilingConfig(
- enabled=True,
- min_duration_ms=1.0, # Only log if duration > 1ms
- log_interval=10, # Log every 10th call
-)
-
-class MyTrainer(AxolotlTrainer):
- def frequently_called_method(self, data):
- with swanlab_profiling_context_advanced(
- self,
- "frequent_op",
- config=profiling_config
- ):
- # This only logs every 10th call, and only if it takes > 1ms
- result = expensive_computation(data)
- return resultfrom axolotl.integrations.swanlab.profiling import (
+ swanlab_profiling_context_advanced,
+ ProfilingConfig,
+)
+
+profiling_config = ProfilingConfig(
+ enabled=True,
+ min_duration_ms=1.0, # Only log if duration > 1ms
+ log_interval=10, # Log every 10th call
+)
+
+class MyTrainer(AxolotlTrainer):
+ def frequently_called_method(self, data):
+ with swanlab_profiling_context_advanced(
+ self,
+ "frequent_op",
+ config=profiling_config
+ ):
+ # This only logs every 10th call, and only if it takes > 1ms
+ result = expensive_computation(data)
+ return resultenabled: Enable/disable profiling globally (default: True)
- min_duration_ms: Minimum duration to log in milliseconds (default: 0.1)
@@ -2204,15 +2759,15 @@ profiling/Time taken: MyTrainer.backward_pass
Configuration in Axolotl Config
plugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: my-projectplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: my-projectfrom axolotl.integrations.swanlab.profiling import DEFAULT_PROFILING_CONFIG
-
-DEFAULT_PROFILING_CONFIG.enabled = Falsefrom axolotl.integrations.swanlab.profiling import DEFAULT_PROFILING_CONFIG
+
+DEFAULT_PROFILING_CONFIG.enabled = FalsePerformance Impact
@@ -2236,41 +2791,41 @@ profiling/Time taken: MyTrainer.backward_pass
Example: Complete Profiling Setup
-from axolotl.integrations.swanlab.profiling import (
- swanlab_profile,
- swanlab_profiling_context,
- ProfilingConfig,
-)
-
-class OptimizedTrainer(AxolotlTrainer):
- def __init__(self, *args, **kwargs):
- super().__init__(*args, **kwargs)
-
- # Custom profiling config for high-frequency operations
- self.fast_op_config = ProfilingConfig(
- enabled=True,
- min_duration_ms=0.5,
- log_interval=50,
- )
-
- @swanlab_profile
- def training_step(self, model, inputs):
- """Main training step - always profile."""
- return super().training_step(model, inputs)
-
- @swanlab_profile
- def compute_loss(self, model, inputs, return_outputs=False):
- """Loss computation - always profile."""
- return super().compute_loss(model, inputs, return_outputs)
-
- def _prepare_inputs(self, inputs):
- """High-frequency operation - throttled profiling."""
- with swanlab_profiling_context_advanced(
- self,
- "prepare_inputs",
- config=self.fast_op_config,
- ):
- return super()._prepare_inputs(inputs)from axolotl.integrations.swanlab.profiling import (
+ swanlab_profile,
+ swanlab_profiling_context,
+ ProfilingConfig,
+)
+
+class OptimizedTrainer(AxolotlTrainer):
+ def __init__(self, *args, **kwargs):
+ super().__init__(*args, **kwargs)
+
+ # Custom profiling config for high-frequency operations
+ self.fast_op_config = ProfilingConfig(
+ enabled=True,
+ min_duration_ms=0.5,
+ log_interval=50,
+ )
+
+ @swanlab_profile
+ def training_step(self, model, inputs):
+ """Main training step - always profile."""
+ return super().training_step(model, inputs)
+
+ @swanlab_profile
+ def compute_loss(self, model, inputs, return_outputs=False):
+ """Loss computation - always profile."""
+ return super().compute_loss(model, inputs, return_outputs)
+
+ def _prepare_inputs(self, inputs):
+ """High-frequency operation - throttled profiling."""
+ with swanlab_profiling_context_advanced(
+ self,
+ "prepare_inputs",
+ config=self.fast_op_config,
+ ):
+ return super()._prepare_inputs(inputs)Troubleshooting
@@ -2278,8 +2833,8 @@ profiling/Time taken: MyTrainer.backward_pass
Profiling metrics not appearing in SwanLab
use_swanlab: true
-swanlab_project: my-projectuse_swanlab: true
+swanlab_project: my-projectINFO: SwanLab initialized for project: my-projectToo many profiling metrics cluttering dashboard
ProfilingConfig with throttling:config = ProfilingConfig(
- min_duration_ms=1.0, # Skip fast ops
- log_interval=100, # Log every 100th call
-)config = ProfilingConfig(
+ min_duration_ms=1.0, # Skip fast ops
+ log_interval=100, # Log every 100th call
+)Profiling overhead impacting training speed
@@ -2312,32 +2867,32 @@ profiling/Time taken: MyTrainer.backward_pass
Complete Config Example
base_model: /path/to/your/model
-model_type: Qwen2ForCausalLM
-
-plugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
- - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
-
-use_swanlab: true
-swanlab_project: RVQ-Alpha-Training
-swanlab_experiment_name: Qwen2.5-7B-MetaQA-Perturb-P020
-swanlab_description: "Training on MetaQA and Perturbation datasets with NEW-RVQ encoding"
-swanlab_mode: cloud
-swanlab_workspace: single-cell-genomics
-
-sequence_len: 32768
-micro_batch_size: 1
-gradient_accumulation_steps: 1
-num_epochs: 2
-learning_rate: 2e-5
-optimizer: adamw_torch_fused
-
-datasets:
- - path: /path/to/dataset
- type: chat_template
-
-output_dir: ./outputsbase_model: /path/to/your/model
+model_type: Qwen2ForCausalLM
+
+plugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+ - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
+
+use_swanlab: true
+swanlab_project: RVQ-Alpha-Training
+swanlab_experiment_name: Qwen2.5-7B-MetaQA-Perturb-P020
+swanlab_description: "Training on MetaQA and Perturbation datasets with NEW-RVQ encoding"
+swanlab_mode: cloud
+swanlab_workspace: single-cell-genomics
+
+sequence_len: 32768
+micro_batch_size: 1
+gradient_accumulation_steps: 1
+num_epochs: 2
+learning_rate: 2e-5
+optimizer: adamw_torch_fused
+
+datasets:
+ - path: /path/to/dataset
+ type: chat_template
+
+output_dir: ./outputsModes Explained
@@ -2383,36 +2938,36 @@ profiling/Time taken: MyTrainer.backward_pass
Missing Project Name
-use_swanlab: trueuse_swanlab: trueuse_swanlab: true
-swanlab_project: my-projectuse_swanlab: true
+swanlab_project: my-projectInvalid Mode
-use_swanlab: true
-swanlab_project: my-project
-swanlab_mode: invalid-modeuse_swanlab: true
+swanlab_project: my-project
+swanlab_mode: invalid-modeuse_swanlab: true
-swanlab_project: my-project
-swanlab_mode: cloud # or: local, offline, disableduse_swanlab: true
+swanlab_project: my-project
+swanlab_mode: cloud # or: local, offline, disabledEmpty Project Name
-use_swanlab: true
-swanlab_project: ""use_swanlab: true
+swanlab_project: ""use_swanlab: true
-swanlab_project: my-projectuse_swanlab: true
+swanlab_project: my-projectCloud Mode API Key Warning
cloud mode without an API key, you’ll receive a warning with multiple solutions:use_swanlab: true
-swanlab_project: my-project
-swanlab_mode: clouduse_swanlab: true
+swanlab_project: my-project
+swanlab_mode: cloudexport SWANLAB_API_KEY=your-api-key
2. Add to config (less secure): swanlab_api_key: your-api-key
@@ -2424,11 +2979,11 @@ profiling/Time taken: MyTrainer.backward_pass
Two Loggers - Warning
-use_swanlab: true
-swanlab_project: my-project
-
-use_wandb: true
-wandb_project: my-projectuse_swanlab: true
+swanlab_project: my-project
+
+use_wandb: true
+wandb_project: my-projectThree+ Loggers - Error-Level Warning
-use_swanlab: true
-swanlab_project: my-project
-
-use_wandb: true
-wandb_project: my-project
-
-use_mlflow: true
-mlflow_tracking_uri: http://localhost:5000use_swanlab: true
+swanlab_project: my-project
+
+use_wandb: true
+wandb_project: my-project
+
+use_mlflow: true
+mlflow_tracking_uri: http://localhost:5000Auto-Enable Logic
use_swanlab:swanlab_project: my-project
-
-use_swanlab: true
-swanlab_project: my-projectswanlab_project: my-project
+
+use_swanlab: true
+swanlab_project: my-projectDistributed Training Detection
use_swanlab: true
-swanlab_project: my-project
-swanlab_mode: clouduse_swanlab: true
+swanlab_project: my-project
+swanlab_mode: cloudMethod 1: Environment Variable (Recommended)
-export SWANLAB_API_KEY=your-api-key-hereexport SWANLAB_API_KEY=your-api-key-hereMethod 2: Login Command
-swanlab loginswanlab loginMethod 3: Config File
-swanlab_api_key: your-api-key-hereswanlab_api_key: your-api-key-hereWhat Gets Logged?
@@ -2530,19 +3085,19 @@ profiling/Time taken: MyTrainer.backward_pass
Local Mode
-swanlab watch ./swanlogswanlab watch ./swanlogIntegration with Existing Tools
plugins:
- - axolotl.integrations.swanlab.SwanLabPlugin
-
-use_swanlab: true
-swanlab_project: my-project
-
-use_wandb: true
-wandb_project: my-projectplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin
+
+use_swanlab: true
+swanlab_project: my-project
+
+use_wandb: true
+wandb_project: my-projectTroubleshooting
@@ -2553,20 +3108,20 @@ profiling/Time taken: MyTrainer.backward_pass
Error: “SwanLab enabled but ‘swanlab_project’ is not set”
use_swanlab: true) but forgot to specify a project name.use_swanlab: true
-swanlab_project: my-project # Add this lineuse_swanlab: true
+swanlab_project: my-project # Add this lineError: “Invalid swanlab_mode: ‘xxx’”
swanlab_mode: cloud # or: local, offline, disabledswanlab_mode: cloud # or: local, offline, disabledError: “swanlab_project cannot be an empty string”
swanlab_project: "" (empty string).swanlab_project: my-projectswanlab_project: my-projectError: “SwanLab is not installed”
pip install swanlab
-pip install swanlab>=0.3.0pip install swanlab
+pip install swanlab>=0.3.0use_swanlab: true
-swanlab_project: my-project
-use_wandb: false # Disable others
-use_mlflow: false
-
-use_swanlab: false
-use_wandb: true
-wandb_project: my-projectuse_swanlab: true
+swanlab_project: my-project
+use_wandb: false # Disable others
+use_mlflow: false
+
+use_swanlab: false
+use_wandb: true
+wandb_project: my-projectAPI Key errors
echo $SWANLAB_API_KEY
-
-swanlab loginecho $SWANLAB_API_KEY
+
+swanlab loginCloud sync issues
offline mode and sync later:swanlab_mode: offlineswanlab_mode: offlineswanlab sync ./swanlogswanlab sync ./swanlogPlugin not loaded
plugins:
- - axolotl.integrations.swanlab.SwanLabPlugin # Correct pathplugins:
+ - axolotl.integrations.swanlab.SwanLabPlugin # Correct pathLark Notification Issues
@@ -2645,17 +3200,17 @@ Info: Other ranks will skip SwanLab to avoid conflicts
Error: “Failed to import SwanLab Lark plugin”
pip install --upgrade swanlab
-
-pip install 'swanlab>=0.3.0'pip install --upgrade swanlab
+
+pip install 'swanlab>=0.3.0'Warning: “Lark webhook has no secret configured”
swanlab_lark_webhook_url but no swanlab_lark_secret.swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxx
-swanlab_lark_secret: your-webhook-secret # Add this lineswanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/xxx
+swanlab_lark_secret: your-webhook-secret # Add this lineError: “Failed to register Lark callback”
curl -X POST "YOUR_WEBHOOK_URL" \
- -H 'Content-Type: application/json' \
- -d '{"msg_type":"text","content":{"text":"Test from Axolotl"}}'
-
-pip show swanlabcurl -X POST "YOUR_WEBHOOK_URL" \
+ -H 'Content-Type: application/json' \
+ -d '{"msg_type":"text","content":{"text":"Test from Axolotl"}}'
+
+pip show swanlab# If running multi-GPU, check rank 0 logs specifically
-grep "Registered Lark" logs/rank_0.log# If running multi-GPU, check rank 0 logs specifically
+grep "Registered Lark" logs/rank_0.loguse_swanlab: true # Must be enabled
-swanlab_project: my-project # Must be setuse_swanlab: true # Must be enabled
+swanlab_project: my-project # Must be set
You can add custom metrics in your callbacks:
-import swanlab
-
-swanlab.log({
- "custom_metric": value,
- "epoch": epoch_num
-})import swanlab
+
+swanlab.log({
+ "custom_metric": value,
+ "epoch": epoch_num
+})swanlab compare run1 run2 run3swanlab compare run1 run2 run3If you could not load your integration, please ensure you are pip installing in editable mode.
-pip install -e .pip install -e .and correctly spelled the integration name in the config file.
-plugins:
- - axolotl.integrations.your_integration_name.YourIntegrationPluginplugins:
+ - axolotl.integrations.your_integration_name.YourIntegrationPluginThis method uses the same dataset format as DPO.
EBFT (Energy-Based Fine-Tuning) fine-tunes language models by optimizing a feature-matching loss rather than relying on external reward functions. A frozen copy of the model extracts embeddings from both generated and ground-truth completions, and the generator is updated via REINFORCE to match the ground-truth feature moments.
+Paper: “Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models” (Jelassi et al., 2026)
+Key advantages:
+EBFT supports two modes:
+base_model: Qwen/Qwen3-4B
+
+rl: ebft
+
+ebft:
+ feature_layers: [0.25, 0.5, 0.75] # Extract features at 25%, 50%, 75% depth
+ embed_method: last_token
+ use_whitening: false
+ alignment_coef: 1.0 # Cosine similarity reward weight
+ diversity_coef: 1.0 # Pairwise dot product penalty
+ ce_coef: 0.0 # Cross-entropy on GT tokens (0 = off)
+
+trl:
+ num_generations: 4
+ max_completion_length: 256
+ temperature: 0.7
+ use_vllm: true
+ vllm_server_host: 0.0.0.0
+ vllm_server_port: 8000
+ vllm_lora_sync: true # LoRA adapter sync (recommended)
+ vllm_sync_interval: 3
+ use_data_producer: true
+ async_prefetch: true # Set false for sync mode
+ scale_rewards: true
+ loss_type: grpo
+ epsilon: 0.2
+
+vllm:
+ gpu_memory_utilization: 0.5
+ max_model_len: 2048
+
+datasets:
+ - path: nvidia/OpenCodeInstruct
+ type: ebft_opencode.transform
+ split: train[:500]
+
+adapter: lora
+lora_r: 16
+lora_alpha: 32
+lora_target_linear: true# Terminal 1: Start vLLM
+CUDA_VISIBLE_DEVICES=0 axolotl vllm-serve config.yaml
+
+# Terminal 2: Train
+CUDA_VISIBLE_DEVICES=1 axolotl train config.yamlFor unstructured text (raw code, prose). No vLLM needed — runs on a single GPU.
+base_model: meta-llama/Llama-3.2-1B
+
+rl: ebft
+
+ebft:
+ mode: strided
+ stride: 8
+ context_length: 8
+ generate_max_len: 8
+ n_samples_per_prompt: 4
+ temperature: 0.6
+ feature_layers: [0.25, 0.5, 0.75]
+ embed_method: last_token
+ use_whitening: true
+ alignment_coef: 1.0
+ diversity_coef: 1.0
+ rl_coef: 1.0
+ ce_coef: 0.03
+ advantage_estimator: rloo
+
+datasets:
+ - path: nvidia/OpenCodeInstruct
+ type: ebft_strided_structured.transform
+ split: train[:1%]
+
+flash_attention: false
+flex_attention: true # Strided mode uses flex_attention
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+ use_reentrant: true # Required for flex_attentionCUDA_VISIBLE_DEVICES=0 axolotl train config.yamlSee examples/ebft/ for complete example configs covering Llama 1B/3B/8B and Qwen3 4B/8B models in both modes.
| Parameter | +Default | +Description | +
|---|---|---|
ebft.feature_layers |
+[0.25, 0.5, 0.75] |
+Layer depths for feature extraction (fractional) | +
ebft.embed_method |
+last_token |
+Feature pooling: last_token, mean_pooling, concat |
+
ebft.use_whitening |
+false |
+SVD whitening of feature dimensions | +
ebft.alignment_coef |
+1.0 |
+Cosine similarity reward weight | +
ebft.diversity_coef |
+1.0 |
+Pairwise dot product penalty weight | +
ebft.ce_coef |
+0.0 |
+Cross-entropy loss on ground-truth tokens | +
ebft.mode |
+structured |
+structured (vLLM) or strided (no vLLM) |
+
ebft.stride |
+— | +Tokens between anchor points (strided mode) | +
ebft.context_length |
+— | +Context window per block (strided mode) | +
ebft.generate_max_len |
+— | +Tokens to generate per block (strided mode) | +
ebft.n_samples_per_prompt |
+— | +Rollouts per document (strided mode) | +
ebft.advantage_estimator |
+grpo |
+grpo or rloo (strided mode) |
+
NeMo Gym provides 50+ verified RL environments (math, coding, tool-use, reasoning) with deterministic reward signals. The axolotl integration supports both single-turn (call /verify after generation) and multi-turn (agent-based tool execution via /run).
For environments that only need answer verification (math, coding challenges). No agent server needed — the reward function calls /verify directly on the resource server.
base_model: Qwen/Qwen2.5-0.5B-Instruct
+
+rl: grpo
+chat_template: tokenizer_default
+
+trl:
+ use_vllm: false # Colocate mode (single GPU)
+ num_generations: 4
+ max_completion_length: 128
+ temperature: 0.9
+ reward_funcs:
+ - axolotl.integrations.nemo_gym.rewards.reward_nemo_gym_verify
+
+plugins:
+ - axolotl.integrations.nemo_gym.NemoGymPlugin
+
+nemo_gym_enabled: true
+nemo_gym_dir: ~/Gym
+nemo_gym_auto_start: false
+nemo_gym_head_port: 11000
+nemo_gym_datasets:
+ - path: resources_servers/reasoning_gym/data/train_basic_arithmetic.jsonl
+ server_name: reasoning_gym
+
+datasets:
+ - path: ~/Gym/resources_servers/reasoning_gym/data/train_basic_arithmetic.jsonl
+ type: chat_template
+ field_messages: responses_create_params.input
+ message_field_content: content
+ message_field_role: role# Terminal 1: Start NeMo Gym resource server
+cd ~/Gym && .venv/bin/ng_run \
+ "+config_paths=[resources_servers/reasoning_gym/configs/resources_only.yaml]" \
+ "+skip_venv_if_present=true"
+
+# Terminal 2: Train
+CUDA_VISIBLE_DEVICES=0 axolotl train config.yamlnemo_gym_datasets.path is relative to nemo_gym_dir. Don’t use absolute paths or they will be double-joined.
For environments with tool-use (weather, search, databases). An agent server orchestrates multi-turn interactions: generate → parse tool calls → execute tools → feed results back → repeat until done.
+base_model: Qwen/Qwen3-0.6B
+
+rl: grpo
+chat_template: tokenizer_default
+
+adapter: lora
+lora_r: 16
+lora_alpha: 32
+lora_target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
+
+trl:
+ use_vllm: true
+ vllm_mode: server
+ vllm_server_host: localhost
+ vllm_server_port: 8000
+ vllm_lora_sync: true
+ vllm_sync_interval: 5
+ use_data_producer: true
+ async_prefetch: true # 3x speedup
+ num_generations: 4
+ max_completion_length: 512
+ temperature: 0.8
+ reward_funcs:
+ - axolotl.integrations.nemo_gym.rewards.reward_env
+
+plugins:
+ - axolotl.integrations.nemo_gym.NemoGymPlugin
+
+nemo_gym_enabled: true
+nemo_gym_auto_start: false
+nemo_gym_head_port: 11000
+nemo_gym_multi_turn: true
+nemo_gym_verify_timeout: 120
+nemo_gym_datasets:
+ - path: resources_servers/example_single_tool_call/data/weather_tool_calling.jsonl
+ server_name: example_single_tool_call
+
+datasets:
+ - path: ~/Gym/resources_servers/example_single_tool_call/data/weather_tool_calling.jsonl
+ type: chat_template
+ field_messages: responses_create_params.input
+ message_field_content: content
+ message_field_role: role
+
+vllm:
+ gpu_memory_utilization: 0.85
+ max_model_len: 2048Multi-turn requires three services running:
+# Terminal 1: vLLM with LoRA + tool calling
+VLLM_ALLOW_RUNTIME_LORA_UPDATING=1 CUDA_VISIBLE_DEVICES=0 \
+ python -m vllm.entrypoints.openai.api_server \
+ --model Qwen/Qwen3-0.6B --max-model-len 2048 \
+ --gpu-memory-utilization 0.85 \
+ --enable-lora --max-lora-rank 64 \
+ --enable-auto-tool-choice --tool-call-parser hermes
+
+# Terminal 2: NeMo Gym servers (resource + model proxy + agent)
+cd ~/Gym && .venv/bin/ng_run \
+ "+config_paths=[configs/axolotl_tool_calling.yaml]" \
+ "+skip_venv_if_present=true"
+
+# Terminal 3: Training
+CUDA_VISIBLE_DEVICES=1 axolotl train config.yamlMulti-turn requires a NeMo Gym agent config YAML that defines three components: a resource server (tools + /verify), a model server proxy (forwards to your vLLM), and an agent server (orchestrates /run). See the NeMo Gym README for agent config format.
# Clone and set up NeMo Gym
+git clone https://github.com/NVIDIA-NeMo/Gym.git ~/Gym
+cd ~/Gym
+uv venv --python 3.12 && source .venv/bin/activate && uv sync
+
+# Fix pycosat build (GCC 13+)
+CFLAGS="" uv pip install pycosat --python .venv/bin/python --no-build-isolation| Parameter | +Type | +Default | +Description | +
|---|---|---|---|
nemo_gym_enabled |
+bool | +— | +Enable the NeMo Gym integration | +
nemo_gym_dir |
+str | +~/Gym |
+Path to NeMo Gym repo | +
nemo_gym_auto_start |
+bool | +true |
+Auto-start resource servers | +
nemo_gym_head_port |
+int | +11000 |
+Head server port | +
nemo_gym_multi_turn |
+bool | +false |
+Enable multi-turn via agent /run |
+
nemo_gym_verify_timeout |
+int | +30 |
+Per-request timeout (seconds) | +
nemo_gym_datasets |
+list | +required | +Dataset configs with path and server_name |
+
| Function | +Mode | +Description | +
|---|---|---|
axolotl.integrations.nemo_gym.rewards.reward_nemo_gym_verify |
+Single-turn | +Calls /verify, returns binary reward |
+
axolotl.integrations.nemo_gym.rewards.reward_env |
+Multi-turn | +Passthrough reward from agent /run |
+
datasets:
- - ds_type: json
- data_files:
- - orca_rlhf.jsonl
- split: train
- type: chatml.inteldatasets:
+ - ds_type: json
+ data_files:
+ - orca_rlhf.jsonl
+ split: train
+ type: chatml.intelTRL supports auto-unwrapping PEFT models for RL training paradigms which rely on a reference model. This significantly reduces memory pressure as an additional refreference model does not need to be loaded, and reference model log-probabilities can be obtained by disabling PEFT adapters. This is enabled by default. To turn it off, pass the following config:
-# load ref model when adapter training.
-rl_adapter_ref_model: true# load ref model when adapter training.
+rl_adapter_ref_model: true