The plugin used to unconditionally monkey-patch
VLLMClient.init_communicator to a no-op AND silently no-op
sync_weights when vllm_lora_sync was off. Combined, this turned the
trainer into a functional no-op whenever (a) the user ran NeMo Gym
+ LoRA without remembering to set vllm_lora_sync=true or (b) the
user ran NeMo Gym + full fine-tune (which had no working sync path
under the old code).
Replace both patches with:
1. A probe of the configured vLLM server's /openapi.json at
pre_model_load. Three transports are recognized:
- NCCL (/init_communicator/ + /update_named_param/) — TRL serve
and axolotl vllm-serve both expose this
- LoRA filesystem (/v1/load_lora_adapter or /set_lora_adapter/)
- HTTP base64 full-weight (/http_update_weights/) — axolotl
vllm-serve only
2. A pure-logic ``select_weight_sync_transport`` that picks the
right one for (server caps × adapter type).
3. ``init_communicator`` is only patched out when the server has no
NCCL routes; against TRL/axolotl serve modules it stays live so
full-finetune NCCL sync works.
4. ``post_trainer_create`` uses the selection table to install LoRA
filesystem sync OR leave the standard NCCL flow alone OR raise
NotImplementedError (HTTP — pending) OR raise a precise diagnosis
when no transport is viable. No more silent no-op trainers.
* nemo gym integration with grpo wip
* mostly working
* cleanup
* simplify
* update docs
* nemo gym support wip
* cleanup
* chore: lint
* address PR review and add more tests
* chore: lint
* post merge lora fixes for CI (#3536) [skip ci]
* post merge lora fixes for CI
* handle lora kernel auto-enable for moe without grouped_mm
* prefer not to import torch in schema validation
* address pr comments, add timeout, add tests
* roundup_power2_divisions not needed with newer pytorch versions (#3540)
* roundup_power2_divisions not needed with newer pytorch versions
* remove typo
* update qwen3.5 moe 35b-a3b yaml for 5090
* more bug fixes
* fix tests to match updated trainer
* don't use fa2 for hooks test
* reset plugins on the instance
* retry download
* fix references to renamed axolotl_cfg property on trainer
* Fix ref to trainer cfg
* fix: robust handling of race condition on patching check (#3543) [skip ci]
* EBFT: Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models (#3527) [skip ci]
* EBFT wip
* fixes
* more fixeS
* add missing strided module
* ebft fixes for multi-turn
* make ebft work with async
* add example for ebft w qwen3.5
* fix for split thinking and update yaml for lora over linear attention only
* enforce_eager for vllm arg in schema
* fix sync weights
* fix multi-gpu
* handle updated sig for mm
* ddp fixes
* improve multi-gpu handling, don't calculate logits, adaptive completion length
* chore: lint
* chore: lint
* support completion_mean
* Address corereview feedback
* clamp min IS ratio
* Address PR code review
* more fixes identified
* address code review
* Fix property from rebase conflict
* fix for ebft sync and update docs
* make trainer loss patch check a solo test
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>