axolotl/tests at 69f165b39bc11e719b9b1c1ea9753f0e374392a2 - axolotl - Gitea

tocmo0nlord/axolotl

Files

History

Wing Lian 69f165b39b probe vLLM weight-sync routes and select transport per server

The plugin used to unconditionally monkey-patch
VLLMClient.init_communicator to a no-op AND silently no-op
sync_weights when vllm_lora_sync was off. Combined, this turned the
trainer into a functional no-op whenever (a) the user ran NeMo Gym
+ LoRA without remembering to set vllm_lora_sync=true or (b) the
user ran NeMo Gym + full fine-tune (which had no working sync path
under the old code).

Replace both patches with:

1. A probe of the configured vLLM server's /openapi.json at
   pre_model_load. Three transports are recognized:
     - NCCL (/init_communicator/ + /update_named_param/) — TRL serve
       and axolotl vllm-serve both expose this
     - LoRA filesystem (/v1/load_lora_adapter or /set_lora_adapter/)
     - HTTP base64 full-weight (/http_update_weights/) — axolotl
       vllm-serve only

2. A pure-logic ``select_weight_sync_transport`` that picks the
   right one for (server caps × adapter type).

3. ``init_communicator`` is only patched out when the server has no
   NCCL routes; against TRL/axolotl serve modules it stays live so
   full-finetune NCCL sync works.

4. ``post_trainer_create`` uses the selection table to install LoRA
   filesystem sync OR leave the standard NCCL flow alone OR raise
   NotImplementedError (HTTP — pending) OR raise a precise diagnosis
   when no transport is viable. No more silent no-op trainers.

2026-04-15 13:27:30 +00:00

..

feat: support dot-notation CLI args for nested config options (#3419 )

2026-02-23 10:10:06 -05:00

Add precompute_ref_log_probs to config schema (#3555 ) [skip ci]

2026-04-01 13:28:40 -04:00

upgrade torchao to 0.17.0 (#3569 )

2026-04-02 10:18:00 -04:00

Respect sequence_len in config for type: llama2_chat (#926 )

2023-12-12 09:39:22 -08:00

probe vLLM weight-sync routes and select transport per server

2026-04-15 13:27:30 +00:00

[gemma4] use mixed Flash Attention and SDPA and add fused RMSNorm+RoPE Triton kernels (#3598 )

2026-04-12 10:29:55 -04:00

Nemo gym integration (#3516 ) [skip ci]

2026-03-25 07:38:06 -04:00

allow bf16 flag but warn (#3563 ) [skip ci]

2026-04-01 09:54:01 -04:00

prompt_strategies

handle trainable/masked spans in content and reasoning content (#3592 )

2026-04-10 14:11:10 -04:00

Fix DO_NOT_TRACK not being correctly handled (#3580 )

2026-04-04 05:16:58 -04:00

validate batch shape against num_generations at config time

2026-04-15 13:27:30 +00:00

__init__.py

fix: minor patches for multimodal (#2441 )

2025-03-31 13:40:12 +07:00

conftest.py

roundup_power2_divisions not needed with newer pytorch versions (#3540 )

2026-03-24 15:40:05 -04:00

constants.py

Add ruff, remove black, isort, flake8, pylint (#3092 )

2025-08-23 23:37:33 -04:00

hf_offline_utils.py

transformers v5 upgrade (#3272 )

2026-01-27 17:08:24 -05:00

test_chunked_xentropy.py

Add ruff, remove black, isort, flake8, pylint (#3092 )

2025-08-23 23:37:33 -04:00

test_context_parallel_batch_size.py

fix: correct total_num_steps and batch_size calculation with context parallelism (#3444 )

2026-03-05 12:33:28 -05:00

test_convert.py

fix: fix CONTRIBUTING.md placeholders, bare except clauses, and add convert.py tests (#3485 ) [skip ci]

2026-03-16 00:12:40 -04:00

test_data.py

Fix: excess_length_strategy truncation method (#3401 )

2026-02-25 11:31:11 +07:00

test_datasets.py

feature: raise on long sequence drop (#3321 )

2025-12-22 13:59:49 -05:00

test_dict.py

Add ruff, remove black, isort, flake8, pylint (#3092 )

2025-08-23 23:37:33 -04:00

test_ebft_kernels.py

EBFT: Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models (#3527 ) [skip ci]

2026-03-24 18:43:46 -04:00

test_ebft_strided_structured.py

EBFT: Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models (#3527 ) [skip ci]

2026-03-24 18:43:46 -04:00

test_exact_deduplication.py

feat:add support dataset_num_processes (#3129 ) [skip ci]

2025-10-13 17:18:12 +07:00

test_freeze.py

Train parameters exclusively in specific ranges (#1390 )

2024-03-14 11:05:42 -04:00

test_http_weight_sync.py

EBFT: Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models (#3527 ) [skip ci]

2026-03-24 18:43:46 -04:00

test_loaders.py

fix: transformers deprecate load_in_Xbit in model_kwargs (#3205 )

2025-10-16 16:07:27 +07:00

test_logging_config_file_capture.py

Debug log, logging improvements (#3159 )

2025-09-17 13:27:03 -04:00

test_lora.py

Add ruff, remove black, isort, flake8, pylint (#3092 )

2025-08-23 23:37:33 -04:00

test_normalize_config.py

transformers v5 upgrade (#3272 )

2026-01-27 17:08:24 -05:00

test_opentelemetry_callback.py

Feat/opentelemetry (#3215 )

2025-10-22 19:16:55 -07:00

test_packed_batch_sampler.py

Add ruff, remove black, isort, flake8, pylint (#3092 )

2025-08-23 23:37:33 -04:00

test_packed_dataset.py

feat:add support dataset_num_processes (#3129 ) [skip ci]

2025-10-13 17:18:12 +07:00

test_packed_pretraining.py

Streaming SFT support (#3101 )

2025-09-02 12:08:44 -04:00

test_perplexity.py

transformers v5 upgrade (#3272 )

2026-01-27 17:08:24 -05:00

test_prompt_tokenizers.py

DPO transformers v0.29 fixes (#3560 ) [skip ci]

2026-03-31 19:04:53 -04:00

test_prompters.py

fix: prompt phi (#1845 ) [skip ci]

2024-08-22 11:46:57 -04:00

test_revision_parameter.py

fix: pass revision parameter to tokenizer and processor loaders (#3388 ) [skip ci]

2026-02-25 11:11:20 +07:00

test_save_deduplicated.py

fix: Save de-duplicated dataset during pre-processing (#3427 )

2026-03-02 12:55:59 -05:00

test_schedulers.py

Add ruff, remove black, isort, flake8, pylint (#3092 )

2025-08-23 23:37:33 -04:00

test_streaming.py

text diffusion training plugin (#3067 )

2025-09-10 20:27:00 -04:00

test_tensor_parallel_batch_size.py

use new tf32 APIs for torch 2.9+ (#3467 ) [skip ci]

2026-03-06 11:40:32 -05:00

test_tokenizers.py

upgrade transformers==5.3.0 trl==0.29.0 kernels (#3459 )

2026-03-06 09:11:20 -05:00

test_train.py

refactor dupes from merge/rebase (#2919 ) [skip ci]

2025-07-14 10:05:26 -04:00

test_triton_kernels.py

use custom triton kernels for entropy from logits and selective softmax (#3510 )

2026-03-19 02:02:43 -04:00

test_utils_tee.py

Debug log, logging improvements (#3159 )

2025-09-17 13:27:03 -04:00

test_validation_dataset.py

automatically enable tf32 if supported (#3473 ) [skip ci]

2026-03-16 23:47:00 -04:00