axolotl/tests at e672d37f338f107b83f255baf95f00dfef05cc88 - axolotl - Gitea

tocmo0nlord/axolotl

Files

History

Manas Vardhan 474208b794 fix: Save de-duplicated dataset during pre-processing (#3427 )

* fix: run deduplication before saving dataset during preprocessing

Move deduplicate_and_log_datasets call before save_preprocessed_dataset
in both SFT and RL data loading pipelines. This ensures the saved
preprocessed dataset is already de-duplicated, so subsequent loads
from cache don't contain duplicates.

Fixes #2719

* fix: include deduplication flag in dataset hash and warn on skip_prepare_dataset+dedup

- Add dataset_exact_deduplication to the hash string in
  generate_dataset_hash_from_config so cached datasets are invalidated
  when the dedup setting changes.
- Log a warning when skip_prepare_dataset=True and
  dataset_exact_deduplication=True, since dedup will be silently
  skipped in that configuration (both SFT and RL paths).

* fix: add ValueError for skip_prepare+dedup, fix test mock target and formatting

- Add config validator (check_deduplication_with_skip_prepare) that raises
  ValueError when skip_prepare_dataset=True and dataset_exact_deduplication=True
- Replace runtime warnings in sft.py/rl.py with the validator check
- Fix RL test: patch axolotl.utils.data.rl.load_tokenizer instead of
  axolotl.loaders.load_tokenizer to properly mock the imported reference
- Fix ruff lint (remove unused imports) and formatting issues

* refactor: inline deduplicate function per review feedback

* fix test fixture, lint

---------

Co-authored-by: ManasVardhan <manasvardhan@users.noreply.github.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>

2026-03-02 12:55:59 -05:00

..

feat: support dot-notation CLI args for nested config options (#3419 )

2026-02-23 10:10:06 -05:00

upgrade transformers to 5.2.0 and torchao to 0.16.0 (#3407 )

2026-02-19 18:27:27 -05:00

mark slow tests that are timing out in CI (#3428 ) [skip ci]

2026-03-02 12:26:30 -05:00

Respect sequence_len in config for type: llama2_chat (#926 )

2023-12-12 09:39:22 -08:00

ScatterMoE LoRA support (#3410 )

2026-02-24 14:59:55 -05:00

transformers v5 upgrade (#3272 )

2026-01-27 17:08:24 -05:00

Add ruff, remove black, isort, flake8, pylint (#3092 )

2025-08-23 23:37:33 -04:00

prompt_strategies

handle warnings from v5 upgrade (#3376 )

2026-01-28 06:45:01 -05:00

fix: remove telemetry warning (#3397 ) [skip ci]

2026-02-10 23:01:16 +07:00

Fix: excess_length_strategy truncation method (#3401 )

2026-02-25 11:31:11 +07:00

__init__.py

fix: minor patches for multimodal (#2441 )

2025-03-31 13:40:12 +07:00

conftest.py

transformers v5 upgrade (#3272 )

2026-01-27 17:08:24 -05:00

constants.py

Add ruff, remove black, isort, flake8, pylint (#3092 )

2025-08-23 23:37:33 -04:00

hf_offline_utils.py

transformers v5 upgrade (#3272 )

2026-01-27 17:08:24 -05:00

test_chunked_xentropy.py

Add ruff, remove black, isort, flake8, pylint (#3092 )

2025-08-23 23:37:33 -04:00

test_data.py

Fix: excess_length_strategy truncation method (#3401 )

2026-02-25 11:31:11 +07:00

test_datasets.py

feature: raise on long sequence drop (#3321 )

2025-12-22 13:59:49 -05:00

test_dict.py

Add ruff, remove black, isort, flake8, pylint (#3092 )

2025-08-23 23:37:33 -04:00

test_exact_deduplication.py

feat:add support dataset_num_processes (#3129 ) [skip ci]

2025-10-13 17:18:12 +07:00

test_expand_mask.py

adding pre-commit auto-update GH action and bumping plugin versions (#2428 )

2025-03-21 11:02:43 -04:00

test_freeze.py

Train parameters exclusively in specific ranges (#1390 )

2024-03-14 11:05:42 -04:00

test_loaders.py

fix: transformers deprecate load_in_Xbit in model_kwargs (#3205 )

2025-10-16 16:07:27 +07:00

test_logging_config_file_capture.py

Debug log, logging improvements (#3159 )

2025-09-17 13:27:03 -04:00

test_lora.py

Add ruff, remove black, isort, flake8, pylint (#3092 )

2025-08-23 23:37:33 -04:00

test_normalize_config.py

transformers v5 upgrade (#3272 )

2026-01-27 17:08:24 -05:00

test_opentelemetry_callback.py

Feat/opentelemetry (#3215 )

2025-10-22 19:16:55 -07:00

test_packed_batch_sampler.py

Add ruff, remove black, isort, flake8, pylint (#3092 )

2025-08-23 23:37:33 -04:00

test_packed_dataset.py

feat:add support dataset_num_processes (#3129 ) [skip ci]

2025-10-13 17:18:12 +07:00

test_packed_pretraining.py

Streaming SFT support (#3101 )

2025-09-02 12:08:44 -04:00

test_perplexity.py

transformers v5 upgrade (#3272 )

2026-01-27 17:08:24 -05:00

test_prompt_tokenizers.py

Add ruff, remove black, isort, flake8, pylint (#3092 )

2025-08-23 23:37:33 -04:00

test_prompters.py

fix: prompt phi (#1845 ) [skip ci]

2024-08-22 11:46:57 -04:00

test_revision_parameter.py

fix: pass revision parameter to tokenizer and processor loaders (#3388 ) [skip ci]

2026-02-25 11:11:20 +07:00

test_save_deduplicated.py

fix: Save de-duplicated dataset during pre-processing (#3427 )

2026-03-02 12:55:59 -05:00

test_schedulers.py

Add ruff, remove black, isort, flake8, pylint (#3092 )

2025-08-23 23:37:33 -04:00

test_streaming.py

text diffusion training plugin (#3067 )

2025-09-10 20:27:00 -04:00

test_tokenizers.py

mark slow tests that are timing out in CI (#3428 ) [skip ci]

2026-03-02 12:26:30 -05:00

test_train.py

refactor dupes from merge/rebase (#2919 ) [skip ci]

2025-07-14 10:05:26 -04:00

test_utils_tee.py

Debug log, logging improvements (#3159 )

2025-09-17 13:27:03 -04:00

test_validation_dataset.py

Distributed Muon Optimizer (#3264 )

2025-12-19 10:43:47 -05:00