feature: raise on long sequence drop (#3321)

* feature: raise on long sequence drop It is sometimes not desired that sequences are silently dropped from the dataset, especially when the dataset has been carefully crafted and pre-fitted for the training context. This would then suggest that an error occurred somewhere in the process. This feature adds a third value for excess_length_strategy called 'raise', which will raise a ValueError if a sequence is encountered that is too long and would have normally been dropped/truncated. * tests: add excess_length_strategy tests * doc: updated return value description for drop_long_seq_in_dataset * add @enable_hf_offline * fixed cfg modified after validate_config called * hf offline fix * fix tqdm desc when raise is used * test: added test for non-batched case * accidental code change revert * test: use pytest.raises * test: simplified drop_seq_len tests * test: moved excess_length_strat test to test_data.py --------- Co-authored-by: salman <salman.mohammadi@outlook.com>
2025-12-23 03:59:49 +09:00
parent efeb5a4e41
commit 92ee4256f7
5 changed files with 67 additions and 7 deletions
--- a/tests/test_datasets.py
+++ b/tests/test_datasets.py
@@ -13,7 +13,9 @@ from transformers import PreTrainedTokenizer

 from axolotl.loaders.tokenizer import load_tokenizer
 from axolotl.utils.data.rl import prepare_preference_datasets
-from axolotl.utils.data.sft import _load_tokenized_prepared_datasets
+from axolotl.utils.data.sft import (
+    _load_tokenized_prepared_datasets,
+)
 from axolotl.utils.dict import DictDefault

 from tests.constants import (