Implement configurable handling of excess tokens in datasets

- Added `excess_token_handling` option to the configuration, allowing users to choose between "drop" and "truncate" for handling tokens exceeding the maximum sequence length. - Introduced `truncate_or_drop_long_seq` function to manage both single and batched samples based on the selected handling method. - Updated relevant dataset processing functions to utilize the new handling option, ensuring backward compatibility with existing "drop" behavior. - Enhanced logging to reflect truncation actions in dataset processing. This change improves flexibility in managing sequence lengths during training and evaluation.
2025-05-12 14:08:43 +02:00
parent 47e0e71bc8
commit 9f68918f13
6 changed files with 247 additions and 41 deletions
--- a/docs/config.qmd
+++ b/docs/config.qmd
@@ -332,6 +332,8 @@ dataset_shard_idx:
 # The maximum length of an input to train with, this should typically be less than 2048
 # as most models have a token/context limit of 2048
 sequence_len: 2048
+# How to handle tokens exceeding max sequence length - "drop" (default, removes sample) or "truncate" (cuts off excess tokens) 
+excess_token_handling: drop
 # Pad inputs so each step uses constant sized buffers
 # This will reduce memory fragmentation and may prevent OOMs, by re-using memory more efficiently
 pad_to_sequence_len: