Streaming Datasets
+Streaming enables memory-efficient training with large datasets by loading data +incrementally rather than loading the entire dataset into memory at once.
+Use streaming when:
+-
+
- Your dataset is too large to fit in memory (e.g. when you’re doing pretraining with massive text corpora) +
- You want to start training immediately without preprocessing the entire dataset +
Streaming works with both remote and locally stored datasets!
+Streaming currently only supports a single dataset. Multi-dataset support will be added soon.
+Configuration
+Basic Streaming
+Enable streaming mode by setting the streaming flag:
streaming: truePretraining with Streaming
+For pretraining tasks, streaming is automatically enabled when using pretraining_dataset:
pretraining_dataset:
+ - path: HuggingFaceFW/fineweb-edu
+ type: pretrain
+ text_column: text
+ split: train
+
+# Optionally, enable sample packing
+streaming_multipack_buffer_size: 10000
+sample_packing: trueSFT with Streaming
+For supervised fine-tuning with streaming:
+streaming: true
+datasets:
+ - path: tatsu-lab/alpaca
+ type: alpaca
+ split: train
+
+# Optionally, enable sample packing
+streaming_multipack_buffer_size: 10000
+sample_packing: trueConfiguration Options
+streaming_multipack_buffer_size
+Controls the buffer size for multipack streaming (default: 10,000). This determines how +many samples are buffered before packing. Larger buffers can improve packing efficiency +but use more memory.
+shuffle_merged_datasets
+When enabled, shuffles the streaming dataset using the buffer. This requires additional +memory for the shuffle buffer.
+Sample Packing with Streaming
+Sample packing is supported for streaming datasets. When enabled, multiple samples are +packed into a single sequence to maximize GPU utilization:
+sample_packing: true
+streaming_multipack_buffer_size: 10000
+
+# For SFT: attention is automatically isolated between packed samples
+# For pretraining: control with pretrain_multipack_attn
+pretrain_multipack_attn: true # prevent cross-attention between packed samplesFor more information, see our documentation on multipacking.
+Important Considerations
+Memory Usage
+While streaming reduces memory usage compared to loading entire datasets, you still need +to consider:
+-
+
- You can control the memory usage by adjusting
streaming_multipack_buffer_size
+ - Sample packing requires buffering multiple samples +
- Shuffling requires additional memory for the shuffle buffer +
Performance
+-
+
- Streaming may have slightly higher latency compared to preprocessed datasets, as samples are processed on-the-fly +
- Network speed and disk read speed are important when streaming from remote sources or a local dataset, respectively +
- Consider using
axolotl preprocessfor smaller or more frequently used datasets
+
Evaluation Datasets
+Evaluation datasets are not streamed to ensure consistent evaluation metrics. They’re +loaded normally even when training uses streaming.
+Examples
+See the examples/streaming/ directory for complete configuration examples:
-
+
pretrain.yaml: Pretraining with streaming dataset
+sft.yaml: Supervised fine-tuning with streaming
+