--- title: Streaming Datasets description: How to use streaming mode for large-scale datasets and memory-efficient training order: 10 --- Streaming enables memory-efficient training with large datasets by loading data incrementally rather than loading the entire dataset into memory at once. Use streaming when: - Your dataset is too large to fit in memory (e.g. when you're doing pretraining with massive text corpora) - You want to start training immediately without preprocessing the entire dataset Streaming works with both remote and locally stored datasets! ::: {.callout-note} Streaming currently only supports a single dataset. Multi-dataset support will be added soon. ::: ## Configuration ### Basic Streaming Enable streaming mode by setting the `streaming` flag: ```yaml streaming: true ``` ### Pretraining with Streaming For pretraining tasks, streaming is automatically enabled when using `pretraining_dataset`: ```yaml pretraining_dataset: - path: HuggingFaceFW/fineweb-edu type: pretrain text_column: text split: train # Optionally, enable sample packing streaming_multipack_buffer_size: 10000 sample_packing: true ``` ### SFT with Streaming For supervised fine-tuning with streaming: ```yaml streaming: true datasets: - path: tatsu-lab/alpaca type: alpaca split: train # Optionally, enable sample packing streaming_multipack_buffer_size: 10000 sample_packing: true ``` ## Configuration Options ### `streaming_multipack_buffer_size` Controls the buffer size for multipack streaming (default: 10,000). This determines how many samples are buffered before packing. Larger buffers can improve packing efficiency but use more memory. ### `shuffle_merged_datasets` When enabled, shuffles the streaming dataset using the buffer. This requires additional memory for the shuffle buffer. ## Sample Packing with Streaming Sample packing is supported for streaming datasets. When enabled, multiple samples are packed into a single sequence to maximize GPU utilization: ```yaml sample_packing: true streaming_multipack_buffer_size: 10000 # For SFT: attention is automatically isolated between packed samples # For pretraining: control with pretrain_multipack_attn pretrain_multipack_attn: true # prevent cross-attention between packed samples ``` For more information, see our [documentation](multipack.qmd) on multipacking. ## Important Considerations ### Memory Usage While streaming reduces memory usage compared to loading entire datasets, you still need to consider: - You can control the memory usage by adjusting `streaming_multipack_buffer_size` - Sample packing requires buffering multiple samples - Shuffling requires additional memory for the shuffle buffer ### Performance - Streaming may have slightly higher latency compared to preprocessed datasets, as samples are processed on-the-fly - Network speed and disk read speed are important when streaming from remote sources or a local dataset, respectively - Consider using `axolotl preprocess` for smaller or more frequently used datasets ### Evaluation Datasets Evaluation datasets are not streamed to ensure consistent evaluation metrics. They're loaded normally even when training uses streaming. ## Examples See the `examples/streaming/` directory for complete configuration examples: - `pretrain.yaml`: Pretraining with streaming dataset - `sft.yaml`: Supervised fine-tuning with streaming