* working * fixes * deprecate --iterable; cleanup * pretrain_multipack_buffer_size -> streaming_multipack_buffer_size * improvements * tests * remove unused * docs, examples * nit * nit * add val_set_size validation * val * nit * min * coderabbito * cleanup * nit * add depr warning, cleanup * nit * fix test, fix quarto * fix * review comments * review comments * fix
121 lines
3.3 KiB
Plaintext
121 lines
3.3 KiB
Plaintext
---
|
|
title: Streaming Datasets
|
|
description: How to use streaming mode for large-scale datasets and memory-efficient training
|
|
order: 10
|
|
---
|
|
|
|
Streaming enables memory-efficient training with large datasets by loading data
|
|
incrementally rather than loading the entire dataset into memory at once.
|
|
|
|
Use streaming when:
|
|
|
|
- Your dataset is too large to fit in memory (e.g. when you're doing pretraining with massive text corpora)
|
|
- You want to start training immediately without preprocessing the entire dataset
|
|
|
|
Streaming works with both remote and locally stored datasets!
|
|
|
|
::: {.callout-note}
|
|
Streaming currently only supports a single dataset. Multi-dataset support will be added soon.
|
|
:::
|
|
|
|
|
|
## Configuration
|
|
|
|
### Basic Streaming
|
|
|
|
Enable streaming mode by setting the `streaming` flag:
|
|
|
|
```yaml
|
|
streaming: true
|
|
```
|
|
|
|
### Pretraining with Streaming
|
|
|
|
For pretraining tasks, streaming is automatically enabled when using `pretraining_dataset`:
|
|
|
|
```yaml
|
|
pretraining_dataset:
|
|
- path: HuggingFaceFW/fineweb-edu
|
|
type: pretrain
|
|
text_column: text
|
|
split: train
|
|
|
|
# Optionally, enable sample packing
|
|
streaming_multipack_buffer_size: 10000
|
|
sample_packing: true
|
|
```
|
|
|
|
### SFT with Streaming
|
|
|
|
For supervised fine-tuning with streaming:
|
|
|
|
```yaml
|
|
streaming: true
|
|
datasets:
|
|
- path: tatsu-lab/alpaca
|
|
type: alpaca
|
|
split: train
|
|
|
|
# Optionally, enable sample packing
|
|
streaming_multipack_buffer_size: 10000
|
|
sample_packing: true
|
|
```
|
|
|
|
## Configuration Options
|
|
|
|
### `streaming_multipack_buffer_size`
|
|
|
|
Controls the buffer size for multipack streaming (default: 10,000). This determines how
|
|
many samples are buffered before packing. Larger buffers can improve packing efficiency
|
|
but use more memory.
|
|
|
|
### `shuffle_merged_datasets`
|
|
|
|
When enabled, shuffles the streaming dataset using the buffer. This requires additional
|
|
memory for the shuffle buffer.
|
|
|
|
## Sample Packing with Streaming
|
|
|
|
Sample packing is supported for streaming datasets. When enabled, multiple samples are
|
|
packed into a single sequence to maximize GPU utilization:
|
|
|
|
```yaml
|
|
sample_packing: true
|
|
streaming_multipack_buffer_size: 10000
|
|
|
|
# For SFT: attention is automatically isolated between packed samples
|
|
# For pretraining: control with pretrain_multipack_attn
|
|
pretrain_multipack_attn: true # prevent cross-attention between packed samples
|
|
```
|
|
|
|
For more information, see our [documentation](multipack.qmd) on multipacking.
|
|
|
|
## Important Considerations
|
|
|
|
### Memory Usage
|
|
|
|
While streaming reduces memory usage compared to loading entire datasets, you still need
|
|
to consider:
|
|
|
|
- You can control the memory usage by adjusting `streaming_multipack_buffer_size`
|
|
- Sample packing requires buffering multiple samples
|
|
- Shuffling requires additional memory for the shuffle buffer
|
|
|
|
### Performance
|
|
|
|
- Streaming may have slightly higher latency compared to preprocessed datasets, as samples are processed on-the-fly
|
|
- Network speed and disk read speed are important when streaming from remote sources or a local dataset, respectively
|
|
- Consider using `axolotl preprocess` for smaller or more frequently used datasets
|
|
|
|
### Evaluation Datasets
|
|
|
|
Evaluation datasets are not streamed to ensure consistent evaluation metrics. They're
|
|
loaded normally even when training uses streaming.
|
|
|
|
## Examples
|
|
|
|
See the `examples/streaming/` directory for complete configuration examples:
|
|
|
|
- `pretrain.yaml`: Pretraining with streaming dataset
|
|
- `sft.yaml`: Supervised fine-tuning with streaming
|