diff --git a/.nojekyll b/.nojekyll index f4b6a59e4..b6837083c 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -f4b50a99 \ No newline at end of file +c8f78714 \ No newline at end of file diff --git a/FAQS.html b/FAQS.html index eadebd0da..3aa04ba1c 100644 --- a/FAQS.html +++ b/FAQS.html @@ -343,6 +343,12 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true}); Dataset Preprocessing + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + @@ -511,7 +516,10 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});

datasets

datasets

-

Module containing Dataset functionality

+

Module containing dataset functionality.

+

We want this to be a wrapper for an existing dataset that we have loaded. Lets use the +concept of middlewares to wrap each dataset. We’ll use the collators later on to pad the +datasets.

Classes

@@ -523,72 +531,23 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true}); - - - -
ConstantLengthDatasetIterable dataset that returns constant length chunks of tokens from stream of
TokenizedPromptDataset Dataset that returns tokenized prompts from a stream of text files.
-
-

ConstantLengthDataset

-
datasets.ConstantLengthDataset(tokenizer, datasets, seq_length=2048)
-

Iterable dataset that returns constant length chunks of tokens from stream of -text files.

-
-

Parameters

- ------ - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
NameTypeDescriptionDefault
tokenizerThe processor used for processing the data.required
datasetDataset with text files.required
seq_lengthLength of token sequences to return.2048
-
-

TokenizedPromptDataset

-
datasets.TokenizedPromptDataset(
-    prompt_tokenizer,
-    dataset,
-    process_count=None,
-    keep_in_memory=False,
-    **kwargs,
-)
+
datasets.TokenizedPromptDataset(
+    prompt_tokenizer,
+    dataset,
+    process_count=None,
+    keep_in_memory=False,
+    **kwargs,
+)

Dataset that returns tokenized prompts from a stream of text files.

-
-

Parameters

+
+

Parameters

diff --git a/docs/api/evaluate.html b/docs/api/evaluate.html index dd22b9949..cc662120b 100644 --- a/docs/api/evaluate.html +++ b/docs/api/evaluate.html @@ -378,6 +378,12 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true}); Dataset Preprocessing + + + - + @@ -1020,8 +1026,8 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true}); - - + + diff --git a/docs/api/integrations.base.html b/docs/api/integrations.base.html index b040a4d67..c412e1f99 100644 --- a/docs/api/integrations.base.html +++ b/docs/api/integrations.base.html @@ -378,6 +378,12 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true}); Dataset Preprocessing + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
datasetsModule containing Dataset functionalityModule containing dataset functionality.
convertCopied from https://github.com/iShohei220/adopt
utils.data.pretrainingdata handling specific to pretrainingutils.data.streamingData handling specific to streaming datasets.
utils.data.sft

prepare_datasets

-
utils.data.sft.prepare_datasets(
-    cfg,
-    tokenizer,
-    processor=None,
-    preprocess_iterable=False,
-)
+
utils.data.sft.prepare_datasets(cfg, tokenizer, processor=None)

Prepare training and evaluation datasets based on configuration.

Parameters

----++++ @@ -572,12 +573,6 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true}); - - - - - -
Optional processor for multimodal datasets. None
preprocess_iterableboolWhether to use iterable preprocessing.False
diff --git a/docs/api/utils.data.pretraining.html b/docs/api/utils.data.streaming.html similarity index 98% rename from docs/api/utils.data.pretraining.html rename to docs/api/utils.data.streaming.html index 27de29126..159f22b7a 100644 --- a/docs/api/utils.data.pretraining.html +++ b/docs/api/utils.data.streaming.html @@ -7,7 +7,7 @@ -utils.data.pretraining – Axolotl +utils.data.streaming – Axolotl + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ + +
+ +
+ + +
+ + + +
+ +
+
+

Streaming Datasets

+
+ +
+
+ How to use streaming mode for large-scale datasets and memory-efficient training +
+
+ + +
+ + + + +
+ + + +
+ + +

Streaming enables memory-efficient training with large datasets by loading data +incrementally rather than loading the entire dataset into memory at once.

+

Use streaming when:

+
    +
  • Your dataset is too large to fit in memory (e.g. when you’re doing pretraining with massive text corpora)
  • +
  • You want to start training immediately without preprocessing the entire dataset
  • +
+

Streaming works with both remote and locally stored datasets!

+
+
+
+ +
+
+Note +
+
+
+

Streaming currently only supports a single dataset. Multi-dataset support will be added soon.

+
+
+
+

Configuration

+
+

Basic Streaming

+

Enable streaming mode by setting the streaming flag:

+
streaming: true
+
+
+

Pretraining with Streaming

+

For pretraining tasks, streaming is automatically enabled when using pretraining_dataset:

+
pretraining_dataset:
+  - path: HuggingFaceFW/fineweb-edu
+    type: pretrain
+    text_column: text
+    split: train
+
+# Optionally, enable sample packing
+streaming_multipack_buffer_size: 10000
+sample_packing: true
+
+
+

SFT with Streaming

+

For supervised fine-tuning with streaming:

+
streaming: true
+datasets:
+  - path: tatsu-lab/alpaca
+    type: alpaca
+    split: train
+
+# Optionally, enable sample packing
+streaming_multipack_buffer_size: 10000
+sample_packing: true
+
+
+
+

Configuration Options

+
+

streaming_multipack_buffer_size

+

Controls the buffer size for multipack streaming (default: 10,000). This determines how +many samples are buffered before packing. Larger buffers can improve packing efficiency +but use more memory.

+
+
+

shuffle_merged_datasets

+

When enabled, shuffles the streaming dataset using the buffer. This requires additional +memory for the shuffle buffer.

+
+
+
+

Sample Packing with Streaming

+

Sample packing is supported for streaming datasets. When enabled, multiple samples are +packed into a single sequence to maximize GPU utilization:

+
sample_packing: true
+streaming_multipack_buffer_size: 10000
+
+# For SFT: attention is automatically isolated between packed samples
+# For pretraining: control with pretrain_multipack_attn
+pretrain_multipack_attn: true  # prevent cross-attention between packed samples
+

For more information, see our documentation on multipacking.

+
+
+

Important Considerations

+
+

Memory Usage

+

While streaming reduces memory usage compared to loading entire datasets, you still need +to consider:

+
    +
  • You can control the memory usage by adjusting streaming_multipack_buffer_size
  • +
  • Sample packing requires buffering multiple samples
  • +
  • Shuffling requires additional memory for the shuffle buffer
  • +
+
+
+

Performance

+
    +
  • Streaming may have slightly higher latency compared to preprocessed datasets, as samples are processed on-the-fly
  • +
  • Network speed and disk read speed are important when streaming from remote sources or a local dataset, respectively
  • +
  • Consider using axolotl preprocess for smaller or more frequently used datasets
  • +
+
+
+

Evaluation Datasets

+

Evaluation datasets are not streamed to ensure consistent evaluation metrics. They’re +loaded normally even when training uses streaming.

+
+
+
+

Examples

+

See the examples/streaming/ directory for complete configuration examples:

+
    +
  • pretrain.yaml: Pretraining with streaming dataset
  • +
  • sft.yaml: Supervised fine-tuning with streaming
  • +
+ + +
+ +
+ +
+ + + + + \ No newline at end of file diff --git a/docs/torchao.html b/docs/torchao.html index 673e65d98..c6ab20338 100644 --- a/docs/torchao.html +++ b/docs/torchao.html @@ -379,6 +379,12 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true}); Dataset Preprocessing + + + + + + +