datasets

datasets

Module containing Dataset functionality

Classes

Name	Description
ConstantLengthDataset	Iterable dataset that returns constant length chunks of tokens from stream of text files.
TokenizedPromptDataset	Dataset that returns tokenized prompts from a stream of text files.

ConstantLengthDataset

datasets.ConstantLengthDataset(self, tokenizer, datasets, seq_length=2048)

Iterable dataset that returns constant length chunks of tokens from stream of text files. Args: tokenizer (Tokenizer): The processor used for processing the data. dataset (dataset.Dataset): Dataset with text files. seq_length (int): Length of token sequences to return.

TokenizedPromptDataset

datasets.TokenizedPromptDataset(
    self,
    prompt_tokenizer,
    dataset,
    process_count=None,
    keep_in_memory=False,
    **kwargs,
)

Dataset that returns tokenized prompts from a stream of text files. Args: prompt_tokenizer (PromptTokenizingStrategy): The prompt tokenizing method for processing the data. dataset (dataset.Dataset): Dataset with text files. process_count (int): Number of processes to use for tokenizing. keep_in_memory (bool): Whether to keep the tokenized dataset in memory.