datasets
datasets
Module containing Dataset functionality
Classes
| Name | Description |
|---|---|
| ConstantLengthDataset | Iterable dataset that returns constant length chunks of tokens from stream of text files. |
| TokenizedPromptDataset | Dataset that returns tokenized prompts from a stream of text files. |
ConstantLengthDataset
datasets.ConstantLengthDataset(self, tokenizer, datasets, seq_length=2048)Iterable dataset that returns constant length chunks of tokens from stream of text files. Args: tokenizer (Tokenizer): The processor used for processing the data. dataset (dataset.Dataset): Dataset with text files. seq_length (int): Length of token sequences to return.
TokenizedPromptDataset
datasets.TokenizedPromptDataset(
self,
prompt_tokenizer,
dataset,
process_count=None,
keep_in_memory=False,
**kwargs,
)Dataset that returns tokenized prompts from a stream of text files. Args: prompt_tokenizer (PromptTokenizingStrategy): The prompt tokenizing method for processing the data. dataset (dataset.Dataset): Dataset with text files. process_count (int): Number of processes to use for tokenizing. keep_in_memory (bool): Whether to keep the tokenized dataset in memory.