datasets

datasets

Module containing Dataset functionality

Classes

Name	Description
ConstantLengthDataset	Iterable dataset that returns constant length chunks of tokens from stream of
TokenizedPromptDataset	Dataset that returns tokenized prompts from a stream of text files.

datasets.ConstantLengthDataset(tokenizer, datasets, seq_length=2048)

Iterable dataset that returns constant length chunks of tokens from stream of text files.

Name	Description	Default
tokenizer	The processor used for processing the data.	required
dataset	Dataset with text files.	required
seq_length	Length of token sequences to return.	`2048`

datasets.TokenizedPromptDataset(
    prompt_tokenizer,
    dataset,
    process_count=None,
    keep_in_memory=False,
    **kwargs,
)

Dataset that returns tokenized prompts from a stream of text files.

Name	Type	Description	Default
prompt_tokenizer	PromptTokenizingStrategy	The prompt tokenizing method for processing the data.	required
dataset	Dataset	Dataset with text files.	required
process_count	int \| None	Number of processes to use for tokenizing.	`None`
keep_in_memory	bool \| None	Whether to keep the tokenized dataset in memory.	`False`