datasets
datasets
Module containing Dataset functionality
Classes
| Name | Description |
|---|---|
| ConstantLengthDataset | Iterable dataset that returns constant length chunks of tokens from stream of |
| TokenizedPromptDataset | Dataset that returns tokenized prompts from a stream of text files. |
ConstantLengthDataset
datasets.ConstantLengthDataset(tokenizer, datasets, seq_length=2048)Iterable dataset that returns constant length chunks of tokens from stream of text files.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| tokenizer | The processor used for processing the data. | required | |
| dataset | Dataset with text files. | required | |
| seq_length | Length of token sequences to return. | 2048 |
TokenizedPromptDataset
datasets.TokenizedPromptDataset(
prompt_tokenizer,
dataset,
process_count=None,
keep_in_memory=False,
**kwargs,
)Dataset that returns tokenized prompts from a stream of text files.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| prompt_tokenizer | PromptTokenizingStrategy | The prompt tokenizing method for processing the data. | required |
| dataset | Dataset | Dataset with text files. | required |
| process_count | int | None | Number of processes to use for tokenizing. | None |
| keep_in_memory | bool | None | Whether to keep the tokenized dataset in memory. | False |