quartodoc integration

This commit is contained in:
Dan Saunders
2025-03-14 16:16:07 +00:00
committed by Dan Saunders
parent c907ac173e
commit e4fd7aad0b
18 changed files with 1005 additions and 1 deletions

44
docs/api/datasets.qmd Normal file
View File

@@ -0,0 +1,44 @@
# datasets { #axolotl.datasets }
`datasets`
Module containing Dataset functionality
## Classes
| Name | Description |
| --- | --- |
| [ConstantLengthDataset](#axolotl.datasets.ConstantLengthDataset) | Iterable dataset that returns constant length chunks of tokens from stream of text files. |
| [TokenizedPromptDataset](#axolotl.datasets.TokenizedPromptDataset) | Dataset that returns tokenized prompts from a stream of text files. |
### ConstantLengthDataset { #axolotl.datasets.ConstantLengthDataset }
```python
datasets.ConstantLengthDataset(self, tokenizer, datasets, seq_length=2048)
```
Iterable dataset that returns constant length chunks of tokens from stream of text files.
Args:
tokenizer (Tokenizer): The processor used for processing the data.
dataset (dataset.Dataset): Dataset with text files.
seq_length (int): Length of token sequences to return.
### TokenizedPromptDataset { #axolotl.datasets.TokenizedPromptDataset }
```python
datasets.TokenizedPromptDataset(
self,
prompt_tokenizer,
dataset,
process_count=None,
keep_in_memory=False,
**kwargs,
)
```
Dataset that returns tokenized prompts from a stream of text files.
Args:
prompt_tokenizer (PromptTokenizingStrategy): The prompt tokenizing method for processing the data.
dataset (dataset.Dataset): Dataset with text files.
process_count (int): Number of processes to use for tokenizing.
keep_in_memory (bool): Whether to keep the tokenized dataset in memory.