quartodoc integration
This commit is contained in:
committed by
Dan Saunders
parent
c907ac173e
commit
e4fd7aad0b
44
docs/api/datasets.qmd
Normal file
44
docs/api/datasets.qmd
Normal file
@@ -0,0 +1,44 @@
|
||||
# datasets { #axolotl.datasets }
|
||||
|
||||
`datasets`
|
||||
|
||||
Module containing Dataset functionality
|
||||
|
||||
## Classes
|
||||
|
||||
| Name | Description |
|
||||
| --- | --- |
|
||||
| [ConstantLengthDataset](#axolotl.datasets.ConstantLengthDataset) | Iterable dataset that returns constant length chunks of tokens from stream of text files. |
|
||||
| [TokenizedPromptDataset](#axolotl.datasets.TokenizedPromptDataset) | Dataset that returns tokenized prompts from a stream of text files. |
|
||||
|
||||
### ConstantLengthDataset { #axolotl.datasets.ConstantLengthDataset }
|
||||
|
||||
```python
|
||||
datasets.ConstantLengthDataset(self, tokenizer, datasets, seq_length=2048)
|
||||
```
|
||||
|
||||
Iterable dataset that returns constant length chunks of tokens from stream of text files.
|
||||
Args:
|
||||
tokenizer (Tokenizer): The processor used for processing the data.
|
||||
dataset (dataset.Dataset): Dataset with text files.
|
||||
seq_length (int): Length of token sequences to return.
|
||||
|
||||
### TokenizedPromptDataset { #axolotl.datasets.TokenizedPromptDataset }
|
||||
|
||||
```python
|
||||
datasets.TokenizedPromptDataset(
|
||||
self,
|
||||
prompt_tokenizer,
|
||||
dataset,
|
||||
process_count=None,
|
||||
keep_in_memory=False,
|
||||
**kwargs,
|
||||
)
|
||||
```
|
||||
|
||||
Dataset that returns tokenized prompts from a stream of text files.
|
||||
Args:
|
||||
prompt_tokenizer (PromptTokenizingStrategy): The prompt tokenizing method for processing the data.
|
||||
dataset (dataset.Dataset): Dataset with text files.
|
||||
process_count (int): Number of processes to use for tokenizing.
|
||||
keep_in_memory (bool): Whether to keep the tokenized dataset in memory.
|
||||
Reference in New Issue
Block a user