* feat(doc): organize docs, add to menu bar, fix broken formatting * feat: add link to custom integrations * feat: update readme for integrations to include citations and repo link * chore: update lm_eval info * chore: use fullname * Update docs/cli.qmd per suggestion Co-authored-by: Dan Saunders <danjsaund@gmail.com> * feat: add sweep doc * feat: add kd doc * fix: remove toc * fix: update deprecation * feat: add more info about chat_template issues * fix: heading level * fix: shell->bash code block * fix: ray link * fix(doc): heading level, header links, formatting * feat: add grpo docs * feat: add style changes * fix: wrong cli arg for lm-eval * fix: remove old run method * feat: load custom integration doc dynamically * fix: remove old cli way * fix: toc * fix: minor formatting --------- Co-authored-by: Dan Saunders <danjsaund@gmail.com>
33 lines
764 B
Plaintext
33 lines
764 B
Plaintext
---
|
|
title: Pre-training
|
|
description: Data format for a pre-training completion task.
|
|
order: 1
|
|
---
|
|
|
|
For pretraining, there is no prompt template or roles. The only required field is `text`:
|
|
|
|
```{.json filename="data.jsonl"}
|
|
{"text": "first row"}
|
|
{"text": "second row"}
|
|
...
|
|
```
|
|
|
|
:::{.callout-note}
|
|
|
|
### Streaming is recommended for large datasets
|
|
|
|
Axolotl usually loads the entire dataset into memory. This will be challenging for large datasets. Use the following config to enable streaming:
|
|
|
|
```{.yaml filename="config.yaml"}
|
|
pretraining_dataset:
|
|
- name:
|
|
path:
|
|
split:
|
|
text_column: # column in dataset with the data, usually `text`
|
|
type: pretrain
|
|
trust_remote_code:
|
|
skip: # number of rows of data to skip over from the beginning
|
|
```
|
|
|
|
:::
|