* feat: update doc contents * chore: move batch vs ga docs * feat: update lambdalabs instructions * fix: refactor dev instructions
27 lines
576 B
Plaintext
27 lines
576 B
Plaintext
---
|
|
title: Pre-training
|
|
description: Data format for a pre-training completion task.
|
|
order: 1
|
|
---
|
|
|
|
For pretraining, there is no prompt template or roles. The only required field is `text`:
|
|
|
|
```{.json filename="data.jsonl"}
|
|
{"text": "first row"}
|
|
{"text": "second row"}
|
|
...
|
|
```
|
|
|
|
:::{.callout-note}
|
|
|
|
### Streaming is recommended for large datasets
|
|
|
|
Axolotl usually loads the entire dataset into memory. This will be challenging for large datasets. Use the following config to enable streaming:
|
|
|
|
```{.yaml filename="config.yaml"}
|
|
pretraining_dataset: # hf path only
|
|
...
|
|
```
|
|
|
|
:::
|