Fix(doc): clarify data loading for local datasets and splitting samples (#2726) [skip ci]
* fix(doc): remove incorrect json dataset loading method * fix(doc): clarify splitting only happens in completion mode * fix: update local file loading on config doc * fix: typo
This commit is contained in:
@@ -36,10 +36,6 @@ It is typically recommended to save your dataset as `.jsonl` due to its flexibil
|
||||
|
||||
Axolotl supports loading from a Hugging Face hub repo or from local files.
|
||||
|
||||
::: {.callout-important}
|
||||
For pre-training only, Axolotl would split texts if it exceeds the context length into multiple smaller prompts.
|
||||
:::
|
||||
|
||||
### Pre-training from Hugging Face hub datasets
|
||||
|
||||
As an example, to train using a Hugging Face dataset `hf_org/name`, you can pass the following config:
|
||||
@@ -77,18 +73,21 @@ datasets:
|
||||
type: completion
|
||||
```
|
||||
|
||||
From local files (either example works):
|
||||
From local files:
|
||||
|
||||
```yaml
|
||||
datasets:
|
||||
- path: A.jsonl
|
||||
type: completion
|
||||
|
||||
- path: json
|
||||
data_files: ["A.jsonl", "B.jsonl", "C.jsonl"]
|
||||
- path: B.jsonl
|
||||
type: completion
|
||||
```
|
||||
|
||||
::: {.callout-important}
|
||||
For `completion` only, Axolotl would split texts if it exceeds the context length into multiple smaller prompts. If you are interested in having this for `pretraining_dataset` too, please let us know or help make a PR!
|
||||
:::
|
||||
|
||||
### Pre-training dataset configuration tips
|
||||
|
||||
#### Setting max_steps
|
||||
|
||||
Reference in New Issue
Block a user