Fix(doc): clarify data loading for local datasets and splitting samples (#2726) [skip ci]
* fix(doc): remove incorrect json dataset loading method * fix(doc): clarify splitting only happens in completion mode * fix: update local file loading on config doc * fix: typo
This commit is contained in:
@@ -98,8 +98,10 @@ plugins:
|
|||||||
# - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
|
# - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
|
||||||
|
|
||||||
# A list of one or more datasets to finetune the model with
|
# A list of one or more datasets to finetune the model with
|
||||||
|
# See https://docs.axolotl.ai/docs/dataset_loading.html for guide on loading datasets
|
||||||
|
# See https://docs.axolotl.ai/docs/dataset-formats/ for guide on dataset formats
|
||||||
datasets:
|
datasets:
|
||||||
# HuggingFace dataset repo | s3://,gs:// path | "json" for local dataset, make sure to fill data_files
|
# HuggingFace dataset repo | s3:// | gs:// | path to local file or directory
|
||||||
- path: vicgalle/alpaca-gpt4
|
- path: vicgalle/alpaca-gpt4
|
||||||
# The type of prompt to use for training. [alpaca, gpteacher, oasst, reflection]
|
# The type of prompt to use for training. [alpaca, gpteacher, oasst, reflection]
|
||||||
type: alpaca # format | format:<prompt_style> (chat/instruct) | <prompt_strategies>.load_<load_fn>
|
type: alpaca # format | format:<prompt_style> (chat/instruct) | <prompt_strategies>.load_<load_fn>
|
||||||
@@ -221,7 +223,7 @@ datasets:
|
|||||||
# The same applies to the `test_datasets` option and the `pretraining_dataset` option. Default is true.
|
# The same applies to the `test_datasets` option and the `pretraining_dataset` option. Default is true.
|
||||||
shuffle_merged_datasets: true
|
shuffle_merged_datasets: true
|
||||||
|
|
||||||
Deduplicates datasets and test_datasets with identical entries.
|
# Deduplicates datasets and test_datasets with identical entries.
|
||||||
dataset_exact_deduplication: true
|
dataset_exact_deduplication: true
|
||||||
|
|
||||||
# A list of one or more datasets to eval the model with.
|
# A list of one or more datasets to eval the model with.
|
||||||
|
|||||||
@@ -36,10 +36,6 @@ It is typically recommended to save your dataset as `.jsonl` due to its flexibil
|
|||||||
|
|
||||||
Axolotl supports loading from a Hugging Face hub repo or from local files.
|
Axolotl supports loading from a Hugging Face hub repo or from local files.
|
||||||
|
|
||||||
::: {.callout-important}
|
|
||||||
For pre-training only, Axolotl would split texts if it exceeds the context length into multiple smaller prompts.
|
|
||||||
:::
|
|
||||||
|
|
||||||
### Pre-training from Hugging Face hub datasets
|
### Pre-training from Hugging Face hub datasets
|
||||||
|
|
||||||
As an example, to train using a Hugging Face dataset `hf_org/name`, you can pass the following config:
|
As an example, to train using a Hugging Face dataset `hf_org/name`, you can pass the following config:
|
||||||
@@ -77,18 +73,21 @@ datasets:
|
|||||||
type: completion
|
type: completion
|
||||||
```
|
```
|
||||||
|
|
||||||
From local files (either example works):
|
From local files:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
datasets:
|
datasets:
|
||||||
- path: A.jsonl
|
- path: A.jsonl
|
||||||
type: completion
|
type: completion
|
||||||
|
|
||||||
- path: json
|
- path: B.jsonl
|
||||||
data_files: ["A.jsonl", "B.jsonl", "C.jsonl"]
|
|
||||||
type: completion
|
type: completion
|
||||||
```
|
```
|
||||||
|
|
||||||
|
::: {.callout-important}
|
||||||
|
For `completion` only, Axolotl would split texts if it exceeds the context length into multiple smaller prompts. If you are interested in having this for `pretraining_dataset` too, please let us know or help make a PR!
|
||||||
|
:::
|
||||||
|
|
||||||
### Pre-training dataset configuration tips
|
### Pre-training dataset configuration tips
|
||||||
|
|
||||||
#### Setting max_steps
|
#### Setting max_steps
|
||||||
|
|||||||
@@ -54,7 +54,7 @@ datasets:
|
|||||||
|
|
||||||
#### Files
|
#### Files
|
||||||
|
|
||||||
Usually, to load a JSON file, you would do something like this:
|
To load a JSON file, you would do something like this:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from datasets import load_dataset
|
from datasets import load_dataset
|
||||||
@@ -66,20 +66,12 @@ Which translates to the following config:
|
|||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
datasets:
|
datasets:
|
||||||
- path: json
|
- path: data.json
|
||||||
data_files: /path/to/your/file.jsonl
|
|
||||||
```
|
|
||||||
|
|
||||||
However, to make things easier, we have added a few shortcuts for loading local dataset files.
|
|
||||||
|
|
||||||
You can just point the `path` to the file or directory along with the `ds_type` to load the dataset. The below example shows for a JSON file:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
datasets:
|
|
||||||
- path: /path/to/your/file.jsonl
|
|
||||||
ds_type: json
|
ds_type: json
|
||||||
```
|
```
|
||||||
|
|
||||||
|
In the example above, it can be seen that we can just point the `path` to the file or directory along with the `ds_type` to load the dataset.
|
||||||
|
|
||||||
This works for CSV, JSON, Parquet, and Arrow files.
|
This works for CSV, JSON, Parquet, and Arrow files.
|
||||||
|
|
||||||
::: {.callout-tip}
|
::: {.callout-tip}
|
||||||
|
|||||||
Reference in New Issue
Block a user