Fix(doc): clarify data loading for local datasets and splitting samples (#2726) [skip ci]

* fix(doc): remove incorrect json dataset loading method

* fix(doc): clarify splitting only happens in completion mode

* fix: update local file loading on config doc

* fix: typo
This commit is contained in:
NanoCode012
2025-05-28 15:48:22 +07:00
committed by GitHub
parent 4a8af60d34
commit 3e6948be97
3 changed files with 14 additions and 21 deletions

View File

@@ -54,7 +54,7 @@ datasets:
#### Files
Usually, to load a JSON file, you would do something like this:
To load a JSON file, you would do something like this:
```python
from datasets import load_dataset
@@ -66,20 +66,12 @@ Which translates to the following config:
```yaml
datasets:
- path: json
data_files: /path/to/your/file.jsonl
```
However, to make things easier, we have added a few shortcuts for loading local dataset files.
You can just point the `path` to the file or directory along with the `ds_type` to load the dataset. The below example shows for a JSON file:
```yaml
datasets:
- path: /path/to/your/file.jsonl
- path: data.json
ds_type: json
```
In the example above, it can be seen that we can just point the `path` to the file or directory along with the `ds_type` to load the dataset.
This works for CSV, JSON, Parquet, and Arrow files.
::: {.callout-tip}