* fix(doc): remove incorrect json dataset loading method * fix(doc): clarify splitting only happens in completion mode * fix: update local file loading on config doc * fix: typo
269 lines
6.5 KiB
Plaintext
269 lines
6.5 KiB
Plaintext
---
|
|
title: Dataset Loading
|
|
description: Understanding how to load datasets from different sources
|
|
back-to-top-navigation: true
|
|
toc: true
|
|
toc-depth: 5
|
|
---
|
|
|
|
## Overview
|
|
|
|
Datasets can be loaded in a number of different ways depending on the how it is saved (the extension of the file) and where it is stored.
|
|
|
|
## Loading Datasets
|
|
|
|
We use the `datasets` library to load datasets and a mix of `load_dataset` and `load_from_disk` to load them.
|
|
|
|
You may recognize the similar named configs between `load_dataset` and the `datasets` section of the config file.
|
|
|
|
```yaml
|
|
datasets:
|
|
- path:
|
|
name:
|
|
data_files:
|
|
split:
|
|
revision:
|
|
trust_remote_code:
|
|
```
|
|
|
|
::: {.callout-tip}
|
|
|
|
Do not feel overwhelmed by the number of options here. A lot of them are optional. In fact, the most common config to use would be `path` and sometimes `data_files`.
|
|
|
|
:::
|
|
|
|
This matches the API of [`datasets.load_dataset`](https://github.com/huggingface/datasets/blob/0b5998ac62f08e358f8dcc17ec6e2f2a5e9450b6/src/datasets/load.py#L1838-L1858), so if you're familiar with that, you will feel right at home.
|
|
|
|
For HuggingFace's guide to load different dataset types, see [here](https://huggingface.co/docs/datasets/loading).
|
|
|
|
For full details on the config, see [config.qmd](config.qmd).
|
|
|
|
::: {.callout-note}
|
|
|
|
You can set multiple datasets in the config file by more than one entry under `datasets`.
|
|
|
|
```yaml
|
|
datasets:
|
|
- path: /path/to/your/dataset
|
|
- path: /path/to/your/other/dataset
|
|
```
|
|
|
|
:::
|
|
|
|
### Local dataset
|
|
|
|
#### Files
|
|
|
|
To load a JSON file, you would do something like this:
|
|
|
|
```python
|
|
from datasets import load_dataset
|
|
|
|
dataset = load_dataset("json", data_files="data.json")
|
|
```
|
|
|
|
Which translates to the following config:
|
|
|
|
```yaml
|
|
datasets:
|
|
- path: data.json
|
|
ds_type: json
|
|
```
|
|
|
|
In the example above, it can be seen that we can just point the `path` to the file or directory along with the `ds_type` to load the dataset.
|
|
|
|
This works for CSV, JSON, Parquet, and Arrow files.
|
|
|
|
::: {.callout-tip}
|
|
|
|
If `path` points to a file and `ds_type` is not specified, we will automatically infer the dataset type from the file extension, so you could omit `ds_type` if you'd like.
|
|
|
|
:::
|
|
|
|
#### Directory
|
|
|
|
If you're loading a directory, you can point the `path` to the directory.
|
|
|
|
Then, you have two options:
|
|
|
|
##### Loading entire directory
|
|
|
|
You do not need any additional configs.
|
|
|
|
We will attempt to load in the following order:
|
|
- datasets saved with `datasets.save_to_disk`
|
|
- loading entire directory of files (such as with parquet/arrow files)
|
|
|
|
```yaml
|
|
datasets:
|
|
- path: /path/to/your/directory
|
|
```
|
|
|
|
##### Loading specific files in directory
|
|
|
|
Provide `data_files` with a list of files to load.
|
|
|
|
```yaml
|
|
datasets:
|
|
# single file
|
|
- path: /path/to/your/directory
|
|
ds_type: csv
|
|
data_files: file1.csv
|
|
|
|
# multiple files
|
|
- path: /path/to/your/directory
|
|
ds_type: json
|
|
data_files:
|
|
- file1.jsonl
|
|
- file2.jsonl
|
|
|
|
# multiple files for parquet
|
|
- path: /path/to/your/directory
|
|
ds_type: parquet
|
|
data_files:
|
|
- file1.parquet
|
|
- file2.parquet
|
|
|
|
```
|
|
|
|
### HuggingFace Hub
|
|
|
|
The method you use to load the dataset depends on how the dataset was created, whether a folder was uploaded directly or a HuggingFace Dataset was pushed.
|
|
|
|
::: {.callout-note}
|
|
|
|
If you're using a private dataset, you will need to enable the `hf_use_auth_token` flag in the root-level of the config file.
|
|
|
|
:::
|
|
|
|
#### Folder uploaded
|
|
|
|
This would mean that the dataset is a single file or file(s) uploaded to the Hub.
|
|
|
|
```yaml
|
|
datasets:
|
|
- path: org/dataset-name
|
|
data_files:
|
|
- file1.jsonl
|
|
- file2.jsonl
|
|
```
|
|
|
|
#### HuggingFace Dataset
|
|
|
|
This means that the dataset is created as a HuggingFace Dataset and pushed to the Hub via `datasets.push_to_hub`.
|
|
|
|
```yaml
|
|
datasets:
|
|
- path: org/dataset-name
|
|
```
|
|
|
|
::: {.callout-note}
|
|
|
|
There are some other configs which may be required like `name`, `split`, `revision`, `trust_remote_code`, etc depending on the dataset.
|
|
|
|
:::
|
|
|
|
### Remote Filesystems
|
|
|
|
Via the `storage_options` config under `load_dataset`, you can load datasets from remote filesystems like S3, GCS, Azure, and OCI.
|
|
|
|
::: {.callout-warning}
|
|
|
|
This is currently experimental. Please let us know if you run into any issues!
|
|
|
|
:::
|
|
|
|
The only difference between the providers is that you need to prepend the path with the respective protocols.
|
|
|
|
```yaml
|
|
datasets:
|
|
# Single file
|
|
- path: s3://bucket-name/path/to/your/file.jsonl
|
|
|
|
# Directory
|
|
- path: s3://bucket-name/path/to/your/directory
|
|
```
|
|
|
|
For directory, we load via `load_from_disk`.
|
|
|
|
#### S3
|
|
|
|
Prepend the path with `s3://`.
|
|
|
|
The credentials are pulled in the following order:
|
|
|
|
- `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN` environment variables
|
|
- from the `~/.aws/credentials` file
|
|
- for nodes on EC2, the IAM metadata provider
|
|
|
|
::: {.callout-note}
|
|
|
|
We assume you have credentials setup and not using anonymous access. If you want to use anonymous access, let us know! We may have to open a config option for this.
|
|
|
|
:::
|
|
|
|
Other environment variables that can be set can be found in [boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#using-environment-variables)
|
|
|
|
#### GCS
|
|
|
|
Prepend the path with `gs://` or `gcs://`.
|
|
|
|
The credentials are loaded in the following order:
|
|
|
|
- gcloud credentials
|
|
- for nodes on GCP, the google metadata service
|
|
- anonymous access
|
|
|
|
#### Azure
|
|
|
|
##### Gen 1
|
|
|
|
Prepend the path with `adl://`.
|
|
|
|
Ensure you have the following environment variables set:
|
|
|
|
- `AZURE_STORAGE_TENANT_ID`
|
|
- `AZURE_STORAGE_CLIENT_ID`
|
|
- `AZURE_STORAGE_CLIENT_SECRET`
|
|
|
|
##### Gen 2
|
|
|
|
Prepend the path with `abfs://` or `az://`.
|
|
|
|
Ensure you have the following environment variables set:
|
|
|
|
- `AZURE_STORAGE_ACCOUNT_NAME`
|
|
- `AZURE_STORAGE_ACCOUNT_KEY`
|
|
|
|
Other environment variables that can be set can be found in [adlfs docs](https://github.com/fsspec/adlfs?tab=readme-ov-file#setting-credentials)
|
|
|
|
#### OCI
|
|
|
|
Prepend the path with `oci://`.
|
|
|
|
It would attempt to read in the following order:
|
|
|
|
- `OCIFS_IAM_TYPE`, `OCIFS_CONFIG_LOCATION`, and `OCIFS_CONFIG_PROFILE` environment variables
|
|
- when on OCI resource, resource principal
|
|
|
|
Other environment variables:
|
|
|
|
- `OCI_REGION_METADATA`
|
|
|
|
Please see the [ocifs docs](https://ocifs.readthedocs.io/en/latest/getting-connected.html#Using-Environment-Variables).
|
|
|
|
### HTTPS
|
|
|
|
The path should start with `https://`.
|
|
|
|
```yaml
|
|
datasets:
|
|
- path: https://path/to/your/dataset/file.jsonl
|
|
```
|
|
|
|
This must be publically accessible.
|
|
|
|
## Next steps
|
|
|
|
Now that you know how to load datasets, you can learn more on how to load your specific dataset format into your target output format [dataset formats docs](dataset-formats).
|