Dataset Loading
+Overview
+Datasets can be loaded in a number of different ways depending on the how it is saved (the extension of the file) and where it is stored.
+Loading Datasets
+We use the datasets library to load datasets and a mix of load_dataset and load_from_disk to load them.
You may recognize the similar named configs between load_dataset and the datasets section of the config file.
datasets:
+ - path:
+ name:
+ data_files:
+ split:
+ revision:
+ trust_remote_code:Do not feel overwhelmed by the number of options here. A lot of them are optional. In fact, the most common config to use would be path and sometimes data_files.
This matches the API of datasets.load_dataset, so if you’re familiar with that, you will feel right at home.
For HuggingFace’s guide to load different dataset types, see here.
+For full details on the config, see config.qmd.
+You can set multiple datasets in the config file by more than one entry under datasets.
datasets:
+ - path: /path/to/your/dataset
+ - path: /path/to/your/other/datasetLocal dataset
+Files
+Usually, to load a JSON file, you would do something like this:
+from datasets import load_dataset
+
+dataset = load_dataset("json", data_files="data.json")Which translates to the following config:
+datasets:
+ - path: json
+ data_files: /path/to/your/file.jsonlHowever, to make things easier, we have added a few shortcuts for loading local dataset files.
+You can just point the path to the file or directory along with the ds_type to load the dataset. The below example shows for a JSON file:
datasets:
+ - path: /path/to/your/file.jsonl
+ ds_type: jsonThis works for CSV, JSON, Parquet, and Arrow files.
+If path points to a file and ds_type is not specified, we will automatically infer the dataset type from the file extension, so you could omit ds_type if you’d like.
Directory
+If you’re loading a directory, you can point the path to the directory.
Then, you have two options:
+Loading entire directory
+You do not need any additional configs.
+We will attempt to load in the following order:
+- datasets saved with datasets.save_to_disk
+- loading entire directory of files (such as with parquet/arrow files)
datasets:
+ - path: /path/to/your/directoryLoading specific files in directory
+Provide data_files with a list of files to load.
datasets:
+ # single file
+ - path: /path/to/your/directory
+ ds_type: csv
+ data_files: file1.csv
+
+ # multiple files
+ - path: /path/to/your/directory
+ ds_type: json
+ data_files:
+ - file1.jsonl
+ - file2.jsonl
+
+ # multiple files for parquet
+ - path: /path/to/your/directory
+ ds_type: parquet
+ data_files:
+ - file1.parquet
+ - file2.parquetHuggingFace Hub
+The method you use to load the dataset depends on how the dataset was created, whether a folder was uploaded directly or a HuggingFace Dataset was pushed.
+If you’re using a private dataset, you will need to enable the hf_use_auth_token flag in the root-level of the config file.
Folder uploaded
+This would mean that the dataset is a single file or file(s) uploaded to the Hub.
+datasets:
+ - path: org/dataset-name
+ data_files:
+ - file1.jsonl
+ - file2.jsonlHuggingFace Dataset
+This means that the dataset is created as a HuggingFace Dataset and pushed to the Hub via datasets.push_to_hub.
datasets:
+ - path: org/dataset-nameThere are some other configs which may be required like name, split, revision, trust_remote_code, etc depending on the dataset.
Remote Filesystems
+Via the storage_options config under load_dataset, you can load datasets from remote filesystems like S3, GCS, Azure, and OCI.
This is currently experimental. Please let us know if you run into any issues!
+The only difference between the providers is that you need to prepend the path with the respective protocols.
+datasets:
+ # Single file
+ - path: s3://bucket-name/path/to/your/file.jsonl
+
+ # Directory
+ - path: s3://bucket-name/path/to/your/directoryFor directory, we load via load_from_disk.
S3
+Prepend the path with s3://.
The credentials are pulled in the following order:
+-
+
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY, andAWS_SESSION_TOKENenvironment variables
+- from the
~/.aws/credentialsfile
+ - for nodes on EC2, the IAM metadata provider +
We assume you have credentials setup and not using anonymous access. If you want to use anonymous access, let us know! We may have to open a config option for this.
+Other environment variables that can be set can be found in boto3 docs
+GCS
+Prepend the path with gs:// or gcs://.
The credentials are loaded in the following order:
+-
+
- gcloud credentials +
- for nodes on GCP, the google metadata service +
- anonymous access +
Azure
+Gen 1
+Prepend the path with adl://.
Ensure you have the following environment variables set:
+-
+
AZURE_STORAGE_TENANT_ID
+AZURE_STORAGE_CLIENT_ID
+AZURE_STORAGE_CLIENT_SECRET
+
Gen 2
+Prepend the path with abfs:// or az://.
Ensure you have the following environment variables set:
+-
+
AZURE_STORAGE_ACCOUNT_NAME
+AZURE_STORAGE_ACCOUNT_KEY
+
Other environment variables that can be set can be found in adlfs docs
+OCI
+Prepend the path with oci://.
It would attempt to read in the following order:
+-
+
OCIFS_IAM_TYPE,OCIFS_CONFIG_LOCATION, andOCIFS_CONFIG_PROFILEenvironment variables
+- when on OCI resource, resource principal +
Other environment variables:
+-
+
OCI_REGION_METADATA
+
Please see the ocifs docs.
+HTTPS
+The path should start with https://.
datasets:
+ - path: https://path/to/your/dataset/file.jsonlThis must be publically accessible.
+Next steps
+Now that you know how to load datasets, you can learn more on how to load your specific dataset format into your target output format dataset formats docs.
+ + +