fix: add links between dataset formats and dataset loading
This commit is contained in:
@@ -13,6 +13,13 @@ As there are a lot of available options in Axolotl, this guide aims to provide a
|
||||
|
||||
Axolotl supports 3 kinds of training methods: pre-training, supervised fine-tuning, and preference-based post-training (e.g. DPO, ORPO, PRMs). Each method has their own dataset format which are described below.
|
||||
|
||||
::: {.callout-tip}
|
||||
|
||||
This guide will mainly use JSONL as an introduction. Please refer to the [dataset loading docs](../dataset_loading.qmd) to understand how to load datasets from other sources.
|
||||
|
||||
For `pretraining_dataset:` specifically, please refer to the [Pre-training section](#pre-training).
|
||||
:::
|
||||
|
||||
## Pre-training
|
||||
|
||||
When aiming to train on large corpora of text datasets, pre-training is your go-to choice. Due to the size of these datasets, downloading the entire-datasets before beginning training would be prohibitively time-consuming. Axolotl supports [streaming](https://huggingface.co/docs/datasets/en/stream) to only load batches into memory at a time.
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
---
|
||||
title: Dataset Loading
|
||||
description: Loading different datasets
|
||||
description: Understanding how to load datasets from different sources
|
||||
back-to-top-navigation: true
|
||||
toc: true
|
||||
toc-depth: 5
|
||||
@@ -8,7 +8,7 @@ toc-depth: 5
|
||||
|
||||
## Overview
|
||||
|
||||
Datasets can be loaded in a number of different ways depending on the format of the data and where it is stored.
|
||||
Datasets can be loaded in a number of different ways depending on the how it is saved (the extension of the file) and where it is stored.
|
||||
|
||||
## Loading Datasets
|
||||
|
||||
@@ -107,7 +107,7 @@ datasets:
|
||||
- path: /path/to/your/directory
|
||||
```
|
||||
|
||||
#### Loading specific files in directory
|
||||
##### Loading specific files in directory
|
||||
|
||||
Provide `data_files` with a list of files to load.
|
||||
|
||||
@@ -270,3 +270,7 @@ datasets:
|
||||
```
|
||||
|
||||
This must be publically accessible.
|
||||
|
||||
## Next steps
|
||||
|
||||
Now that you know how to load datasets, you can learn more on how to load your specific dataset format into your target output format [dataset formats docs](dataset-formats).
|
||||
|
||||
Reference in New Issue
Block a user