fix: add links between dataset formats and dataset loading

This commit is contained in:
NanoCode012
2025-04-04 19:36:14 +07:00
parent 7c890f686e
commit 9b59a53e2d
2 changed files with 14 additions and 3 deletions

View File

@@ -13,6 +13,13 @@ As there are a lot of available options in Axolotl, this guide aims to provide a
Axolotl supports 3 kinds of training methods: pre-training, supervised fine-tuning, and preference-based post-training (e.g. DPO, ORPO, PRMs). Each method has their own dataset format which are described below.
::: {.callout-tip}
This guide will mainly use JSONL as an introduction. Please refer to the [dataset loading docs](../dataset_loading.qmd) to understand how to load datasets from other sources.
For `pretraining_dataset:` specifically, please refer to the [Pre-training section](#pre-training).
:::
## Pre-training
When aiming to train on large corpora of text datasets, pre-training is your go-to choice. Due to the size of these datasets, downloading the entire-datasets before beginning training would be prohibitively time-consuming. Axolotl supports [streaming](https://huggingface.co/docs/datasets/en/stream) to only load batches into memory at a time.

View File

@@ -1,6 +1,6 @@
---
title: Dataset Loading
description: Loading different datasets
description: Understanding how to load datasets from different sources
back-to-top-navigation: true
toc: true
toc-depth: 5
@@ -8,7 +8,7 @@ toc-depth: 5
## Overview
Datasets can be loaded in a number of different ways depending on the format of the data and where it is stored.
Datasets can be loaded in a number of different ways depending on the how it is saved (the extension of the file) and where it is stored.
## Loading Datasets
@@ -107,7 +107,7 @@ datasets:
- path: /path/to/your/directory
```
#### Loading specific files in directory
##### Loading specific files in directory
Provide `data_files` with a list of files to load.
@@ -270,3 +270,7 @@ datasets:
```
This must be publically accessible.
## Next steps
Now that you know how to load datasets, you can learn more on how to load your specific dataset format into your target output format [dataset formats docs](dataset-formats).