diff --git a/docs/dataset-formats/index.qmd b/docs/dataset-formats/index.qmd index 121341e55..a071f1d56 100644 --- a/docs/dataset-formats/index.qmd +++ b/docs/dataset-formats/index.qmd @@ -13,6 +13,13 @@ As there are a lot of available options in Axolotl, this guide aims to provide a Axolotl supports 3 kinds of training methods: pre-training, supervised fine-tuning, and preference-based post-training (e.g. DPO, ORPO, PRMs). Each method has their own dataset format which are described below. +::: {.callout-tip} + +This guide will mainly use JSONL as an introduction. Please refer to the [dataset loading docs](../dataset_loading.qmd) to understand how to load datasets from other sources. + +For `pretraining_dataset:` specifically, please refer to the [Pre-training section](#pre-training). +::: + ## Pre-training When aiming to train on large corpora of text datasets, pre-training is your go-to choice. Due to the size of these datasets, downloading the entire-datasets before beginning training would be prohibitively time-consuming. Axolotl supports [streaming](https://huggingface.co/docs/datasets/en/stream) to only load batches into memory at a time. diff --git a/docs/dataset_loading.qmd b/docs/dataset_loading.qmd index d7376b009..09c8b0098 100644 --- a/docs/dataset_loading.qmd +++ b/docs/dataset_loading.qmd @@ -1,6 +1,6 @@ --- title: Dataset Loading -description: Loading different datasets +description: Understanding how to load datasets from different sources back-to-top-navigation: true toc: true toc-depth: 5 @@ -8,7 +8,7 @@ toc-depth: 5 ## Overview -Datasets can be loaded in a number of different ways depending on the format of the data and where it is stored. +Datasets can be loaded in a number of different ways depending on the how it is saved (the extension of the file) and where it is stored. ## Loading Datasets @@ -107,7 +107,7 @@ datasets: - path: /path/to/your/directory ``` -#### Loading specific files in directory +##### Loading specific files in directory Provide `data_files` with a list of files to load. @@ -270,3 +270,7 @@ datasets: ``` This must be publically accessible. + +## Next steps + +Now that you know how to load datasets, you can learn more on how to load your specific dataset format into your target output format [dataset formats docs](dataset-formats).