Fix(doc): address missing doc changes (#2362)
Some checks failed
ci-cd / build-axolotl (<nil>, 124, 12.4.1, 3.11, 2.4.1) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 124, 12.4.1, 3.11, 2.6.0) (push) Has been cancelled
ci-cd / build-axolotl (vllm, 124, 12.4.1, true, 3.11, 2.5.1) (push) Has been cancelled
publish pypi / Create Release (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 124, 12.4.1, 3.11, 2.4.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 124, 12.4.1, true, 3.11, 2.5.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud-no-tmux (<nil>, 124, 12.4.1, 3.11, 2.4.1) (push) Has been cancelled
publish pypi / Upload release to PyPI (push) Has been cancelled

* fix: add multiple tips about eos_token masking

* fix: format dataset preprocessing doc

* Update docs/dataset-formats/conversation.qmd

Co-authored-by: salman <salman.mohammadi@outlook.com>

---------

Co-authored-by: salman <salman.mohammadi@outlook.com>
This commit is contained in:
NanoCode012
2025-02-26 01:50:02 +07:00
committed by GitHub
parent 2efe1b4c09
commit 75cbd15301
4 changed files with 27 additions and 7 deletions

View File

@@ -166,7 +166,7 @@ datasets:
# IMPORTANT: The following fields determine which parts of the conversation to train on. # IMPORTANT: The following fields determine which parts of the conversation to train on.
# Priority order: message_field_training > message_field_training_detail > train_on_inputs or role in roles_to_train # Priority order: message_field_training > message_field_training_detail > train_on_inputs or role in roles_to_train
# See examples at `docs/dataset-formats/conversation.qmd` # See examples at `docs/dataset-formats/conversation.qmd`
# Note: If the below 4 fields are empty, defaults to training only on the last message. # Note: If the below 4 fields are set to empty, defaults to training only on the last message.
# Optional[List[str]]. Roles to train on. The tokens from these roles will be considered for the loss. # Optional[List[str]]. Roles to train on. The tokens from these roles will be considered for the loss.
roles_to_train: ["assistant"] # default roles_to_train: ["assistant"] # default
@@ -174,6 +174,7 @@ datasets:
# - all: train on all EOS tokens # - all: train on all EOS tokens
# - turn (default): train on the EOS token at the end of each trainable turn # - turn (default): train on the EOS token at the end of each trainable turn
# - last: train on the last EOS token in the conversation # - last: train on the last EOS token in the conversation
# TIP: Please make sure that your `tokenizer.eos_token` is same as EOS/EOT token in template. Otherwise, set `eos_token` under `special_tokens`.
train_on_eos: last train_on_eos: last
# The key in the message turn that indicates via boolean whether tokens of a turn should be considered for training. Useful to selectively train on certain turns besides the `roles_to_train`. # The key in the message turn that indicates via boolean whether tokens of a turn should be considered for training. Useful to selectively train on certain turns besides the `roles_to_train`.
message_field_training: training message_field_training: training

View File

@@ -104,6 +104,10 @@ datasets:
type: chat_template type: chat_template
``` ```
::: {.callout-important}
Please make sure that your `tokenizer.eos_token` is same as EOS/EOT token in template. Otherwise, set `eos_token` under `special_tokens`.
:::
5. (Advanced) Using fine-grained control over tokens and turns to train in a conversation 5. (Advanced) Using fine-grained control over tokens and turns to train in a conversation
For a data sample that looks like: For a data sample that looks like:
@@ -151,4 +155,6 @@ datasets:
message_field_training_detail: train_detail message_field_training_detail: train_detail
``` ```
Tip: It is not necessary to use both `message_field_training` and `message_field_training_detail` at a time. ::: {.callout-tip}
It is not necessary to set both `message_field_training` and `message_field_training_detail` at once.
:::

View File

@@ -3,8 +3,11 @@ title: Dataset Preprocessing
description: How datasets are processed description: How datasets are processed
--- ---
## Overview
Dataset pre-processing is the step where Axolotl takes each dataset you've configured alongside Dataset pre-processing is the step where Axolotl takes each dataset you've configured alongside
the (dataset format)[../dataset-formats/] and prompt strategies to: the [dataset format](docs/dataset-formats) and prompt strategies to:
- parse the dataset based on the *dataset format* - parse the dataset based on the *dataset format*
- transform the dataset to how you would interact with the model based on the *prompt strategy* - transform the dataset to how you would interact with the model based on the *prompt strategy*
- tokenize the dataset based on the configured model & tokenizer - tokenize the dataset based on the configured model & tokenizer
@@ -12,10 +15,12 @@ the (dataset format)[../dataset-formats/] and prompt strategies to:
The processing of the datasets can happen one of two ways: The processing of the datasets can happen one of two ways:
1. Before kicking off training by calling `python -m axolotl.cli.preprocess /path/to/your.yaml --debug` 1. Before kicking off training by calling `axolotl preprocess config.yaml --debug`
2. When training is started 2. When training is started
What are the benefits of pre-processing? When training interactively or for sweeps ### What are the benefits of pre-processing?
When training interactively or for sweeps
(e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly (e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly
slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent
training parameters so that it will intelligently pull from its cache when possible. training parameters so that it will intelligently pull from its cache when possible.
@@ -28,8 +33,12 @@ default path of `./last_run_prepared/`, but will ignore anything already cached
setting `dataset_prepared_path: ./last_run_prepared`, the trainer will use whatever pre-processed setting `dataset_prepared_path: ./last_run_prepared`, the trainer will use whatever pre-processed
data is in the cache. data is in the cache.
What are the edge cases? Let's say you are writing a custom prompt strategy or using a user-defined ### What are the edge cases?
Let's say you are writing a custom prompt strategy or using a user-defined
prompt template. Because the trainer cannot readily detect these changes, we cannot change the prompt template. Because the trainer cannot readily detect these changes, we cannot change the
calculated hash value for the pre-processed dataset. If you have `dataset_prepared_path: ...` set calculated hash value for the pre-processed dataset.
If you have `dataset_prepared_path: ...` set
and change your prompt templating logic, it may not pick up the changes you made and you will be and change your prompt templating logic, it may not pick up the changes you made and you will be
training over the old prompt. training over the old prompt.

View File

@@ -46,3 +46,7 @@ description: Frequently asked questions
**Q: `Content end boundary is the same as start boundary for turn ___. This is likely an empty turn.`** **Q: `Content end boundary is the same as start boundary for turn ___. This is likely an empty turn.`**
> A: This is likely an empty turn. > A: This is likely an empty turn.
**Q: The EOS/EOT token is incorrectly being masked or not being masked.**
> A: This is because of the mismatch between `tokenizer.eos_token` and EOS/EOT token in template. Please make sure to set `eos_token` under `special_tokens` to the same EOS/EOT token as in template.