Fix(doc): address missing doc changes (#2362)

* fix: add multiple tips about eos_token masking * fix: format dataset preprocessing doc * Update docs/dataset-formats/conversation.qmd Co-authored-by: salman <salman.mohammadi@outlook.com> --------- Co-authored-by: salman <salman.mohammadi@outlook.com>
2025-02-26 01:50:02 +07:00
parent 2efe1b4c09
commit 75cbd15301
4 changed files with 27 additions and 7 deletions
--- a/docs/config.qmd
+++ b/docs/config.qmd
@@ -166,7 +166,7 @@ datasets:
    # IMPORTANT: The following fields determine which parts of the conversation to train on.
    # Priority order: message_field_training > message_field_training_detail > train_on_inputs or role in roles_to_train
    # See examples at `docs/dataset-formats/conversation.qmd`
-    # Note: If the below 4 fields are empty, defaults to training only on the last message.
+    # Note: If the below 4 fields are set to empty, defaults to training only on the last message.

    # Optional[List[str]]. Roles to train on. The tokens from these roles will be considered for the loss.
    roles_to_train: ["assistant"]  # default
@@ -174,6 +174,7 @@ datasets:
    # - all: train on all EOS tokens
    # - turn (default): train on the EOS token at the end of each trainable turn
    # - last: train on the last EOS token in the conversation
+    # TIP: Please make sure that your `tokenizer.eos_token` is same as EOS/EOT token in template. Otherwise, set `eos_token` under `special_tokens`.
    train_on_eos: last
    # The key in the message turn that indicates via boolean whether tokens of a turn should be considered for training. Useful to selectively train on certain turns besides the `roles_to_train`.
    message_field_training: training
--- a/docs/dataset-formats/conversation.qmd
+++ b/docs/dataset-formats/conversation.qmd
@@ -104,6 +104,10 @@ datasets:
    type: chat_template
 ```

+::: {.callout-important}
+Please make sure that your `tokenizer.eos_token` is same as EOS/EOT token in template. Otherwise, set `eos_token` under `special_tokens`.
+:::
+
 5. (Advanced) Using fine-grained control over tokens and turns to train in a conversation

 For a data sample that looks like:
@@ -151,4 +155,6 @@ datasets:
    message_field_training_detail: train_detail
 ```

-Tip: It is not necessary to use both `message_field_training` and `message_field_training_detail` at a time.
+::: {.callout-tip}
+It is not necessary to set both `message_field_training` and `message_field_training_detail` at once.
+:::
--- a/docs/dataset_preprocessing.qmd
+++ b/docs/dataset_preprocessing.qmd
@@ -3,8 +3,11 @@ title: Dataset Preprocessing
 description: How datasets are processed
 ---

+## Overview
+
 Dataset pre-processing is the step where Axolotl takes each dataset you've configured alongside
-the (dataset format)[../dataset-formats/] and prompt strategies to:
+the [dataset format](docs/dataset-formats) and prompt strategies to:
+
 - parse the dataset based on the *dataset format*
 - transform the dataset to how you would interact with the model based on the *prompt strategy*
 - tokenize the dataset based on the configured model & tokenizer
@@ -12,10 +15,12 @@ the (dataset format)[../dataset-formats/] and prompt strategies to:

 The processing of the datasets can happen one of two ways:

-1. Before kicking off training by calling `python -m axolotl.cli.preprocess /path/to/your.yaml --debug`
+1. Before kicking off training by calling `axolotl preprocess config.yaml --debug`
 2. When training is started

-What are the benefits of pre-processing? When training interactively or for sweeps
+### What are the benefits of pre-processing?
+
+When training interactively or for sweeps
 (e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly
 slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent
 training parameters so that it will intelligently pull from its cache when possible.
@@ -28,8 +33,12 @@ default path of `./last_run_prepared/`, but will ignore anything already cached
 setting `dataset_prepared_path: ./last_run_prepared`, the trainer will use whatever pre-processed
 data is in the cache.

-What are the edge cases? Let's say you are writing a custom prompt strategy or using a user-defined
+### What are the edge cases?
+
+Let's say you are writing a custom prompt strategy or using a user-defined
 prompt template. Because the trainer cannot readily detect these changes, we cannot change the
-calculated hash value for the pre-processed dataset. If you have `dataset_prepared_path: ...` set
+calculated hash value for the pre-processed dataset.
+
+If you have `dataset_prepared_path: ...` set
 and change your prompt templating logic, it may not pick up the changes you made and you will be
 training over the old prompt.
--- a/docs/faq.qmd
+++ b/docs/faq.qmd
@@ -46,3 +46,7 @@ description: Frequently asked questions
 **Q: `Content end boundary is the same as start boundary for turn ___. This is likely an empty turn.`**

 > A: This is likely an empty turn.
+
+**Q: The EOS/EOT token is incorrectly being masked or not being masked.**
+
+> A: This is because of the mismatch between `tokenizer.eos_token` and EOS/EOT token in template. Please make sure to set `eos_token` under `special_tokens` to the same EOS/EOT token as in template.