diff --git a/docs/config.qmd b/docs/config.qmd
index 8327e1488..3a11666a5 100644
--- a/docs/config.qmd
+++ b/docs/config.qmd
@@ -166,7 +166,7 @@ datasets:
     # IMPORTANT: The following fields determine which parts of the conversation to train on.
     # Priority order: message_field_training > message_field_training_detail > train_on_inputs or role in roles_to_train
     # See examples at `docs/dataset-formats/conversation.qmd`
-    # Note: If the below 4 fields are empty, defaults to training only on the last message.
+    # Note: If the below 4 fields are set to empty, defaults to training only on the last message.
 
     # Optional[List[str]]. Roles to train on. The tokens from these roles will be considered for the loss.
     roles_to_train: ["assistant"]  # default
@@ -174,6 +174,7 @@ datasets:
     # - all: train on all EOS tokens
     # - turn (default): train on the EOS token at the end of each trainable turn
     # - last: train on the last EOS token in the conversation
+    # TIP: Please make sure that your `tokenizer.eos_token` is same as EOS/EOT token in template. Otherwise, set `eos_token` under `special_tokens`.
     train_on_eos: last
     # The key in the message turn that indicates via boolean whether tokens of a turn should be considered for training. Useful to selectively train on certain turns besides the `roles_to_train`.
     message_field_training: training
diff --git a/docs/dataset-formats/conversation.qmd b/docs/dataset-formats/conversation.qmd
index d67e35876..8ce95b7b0 100644
--- a/docs/dataset-formats/conversation.qmd
+++ b/docs/dataset-formats/conversation.qmd
@@ -104,6 +104,10 @@ datasets:
     type: chat_template
 ```
 
+::: {.callout-important}
+Please make sure that your `tokenizer.eos_token` is same as EOS/EOT token in template. Otherwise, set `eos_token` under `special_tokens`.
+:::
+
 5. (Advanced) Using fine-grained control over tokens and turns to train in a conversation
 
 For a data sample that looks like:
@@ -151,4 +155,6 @@ datasets:
     message_field_training_detail: train_detail
 ```
 
-Tip: It is not necessary to use both `message_field_training` and `message_field_training_detail` at a time.
+::: {.callout-tip}
+It is not necessary to set both `message_field_training` and `message_field_training_detail` at once.
+:::
diff --git a/docs/dataset_preprocessing.qmd b/docs/dataset_preprocessing.qmd
index c99fce444..1075dc8e5 100644
--- a/docs/dataset_preprocessing.qmd
+++ b/docs/dataset_preprocessing.qmd
@@ -3,8 +3,11 @@ title: Dataset Preprocessing
 description: How datasets are processed
 ---
 
+## Overview
+
 Dataset pre-processing is the step where Axolotl takes each dataset you've configured alongside
-the (dataset format)[../dataset-formats/] and prompt strategies to:
+the [dataset format](docs/dataset-formats) and prompt strategies to:
+
  - parse the dataset based on the *dataset format*
  - transform the dataset to how you would interact with the model based on the *prompt strategy*
  - tokenize the dataset based on the configured model & tokenizer
@@ -12,10 +15,12 @@ the (dataset format)[../dataset-formats/] and prompt strategies to:
 
 The processing of the datasets can happen one of two ways:
 
-1. Before kicking off training by calling `python -m axolotl.cli.preprocess /path/to/your.yaml --debug`
+1. Before kicking off training by calling `axolotl preprocess config.yaml --debug`
 2. When training is started
 
-What are the benefits of pre-processing? When training interactively or for sweeps
+### What are the benefits of pre-processing?
+
+When training interactively or for sweeps
 (e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly
 slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent
 training parameters so that it will intelligently pull from its cache when possible.
@@ -28,8 +33,12 @@ default path of `./last_run_prepared/`, but will ignore anything already cached
 setting `dataset_prepared_path: ./last_run_prepared`, the trainer will use whatever pre-processed
 data is in the cache.
 
-What are the edge cases? Let's say you are writing a custom prompt strategy or using a user-defined
+### What are the edge cases?
+
+Let's say you are writing a custom prompt strategy or using a user-defined
 prompt template. Because the trainer cannot readily detect these changes, we cannot change the
-calculated hash value for the pre-processed dataset. If you have `dataset_prepared_path: ...` set
+calculated hash value for the pre-processed dataset.
+
+If you have `dataset_prepared_path: ...` set
 and change your prompt templating logic, it may not pick up the changes you made and you will be
 training over the old prompt.
diff --git a/docs/faq.qmd b/docs/faq.qmd
index 0a181e022..1b5037db9 100644
--- a/docs/faq.qmd
+++ b/docs/faq.qmd
@@ -46,3 +46,7 @@ description: Frequently asked questions
 **Q: `Content end boundary is the same as start boundary for turn ___. This is likely an empty turn.`**
 
 > A: This is likely an empty turn.
+
+**Q: The EOS/EOT token is incorrectly being masked or not being masked.**
+
+> A: This is because of the mismatch between `tokenizer.eos_token` and EOS/EOT token in template. Please make sure to set `eos_token` under `special_tokens` to the same EOS/EOT token as in template.