diff --git a/docs/config.qmd b/docs/config.qmd index 8327e1488..3a11666a5 100644 --- a/docs/config.qmd +++ b/docs/config.qmd @@ -166,7 +166,7 @@ datasets: # IMPORTANT: The following fields determine which parts of the conversation to train on. # Priority order: message_field_training > message_field_training_detail > train_on_inputs or role in roles_to_train # See examples at `docs/dataset-formats/conversation.qmd` - # Note: If the below 4 fields are empty, defaults to training only on the last message. + # Note: If the below 4 fields are set to empty, defaults to training only on the last message. # Optional[List[str]]. Roles to train on. The tokens from these roles will be considered for the loss. roles_to_train: ["assistant"] # default @@ -174,6 +174,7 @@ datasets: # - all: train on all EOS tokens # - turn (default): train on the EOS token at the end of each trainable turn # - last: train on the last EOS token in the conversation + # TIP: Please make sure that your `tokenizer.eos_token` is same as EOS/EOT token in template. Otherwise, set `eos_token` under `special_tokens`. train_on_eos: last # The key in the message turn that indicates via boolean whether tokens of a turn should be considered for training. Useful to selectively train on certain turns besides the `roles_to_train`. message_field_training: training diff --git a/docs/dataset-formats/conversation.qmd b/docs/dataset-formats/conversation.qmd index d67e35876..8ce95b7b0 100644 --- a/docs/dataset-formats/conversation.qmd +++ b/docs/dataset-formats/conversation.qmd @@ -104,6 +104,10 @@ datasets: type: chat_template ``` +::: {.callout-important} +Please make sure that your `tokenizer.eos_token` is same as EOS/EOT token in template. Otherwise, set `eos_token` under `special_tokens`. +::: + 5. (Advanced) Using fine-grained control over tokens and turns to train in a conversation For a data sample that looks like: @@ -151,4 +155,6 @@ datasets: message_field_training_detail: train_detail ``` -Tip: It is not necessary to use both `message_field_training` and `message_field_training_detail` at a time. +::: {.callout-tip} +It is not necessary to set both `message_field_training` and `message_field_training_detail` at once. +::: diff --git a/docs/dataset_preprocessing.qmd b/docs/dataset_preprocessing.qmd index c99fce444..1075dc8e5 100644 --- a/docs/dataset_preprocessing.qmd +++ b/docs/dataset_preprocessing.qmd @@ -3,8 +3,11 @@ title: Dataset Preprocessing description: How datasets are processed --- +## Overview + Dataset pre-processing is the step where Axolotl takes each dataset you've configured alongside -the (dataset format)[../dataset-formats/] and prompt strategies to: +the [dataset format](docs/dataset-formats) and prompt strategies to: + - parse the dataset based on the *dataset format* - transform the dataset to how you would interact with the model based on the *prompt strategy* - tokenize the dataset based on the configured model & tokenizer @@ -12,10 +15,12 @@ the (dataset format)[../dataset-formats/] and prompt strategies to: The processing of the datasets can happen one of two ways: -1. Before kicking off training by calling `python -m axolotl.cli.preprocess /path/to/your.yaml --debug` +1. Before kicking off training by calling `axolotl preprocess config.yaml --debug` 2. When training is started -What are the benefits of pre-processing? When training interactively or for sweeps +### What are the benefits of pre-processing? + +When training interactively or for sweeps (e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent training parameters so that it will intelligently pull from its cache when possible. @@ -28,8 +33,12 @@ default path of `./last_run_prepared/`, but will ignore anything already cached setting `dataset_prepared_path: ./last_run_prepared`, the trainer will use whatever pre-processed data is in the cache. -What are the edge cases? Let's say you are writing a custom prompt strategy or using a user-defined +### What are the edge cases? + +Let's say you are writing a custom prompt strategy or using a user-defined prompt template. Because the trainer cannot readily detect these changes, we cannot change the -calculated hash value for the pre-processed dataset. If you have `dataset_prepared_path: ...` set +calculated hash value for the pre-processed dataset. + +If you have `dataset_prepared_path: ...` set and change your prompt templating logic, it may not pick up the changes you made and you will be training over the old prompt. diff --git a/docs/faq.qmd b/docs/faq.qmd index 0a181e022..1b5037db9 100644 --- a/docs/faq.qmd +++ b/docs/faq.qmd @@ -46,3 +46,7 @@ description: Frequently asked questions **Q: `Content end boundary is the same as start boundary for turn ___. This is likely an empty turn.`** > A: This is likely an empty turn. + +**Q: The EOS/EOT token is incorrectly being masked or not being masked.** + +> A: This is because of the mismatch between `tokenizer.eos_token` and EOS/EOT token in template. Please make sure to set `eos_token` under `special_tokens` to the same EOS/EOT token as in template.