Feat: minor docs improvements for RLHF and faq on embeddings (#2401) [skip ci]

* feat: add doc on shrink_embeddings and custom calling * chore: rename inference doc * fix: clarify same config is used for all cli * chore: rearrange order inference qmd * feat: add simpo to doc * fix: update defaults * feat: add rl configs to doc * fix: ensure beta consistent with trl.beta * fix: clarify about lora/fft * chore: rename title * chore: fix language * feat: move config reference higher * Update docs/getting-started.qmd Co-authored-by: salman <salman.mohammadi@outlook.com> * Update docs/rlhf.qmd Co-authored-by: salman <salman.mohammadi@outlook.com> --------- Co-authored-by: salman <salman.mohammadi@outlook.com>
2025-03-17 19:39:04 +07:00
parent 7235123d44
commit 51cd409488
7 changed files with 100 additions and 22 deletions
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -32,8 +32,9 @@ website:
          contents:
            - docs/getting-started.qmd
            - docs/installation.qmd
-            - docs/cli.qmd
            - docs/inference.qmd
+            - docs/cli.qmd
+            - docs/config.qmd

        - section: "Dataset Formats"
          contents: docs/dataset-formats/*
@@ -74,10 +75,6 @@ website:
            - docs/debugging.qmd
            - docs/nccl.qmd

-        - section: "Reference"
-          contents:
-            - docs/config.qmd
-
 format:
  html:
    theme: darkly
--- a/docs/config.qmd
+++ b/docs/config.qmd
@@ -1,5 +1,5 @@
 ---
-title: Config options
+title: Config Reference
 description: A complete list of all configuration options.
 ---

@@ -30,6 +30,8 @@ tokenizer_legacy:
 # Resize the model embeddings when new tokens are added to multiples of 32
 # This is reported to improve training speed on some models
 resize_token_embeddings_to_32x:
+# Optional[bool] Whether to shrink the embeddings to len(tokenizer). By default, we won't shrink.
+shrink_embeddings:

 # (Internal use only)
 # Used to identify which the model is based on
@@ -205,10 +207,46 @@ test_datasets:
    data_files:
      - /workspace/data/eval.jsonl

-# use RL training: 'dpo', 'ipo', 'kto'
+# use RL training: 'dpo', 'ipo', 'kto', 'simpo', 'orpo', 'grpo'
 rl:
-# whether to perform weighting if doing DPO training. Boolean.
-dpo_use_weighting:
+rl_beta:  # Optional[float]. The beta parameter for the RL training.
+
+# dpo
+dpo_use_weighting:  # Optional[bool]. Whether to perform weighting.
+rpo_alpha: # Optional[float]. Weighting of NLL term in loss from RPO paper.
+
+# orpo
+orpo_alpha: 0.1  # Parameter controlling the relative ratio loss weight in the ORPO loss. Passed to `beta` in `ORPOConfig` due to trl mapping.
+
+# kto
+kto_desirable_weight: # Optional[float]. Factor for desirable loss term in KTO loss.
+kto_undesirable_weight: # Optional[float]. Factor for undesirable loss term in KTO loss.
+
+# simpo
+cpo_alpha: 1.0  # Weight of the BC regularizer
+simpo_gamma: 0.5  # Target reward margin for the SimPO loss
+
+# grpo
+trl:
+  use_vllm: # Optional[bool]. Whether to use VLLM for RL training.
+  vllm_device: # Optional[str]. Device to use for VLLM.
+  vllm_gpu_memory_utilization: # Optional[float]. GPU memory utilization for VLLM.
+  vllm_max_model_len: # Optional[int]. Maximum length of the model for VLLM.
+  vllm_dtype: # Optional[str]. Data type for VLLM.
+
+  beta: # Optional[float]. Beta parameter for the RL training. Same as `rl_beta`. Use
+  max_completion_length: # Optional[int]. Maximum length of the completion for RL training.
+
+  reward_funcs: # Optional[list[str]]. List of reward functions to load. Paths must be importable from current dir.
+  reward_weights: # Optional[list[float]]. List of reward weights for the reward functions.
+
+  num_generations: # Optional[int]. Number of generations to sample.
+  log_completions: # Optional[bool]. Whether to log completions.
+
+  sync_ref_model: # Optional[bool]. Whether to sync the reference model.
+  ref_model_mixup_alpha: # Optional[float]. Mixup alpha for the reference model.
+  ref_model_sync_steps: # Optional[int]. Sync steps for the reference model.
+

 # reward modelling: `True` or `False`
 reward_model:
@@ -232,7 +270,7 @@ default_system_message: You are a helpful assistant. Please give a long and deta
 # subsequent training attempts load faster, relative path
 dataset_prepared_path: data/last_run_prepared
 # Push prepared dataset to hub
-push_dataset_to_hub: # repo path
+push_dataset_to_hub: # Optional[str] repo_org/repo_name
 # The maximum number of processes to use while preprocessing your input dataset. This defaults to `os.cpu_count()`
 # if not set.
 dataset_processes: # defaults to os.cpu_count() if not set
--- a/docs/faq.qmd
+++ b/docs/faq.qmd
@@ -27,6 +27,16 @@ description: Frequently asked questions

 > A: This is usually an issue with the GPU. This can be resolved through setting the os environment variable `CUDA_VISIBLE_DEVICES=0`. If you are on runpod, this is usually a pod issue. Starting a new pod should take care of it.

+**Q: Received mismatch error on merge adapters / loading adapters between torch.Size of checkpoint and model.**
+
+> A: This is likely due to vocab size mismatch. By default, Axolotl expands the model's embeddings if the tokenizer has more tokens than the model. Please use the `axolotl merge-lora` command to merge the adapters instead of using your own scripts.
+
+> On the other hand, if the model has more tokens than the tokenizer, Axolotl does not shrink the model's embeddings unless `shrink_embeddings: true` is set in the config.
+
+**Q: How to call Axolotl via custom python scripts?**
+
+> A: Yes, since Axolotl is just Python, please see `src/axolotl/cli/main.py` on how each command is called.
+
 ### Chat templates

 **Q: `jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'content' / 'role' / ____`**
--- a/docs/getting-started.qmd
+++ b/docs/getting-started.qmd
@@ -36,7 +36,9 @@ The YAML configuration file controls everything about your training. Here's what

 ```yaml
 base_model: NousResearch/Llama-3.2-1B
-# hub_model_id: username/custom_model_name
+
+load_in_8bit: true
+adapter: lora

 datasets:
  - path: teknium/GPT4-LLM-Cleaned
@@ -44,11 +46,15 @@ datasets:
 dataset_prepared_path: last_run_prepared
 val_set_size: 0.1
 output_dir: ./outputs/lora-out
-
-adapter: lora
-lora_model_dir:
 ```

+::: {.callout-tip}
+`load_in_8bit: true` and `adapter: lora` enables LoRA adapter finetuning.
+
+- To perform Full finetuning, remove these two lines.
+- To perform QLoRA finetuning, replace with `load_in_4bit: true` and `adapter: qlora`.
+:::
+
 See our [Config options](config.qmd) for more details.

 ### Training {#sec-training}
@@ -56,7 +62,7 @@ See our [Config options](config.qmd) for more details.
 When you run `axolotl train`, Axolotl:

 1. Downloads the base model
-2. (If specified) applies LoRA adapter layers
+2. (If specified) applies QLoRA/LoRA adapter layers
 3. Loads and processes the dataset
 4. Runs the training loop
 5. Saves the trained model and / or LoRA weights
@@ -69,6 +75,8 @@ Let's modify the example for your own data:

 ```yaml
 base_model: NousResearch/Nous-Hermes-llama-1b-v1
+
+load_in_8bit: true
 adapter: lora

 # Training settings
@@ -104,8 +112,6 @@ format):
 {"instruction": "Classify this text", "input": "Not good at all", "output": "negative"}
 ```

-Please consult the supported [Dataset Formats](dataset-formats/) for more details.
-
 3. Run the training:

 ```bash
--- a/docs/inference.qmd
+++ b/docs/inference.qmd
@@ -1,5 +1,5 @@
 ---
-title: "Inference"
+title: "Inference and Merging"
 format:
  html:
    toc: true
@@ -9,10 +9,14 @@ execute:
  enabled: false
 ---

-This guide covers how to use your trained models for inference, including model loading, interactive testing, and common troubleshooting steps.
+This guide covers how to use your trained models for inference, including model loading, interactive testing, merging adapters, and common troubleshooting steps.

 ## Quick Start {#sec-quickstart}

+::: {.callout-tip}
+Use the same config used for training on inference/merging.
+:::
+
 ### Basic Inference {#sec-basic}

 ::: {.panel-tabset}
--- a/docs/rlhf.qmd
+++ b/docs/rlhf.qmd
@@ -298,7 +298,7 @@ The input format is a simple JSON input with customizable fields based on the ab

 ### IPO

-As IPO is just DPO with a different loss function, all supported options for DPO works here.
+As IPO is just DPO with a different loss function, all supported dataset formats for [DPO](#dpo) are also supported for IPO.

 ```yaml
 rl: ipo
@@ -344,8 +344,9 @@ ORPO supports the following types with the following dataset format:

 ```yaml
 rl: kto
-rl_beta: 0.5
-kto_desirable_weight: 0.2
+rl_beta: 0.1  # default
+kto_desirable_weight: 1.0  # default
+kto_undesirable_weight: 1.0  # default

 remove_unused_columns: false

@@ -544,6 +545,19 @@ To see other examples of custom reward functions, please see [TRL GRPO Docs](htt

 To see description of the configs, please see [TRLConfig](https://github.com/axolotl-ai-cloud/axolotl/blob/main/src/axolotl/utils/config/models/input/v0_4_1/trl.py).

+### SimPO
+
+SimPO uses [CPOTrainer](https://huggingface.co/docs/trl/main/en/cpo_trainer) but with alternative loss function.
+
+```yaml
+rl: simpo
+rl_beta: 0.1  # default in CPOTrainer
+cpo_alpha: 1.0  # default in CPOTrainer
+simpo_gamma: 0.5  # default in CPOTrainer
+```
+
+This method uses the same dataset format as [DPO](#dpo).
+
 ### Using local dataset files

 ```yaml
--- a/src/axolotl/utils/config/models/input/v0_4_1/init.py
+++ b/src/axolotl/utils/config/models/input/v0_4_1/init.py
@@ -1,4 +1,5 @@
 """Module with Pydantic models for configuration."""
+
 # pylint: disable=too-many-lines

 import logging
@@ -1827,6 +1828,14 @@ class AxolotlConfigWCapabilities(AxolotlInputConfig):
                data["torch_compile"] = False
        return data

+    @model_validator(mode="before")
+    @classmethod
+    def check_beta_and_trl_beta_match(cls, data):
+        if data.get("beta") and data.get("trl", {}).get("beta"):
+            if data["beta"] != data["trl"]["beta"]:
+                raise ValueError("beta and trl.beta must match or one must be removed")
+        return data
+

 def handle_legacy_message_fields_logic(data: dict) -> dict:
    """