diff --git a/docs/lora_optims.qmd b/docs/lora_optims.qmd index 8bee20402..a7555a0a3 100644 --- a/docs/lora_optims.qmd +++ b/docs/lora_optims.qmd @@ -66,6 +66,10 @@ logic to be compatible with more of them. +::: {.callout-tip} +Check out our [LoRA optimizations blog](https://axolotlai.substack.com/p/accelerating-lora-fine-tuning-with). +::: + ## Usage These optimizations can be enabled in your Axolotl config YAML file. The diff --git a/docs/reward_modelling.qmd b/docs/reward_modelling.qmd index c9ac5f801..386dc1f57 100644 --- a/docs/reward_modelling.qmd +++ b/docs/reward_modelling.qmd @@ -41,6 +41,10 @@ Bradley-Terry chat templates expect single-turn conversations in the following f ### Process Reward Models (PRM) +::: {.callout-tip} +Check out our [PRM blog](https://axolotlai.substack.com/p/process-reward-models). +::: + Process reward models are trained using data which contains preference annotations for each step in a series of interactions. Typically, PRMs are trained to provide reward signals over each step of a reasoning trace and are used for downstream reinforcement learning. ```yaml base_model: Qwen/Qwen2.5-3B diff --git a/docs/rlhf.qmd b/docs/rlhf.qmd index 773b159e8..ac1cf0393 100644 --- a/docs/rlhf.qmd +++ b/docs/rlhf.qmd @@ -497,6 +497,10 @@ The input format is a simple JSON input with customizable fields based on the ab ### GRPO +::: {.callout-tip} +Check out our [GRPO cookbook](https://github.com/axolotl-ai-cloud/axolotl-cookbook/tree/main/grpo#training-an-r1-style-large-language-model-using-grpo). +::: + GRPO uses custom reward functions and transformations. Please have them ready locally. For ex, to load OpenAI's GSM8K and use a random reward for completions: