Process reward models (#2241)

* adding model_cfg to set num_labels * using a num_labels field instead * linting * WIP stepwise prompt tokenizer * this should work? * trainer working? * pushing to runpod * fixing saving * updating conf * updating config, adding docs * adding stepwise supervision docpage * updating tests * adding test for dataset * fixing tests * linting * addressing some comments * adding additional cfg fields support * updating tests, fixing cfg * fixing tests * updating loss * Update test_process_reward_model_smollm2.py * updating loss values and seed * dumb pre-commit
2025-01-29 05:08:33 +00:00
parent c071a530f7
commit 54dd7abfc1
17 changed files with 542 additions and 25 deletions
--- a/docs/config.qmd
+++ b/docs/config.qmd
@@ -187,6 +187,12 @@ rl:
 # whether to perform weighting if doing DPO training. Boolean.
 dpo_use_weighting:

+# reward modelling: `True` or `False`
+reward_model:
+
+# process reward modelling: `True` or `False`
+process_reward_model:
+
 # The name of the chat template to use for training, following values are supported:
 # - tokenizer_default: Uses the chat template that is available in the tokenizer_config.json. If the chat template is not available in the tokenizer, it will raise an error. This is the default value.
 # - alpaca/inst/chatml/gemma/cohere/llama3/phi_3/deepseek_v2/jamba: These chat templates are available in the axolotl codebase at src/axolotl/utils/chat_templates.py
--- a/docs/dataset-formats/stepwise_supervised.qmd
+++ b/docs/dataset-formats/stepwise_supervised.qmd
@@ -0,0 +1,18 @@
+---
+title: Stepwise Supervised Format
+description: Format for datasets with stepwise completions and labels
+order: 3
+---
+
+## Stepwise Supervised
+
+The stepwise supervised format is designed for chain-of-thought (COT) reasoning datasets where each example contains multiple completion steps and a preference label for each step.
+### ExampleHere's a simple example of a stepwise supervised dataset entry:```json
+{
+  "prompt": "Which number is larger, 9.8 or 9.11?",
+  "completions": [
+    "The fractional part of 9.8 is 0.8, while the fractional part of 9.11 is 0.11.",
+    "Since 0.11 is greater than 0.8, the number 9.11 is larger than 9.8."
+  ],
+  "labels": [true, false]
+}
--- a/docs/reward_modelling.qmd
+++ b/docs/reward_modelling.qmd
@@ -0,0 +1,47 @@
+---
+title: "Reward Modelling"
+description: "Reward models are used to guide models towards behaviors which is preferred by humans, by training over large datasets annotated with human preferences. "
+---
+
+### Overview
+
+Reward modelling is a technique used to train models to predict the reward or value of a given input. This is particularly useful in reinforcement learning scenarios where the model needs to evaluate the quality of its actions or predictions.
+We support the reward modelling techniques supported by `trl`.
+
+### (Outcome) Reward Models
+
+Outcome reward models are trained using data which contains preference annotations for an entire interaction between the user and model (e.g. rather than per-turn or per-step).
+
+```yaml
+base_model: google/gemma-2-2b
+model_type: AutoModelForSequenceClassification
+num_labels: 1
+tokenizer_type: AutoTokenizer
+
+reward_model: true
+chat_template: gemma
+datasets:
+  - path: argilla/distilabel-intel-orca-dpo-pairs
+    type: bradley_terry.chat_template
+
+val_set_size: 0.1
+eval_steps: 100
+```
+
+### Process Reward Models (PRM)
+
+Process reward models are trained using data which contains preference annotations for each step in a series of interactions. Typically, PRMs are trained to provide reward signals over each step of a reasoning trace and are used for downstream reinforcement learning.
+```yaml
+base_model: Qwen/Qwen2.5-3B
+model_type: AutoModelForTokenClassification
+num_labels: 2
+
+process_reward_model: true
+datasets:
+  - path: trl-lib/math_shepherd
+    type: stepwise_supervised
+    split: train
+
+val_set_size: 0.1
+eval_steps: 100
+```