Process reward models (#2241)
* adding model_cfg to set num_labels * using a num_labels field instead * linting * WIP stepwise prompt tokenizer * this should work? * trainer working? * pushing to runpod * fixing saving * updating conf * updating config, adding docs * adding stepwise supervision docpage * updating tests * adding test for dataset * fixing tests * linting * addressing some comments * adding additional cfg fields support * updating tests, fixing cfg * fixing tests * updating loss * Update test_process_reward_model_smollm2.py * updating loss values and seed * dumb pre-commit
This commit is contained in:
47
docs/reward_modelling.qmd
Normal file
47
docs/reward_modelling.qmd
Normal file
@@ -0,0 +1,47 @@
|
||||
---
|
||||
title: "Reward Modelling"
|
||||
description: "Reward models are used to guide models towards behaviors which is preferred by humans, by training over large datasets annotated with human preferences. "
|
||||
---
|
||||
|
||||
### Overview
|
||||
|
||||
Reward modelling is a technique used to train models to predict the reward or value of a given input. This is particularly useful in reinforcement learning scenarios where the model needs to evaluate the quality of its actions or predictions.
|
||||
We support the reward modelling techniques supported by `trl`.
|
||||
|
||||
### (Outcome) Reward Models
|
||||
|
||||
Outcome reward models are trained using data which contains preference annotations for an entire interaction between the user and model (e.g. rather than per-turn or per-step).
|
||||
|
||||
```yaml
|
||||
base_model: google/gemma-2-2b
|
||||
model_type: AutoModelForSequenceClassification
|
||||
num_labels: 1
|
||||
tokenizer_type: AutoTokenizer
|
||||
|
||||
reward_model: true
|
||||
chat_template: gemma
|
||||
datasets:
|
||||
- path: argilla/distilabel-intel-orca-dpo-pairs
|
||||
type: bradley_terry.chat_template
|
||||
|
||||
val_set_size: 0.1
|
||||
eval_steps: 100
|
||||
```
|
||||
|
||||
### Process Reward Models (PRM)
|
||||
|
||||
Process reward models are trained using data which contains preference annotations for each step in a series of interactions. Typically, PRMs are trained to provide reward signals over each step of a reasoning trace and are used for downstream reinforcement learning.
|
||||
```yaml
|
||||
base_model: Qwen/Qwen2.5-3B
|
||||
model_type: AutoModelForTokenClassification
|
||||
num_labels: 2
|
||||
|
||||
process_reward_model: true
|
||||
datasets:
|
||||
- path: trl-lib/math_shepherd
|
||||
type: stepwise_supervised
|
||||
split: train
|
||||
|
||||
val_set_size: 0.1
|
||||
eval_steps: 100
|
||||
```
|
||||
Reference in New Issue
Block a user