feat(doc): add more info on RewardModel datasets (#2391)
* fix: reduce title size * feat(doc): add rm dataset info * Update docs/reward_modelling.qmd following suggestion Co-authored-by: salman <salman.mohammadi@outlook.com> --------- Co-authored-by: salman <salman.mohammadi@outlook.com>
This commit is contained in:
@@ -28,6 +28,17 @@ val_set_size: 0.1
|
|||||||
eval_steps: 100
|
eval_steps: 100
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Bradley-Terry chat templates expect single-turn conversations in the following format:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"system": "...", // optional
|
||||||
|
"input": "...",
|
||||||
|
"chosen": "...",
|
||||||
|
"rejected": "..."
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
### Process Reward Models (PRM)
|
### Process Reward Models (PRM)
|
||||||
|
|
||||||
Process reward models are trained using data which contains preference annotations for each step in a series of interactions. Typically, PRMs are trained to provide reward signals over each step of a reasoning trace and are used for downstream reinforcement learning.
|
Process reward models are trained using data which contains preference annotations for each step in a series of interactions. Typically, PRMs are trained to provide reward signals over each step of a reasoning trace and are used for downstream reinforcement learning.
|
||||||
@@ -45,3 +56,5 @@ datasets:
|
|||||||
val_set_size: 0.1
|
val_set_size: 0.1
|
||||||
eval_steps: 100
|
eval_steps: 100
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Please see [stepwise_supervised](dataset-formats/stepwise_supervised.qmd) for more details on the dataset format.
|
||||||
|
|||||||
@@ -14,7 +14,7 @@
|
|||||||
h1 {
|
h1 {
|
||||||
font-family: var(--font-title);
|
font-family: var(--font-title);
|
||||||
font-weight: 400;
|
font-weight: 400;
|
||||||
font-size: 6rem;
|
font-size: 5rem;
|
||||||
line-height: 1.1;
|
line-height: 1.1;
|
||||||
letter-spacing: -0.05em;
|
letter-spacing: -0.05em;
|
||||||
font-feature-settings: "ss01" on;
|
font-feature-settings: "ss01" on;
|
||||||
|
|||||||
Reference in New Issue
Block a user