diff --git a/docs/reward_modelling.qmd b/docs/reward_modelling.qmd index 8baa93424..c9ac5f801 100644 --- a/docs/reward_modelling.qmd +++ b/docs/reward_modelling.qmd @@ -28,6 +28,17 @@ val_set_size: 0.1 eval_steps: 100 ``` +Bradley-Terry chat templates expect single-turn conversations in the following format: + +```json +{ + "system": "...", // optional + "input": "...", + "chosen": "...", + "rejected": "..." +} +``` + ### Process Reward Models (PRM) Process reward models are trained using data which contains preference annotations for each step in a series of interactions. Typically, PRMs are trained to provide reward signals over each step of a reasoning trace and are used for downstream reinforcement learning. @@ -45,3 +56,5 @@ datasets: val_set_size: 0.1 eval_steps: 100 ``` + +Please see [stepwise_supervised](dataset-formats/stepwise_supervised.qmd) for more details on the dataset format. diff --git a/styles.css b/styles.css index 891349b4b..749ff4366 100644 --- a/styles.css +++ b/styles.css @@ -14,7 +14,7 @@ h1 { font-family: var(--font-title); font-weight: 400; - font-size: 6rem; + font-size: 5rem; line-height: 1.1; letter-spacing: -0.05em; font-feature-settings: "ss01" on;