* feat: add center_rewards_coefficient for reward modeling - Add center_rewards_coefficient parameter to Pydantic schema with paper reference - Pass parameter through base builder and causal builder to training args - Add documentation section with usage examples and theoretical background - Enable parameter in reward modeling example configs with recommended value - Enables reward centering for improved training stability in RLHF workflows Implements auxiliary loss from Eisenstein et al. 2023 (https://huggingface.co/papers/2312.09244) to incentivize mean-zero reward outputs without post-training normalization. * Update description * test: add unit tests for center_rewards_coefficient integration * Update src/axolotl/core/builders/base.py Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * Update docs/reward_modelling.qmd Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * Update docs/reward_modelling.qmd Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * reference to TRL documentation. * add new reward model configuration for qwen3 with comprehensive parameters * Verified center_rewards_coefficient is correctly passed through the trainer builder to training arguments. * Refactor reward modeling documentation to consolidate information on center_rewards_coefficient * Remove unit tests for center_rewards_coefficient integration as part of codebase cleanup. * linting * nit * Apply suggestions from code review Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * lint --------- Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> Co-authored-by: Salman Mohammadi <salman.mohammadi@outlook.com>
45 lines
905 B
YAML
45 lines
905 B
YAML
base_model: Skywork/Skywork-Reward-V2-Qwen3-8B
|
|
model_type: AutoModelForSequenceClassification
|
|
num_labels: 1
|
|
|
|
reward_model: true
|
|
center_rewards_coefficient: 0.01 # Incentivize mean-zero rewards for improved stability
|
|
chat_template: qwen3
|
|
datasets:
|
|
- path: argilla/distilabel-intel-orca-dpo-pairs
|
|
type: bradley_terry.chat_template
|
|
|
|
val_set_size: 0.0
|
|
output_dir: ./outputs/out
|
|
|
|
sequence_len: 8192
|
|
sample_packing: false
|
|
eval_sample_packing: false
|
|
pad_to_sequence_len: true
|
|
|
|
deepspeed: deepspeed_configs/zero1.json
|
|
|
|
wandb_project:
|
|
wandb_entity:
|
|
wandb_watch:
|
|
wandb_name:
|
|
wandb_log_model:
|
|
|
|
gradient_accumulation_steps: 4
|
|
micro_batch_size: 1
|
|
eval_batch_size: 1
|
|
num_epochs: 3
|
|
optimizer: adamw_bnb_8bit
|
|
lr_scheduler: linear
|
|
learning_rate: 0.00002
|
|
|
|
bf16: true
|
|
tf32: true
|
|
|
|
gradient_checkpointing: true
|
|
gradient_checkpointing_kwargs:
|
|
use_reentrant: false
|
|
warmup_ratio: 0.1
|
|
logging_steps: 1
|
|
weight_decay: 0.01
|