Files

Wing Lian f243c2186d RL/DPO (#935 )

* ipo-dpo trainer

* fix missing abstract method

* chatml template, grad checkpointing kwargs support

* fix steps calc for RL and add dataloader kwargs

* wip to fix dpo and start ppo

* more fixes

* refactor to generalize map fn

* fix dataset loop and handle argilla pref dataset

* set training args

* load reference model on seperate gpu if more than one device

* no auto upload to hub for dpo, don't add lora adapters to ref model for dpo

* fixes for rl training

* support for ipo from yaml

* set dpo training args from the config, add tests

* chore: lint

* set sequence_len for model in test

* add RLHF docs

2024-01-04 18:22:55 -05:00

975 B

Raw Blame History

RLHF (Beta)

Overview

Reinforcement Learning from Human Feedback is a method whereby a language model is optimized from data using human feedback. Various methods include, but not limited to:

Proximal Policy Optimization (PPO) (not yet supported in axolotl)
Direct Preference Optimization (DPO)
Identity Preference Optimization (IPO)

RLHF using Axolotl

[!IMPORTANT] This is a BETA feature and many features are not fully implemented. You are encouraged to open new PRs to improve the integration and functionality.

The various RL training methods are implemented in trl and wrapped via axolotl. Below are various examples with how you can use various preference datasets to train models that use ChatML

DPO

rl: true
datasets:
  - path: Intel/orca_dpo_pairs
    split: train
    type: intel_apply_chatml
  - path: argilla/ultrafeedback-binarized-preferences
    split: train
    type: argilla_apply_chatml

IPO

rl: ipo

975 B Raw Blame History

RLHF (Beta)

Overview

RLHF using Axolotl

DPO

IPO

975 B

Raw Blame History