Process reward models (#2241)

* adding model_cfg to set num_labels * using a num_labels field instead * linting * WIP stepwise prompt tokenizer * this should work? * trainer working? * pushing to runpod * fixing saving * updating conf * updating config, adding docs * adding stepwise supervision docpage * updating tests * adding test for dataset * fixing tests * linting * addressing some comments * adding additional cfg fields support * updating tests, fixing cfg * fixing tests * updating loss * Update test_process_reward_model_smollm2.py * updating loss values and seed * dumb pre-commit
2025-01-29 05:08:33 +00:00
parent c071a530f7
commit 54dd7abfc1
17 changed files with 542 additions and 25 deletions
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -19,7 +19,7 @@ repos:
    hooks:
      - id: isort
 -   repo: https://github.com/PyCQA/flake8
-    rev: 6.0.0
+    rev: 6.1.0
    hooks:
    - id: flake8
 -   repo: https://github.com/PyCQA/pylint
--- a/docs/config.qmd
+++ b/docs/config.qmd
@@ -187,6 +187,12 @@ rl:
 # whether to perform weighting if doing DPO training. Boolean.
 dpo_use_weighting:
 # reward modelling: `True` or `False`
 reward_model:
 # process reward modelling: `True` or `False`
 process_reward_model:
 # The name of the chat template to use for training, following values are supported:
 # - tokenizer_default: Uses the chat template that is available in the tokenizer_config.json. If the chat template is not available in the tokenizer, it will raise an error. This is the default value.
 # - alpaca/inst/chatml/gemma/cohere/llama3/phi_3/deepseek_v2/jamba: These chat templates are available in the axolotl codebase at src/axolotl/utils/chat_templates.py
--- a/docs/dataset-formats/stepwise_supervised.qmd
+++ b/docs/dataset-formats/stepwise_supervised.qmd
@@ -0,0 +1,18 @@
 ---
 title: Stepwise Supervised Format
 description: Format for datasets with stepwise completions and labels
 order: 3
 ---
 ## Stepwise Supervised
 The stepwise supervised format is designed for chain-of-thought (COT) reasoning datasets where each example contains multiple completion steps and a preference label for each step.
 ### ExampleHere's a simple example of a stepwise supervised dataset entry:```json
 {
  "prompt": "Which number is larger, 9.8 or 9.11?",
  "completions": [
    "The fractional part of 9.8 is 0.8, while the fractional part of 9.11 is 0.11.",
    "Since 0.11 is greater than 0.8, the number 9.11 is larger than 9.8."
  ],
  "labels": [true, false]
 }
--- a/docs/reward_modelling.qmd
+++ b/docs/reward_modelling.qmd
@@ -0,0 +1,47 @@
 ---
 title: "Reward Modelling"
 description: "Reward models are used to guide models towards behaviors which is preferred by humans, by training over large datasets annotated with human preferences. "
 ---
 ### Overview
 Reward modelling is a technique used to train models to predict the reward or value of a given input. This is particularly useful in reinforcement learning scenarios where the model needs to evaluate the quality of its actions or predictions.
 We support the reward modelling techniques supported by `trl`.
 ### (Outcome) Reward Models
 Outcome reward models are trained using data which contains preference annotations for an entire interaction between the user and model (e.g. rather than per-turn or per-step).
 ```yaml
 base_model: google/gemma-2-2b
 model_type: AutoModelForSequenceClassification
 num_labels: 1
 tokenizer_type: AutoTokenizer
 reward_model: true
 chat_template: gemma
 datasets:
  - path: argilla/distilabel-intel-orca-dpo-pairs
    type: bradley_terry.chat_template
 val_set_size: 0.1
 eval_steps: 100
 ```
 ### Process Reward Models (PRM)
 Process reward models are trained using data which contains preference annotations for each step in a series of interactions. Typically, PRMs are trained to provide reward signals over each step of a reasoning trace and are used for downstream reinforcement learning.
 ```yaml
 base_model: Qwen/Qwen2.5-3B
 model_type: AutoModelForTokenClassification
 num_labels: 2
 process_reward_model: true
 datasets:
  - path: trl-lib/math_shepherd
    type: stepwise_supervised
    split: train
 val_set_size: 0.1
 eval_steps: 100
 ```
--- a/examples/gemma2/reward-model.yaml
+++ b/examples/gemma2/reward-model.yaml
@@ -1,6 +1,7 @@
 base_model: google/gemma-2-2b
 # optionally might have model_type or tokenizer_type
 model_type: AutoModelForSequenceClassification
 num_labels: 1
 tokenizer_type: AutoTokenizer
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
--- a/examples/qwen2/prm.yaml
+++ b/examples/qwen2/prm.yaml
@@ -0,0 +1,72 @@
 base_model: Qwen/Qwen2.5-3B
 # optionally might have model_type or tokenizer_type
 model_type: AutoModelForTokenClassification
 num_labels: 2
 tokenizer_type: AutoTokenizer
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
 load_in_8bit: false
 load_in_4bit: false
 strict: false
 process_reward_model: true
 chat_template:
 datasets:
  - path: trl-lib/math_shepherd
    type: stepwise_supervised
    step_separator: "\n"
    max_completion_length:
    train_on_last_step_only: false
 val_set_size: 0.2
 output_dir: ./outputs/out
 remove_unused_columns: false
 sequence_len: 2048
 sample_packing: false
 eval_sample_packing: false
 pad_to_sequence_len: true
 wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_name:
 wandb_log_model:
 gradient_accumulation_steps: 1
 micro_batch_size: 8
 eval_batch_size: 8
 num_epochs: 1
 optimizer: adamw_torch
 lr_scheduler: cosine
 learning_rate: 0.0002
 train_on_inputs: false
 group_by_length: false
 bf16: true
 fp16:
 tf32:
 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
 early_stopping_patience:
 resume_from_checkpoint:
 local_rank:
 logging_steps: 1
 xformers_attention:
 flash_attention: true
 warmup_ratio: 0.1
 evals_per_epoch:
 eval_table_size:
 eval_max_new_tokens: 128
 eval_steps: 100
 saves_per_epoch: 1
 debug:
 deepspeed:
 weight_decay: 0.0
 fsdp:
 fsdp_config:
 special_tokens:
--- a/examples/qwen2/reward-model.yaml
+++ b/examples/qwen2/reward-model.yaml
@@ -0,0 +1,67 @@
 base_model:  Qwen/Qwen2.5-0.5B
 # optionally might have model_type or tokenizer_type
 model_type: AutoModelForSequenceClassification
 num_labels: 1
 tokenizer_type: AutoTokenizer
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
 load_in_8bit: false
 load_in_4bit: false
 strict: false
 reward_model: true
 chat_template: qwen_25
 datasets:
  - path: argilla/distilabel-intel-orca-dpo-pairs
    type: bradley_terry.chat_template
 val_set_size: 0.0
 output_dir: ./outputs/out
 remove_unused_columns: false
 sequence_len: 2048
 sample_packing: false
 eval_sample_packing: false
 pad_to_sequence_len: true
 wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_name:
 wandb_log_model:
 gradient_accumulation_steps: 4
 micro_batch_size: 2
 num_epochs: 4
 optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002
 train_on_inputs: false
 group_by_length: false
 bf16: true
 fp16:
 tf32: true
 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
 early_stopping_patience:
 resume_from_checkpoint:
 local_rank:
 logging_steps: 1
 xformers_attention:
 flash_attention: true
 warmup_ratio: 0.1
 evals_per_epoch:
 eval_table_size:
 eval_max_new_tokens: 128
 saves_per_epoch: 1
 debug:
 deepspeed:
 weight_decay: 0.0
 fsdp:
 fsdp_config:
 special_tokens:
--- a/src/axolotl/core/trainer_builder.py
+++ b/src/axolotl/core/trainer_builder.py
@@ -44,6 +44,8 @@ from trl import (
    KTOTrainer,
    ORPOConfig,
    ORPOTrainer,
    PRMConfig,
    PRMTrainer,
    RewardConfig,
    RewardTrainer,
 )
@@ -342,6 +344,13 @@ class AxolotlRewardConfig(AxolotlTrainingMixins, RewardConfig):
    """
@dataclass
 class AxolotlPRMConfig(AxolotlTrainingMixins, PRMConfig):
    """
    PRM config for PRM training
    """
 class SchedulerMixin(Trainer):
    """
    Mixin class for scheduler setup in CausalTrainer.
@@ -1244,6 +1253,14 @@ class AxolotlRewardTrainer(SchedulerMixin, RewardTrainer):
    tag_names = ["axolotl", "reward"]
 class AxolotlPRMTrainer(SchedulerMixin, PRMTrainer):
    """
    Extend the base trl.PRMTrainer for axolotl helpers
    """
    tag_names = ["axolotl", "prm"]
 class TrainerBuilderBase(abc.ABC):
    """
    Base class for trainer builder
@@ -1377,7 +1394,8 @@ class TrainerBuilderBase(abc.ABC):
 class HFCausalTrainerBuilder(TrainerBuilderBase):
    """
-    Build the HuggingFace training args/trainer for Causal models
+    Build the HuggingFace training args/trainer for causal models
    and reward modelling using TRL.
    """
    def get_callbacks(self):
@@ -1452,6 +1470,8 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
            return AxolotlMambaTrainer
        if self.cfg.reward_model:
            return AxolotlRewardTrainer
        if self.cfg.process_reward_model:
            return AxolotlPRMTrainer
        return AxolotlTrainer
    def build(self, total_num_steps):
@@ -1842,11 +1862,13 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
                "accelerator_config"
            ] = self.cfg.accelerator_config
-        training_args_cls = (
+        if self.cfg.reward_model:
-            AxolotlTrainingArguments
+            training_args_cls = AxolotlRewardConfig
-            if not self.cfg.reward_model
+        elif self.cfg.process_reward_model:
-            else AxolotlRewardConfig
+            training_args_cls = AxolotlPRMConfig
-        )
+        else:
            training_args_cls = AxolotlTrainingArguments
        training_args = training_args_cls(  # pylint: disable=unexpected-keyword-arg
            **training_arguments_kwargs,
        )
@@ -1880,9 +1902,9 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        if eval_data_collator := self.build_collator(
            training_args, is_eval=True, **data_collator_kwargs
        ):
-            if not self.cfg.reward_model:
+            if not (self.cfg.reward_model or self.cfg.process_reward_model):
                trainer_kwargs["eval_data_collator"] = eval_data_collator
-        if not self.cfg.reward_model:
+        if not (self.cfg.reward_model or self.cfg.process_reward_model):
            trainer_kwargs["bench_data_collator"] = transformers.DataCollatorForSeq2Seq(
                self.tokenizer,
                return_tensors="pt",
@@ -1893,8 +1915,10 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
            trainer_kwargs["processing_class"] = self.tokenizer
        else:
            trainer_kwargs["tokenizer"] = self.tokenizer
-
+        if (
-        if (trainer_cls is not AxolotlRewardTrainer) and self.cfg.datasets is not None:
+            not (trainer_cls in [AxolotlRewardTrainer, AxolotlPRMTrainer])
            and self.cfg.datasets is not None
        ):
            trainer_kwargs["dataset_tags"] = [
                d["path"] for d in self.cfg.datasets if not Path(d["path"]).is_dir()
            ]
@@ -1984,7 +2008,7 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
 class HFRLTrainerBuilder(TrainerBuilderBase):
    """
-    Trainer factory class for DPO Trainer
+    Trainer factory class for TRL-based RLHF trainers (e.g. DPO)
    """
    def get_callbacks(self):
--- a/src/axolotl/datasets.py
+++ b/src/axolotl/datasets.py
@@ -52,6 +52,7 @@ class TokenizedPromptDataset(Dataset):
        if self.prompt_tokenizer.supports_batched:
            map_kwargs["batched"] = True
            map_kwargs["batch_size"] = 100
        return dataset.map(
            self.prompt_tokenizer.tokenize_prompt,
            num_proc=num_proc,
--- a/src/axolotl/prompt_strategies/stepwise_supervised.py
+++ b/src/axolotl/prompt_strategies/stepwise_supervised.py
@@ -0,0 +1,116 @@
 """
 Module for stepwise datasets, typically including a prompt and reasoning traces,
 and (optionally) per-step, or per-prompt-trace labels for reward modelling.
 """
 from itertools import chain
 from typing import Dict, List, Optional, Union
 from transformers import BatchEncoding, PreTrainedTokenizer
 from axolotl.prompt_tokenizers import IGNORE_INDEX
 from axolotl.utils.dict import DictDefault
 class StepwiseSupervisedPromptTokenizingStrategy:
    """
    Tokenizing strategy for supervised stepwise datasets, typically used for COT-reasoning.
    These datasets should include the following columns:
    - prompt: the prompt text
    - completions: a list of `n` completion steps
    - labels: a list of `n` labels indicating the "correctness" of each step
    """
    def __init__(
        self,
        tokenizer,
        sequence_len: int = 2048,
        step_separator: str = "\n",
        max_completion_length: Optional[int] = None,
        train_on_last_step_only: bool = False,
    ):
        self.tokenizer = tokenizer
        self.sequence_len = sequence_len
        self.step_separator = step_separator
        self.max_completion_length = max_completion_length
        self.train_on_last_step_only = train_on_last_step_only
    def tokenize_prompt(
        self, prompt: Dict[str, Union[str, List[str]]]
    ) -> BatchEncoding:
        # Inspired by TRL's PRMTRainer
        # https://github.com/huggingface/trl/blob/ed7de87dc766478c024b68f12530d1b0e7c3ff23/trl/trainer/prm_trainer.py#L206
        prompt_ids = self.tokenizer(prompt["prompt"], add_special_tokens=False)[
            "input_ids"
        ]
        completions_ids = [
            self.tokenizer(completion, add_special_tokens=False)["input_ids"]
            for completion in prompt["completions"]
        ]
        # Handle labels
        if self.train_on_last_step_only:
            labels = [IGNORE_INDEX] * (len(prompt["labels"]) - 1) + [
                int(prompt["labels"][-1])
            ]
        else:
            labels = [int(label) for label in prompt["labels"]]
        # Add step separators
        separator_ids = self.tokenizer.encode(
            self.step_separator, add_special_tokens=False
        )
        completions_ids = [completion + separator_ids for completion in completions_ids]
        # Create step-wise labels
        labels = [
            [IGNORE_INDEX] * (len(completion) - 1) + [label]  # type: ignore
            for completion, label in zip(completions_ids, labels)
        ]
        # Join all steps
        completion_ids = list(chain(*completions_ids))
        labels = list(chain(*labels))  # type: ignore
        # Handle max lengths
        if self.max_completion_length:
            completion_ids = completion_ids[: self.max_completion_length]
            labels = labels[: self.max_completion_length]
        # Add BOS token if model has one
        if self.tokenizer.bos_token_id is not None:
            prompt_ids = [self.tokenizer.bos_token_id] + prompt_ids
        # Combine prompt and completion
        input_ids = prompt_ids + completion_ids
        full_labels = [IGNORE_INDEX] * len(prompt_ids) + labels
        # Apply max sequence length
        if self.sequence_len:
            input_ids = input_ids[: self.sequence_len]
            full_labels = full_labels[: self.sequence_len]
        return {
            "input_ids": input_ids,
            "labels": full_labels,
            "attention_mask": [1] * len(input_ids),
        }
    @property
    def supports_batched(self):
        return False
 def load(
    tokenizer: PreTrainedTokenizer,
    cfg: DictDefault,
    ds_cfg: DictDefault,
 ) -> StepwiseSupervisedPromptTokenizingStrategy:
    return StepwiseSupervisedPromptTokenizingStrategy(
        tokenizer,
        cfg.sequence_len,
        step_separator=ds_cfg.get("step_separator", "\n"),
        max_completion_length=ds_cfg.max_completion_length,
        train_on_last_step_only=ds_cfg.get("train_on_last_step_only", False),
    )
--- a/src/axolotl/train.py
+++ b/src/axolotl/train.py
@@ -259,7 +259,7 @@ def train(
                .decode("utf-8")
            }
            if cfg.datasets is not None:
-                if cfg.rl is not None or cfg.reward_model:
+                if cfg.rl is not None or cfg.reward_model or cfg.process_reward_model:
                    dataset_tags = [
                        d["path"] for d in cfg.datasets if not Path(d["path"]).is_dir()
                    ]
--- a/src/axolotl/utils/config/models/input/v0_4_1/init.py
+++ b/src/axolotl/utils/config/models/input/v0_4_1/init.py
@@ -236,6 +236,18 @@ class DPODataset(BaseModel):
    revision: Optional[str] = None
 class StepwiseSupervisedDataset(BaseModel):
    """Stepwise supervised dataset configuration subset"""
    path: Optional[str] = None
    split: Optional[str] = None
    data_files: Optional[List[str]] = None
    revision: Optional[str] = None
    step_separator: Optional[str] = None
    max_completion_length: Optional[int] = None
    train_on_last_step_only: Optional[bool] = None
 class UserDefinedKTOType(BaseModel):
    """User defined typing for KTO"""
@@ -626,12 +638,14 @@ class AxolotlInputConfig(
    rl: Optional[RLType] = None
    reward_model: Optional[bool] = None
    process_reward_model: Optional[bool] = None
    num_labels: Optional[int] = None
    dpo_use_weighting: Optional[
        bool
    ] = None  # whether to use weighting in DPO trainer. If none, default is false in the trainer.
-    datasets: Optional[conlist(Union[SFTDataset, DPODataset, KTODataset], min_length=1)] = None  # type: ignore
+    datasets: Optional[conlist(Union[SFTDataset, DPODataset, KTODataset, StepwiseSupervisedDataset], min_length=1)] = None  # type: ignore
-    test_datasets: Optional[conlist(Union[SFTDataset, DPODataset, KTODataset], min_length=1)] = None  # type: ignore
+    test_datasets: Optional[conlist(Union[SFTDataset, DPODataset, KTODataset, StepwiseSupervisedDataset], min_length=1)] = None  # type: ignore
    shuffle_merged_datasets: Optional[bool] = True
    dataset_prepared_path: Optional[str] = None
    dataset_shard_num: Optional[int] = None
--- a/src/axolotl/utils/data/sft.py
+++ b/src/axolotl/utils/data/sft.py
@@ -8,6 +8,8 @@ from typing import List, Tuple, Union
 from datasets import (
    Dataset,
    DatasetDict,
    Sequence,
    Value,
    concatenate_datasets,
    load_dataset,
    load_from_disk,
@@ -467,6 +469,17 @@ def get_dataset_wrapper(
            dataset,
            **ds_kwargs,
        )
    elif config_dataset.type.startswith("stepwise_supervised"):
        dataset_prompter = UnsupportedPrompter()
        ds_strategy = load(config_dataset.type, tokenizer, cfg, config_dataset)
        # we need to explicitly cast boolean labels to int
        # for compatibility with how trl's PRMTrainer works
        dataset = dataset.cast_column("labels", Sequence(Value("int64")))
        dataset_wrapper = TokenizedPromptDataset(
            ds_strategy,
            dataset,
            **ds_kwargs,
        )
    elif ds_strategy := load(
        config_dataset.type, tokenizer, cfg, config_dataset, processor=processor
    ):
--- a/src/axolotl/utils/models.py
+++ b/src/axolotl/utils/models.py
@@ -138,7 +138,9 @@ def load_model_config(cfg):
    config_kwargs = {}
    if cfg.revision_of_model:
        config_kwargs["revision"] = cfg.revision_of_model
-
+    if cfg.num_labels:
        # num_labels is used to initialize classifier models
        config_kwargs["num_labels"] = cfg.num_labels
    try:
        model_config = AutoConfig.from_pretrained(
            model_config_name,
--- a/tests/e2e/test_process_reward_model_smollm2.py
+++ b/tests/e2e/test_process_reward_model_smollm2.py
@@ -0,0 +1,69 @@
 """
 E2E tests for process reward model w/ lora llama
 """
 import logging
 import os
 import unittest
 from axolotl.cli.args import TrainerCliArgs
 from axolotl.common.datasets import load_datasets
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault
 from .utils import check_model_output_exists, check_tensorboard, with_temp_dir
 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
 class TestProcessRewardSmolLM2(unittest.TestCase):
    """
    Test case for Llama process reward models using LoRA
    """
    @with_temp_dir
    def test_prm(self, temp_dir):
        # pylint: disable=duplicate-code
        cfg = DictDefault(
            {
                "base_model": "HuggingFaceTB/SmolLM2-135M",
                "model_type": "AutoModelForTokenClassification",
                "num_labels": 2,
                "process_reward_model": True,
                "sequence_len": 512,
                "val_set_size": 0.0,
                "datasets": [
                    {
                        "path": "trl-lib/math_shepherd",
                        "type": "stepwise_supervised",
                        "step_separator": "\n",
                        "split": "train[:10%]",
                    },
                ],
                "max_steps": 100,
                "num_epochs": 1,
                "micro_batch_size": 4,
                "gradient_accumulation_steps": 1,
                "output_dir": temp_dir,
                "learning_rate": 0.0005,
                "optimizer": "adamw_torch",
                "lr_scheduler": "cosine",
                "gradient_checkpointing": True,
                "warmup_ratio": 0.1,
                "use_tensorboard": True,
                "special_tokens": {"pad_token": "<|endoftext|>"},
                "seed": 42,
            }
        )
        normalize_config(cfg)
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
        train(cfg=cfg, dataset_meta=dataset_meta)
        check_tensorboard(
            temp_dir + "/runs", "train/train_loss", 2.5, "Train Loss is too high"
        )
        check_model_output_exists(temp_dir, cfg)
--- a/tests/e2e/test_reward_model_smollm2.py
+++ b/tests/e2e/test_reward_model_smollm2.py
@@ -12,25 +12,25 @@ from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault
-from .utils import check_model_output_exists, with_temp_dir
+from .utils import check_model_output_exists, check_tensorboard, with_temp_dir
 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
-class TestRewardModelLoraLlama(unittest.TestCase):
+class TestRewardModelLoraSmolLM2(unittest.TestCase):
    """
    Test case for Llama reward models using LoRA
    """
    @with_temp_dir
-    def test_rm_fft(self, temp_dir):
+    def test_rm_lora(self, temp_dir):
        # pylint: disable=duplicate-code
        cfg = DictDefault(
            {
-                "base_model": "JackFram/llama-68m",
+                "base_model": "HuggingFaceTB/SmolLM2-135M",
                "model_type": "AutoModelForSequenceClassification",
-                "tokenizer_type": "LlamaTokenizer",
+                "num_labels": 1,
                "chat_template": "alpaca",
                "reward_model": True,
                "sequence_len": 1024,
@@ -42,16 +42,16 @@ class TestRewardModelLoraLlama(unittest.TestCase):
                "lora_target_linear": True,
                "val_set_size": 0.0,
                "special_tokens": {
-                    "unk_token": "<unk>",
+                    "pad_token": "<|endoftext|>",
                    "bos_token": "<s>",
                    "eos_token": "</s>",
                },
                "datasets": [
                    {
                        "path": "argilla/distilabel-intel-orca-dpo-pairs",
                        "type": "bradley_terry.chat_template",
                        "split": "train[:10%]",
                    },
                ],
                "lora_modules_to_save": ["embed_tokens", "lm_head"],
                "remove_unused_columns": False,
                "max_steps": 10,
                "num_epochs": 1,
@@ -59,10 +59,11 @@ class TestRewardModelLoraLlama(unittest.TestCase):
                "gradient_accumulation_steps": 1,
                "output_dir": temp_dir,
                "learning_rate": 0.00001,
-                "optimizer": "adamw_bnb_8bit",
+                "optimizer": "adamw_torch",
                "lr_scheduler": "cosine",
                "gradient_checkpointing": True,
                "warmup_ratio": 0.1,
                "use_tensorboard": True,
            }
        )
        normalize_config(cfg)
@@ -70,4 +71,7 @@ class TestRewardModelLoraLlama(unittest.TestCase):
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
        train(cfg=cfg, dataset_meta=dataset_meta)
        check_tensorboard(
            temp_dir + "/runs", "train/train_loss", 2.5, "Train Loss is too high"
        )
        check_model_output_exists(temp_dir, cfg)
--- a/tests/prompt_strategies/test_stepwise.py
+++ b/tests/prompt_strategies/test_stepwise.py
@@ -0,0 +1,63 @@
 """
 tests for chat_template prompt strategy
 """
 import datasets
 import pytest
 from datasets import Dataset
 from transformers import AutoTokenizer
 from axolotl.datasets import TokenizedPromptDataset
 from axolotl.prompt_strategies.stepwise_supervised import (
    StepwiseSupervisedPromptTokenizingStrategy,
 )
 class TestStepWiseSupervisedPromptTokenizingStrategy:
    """
    Test class for stepwise supervised prompt strategy
    """
    @pytest.fixture()
    def stepwise_supervised_dataset(self):
        # pylint: disable=duplicate-code
        return Dataset.from_list(
            [
                {
                    "prompt": "Which number is larger, 9.8 or 9.11?",
                    "completions": [
                        "The fractional part of 9.8 is 0.8, while the fractional part of 9.11 is 0.11.",
                        "Since 0.11 is greater than 0.8, the number 9.11 is larger than 9.8.",
                        "Actually, this is incorrect. In decimal numbers, 0.8 is equal to 0.80, which is larger than 0.11. Therefore, 9.8 is larger than 9.11.",
                    ],
                    "labels": [True, False, False],
                }
            ]
        )
    @pytest.fixture()
    def tokenizer(self):
        return AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
    def test_stepwise_supervised_dataset(self, tokenizer, stepwise_supervised_dataset):
        strategy = StepwiseSupervisedPromptTokenizingStrategy(
            tokenizer,
            sequence_len=2048,
            step_separator="\n",
        )
        stepwise_supervised_dataset = stepwise_supervised_dataset.cast_column(
            "labels", datasets.Sequence(datasets.Value("int64"))
        )
        dataset_wrapper = TokenizedPromptDataset(
            strategy,
            stepwise_supervised_dataset,
            process_count=1,
        )
        labels = dataset_wrapper[0]["labels"]
        # expected labels is:
        # the prompt + first step are ignored, followed by the label for step 1 (True)
        # the second step, and its label (False)
        # the third step, and its label (False)
        expected = [-100] * 47 + [1] + [-100] * 29 + [0] + [-100] * 48 + [0]
        assert labels == expected