use alternate math-hard repo

apply chat template as arg
revision support
2025-01-13 08:46:35 -05:00 · 2025-01-12 17:38:32 -05:00 · 2025-01-12 05:17:03 -05:00 · 2025-01-11 23:18:27 -05:00 · 2025-01-08 08:38:06 -05:00 · 2025-01-07 15:13:18 -05:00
20 changed files with 897 additions and 262 deletions
--- a/README.md
+++ b/README.md
@@ -217,7 +217,7 @@ If you love axolotl, consider sponsoring the project by reaching out directly to
 ---
- [Modal](https://modal.com/) Modal lets you run data/AI jobs in the cloud, by just writing a few lines of Python. Customers use Modal to deploy Gen AI models at large scale, fine-tune LLM models, run protein folding simulations, and much more.
+- [Modal](https://www.modal.com?utm_source=github&utm_medium=github&utm_campaign=axolotl) Modal lets you run data/AI jobs in the cloud, by just writing a few lines of Python. Customers use Modal to deploy Gen AI models at large scale, fine-tune LLM models, run protein folding simulations, and much more.
 ---
--- a/deepspeed_configs/zero1_torch_compile.json
+++ b/deepspeed_configs/zero1_torch_compile.json
@@ -0,0 +1,27 @@
 {
  "zero_optimization": {
    "stage": 1,
    "overlap_comm": true
  },
  "bf16": {
    "enabled": "auto"
  },
  "fp16": {
    "enabled": "auto",
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "compile": {
    "disable": false,
    "backend": "inductor"
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
 }
--- a/examples/cloud/modal.yaml
+++ b/examples/cloud/modal.yaml
@@ -0,0 +1,15 @@
 volumes:
  - name: axolotl-data
    mount: /workspace/data
  - name: axolotl-artifacts
    mount: /workspace/artifacts
 secrets:
  - HF_TOKEN
  - WANDB_API_KEY
 branch: cli-cloud-modal
 gpu: h100
 gpu_count: 1
 memory: 128
 timeout: 86400
 timeout_preprocess: 14400
 memory_preprocess: 32
--- a/lm_eval-kd.yaml
+++ b/lm_eval-kd.yaml
@@ -0,0 +1,11 @@
 lm_eval_model: axolotl-ai-co/numina-8b-ep1-exp1
 lm_eval_tasks:
  - leaderboard_math_hard
 lm_eval_batch_size: 64
 apply_chat_template: false
 wandb_project: numina-kd-experiment
 wandb_entity: axolotl-ai
 bf16: true
 flash_attention: true
 output_dir: ./outputs/model-evals-out
--- a/requirements.txt
+++ b/requirements.txt
@@ -25,6 +25,7 @@ hf_transfer
 sentencepiece
 gradio==3.50.2
 modal==0.70.5
 pydantic==2.6.3
 addict
 fire
@@ -53,7 +54,7 @@ zstandard==0.22.0
 fastcore
 # lm eval harness
-lm_eval==0.4.4
+lm_eval==0.4.7
 langdetect==1.0.9
 immutabledict==4.2.0
 antlr4-python3-runtime==4.13.2
@@ -61,4 +62,4 @@ antlr4-python3-runtime==4.13.2
 torchao==0.7.0
 schedulefree==1.3.0
-axolotl-contribs-lgpl==0.0.1b2
+axolotl-contribs-lgpl==0.0.2
--- a/scripts/motd
+++ b/scripts/motd
@@ -1,10 +1,15 @@
-                                 dP            dP   dP
+     #@@ #@@      @@# @@#
-                                 88            88   88
+    @@  @@          @@  @@           =@@#                               @@                 #@    =@@#.
-      .d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88
+    @@    #@@@@@@@@@    @@           #@#@=                              @@                 #@     .=@@
-      88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88
+      #@@@@@@@@@@@@@@@@@            =@# @#     ##=     ##    =####=+    @@      =#####+  =#@@###.   @@
-      88.  .88  .d88b.  88.  .88 88 88.  .88   88   88
+    @@@@@@@@@@/  +@@/  +@@          #@  =@=     #@=   @@   =@#+  +#@#   @@    =@#+  +#@#   #@.      @@
-      `88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP
+    @@@@@@@@@@  ##@@  ##@@         =@#   @#      =@# @#    @@      @@   @@    @@      #@   #@       @@
     @@@@@@@@@@@@@@@@@@@@          #@=+++#@=      =@@#     @@      @@   @@    @@      #@   #@       @@
                                  =@#=====@@     =@# @#    @@      @@   @@    @@      #@   #@       @@
    @@@@@@@@@@@@@@@@  @@@@        #@      #@=   #@=  +@@   #@#    =@#   @@.   =@#    =@#   #@.      @@
                                 =@#       @#  #@=     #@   =#@@@@#=    +#@@=  +#@@@@#=    .##@@+   @@
    @@@@  @@@@@@@@@@@@@@@@
 Welcome to the axolotl cloud image! If the you've mounted a disk to /workspace and the axolotl directory ie empty, run the following commands:
--- a/src/axolotl/cli/cloud/init.py
+++ b/src/axolotl/cli/cloud/init.py
@@ -0,0 +1,56 @@
 """
 launch axolotl in supported cloud platforms
 """
 from pathlib import Path
 from typing import Union
 import yaml
 from axolotl.cli import print_axolotl_text_art
 from axolotl.cli.cloud.modal_ import ModalCloud
 from axolotl.utils.dict import DictDefault
 def load_cloud_cfg(cloud_config: Union[Path, str]) -> DictDefault:
    """Load and validate cloud configuration."""
    # Load cloud configuration.
    with open(cloud_config, encoding="utf-8") as file:
        cloud_cfg: DictDefault = DictDefault(yaml.safe_load(file))
    return cloud_cfg
 def do_cli_preprocess(
    cloud_config: Union[Path, str],
    config: Union[Path, str] = Path("examples/"),
 ) -> None:
    print_axolotl_text_art()
    cloud_cfg = load_cloud_cfg(cloud_config)
    cloud = ModalCloud(cloud_cfg)
    with open(config, "r", encoding="utf-8") as file:
        config_yaml = file.read()
    cloud.preprocess(config_yaml)
 def do_cli_train(
    cloud_config: Union[Path, str],
    config: Union[Path, str] = Path("examples/"),
    accelerate: bool = True,
 ) -> None:
    print_axolotl_text_art()
    cloud_cfg = load_cloud_cfg(cloud_config)
    cloud = ModalCloud(cloud_cfg)
    with open(config, "r", encoding="utf-8") as file:
        config_yaml = file.read()
    cloud.train(config_yaml, accelerate=accelerate)
 def do_cli_lm_eval(
    cloud_config: Union[Path, str],
    config: Union[Path, str] = Path("examples/"),
 ) -> None:
    print_axolotl_text_art()
    cloud_cfg = load_cloud_cfg(cloud_config)
    cloud = ModalCloud(cloud_cfg)
    with open(config, "r", encoding="utf-8") as file:
        config_yaml = file.read()
    cloud.lm_eval(config_yaml)
--- a/src/axolotl/cli/cloud/base.py
+++ b/src/axolotl/cli/cloud/base.py
@@ -0,0 +1,18 @@
 """
 base class for cloud platforms from cli
 """
 from abc import ABC, abstractmethod
 class Cloud(ABC):
    """
    Abstract base class for cloud platforms.
    """
    @abstractmethod
    def preprocess(self, config_yaml: str, *args, **kwargs) -> None:
        pass
    @abstractmethod
    def train(self, config_yaml: str, accelerate: bool = True) -> str:
        pass
--- a/src/axolotl/cli/cloud/modal_.py
+++ b/src/axolotl/cli/cloud/modal_.py
@@ -0,0 +1,272 @@
 """
 Modal Cloud support from CLI
 """
 import copy
 import json
 import os
 import subprocess  # nosec B404
 from pathlib import Path
 from random import randint
 import modal
 from axolotl.cli.cloud.base import Cloud
 def run_cmd(cmd: str, run_folder: str, volumes=None):
    """Run a command inside a folder, with Modal Volume reloading before and commit on success."""
    # Ensure volumes contain latest files.
    if volumes:
        for _, vol in volumes.items():
            vol.reload()
    # modal workaround so it doesn't use the automounted axolotl
    new_env = copy.deepcopy(os.environ)
    if "PYTHONPATH" in new_env:
        del new_env["PYTHONPATH"]
    # Propagate errors from subprocess.
    if exit_code := subprocess.call(  # nosec B603
        cmd.split(), cwd=run_folder, env=new_env
    ):
        exit(exit_code)  # pylint: disable=consider-using-sys-exit
    # Commit writes to volume.
    if volumes:
        for _, vol in volumes.items():
            vol.commit()
 class ModalCloud(Cloud):
    """
    Modal Cloud implementation.
    """
    def __init__(self, config, app=None):
        self.config = config
        if not app:
            app = modal.App()
        self.app = app
        self.volumes = {}
        if config.volumes:
            for volume_config in config.volumes:
                _, mount, vol = self.create_volume(volume_config)
                self.volumes[mount] = (vol, volume_config)
    def get_env(self):
        res = {
            "HF_DATASETS_CACHE": "/workspace/data/huggingface-cache/datasets",
            "HF_HUB_CACHE": "/workspace/data/huggingface-cache/hub",
        }
        for key in self.config.get("env", []):
            if isinstance(key, str):
                if val := os.environ.get(key, ""):
                    res[key] = val
            elif isinstance(key, dict):
                (key_, val) = list(key.items())[0]
                res[key_] = val
        return res
    def get_image(self):
        docker_tag = "main-py3.11-cu124-2.5.1"
        if self.config.docker_tag:
            docker_tag = self.config.docker_tag
        docker_image = f"axolotlai/axolotl:{docker_tag}"
        # grab the sha256 hash from docker hub for this image+tag
        # this ensures that we always get the latest image for this tag, even if it's already cached
        try:
            manifest = subprocess.check_output(  # nosec B602
                f"docker manifest inspect {docker_image}",
                shell=True,
            ).decode("utf-8")
            sha256_hash = json.loads(manifest)["manifests"][0]["digest"]
        except subprocess.CalledProcessError:
            sha256_hash = None
        # create the image
        if sha256_hash:
            image = modal.Image.from_registry(f"axolotlai/axolotl@{sha256_hash}")
        else:
            image = modal.Image.from_registry(docker_image)
        # branch
        if self.config.branch:
            image = image.dockerfile_commands(
                [
                    # Random id for cache busting of branch commits
                    f"RUN echo '{str(randint(0, 1000000))}'",  # nosec B311
                    f"RUN cd /workspace/axolotl && git fetch && git checkout {self.config.branch}",
                    "RUN cd /workspace/ && git clone https://github.com/winglian/lm-evaluation-harness.git && cd lm-evaluation-harness && pip install -e .[math]",
                ]
            )
        if env := self.get_env():
            image = image.env(env)
        image = image.pip_install("fastapi==0.110.0", "pydantic==2.6.3")
        return image
    def get_secrets(self):
        res = []
        if self.config.secrets:
            for key in self.config.get("secrets", []):
                # pylint: disable=duplicate-code
                if isinstance(key, str):
                    if val := os.environ.get(key, ""):
                        res.append(modal.Secret.from_dict({key: val}))
                elif isinstance(key, dict):
                    (key_, val) = list(key.items())[0]
                    res.append(modal.Secret.from_dict({key_: val}))
        return res
    def create_volume(self, volume_config):
        name = volume_config.name
        mount = volume_config.mount
        return name, mount, modal.Volume.from_name(name, create_if_missing=True)
    def get_ephemeral_disk_size(self):
        return 1000 * 525  # 1 TiB
    def get_preprocess_timeout(self):
        if self.config.timeout_preprocess:
            return int(self.config.timeout_preprocess)
        return 60 * 60 * 3  # 3 hours
    def get_preprocess_memory(self):
        memory = 128  # default to 128GiB
        if self.config.memory:
            memory = int(self.config.memory)
        if self.config.memory_preprocess:
            memory = int(self.config.memory_preprocess)
        return 1024 * memory
    def get_preprocess_env(self):
        return self.app.function(
            image=self.get_image(),
            volumes={k: v[0] for k, v in self.volumes.items()},
            cpu=8.0,
            ephemeral_disk=self.get_ephemeral_disk_size(),
            memory=self.get_preprocess_memory(),
            timeout=self.get_preprocess_timeout(),
            secrets=self.get_secrets(),
        )
    def preprocess(self, config_yaml: str, *args, **kwargs):
        modal_fn = self.get_preprocess_env()(_preprocess)
        with modal.enable_output():
            with self.app.run(detach=True):
                modal_fn.remote(
                    config_yaml,
                    volumes={k: v[0] for k, v in self.volumes.items()},
                    *args,
                    **kwargs,
                )
    def get_train_timeout(self):
        if self.config.timeout:
            return int(self.config.timeout)
        return 60 * 60 * 24  # 24 hours
    def get_train_gpu(self):  # pylint: disable=too-many-return-statements
        count = self.config.gpu_count or 1
        family = self.config.gpu.lower() or "l40s"
        if family == "l40s":
            return modal.gpu.L40S(count=count)
        if family == "a100":
            return modal.gpu.A100(count=count, size="40GB")
        if family == "a100-80gb":
            return modal.gpu.A100(count=count, size="80GB")
        if family in ["a10", "a10g"]:
            return modal.gpu.A10G(count=count)
        if family == "h100":
            return modal.gpu.H100(count=count)
        if family == "t4":
            return modal.gpu.T4(count=count)
        if family == "l4":
            return modal.gpu.L4(count=count)
        raise ValueError(f"Unsupported GPU family: {family}")
    def get_train_memory(self):
        memory = 128  # default to 128GiB
        if self.config.memory:
            memory = int(self.config.memory)
        return 1024 * memory
    def get_train_env(self):
        return self.app.function(
            image=self.get_image(),
            volumes={k: v[0] for k, v in self.volumes.items()},
            cpu=16.0,
            gpu=self.get_train_gpu(),
            memory=self.get_train_memory(),
            timeout=self.get_train_timeout(),
            secrets=self.get_secrets(),
        )
    def train(self, config_yaml: str, accelerate: bool = True):
        modal_fn = self.get_train_env()(_train)
        with modal.enable_output():
            with self.app.run(detach=True):
                modal_fn.remote(
                    config_yaml,
                    accelerate=accelerate,
                    volumes={k: v[0] for k, v in self.volumes.items()},
                )
    def lm_eval(self, config_yaml: str):
        modal_fn = self.get_train_env()(_lm_eval)
        with modal.enable_output():
            with self.app.run(detach=True):
                modal_fn.remote(
                    config_yaml,
                    volumes={k: v[0] for k, v in self.volumes.items()},
                )
 def _preprocess(config_yaml: str, volumes=None):
    Path("/workspace/artifacts/axolotl").mkdir(parents=True, exist_ok=True)
    with open(
        "/workspace/artifacts/axolotl/config.yaml", "w", encoding="utf-8"
    ) as f_out:
        f_out.write(config_yaml)
    run_folder = "/workspace/artifacts/axolotl"
    run_cmd(
        "axolotl preprocess /workspace/artifacts/axolotl/config.yaml --dataset-processes=8",
        run_folder,
        volumes,
    )
 def _train(config_yaml: str, accelerate: bool = True, volumes=None):
    with open(
        "/workspace/artifacts/axolotl/config.yaml", "w", encoding="utf-8"
    ) as f_out:
        f_out.write(config_yaml)
    run_folder = "/workspace/artifacts/axolotl"
    if accelerate:
        accelerate_args = "--accelerate"
    else:
        accelerate_args = "--no-accelerate"
    run_cmd(
        f"axolotl train {accelerate_args} /workspace/artifacts/axolotl/config.yaml",
        run_folder,
        volumes,
    )
 def _lm_eval(config_yaml: str, volumes=None):
    with open(
        "/workspace/artifacts/axolotl/config.yaml", "w", encoding="utf-8"
    ) as f_out:
        f_out.write(config_yaml)
    run_folder = "/workspace/artifacts/axolotl"
    run_cmd(
        "axolotl lm-eval /workspace/artifacts/axolotl/config.yaml",
        run_folder,
        volumes,
    )
--- a/src/axolotl/cli/main.py
+++ b/src/axolotl/cli/main.py
@@ -13,6 +13,7 @@ from axolotl.cli.utils import (
    fetch_from_github,
 )
 from axolotl.common.cli import EvaluateCliArgs, PreprocessCliArgs, TrainerCliArgs
 from axolotl.integrations.lm_eval.cli import lm_eval
 from axolotl.utils import set_pytorch_cuda_alloc_conf
 from axolotl.utils.config.models.input.v0_4_1 import AxolotlInputConfig
@@ -25,15 +26,21 @@ def cli():
@cli.command()
@click.argument("config", type=click.Path(exists=True, path_type=str))
@click.option("--cloud", default=None, type=click.Path(exists=True, path_type=str))
@add_options_from_dataclass(PreprocessCliArgs)
@add_options_from_config(AxolotlInputConfig)
-def preprocess(config: str, **kwargs):
+def preprocess(config: str, cloud: Optional[str] = None, **kwargs):
    """Preprocess datasets before training."""
    kwargs = {k: v for k, v in kwargs.items() if v is not None}
-    from axolotl.cli.preprocess import do_cli
+    if cloud:
        from axolotl.cli.cloud import do_cli_preprocess
-    do_cli(config=config, **kwargs)
+        do_cli_preprocess(cloud_config=cloud, config=config)
    else:
        from axolotl.cli.preprocess import do_cli
        do_cli(config=config, **kwargs)
@cli.command()
@@ -43,25 +50,33 @@ def preprocess(config: str, **kwargs):
    default=True,
    help="Use accelerate launch for multi-GPU training",
 )
@click.option("--cloud", default=None, type=click.Path(exists=True, path_type=str))
@add_options_from_dataclass(TrainerCliArgs)
@add_options_from_config(AxolotlInputConfig)
-def train(config: str, accelerate: bool, **kwargs):
+def train(config: str, accelerate: bool, cloud: Optional[str], **kwargs):
    """Train or fine-tune a model."""
    kwargs = {k: v for k, v in kwargs.items() if v is not None}
    # Enable expandable segments for cuda allocation to improve VRAM usage
    set_pytorch_cuda_alloc_conf()
    from axolotl.cli.cloud import do_cli_train
    if accelerate:
-        base_cmd = ["accelerate", "launch", "-m", "axolotl.cli.train"]
+        if cloud:
-        if config:
+            do_cli_train(cloud_config=cloud, config=config, accelerate=True)
-            base_cmd.append(config)
+        else:
-        cmd = build_command(base_cmd, kwargs)
+            base_cmd = ["accelerate", "launch", "-m", "axolotl.cli.train"]
-        subprocess.run(cmd, check=True)  # nosec B603
+            if config:
                base_cmd.append(config)
            cmd = build_command(base_cmd, kwargs)
            subprocess.run(cmd, check=True)  # nosec B603
    else:
-        from axolotl.cli.train import do_cli
+        if cloud:
            do_cli_train(cloud_config=cloud, config=config, accelerate=False)
        else:
            from axolotl.cli.train import do_cli
-        do_cli(config=config, **kwargs)
+            do_cli(config=config, **kwargs)
@cli.command()
@@ -93,7 +108,7 @@ def evaluate(config: str, accelerate: bool, **kwargs):
@click.argument("config", type=click.Path(exists=True, path_type=str))
@click.option(
    "--accelerate/--no-accelerate",
-    default=True,
+    default=False,
    help="Use accelerate launch for multi-GPU inference",
 )
@click.option(
@@ -124,7 +139,7 @@ def inference(
    if lora_model_dir:
        kwargs["lora_model_dir"] = lora_model_dir
    if base_model:
-        kwargs["output_dir"] = base_model
+        kwargs["base_model"] = base_model
    if accelerate:
        base_cmd = ["accelerate", "launch", "-m", "axolotl.cli.inference"]
@@ -254,6 +269,9 @@ def fetch(directory: str, dest: Optional[str]):
    fetch_from_github(f"{directory}/", dest)
 cli.add_command(lm_eval)
 def main():
    cli()
--- a/src/axolotl/core/trainer_builder.py
+++ b/src/axolotl/core/trainer_builder.py
@@ -56,6 +56,7 @@ from axolotl.monkeypatch.relora import ReLoRACallback, ReLoRAScheduler
 from axolotl.utils import is_comet_available, is_mlflow_available
 from axolotl.utils.callbacks import (
    EvalFirstStepCallback,
    GCCallback,
    GPUStatsCallback,
    LossWatchDogCallback,
    SaveAxolotlConfigtoWandBCallback,
@@ -67,7 +68,7 @@ from axolotl.utils.callbacks import (
 )
 from axolotl.utils.callbacks.lisa import lisa_callback_factory
 from axolotl.utils.callbacks.profiler import PytorchProfilerCallback
-from axolotl.utils.chat_templates import get_chat_template
+from axolotl.utils.chat_templates import get_chat_template_from_config
 from axolotl.utils.collators import (
    BatchSamplerDataCollatorForSeq2Seq,
    DataCollatorForSeq2Seq,
@@ -1452,6 +1453,8 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        if self.cfg.loss_watchdog_threshold is not None:
            callbacks.append(LossWatchDogCallback(self.cfg))
        if self.cfg.gc_steps:
            callbacks.append(GCCallback(gc_steps=self.cfg.gc_steps))
        callbacks.append(SaveModelCallback())
        return callbacks
@@ -1831,8 +1834,8 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        training_arguments_kwargs["model_type"] = self.cfg.model_config_type
        training_arguments_kwargs["pretraining"] = bool(self.cfg.pretraining_dataset)
        if self.cfg.chat_template:
-            training_arguments_kwargs["chat_template"] = get_chat_template(
+            training_arguments_kwargs["chat_template"] = get_chat_template_from_config(
-                self.cfg.chat_template,
+                cfg=self.cfg,
                tokenizer=self.tokenizer,
            )
--- a/src/axolotl/integrations/lm_eval/init.py
+++ b/src/axolotl/integrations/lm_eval/init.py
@@ -2,9 +2,9 @@
 Module for the Plugin for LM Eval Harness
 """
 import subprocess  # nosec
 from datetime import datetime
 from axolotl.integrations.base import BasePlugin
 from axolotl.integrations.lm_eval.cli import build_lm_eval_command
 from .args import LMEvalArgs  # pylint: disable=unused-import. # noqa: F401
@@ -18,25 +18,19 @@ class LMEvalPlugin(BasePlugin):
        return "axolotl.integrations.lm_eval.LMEvalArgs"
    def post_train_unload(self, cfg):
-        tasks = ",".join(cfg.lm_eval_tasks)
+        if cfg.lm_eval_post_train:
-        fa2 = ",attn_implementation=flash_attention_2" if cfg.flash_attention else ""
+            # pylint: disable=duplicate-code
-        dtype = ",dtype=bfloat16" if cfg.bf16 else ",dtype=float16"
+            for lm_eval_args in build_lm_eval_command(
-        output_path = cfg.output_dir
+                cfg.lm_eval_tasks,
-        output_path += "" if cfg.output_dir.endswith("/") else "/"
+                bfloat16=cfg.bfloat16 or cfg.bf16,
-        output_path += "lm_eval_results/" + datetime.now().strftime("%Y%m%d_%H%M%S")
+                flash_attention=cfg.flash_attention,
-        subprocess.run(  # nosec
+                output_dir=cfg.output_dir,
-            [
+                batch_size=cfg.lm_eval_batch_size,
-                "lm_eval",
+                wandb_project=cfg.wandb_project,
-                "--model",
+                wandb_entity=cfg.wandb_entity,
-                "hf",
+                model=cfg.lm_eval_model or cfg.hub_model_id,
-                "--model_args",
+            ):
-                f"pretrained={cfg.output_dir}{fa2}{dtype}",
+                subprocess.run(  # nosec
-                "--tasks",
+                    lm_eval_args,
-                tasks,
+                    check=True,
-                "--batch_size",
+                )
                str(cfg.lm_eval_batch_size),
                "--output_path",
                output_path,
            ],
            check=True,
        )
--- a/src/axolotl/integrations/lm_eval/args.py
+++ b/src/axolotl/integrations/lm_eval/args.py
@@ -13,3 +13,5 @@ class LMEvalArgs(BaseModel):
    lm_eval_tasks: List[str] = []
    lm_eval_batch_size: Optional[int] = 8
    lm_eval_post_train: Optional[bool] = True
    lm_eval_model: Optional[str] = None
--- a/src/axolotl/integrations/lm_eval/cli.py
+++ b/src/axolotl/integrations/lm_eval/cli.py
@@ -0,0 +1,113 @@
 """
 axolotl CLI for running lm_eval tasks
 """
 import subprocess  # nosec
 from collections import defaultdict
 from datetime import datetime
 from typing import Optional
 import click
 import yaml
 from axolotl.utils.dict import DictDefault
 def build_lm_eval_command(
    tasks: list[str],
    bfloat16=True,
    flash_attention=False,
    output_dir="./",
    batch_size=8,
    wandb_project=None,
    wandb_entity=None,
    model=None,
    revision=None,
    apply_chat_template=None,
    fewshot_as_multiturn=None,
 ):
    tasks_by_num_fewshot: dict[str, list] = defaultdict(list)
    for task in tasks:
        num_fewshot = "-1"
        task_parts = task.split(":")
        task_name = task_parts[0]
        if len(task_parts) == 2:
            task_name, num_fewshot = task_parts
        tasks_by_num_fewshot[str(num_fewshot)].append(task_name)
    for num_fewshot, tasks_list in tasks_by_num_fewshot.items():
        tasks_str = ",".join(tasks_list)
        num_fewshot_val = num_fewshot if num_fewshot != "-1" else None
        pretrained = "pretrained="
        pretrained += model if model else output_dir
        fa2 = ",attn_implementation=flash_attention_2" if flash_attention else ""
        dtype = ",dtype=bfloat16" if bfloat16 else ",dtype=float16"
        revision = f",revision={revision}" if revision else ""
        output_path = output_dir
        output_path += "" if output_dir.endswith("/") else "/"
        output_path += "lm_eval_results/" + datetime.now().strftime("%Y%m%d_%H%M%S")
        lm_eval_args = [
            "lm_eval",
            "--model",
            "hf",
            "--model_args",
            f"{pretrained}{fa2}{dtype}{revision}",
            "--tasks",
            tasks_str,
            "--batch_size",
            str(batch_size),
            "--output_path",
            output_path,
        ]
        wandb_args = []
        if wandb_project:
            wandb_args.append(f"project={wandb_project}")
        if wandb_entity:
            wandb_args.append(f"entity={wandb_entity}")
        if wandb_args:
            lm_eval_args.append("--wandb_args")
            lm_eval_args.append(",".join(wandb_args))
        if apply_chat_template:
            lm_eval_args.append("--apply_chat_template")
        if num_fewshot_val:
            lm_eval_args.append("--num_fewshot")
            lm_eval_args.append(str(num_fewshot_val))
            if apply_chat_template and fewshot_as_multiturn:
                lm_eval_args.append("--fewshot_as_multiturn")
        yield lm_eval_args
@click.command()
@click.argument("config", type=click.Path(exists=True, path_type=str))
@click.option("--cloud", default=None, type=click.Path(exists=True, path_type=str))
 def lm_eval(config: str, cloud: Optional[str] = None):
    """
    use lm eval to evaluate a trained language model
    """
    if cloud:
        from axolotl.cli.cloud import do_cli_lm_eval
        do_cli_lm_eval(cloud_config=cloud, config=config)
    else:
        with open(config, encoding="utf-8") as file:
            cfg: DictDefault = DictDefault(yaml.safe_load(file))
        # pylint: disable=duplicate-code
        for lm_eval_args in build_lm_eval_command(
            cfg.lm_eval_tasks,
            bfloat16=cfg.bfloat16 or cfg.bf16,
            flash_attention=cfg.flash_attention,
            output_dir=cfg.output_dir,
            batch_size=cfg.lm_eval_batch_size,
            wandb_project=cfg.wandb_project,
            wandb_entity=cfg.wandb_entity,
            model=cfg.lm_eval_model or cfg.hub_model_id,
            revision=cfg.revision,
            apply_chat_template=cfg.apply_chat_template,
            fewshot_as_multiturn=cfg.fewshot_as_multiturn,
        ):
            subprocess.run(  # nosec
                lm_eval_args,
                check=True,
            )
--- a/src/axolotl/train.py
+++ b/src/axolotl/train.py
@@ -1,5 +1,6 @@
 """Prepare and train a model on a dataset. Can also infer from a model or merge lora"""
 import inspect
 import os
 import signal
 import sys
@@ -126,7 +127,20 @@ def train(
    )
    if cfg.fix_untrained_tokens:
-        fix_untrained_tokens(model, tokenizer, train_dataset)
+        # check if the `token_ids_to_fix` kwarg exists in the fix_untrained_tokens args
        sig = inspect.signature(fix_untrained_tokens)
        # if the function has the `token_ids_to_fix` arg, and fix_untrained_tokens is a list
        if "token_ids_to_fix" in sig.parameters and isinstance(
            cfg.fix_untrained_tokens, list
        ):
            fix_untrained_tokens(
                model,
                tokenizer,
                train_dataset,
                token_ids_to_fix=cfg.fix_untrained_tokens,
            )
        else:
            fix_untrained_tokens(model, tokenizer, train_dataset)
        if cfg.local_rank == 0:
            model.save_pretrained(
                str(Path(cfg.output_dir)), safe_serialization=safe_serialization
--- a/src/axolotl/utils/callbacks/init.py
+++ b/src/axolotl/utils/callbacks/init.py
@@ -2,6 +2,7 @@
 from __future__ import annotations
 import gc
 import logging
 import math
 import os
@@ -842,3 +843,17 @@ class SaveModelCallback(TrainerCallback):
    ):
        control.should_save = True
        return control
 class GCCallback(TrainerCallback):
    """Callback to garbage collect torch cache"""
    def __init__(self, gc_steps=None):
        self.gc_steps = gc_steps
    def on_step_end(
        self, args, state, control, **kwargs  # pylint: disable=unused-argument
    ):
        if state.global_step % self.gc_steps == 0:
            torch.cuda.empty_cache()
            gc.collect()
--- a/src/axolotl/utils/config/models/input/v0_4_1/init.py
+++ b/src/axolotl/utils/config/models/input/v0_4_1/init.py
@@ -666,6 +666,8 @@ class AxolotlInputConfig(
    loss_watchdog_threshold: Optional[float] = None
    loss_watchdog_patience: Optional[int] = None
    gc_steps: Optional[int] = None
    bf16: Optional[Union[Literal["auto"], bool]] = "auto"
    fp16: Optional[bool] = None
    bfloat16: Optional[bool] = None  # for non-AMP cases
@@ -792,7 +794,7 @@ class AxolotlInputConfig(
    chat_template_jinja: Optional[str] = None
    default_system_message: Optional[str] = None
-    fix_untrained_tokens: Optional[bool] = None
+    fix_untrained_tokens: Optional[Union[int, List[int]]] = None
    # INTERNALS - document for now, generally not set externally
    is_preprocess: Optional[bool] = None
--- a/src/axolotl/utils/data/pretraining.py
+++ b/src/axolotl/utils/data/pretraining.py
@@ -28,8 +28,10 @@ def encode_pretraining(
    )
    # Convert to PyTorch tensors
    input_ids = [torch.tensor(seq) for seq in res["input_ids"]]
    targets = [torch.tensor(seq) for seq in res["input_ids"]]
    attention_mask = [torch.tensor(seq) for seq in res["attention_mask"]]
    new_input_ids = []
    new_labels = []
    new_attention_mask = []
    # Append EOS and PAD tokens to input_ids, and correct attention_mask
    for i, _ in enumerate(input_ids):
@@ -40,22 +42,34 @@ def encode_pretraining(
            ),
            dim=0,
        )
        targets[i] = torch.cat(
            (
                targets[i],
                torch.tensor([tokenizer.eos_token_id, -100]),
            ),
            dim=0,
        )
        attention_mask[i] = torch.cat((attention_mask[i], torch.tensor([1, 0])), dim=0)
    # Concatenate tokens so that their lengths are less than max_tokens
    buffer_input_ids = torch.tensor([], dtype=torch.long)
    buffer_labels = torch.tensor([], dtype=torch.long)
    buffer_attention_mask = torch.tensor([], dtype=torch.long)
-    for ids, mask in zip(input_ids, attention_mask):
+    for ids, labels, mask in zip(input_ids, targets, attention_mask):
        if buffer_input_ids.numel() == max_tokens:
            new_input_ids.append(buffer_input_ids)
            new_labels.append(buffer_labels)
            new_attention_mask.append(buffer_attention_mask)
            buffer_input_ids = torch.tensor([], dtype=torch.long)
            buffer_labels = torch.tensor([], dtype=torch.long)
            buffer_attention_mask = torch.tensor([], dtype=torch.long)
            buffer_input_ids = torch.cat((buffer_input_ids, ids), dim=0)
            buffer_labels = torch.cat((buffer_labels, labels), dim=0)
            buffer_attention_mask = torch.cat((buffer_attention_mask, mask), dim=0)
        elif buffer_input_ids.numel() + ids.numel() <= max_tokens:
            buffer_input_ids = torch.cat((buffer_input_ids, ids), dim=0)
            buffer_labels = torch.cat((buffer_labels, labels), dim=0)
            buffer_attention_mask = torch.cat((buffer_attention_mask, mask), dim=0)
        else:
            buffer_input_ids = torch.cat(
@@ -69,6 +83,17 @@ def encode_pretraining(
                ),
                dim=0,
            )
            buffer_labels = torch.cat(
                (
                    buffer_labels,
                    torch.full(
                        (max_tokens - buffer_labels.numel(),),
                        -100,
                        dtype=torch.long,
                    ),
                ),
                dim=0,
            )
            buffer_attention_mask = torch.cat(
                (
                    buffer_attention_mask,
@@ -81,11 +106,14 @@ def encode_pretraining(
                dim=0,
            )
            new_input_ids.append(buffer_input_ids)
            new_labels.append(buffer_labels)
            new_attention_mask.append(buffer_attention_mask)
            buffer_input_ids = torch.tensor([], dtype=torch.long)
            buffer_labels = torch.tensor([], dtype=torch.long)
            buffer_attention_mask = torch.tensor([], dtype=torch.long)
            buffer_input_ids = torch.cat((buffer_input_ids, ids), dim=0)
            buffer_labels = torch.cat((buffer_labels, labels), dim=0)
            buffer_attention_mask = torch.cat((buffer_attention_mask, mask), dim=0)
    if buffer_input_ids.numel() > 0:  # for any leftover tokens
@@ -101,6 +129,17 @@ def encode_pretraining(
                ),
                dim=0,
            )
            buffer_labels = torch.cat(
                (
                    buffer_labels,
                    torch.full(
                        (max_tokens - buffer_labels.numel(),),
                        -100,
                        dtype=torch.long,
                    ),
                ),
                dim=0,
            )
            buffer_attention_mask = torch.cat(
                (
                    buffer_attention_mask,
@@ -113,11 +152,12 @@ def encode_pretraining(
                dim=0,
            )
        new_input_ids.append(buffer_input_ids)
        new_labels.append(buffer_labels)
        new_attention_mask.append(buffer_attention_mask)
    ret = {
        "input_ids": [seq.tolist() for seq in new_input_ids],
-        "labels": [seq.tolist() for seq in new_input_ids],
+        "labels": [seq.tolist() for seq in new_labels],
        "attention_mask": [seq.tolist() for seq in new_attention_mask],
    }
--- a/src/axolotl/utils/data/sft.py
+++ b/src/axolotl/utils/data/sft.py
@@ -3,7 +3,7 @@
 import functools
 import logging
 from pathlib import Path
-from typing import List, Optional, Tuple, Union
+from typing import List, Tuple, Union
 from datasets import (
    Dataset,
@@ -12,8 +12,6 @@ from datasets import (
    load_dataset,
    load_from_disk,
 )
 from huggingface_hub import hf_hub_download
 from huggingface_hub.utils import HFValidationError
 from transformers import PreTrainedTokenizerBase
 from axolotl.common.const import DEFAULT_DATASET_PREPARED_PATH
@@ -42,6 +40,7 @@ from axolotl.prompters import (
    UnsupportedPrompter,
 )
 from axolotl.utils.data.pretraining import wrap_pretraining_dataset
 from axolotl.utils.data.shared import load_dataset_w_config
 from axolotl.utils.data.utils import (
    deduplicate_and_log_datasets,
    md5,
@@ -85,6 +84,7 @@ def prepare_dataset(cfg, tokenizer, processor=None):
                    processor=processor,
                )
    else:
        # Load streaming dataset if pretraining_dataset is given
        path = cfg.pretraining_dataset
        split = "train"
        name = None
@@ -116,7 +116,18 @@ def prepare_dataset(cfg, tokenizer, processor=None):
        )
        # https://discuss.huggingface.co/t/how-to-use-huggingface-trainer-streaming-datasets-without-wrapping-it-with-torchdatas-iterablewrapper/25230
        train_dataset = train_dataset.with_format("torch")
        # Load eval dataset (non-streaming) if specified
        eval_dataset = None
        if cfg.test_datasets:
            _, eval_dataset, _ = load_prepare_datasets(
                tokenizer,
                cfg,
                DEFAULT_DATASET_PREPARED_PATH,
                split="test",
                processor=processor,
            )
        if cfg.dataset_exact_deduplication:
            LOG.info("Deduplication not available for pretrained datasets")
@@ -243,195 +254,9 @@ def load_tokenized_prepared_datasets(
        # pylint: disable=invalid-name
        for config_dataset in for_d_in_datasets(cfg_datasets):
-            ds: Optional[Union[Dataset, DatasetDict]] = None
+            ds: Union[Dataset, DatasetDict] = load_dataset_w_config(
-            ds_from_hub = False
+                config_dataset, use_auth_token
-            ds_trust_remote_code = config_dataset.trust_remote_code
+            )
            try:
                # this is just a basic check to see if the path is a
                # valid HF dataset that's loadable
                load_dataset(
                    config_dataset.path,
                    name=config_dataset.name,
                    streaming=True,
                    token=use_auth_token,
                    revision=config_dataset.revision,
                    trust_remote_code=ds_trust_remote_code,
                )
                ds_from_hub = True
            except (FileNotFoundError, ConnectionError, HFValidationError, ValueError):
                pass
            ds_from_cloud = False
            storage_options = {}
            remote_file_system = None
            if config_dataset.path.startswith("s3://"):
                try:
                    import aiobotocore.session  # type: ignore
                    import s3fs  # type: ignore
                except ImportError as exc:
                    raise ImportError(
                        "s3:// paths require aiobotocore and s3fs to be installed"
                    ) from exc
                # Takes credentials from ~/.aws/credentials for default profile
                s3_session = aiobotocore.session.AioSession(profile="default")
                storage_options = {"session": s3_session}
                remote_file_system = s3fs.S3FileSystem(**storage_options)
            elif config_dataset.path.startswith(
                "gs://"
            ) or config_dataset.path.startswith("gcs://"):
                try:
                    import gcsfs  # type: ignore
                except ImportError as exc:
                    raise ImportError(
                        "gs:// or gcs:// paths require gcsfs to be installed"
                    ) from exc
                # gcsfs will use default credentials from the environment else anon
                # https://gcsfs.readthedocs.io/en/latest/#credentials
                storage_options = {"token": None}
                remote_file_system = gcsfs.GCSFileSystem(**storage_options)
            # TODO: Figure out how to get auth creds passed
            # elif config_dataset.path.startswith("adl://") or config_dataset.path.startswith("abfs://"):
            #     try:
            #         import adlfs
            #     except ImportError as exc:
            #        raise ImportError(
            #            "adl:// or abfs:// paths require adlfs to be installed"
            #        ) from exc
            #     # Gen 1
            #     storage_options = {
            #         "tenant_id": TENANT_ID,
            #         "client_id": CLIENT_ID,
            #         "client_secret": CLIENT_SECRET,
            #     }
            #     # Gen 2
            #     storage_options = {
            #         "account_name": ACCOUNT_NAME,
            #         "account_key": ACCOUNT_KEY,
            #     }
            #     remote_file_system = adlfs.AzureBlobFileSystem(**storage_options)
            try:
                if remote_file_system and remote_file_system.exists(
                    config_dataset.path
                ):
                    ds_from_cloud = True
            except (FileNotFoundError, ConnectionError):
                pass
            # prefer local dataset, even if hub exists
            local_path = Path(config_dataset.path)
            if local_path.exists():
                if local_path.is_dir():
                    if config_dataset.data_files:
                        ds_type = get_ds_type(config_dataset)
                        ds = load_dataset(
                            ds_type,
                            name=config_dataset.name,
                            data_files=config_dataset.data_files,
                            streaming=False,
                            split=None,
                        )
                    else:
                        try:
                            ds = load_from_disk(config_dataset.path)
                        except FileNotFoundError:
                            ds = load_dataset(
                                config_dataset.path,
                                name=config_dataset.name,
                                streaming=False,
                                split=None,
                            )
                elif local_path.is_file():
                    ds_type = get_ds_type(config_dataset)
                    ds = load_dataset(
                        ds_type,
                        name=config_dataset.name,
                        data_files=config_dataset.path,
                        streaming=False,
                        split=None,
                    )
                else:
                    raise ValueError(
                        "unhandled dataset load: local path exists, but is neither a directory or a file"
                    )
            elif ds_from_hub:
                load_ds_kwargs = {}
                if config_dataset.split:
                    load_ds_kwargs["split"] = config_dataset.split
                ds = load_dataset(
                    config_dataset.path,
                    name=config_dataset.name,
                    streaming=False,
                    data_files=config_dataset.data_files,
                    token=use_auth_token,
                    revision=config_dataset.revision,
                    trust_remote_code=config_dataset.trust_remote_code,
                    **load_ds_kwargs,
                )
            elif ds_from_cloud and remote_file_system:
                if remote_file_system.isdir(config_dataset.path):
                    ds = load_from_disk(
                        config_dataset.path,
                        storage_options=storage_options,
                    )
                elif remote_file_system.isfile(config_dataset.path):
                    ds_type = get_ds_type(config_dataset)
                    ds = load_dataset(
                        ds_type,
                        name=config_dataset.name,
                        data_files=config_dataset.path,
                        streaming=False,
                        split=None,
                        storage_options=storage_options,
                        trust_remote_code=config_dataset.trust_remote_code,
                    )
            elif config_dataset.path.startswith("https://"):
                ds_type = get_ds_type(config_dataset)
                ds = load_dataset(
                    ds_type,
                    name=config_dataset.name,
                    data_files=config_dataset.path,
                    streaming=False,
                    split=None,
                    storage_options=storage_options,
                    trust_remote_code=config_dataset.trust_remote_code,
                )
            else:
                if isinstance(config_dataset.data_files, str):
                    fp = hf_hub_download(
                        repo_id=config_dataset.path,
                        repo_type="dataset",
                        filename=config_dataset.data_files,
                        revision=config_dataset.revision,
                    )
                elif isinstance(config_dataset.data_files, list):
                    fp = []
                    for file in config_dataset.data_files:
                        fp.append(
                            hf_hub_download(
                                repo_id=config_dataset.path,
                                repo_type="dataset",
                                filename=file,
                                revision=config_dataset.revision,
                            )
                        )
                else:
                    raise ValueError(
                        "data_files must be either a string or list of strings"
                    )
                ds = load_dataset(
                    "json",
                    name=config_dataset.name,
                    data_files=fp,
                    streaming=False,
                    split=None,
                )
            if not ds:
                raise ValueError("unhandled dataset load")
            d_base_type = d_prompt_style = None
            d_type = config_dataset.type
@@ -501,24 +326,6 @@ def load_tokenized_prepared_datasets(
    return dataset, prompters
 def get_ds_type(config_dataset: DictDefault):
    """
    Get the dataset type from the path if it's not specified
    """
    ds_type = "json"
    if config_dataset.ds_type:
        ds_type = config_dataset.ds_type
    elif ".parquet" in config_dataset.path:
        ds_type = "parquet"
    elif ".arrow" in config_dataset.path:
        ds_type = "arrow"
    elif ".csv" in config_dataset.path:
        ds_type = "csv"
    elif ".txt" in config_dataset.path:
        ds_type = "text"
    return ds_type
 def load_prepare_datasets(
    tokenizer: PreTrainedTokenizerBase,
    cfg,
--- a/src/axolotl/utils/data/shared.py
+++ b/src/axolotl/utils/data/shared.py
@@ -0,0 +1,222 @@
 """
 dataset loading shared utils
 """
 from pathlib import Path
 from typing import Optional, Union
 from datasets import Dataset, DatasetDict, load_dataset, load_from_disk
 from huggingface_hub import hf_hub_download
 from huggingface_hub.errors import HFValidationError
 from axolotl.utils.dict import DictDefault
 def get_ds_type(config_dataset: DictDefault):
    """
    Get the dataset type from the path if it's not specified
    """
    ds_type = "json"
    if config_dataset.ds_type:
        ds_type = config_dataset.ds_type
    elif ".parquet" in config_dataset.path:
        ds_type = "parquet"
    elif ".arrow" in config_dataset.path:
        ds_type = "arrow"
    elif ".csv" in config_dataset.path:
        ds_type = "csv"
    elif ".txt" in config_dataset.path:
        ds_type = "text"
    return ds_type
 def load_dataset_w_config(config_dataset, auth_token):
    # pylint: disable=invalid-name
    ds: Optional[Union[Dataset, DatasetDict]] = None  # pylint: disable=invalid-name
    ds_from_hub = False
    ds_trust_remote_code = config_dataset.trust_remote_code
    try:
        # this is just a basic check to see if the path is a
        # valid HF dataset that's loadable
        load_dataset(
            config_dataset.path,
            name=config_dataset.name,
            streaming=True,
            token=auth_token,
            revision=config_dataset.revision,
            trust_remote_code=ds_trust_remote_code,
        )
        ds_from_hub = True
    except (FileNotFoundError, ConnectionError, HFValidationError, ValueError):
        pass
    ds_from_cloud = False
    storage_options = {}
    remote_file_system = None
    if config_dataset.path.startswith("s3://"):
        try:
            import aiobotocore.session  # type: ignore
            import s3fs  # type: ignore
        except ImportError as exc:
            raise ImportError(
                "s3:// paths require aiobotocore and s3fs to be installed"
            ) from exc
        # Takes credentials from ~/.aws/credentials for default profile
        s3_session = aiobotocore.session.AioSession(profile="default")
        storage_options = {"session": s3_session}
        remote_file_system = s3fs.S3FileSystem(**storage_options)
    elif config_dataset.path.startswith("gs://") or config_dataset.path.startswith(
        "gcs://"
    ):
        try:
            import gcsfs  # type: ignore
        except ImportError as exc:
            raise ImportError(
                "gs:// or gcs:// paths require gcsfs to be installed"
            ) from exc
        # gcsfs will use default credentials from the environment else anon
        # https://gcsfs.readthedocs.io/en/latest/#credentials
        storage_options = {"token": None}
        remote_file_system = gcsfs.GCSFileSystem(**storage_options)
    # TODO: Figure out how to get auth creds passed
    # elif config_dataset.path.startswith("adl://") or config_dataset.path.startswith("abfs://"):
    #     try:
    #         import adlfs
    #     except ImportError as exc:
    #        raise ImportError(
    #            "adl:// or abfs:// paths require adlfs to be installed"
    #        ) from exc
    #     # Gen 1
    #     storage_options = {
    #         "tenant_id": TENANT_ID,
    #         "client_id": CLIENT_ID,
    #         "client_secret": CLIENT_SECRET,
    #     }
    #     # Gen 2
    #     storage_options = {
    #         "account_name": ACCOUNT_NAME,
    #         "account_key": ACCOUNT_KEY,
    #     }
    #     remote_file_system = adlfs.AzureBlobFileSystem(**storage_options)
    try:
        if remote_file_system and remote_file_system.exists(config_dataset.path):
            ds_from_cloud = True
    except (FileNotFoundError, ConnectionError):
        pass
    # prefer local dataset, even if hub exists
    local_path = Path(config_dataset.path)
    if local_path.exists():
        if local_path.is_dir():
            if config_dataset.data_files:
                ds_type = get_ds_type(config_dataset)
                ds = load_dataset(  # pylint: disable=invalid-name
                    ds_type,
                    name=config_dataset.name,
                    data_files=config_dataset.data_files,
                    streaming=False,
                    split=None,
                )
            else:
                try:
                    ds = load_from_disk(
                        config_dataset.path
                    )  # pylint: disable=invalid-name
                except FileNotFoundError:
                    ds = load_dataset(
                        config_dataset.path,
                        name=config_dataset.name,
                        streaming=False,
                        split=None,
                    )
        elif local_path.is_file():
            ds_type = get_ds_type(config_dataset)
            ds = load_dataset(  # pylint: disable=invalid-name
                ds_type,
                name=config_dataset.name,
                data_files=config_dataset.path,
                streaming=False,
                split=None,
            )
        else:
            raise ValueError(
                "unhandled dataset load: local path exists, but is neither a directory or a file"
            )
    elif ds_from_hub:
        load_ds_kwargs = {}
        if config_dataset.split:
            load_ds_kwargs["split"] = config_dataset.split
        ds = load_dataset(
            config_dataset.path,
            name=config_dataset.name,
            streaming=False,
            data_files=config_dataset.data_files,
            token=auth_token,
            revision=config_dataset.revision,
            trust_remote_code=config_dataset.trust_remote_code,
            **load_ds_kwargs,
        )
    elif ds_from_cloud and remote_file_system:
        if remote_file_system.isdir(config_dataset.path):
            ds = load_from_disk(
                config_dataset.path,
                storage_options=storage_options,
            )
        elif remote_file_system.isfile(config_dataset.path):
            ds_type = get_ds_type(config_dataset)
            ds = load_dataset(
                ds_type,
                name=config_dataset.name,
                data_files=config_dataset.path,
                streaming=False,
                split=None,
                storage_options=storage_options,
                trust_remote_code=config_dataset.trust_remote_code,
            )
    elif config_dataset.path.startswith("https://"):
        ds_type = get_ds_type(config_dataset)
        ds = load_dataset(
            ds_type,
            name=config_dataset.name,
            data_files=config_dataset.path,
            streaming=False,
            split=None,
            storage_options=storage_options,
            trust_remote_code=config_dataset.trust_remote_code,
        )
    else:
        if isinstance(config_dataset.data_files, str):
            fp = hf_hub_download(
                repo_id=config_dataset.path,
                repo_type="dataset",
                filename=config_dataset.data_files,
                revision=config_dataset.revision,
            )
        elif isinstance(config_dataset.data_files, list):
            fp = []
            for file in config_dataset.data_files:
                fp.append(
                    hf_hub_download(
                        repo_id=config_dataset.path,
                        repo_type="dataset",
                        filename=file,
                        revision=config_dataset.revision,
                    )
                )
        else:
            raise ValueError("data_files must be either a string or list of strings")
        ds = load_dataset(
            "json",
            name=config_dataset.name,
            data_files=fp,
            streaming=False,
            split=None,
        )
    if not ds:
        raise ValueError("unhandled dataset load")
    return ds
Author	SHA1	Message	Date
Wing Lian	ee20600b9a	use alternate math-hard repo	2025-01-13 08:46:35 -05:00
Wing Lian	fd91de3ea6	apply chat template as arg	2025-01-12 17:38:32 -05:00
Wing Lian	530bf77cf9	revision support	2025-01-12 05:17:03 -05:00
Wing Lian	bfc91a91ca	use chat template	2025-01-11 23:18:27 -05:00
Wing Lian	5c226b600d	pr feedback	2025-01-08 08:38:06 -05:00
Wing Lian	af66f7c274	update link in README to include utm	2025-01-07 15:13:18 -05:00
Wing Lian	079f94ee99	include modal in requirements	2025-01-07 08:48:25 -05:00
Wing Lian	981ad965d0	allow minimal yaml for lm eval	2025-01-06 17:41:10 -05:00
Wing Lian	7ba701a355	cache bust when using branch, grab sha of latest image tag, update lm-eval dep	2025-01-06 16:19:08 -05:00
Wing Lian	0390bce7aa	lm_eval option to not post eval, and append not extend	2025-01-06 11:52:07 -05:00
Wing Lian	2741d8de23	Fix the sub call to lm-eval	2025-01-06 11:44:55 -05:00
Wing Lian	27a88f37cd	do lm_eval in cloud too	2025-01-06 11:17:14 -05:00
Wing Lian	6da8abc01f	native support for modal cloud from CLI	2025-01-05 21:49:53 -05:00
Wing Lian	3915abee4c	make sure padding is labeled as -100 for pretraining (#2227 )	2024-12-31 15:22:18 -05:00
NJordan72	7a38dbe674	fix: allow trainer builder to use custom jinja chat template (#2219 ) * fix: allow trainer builder to use custom jinja chat template * chore: use get_chat_template_from_config Co-authored-by: Chirag Jain <jain.chirag925@gmail.com> * fix: swap imports --------- Co-authored-by: Chirag Jain <jain.chirag925@gmail.com>	2024-12-24 16:18:50 -05:00
Wing Lian	e0a2eb2ebd	fix untrained tokens if specified explicitly from a list (#2210 )	2024-12-23 09:08:28 -05:00
Wing Lian	d852d7af7a	inference - don't default w accelerate, fix base model (#2216 ) [skip ci]	2024-12-23 07:48:41 -05:00
Wing Lian	3742deb1de	add deepspeed example with torch compile enabled (#2212 ) [skip ci]	2024-12-22 12:11:39 -05:00
Wing Lian	2312caaa98	GC every n steps (#2209 )	2024-12-21 17:38:33 -05:00
Wing Lian	307cf7c685	move the dataset loading from remote/disk to a shared function so we can re-use for RL (#2204 )	2024-12-20 21:43:52 -05:00
Dan Saunders	70541145f1	adding test_datasets compat with pretraining_dataset (streaming) (#2206 ) [skip ci]	2024-12-20 21:43:33 -05:00