log warning re: logged losses / gradient scaling per rank

using existing packed seqlens util
adding smoke test
2025-04-07 18:47:43 +00:00 · 2025-04-07 18:47:43 +00:00 · 2025-04-07 18:47:43 +00:00 · 2025-04-07 18:47:43 +00:00 · 2025-04-07 18:47:43 +00:00 · 2025-04-07 18:47:43 +00:00
145 changed files with 1583 additions and 1418 deletions
--- a/.github/workflows/multi-gpu-e2e.yml
+++ b/.github/workflows/multi-gpu-e2e.yml
@@ -24,6 +24,13 @@ jobs:
      fail-fast: false
      matrix:
        include:
+          - cuda: 124
+            cuda_version: 12.4.1
+            python_version: "3.11"
+            pytorch: 2.6.0
+            axolotl_extras: vllm
+            num_gpus: 2
+            nightly_build: "true"
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
@@ -38,13 +45,6 @@ jobs:
            axolotl_extras: vllm
            num_gpus: 2
            nightly_build: "true"
-          - cuda: 124
-            cuda_version: 12.4.1
-            python_version: "3.11"
-            pytorch: 2.6.0
-            axolotl_extras: vllm
-            num_gpus: 2
-            nightly_build: "true"
    runs-on: [self-hosted, modal]
    timeout-minutes: 120
    steps:
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -211,7 +211,7 @@ jobs:
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
-            pytorch: 2.5.1
+            pytorch: 2.6.0
            num_gpus: 1
            axolotl_extras: vllm
    steps:
@@ -258,7 +258,7 @@ jobs:
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
-            pytorch: 2.6.0
+            pytorch: 2.5.1
            num_gpus: 1
            axolotl_extras: vllm
    steps:
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -231,6 +231,7 @@ website:
            - docs/reward_modelling.qmd
            - docs/lr_groups.qmd
            - docs/lora_optims.qmd
+            - docs/dataset_loading.qmd

        - section: "Core Concepts"
          contents:
--- a/cicd/multigpu.sh
+++ b/cicd/multigpu.sh
@@ -2,5 +2,5 @@
 set -e

 # only run one test at a time so as not to OOM the GPU
-pytest -v -n2 /workspace/axolotl/tests/e2e/multigpu/ --ignore=/workspace/axolotl/tests/e2e/multigpu/solo/
-pytest -v -n1 /workspace/axolotl/tests/e2e/multigpu/solo/
+pytest -v  --durations=10 -n2 /workspace/axolotl/tests/e2e/multigpu/ --ignore=/workspace/axolotl/tests/e2e/multigpu/solo/
+pytest -v  --durations=10 -n1 /workspace/axolotl/tests/e2e/multigpu/solo/
--- a/docker/Dockerfile-base
+++ b/docker/Dockerfile-base
@@ -29,7 +29,7 @@ ENV PATH="/root/miniconda3/envs/py${PYTHON_VERSION}/bin:${PATH}"
 WORKDIR /workspace

 RUN python3 -m pip install --upgrade pip && pip3 install -U packaging==23.2 setuptools==75.8.0 wheel && \
-    python3 -m pip install --no-cache-dir -U torch==${PYTORCH_VERSION}+cu${CUDA} --extra-index-url https://download.pytorch.org/whl/cu$CUDA && \
+    python3 -m pip install --no-cache-dir -U torch==${PYTORCH_VERSION}+cu${CUDA} torchvision --extra-index-url https://download.pytorch.org/whl/cu$CUDA && \
    python3 -m pip install --no-cache-dir "causal_conv1d @ git+https://github.com/Dao-AILab/causal-conv1d.git@main" && \
    python3 -m pip install --no-cache-dir "mamba_ssm @ git+https://github.com/state-spaces/mamba.git@main"

--- a/docs/config.qmd
+++ b/docs/config.qmd
@@ -109,7 +109,7 @@ datasets:
    preprocess_shards: # Optional[int] process dataset in N sequential chunks for memory efficiency (exclusive with `shards`)

    name: # Optional[str] name of dataset configuration to load
-    train_on_split: train # Optional[str] name of dataset split to load from
+    split: train # Optional[str] name of dataset split to load from
    revision: # Optional[str] The specific revision of the dataset to use when loading from the Hugging Face Hub. This can be a commit hash, tag, or branch name. If not specified, the latest version will be used. This parameter is ignored for local datasets.
    trust_remote_code: # Optional[bool] Trust remote code for untrusted source

@@ -165,7 +165,9 @@ datasets:
      content: value
      # ...

-    # Optional[Dict[str, List]]. Roles mapping in the messages. The default is:
+    # Optional[Dict[str, List]]. Roles mapping in the messages.
+    # The format is {target_role: [source_roles]}. All source roles will be mapped to the target role.
+    # The default is:
    roles:
      user: ["human", "user"]
      assistant: ["gpt", "assistant"]
--- a/docs/dataset-formats/index.qmd
+++ b/docs/dataset-formats/index.qmd
@@ -13,6 +13,13 @@ As there are a lot of available options in Axolotl, this guide aims to provide a

 Axolotl supports 3 kinds of training methods: pre-training, supervised fine-tuning, and preference-based post-training (e.g. DPO, ORPO, PRMs). Each method has their own dataset format which are described below.

+::: {.callout-tip}
+
+This guide will mainly use JSONL as an introduction. Please refer to the [dataset loading docs](../dataset_loading.qmd) to understand how to load datasets from other sources.
+
+For `pretraining_dataset:` specifically, please refer to the [Pre-training section](#pre-training).
+:::
+
 ## Pre-training

 When aiming to train on large corpora of text datasets, pre-training is your go-to choice. Due to the size of these datasets, downloading the entire-datasets before beginning training would be prohibitively time-consuming. Axolotl supports [streaming](https://huggingface.co/docs/datasets/en/stream) to only load batches into memory at a time.
--- a/docs/dataset_loading.qmd
+++ b/docs/dataset_loading.qmd
@@ -0,0 +1,276 @@
+---
+title: Dataset Loading
+description: Understanding how to load datasets from different sources
+back-to-top-navigation: true
+toc: true
+toc-depth: 5
+---
+
+## Overview
+
+Datasets can be loaded in a number of different ways depending on the how it is saved (the extension of the file) and where it is stored.
+
+## Loading Datasets
+
+We use the `datasets` library to load datasets and a mix of `load_dataset` and `load_from_disk` to load them.
+
+You may recognize the similar named configs between `load_dataset` and the `datasets` section of the config file.
+
+```yaml
+datasets:
+  - path:
+    name:
+    data_files:
+    split:
+    revision:
+    trust_remote_code:
+```
+
+::: {.callout-tip}
+
+Do not feel overwhelmed by the number of options here. A lot of them are optional. In fact, the most common config to use would be `path` and sometimes `data_files`.
+
+:::
+
+This matches the API of [`datasets.load_dataset`](https://github.com/huggingface/datasets/blob/0b5998ac62f08e358f8dcc17ec6e2f2a5e9450b6/src/datasets/load.py#L1838-L1858), so if you're familiar with that, you will feel right at home.
+
+For HuggingFace's guide to load different dataset types, see [here](https://huggingface.co/docs/datasets/loading).
+
+For full details on the config, see [config.qmd](config.qmd).
+
+::: {.callout-note}
+
+You can set multiple datasets in the config file by more than one entry under `datasets`.
+
+```yaml
+datasets:
+  - path: /path/to/your/dataset
+  - path: /path/to/your/other/dataset
+```
+
+:::
+
+### Local dataset
+
+#### Files
+
+Usually, to load a JSON file, you would do something like this:
+
+```python
+from datasets import load_dataset
+
+dataset = load_dataset("json", data_files="data.json")
+```
+
+Which translates to the following config:
+
+```yaml
+datasets:
+  - path: json
+    data_files: /path/to/your/file.jsonl
+```
+
+However, to make things easier, we have added a few shortcuts for loading local dataset files.
+
+You can just point the `path` to the file or directory along with the `ds_type` to load the dataset. The below example shows for a JSON file:
+
+```yaml
+datasets:
+  - path: /path/to/your/file.jsonl
+    ds_type: json
+```
+
+This works for CSV, JSON, Parquet, and Arrow files.
+
+::: {.callout-tip}
+
+If `path` points to a file and `ds_type` is not specified, we will automatically infer the dataset type from the file extension, so you could omit `ds_type` if you'd like.
+
+:::
+
+#### Directory
+
+If you're loading a directory, you can point the `path` to the directory.
+
+Then, you have two options:
+
+##### Loading entire directory
+
+You do not need any additional configs.
+
+We will attempt to load in the following order:
+- datasets saved with `datasets.save_to_disk`
+- loading entire directory of files (such as with parquet/arrow files)
+
+```yaml
+datasets:
+  - path: /path/to/your/directory
+```
+
+##### Loading specific files in directory
+
+Provide `data_files` with a list of files to load.
+
+```yaml
+datasets:
+    # single file
+  - path: /path/to/your/directory
+    ds_type: csv
+    data_files: file1.csv
+
+    # multiple files
+  - path: /path/to/your/directory
+    ds_type: json
+    data_files:
+      - file1.jsonl
+      - file2.jsonl
+
+    # multiple files for parquet
+  - path: /path/to/your/directory
+    ds_type: parquet
+    data_files:
+      - file1.parquet
+      - file2.parquet
+
+```
+
+### HuggingFace Hub
+
+The method you use to load the dataset depends on how the dataset was created, whether a folder was uploaded directly or a HuggingFace Dataset was pushed.
+
+::: {.callout-note}
+
+If you're using a private dataset, you will need to enable the `hf_use_auth_token` flag in the root-level of the config file.
+
+:::
+
+#### Folder uploaded
+
+This would mean that the dataset is a single file or file(s) uploaded to the Hub.
+
+```yaml
+datasets:
+  - path: org/dataset-name
+    data_files:
+      - file1.jsonl
+      - file2.jsonl
+```
+
+#### HuggingFace Dataset
+
+This means that the dataset is created as a HuggingFace Dataset and pushed to the Hub via `datasets.push_to_hub`.
+
+```yaml
+datasets:
+  - path: org/dataset-name
+```
+
+::: {.callout-note}
+
+There are some other configs which may be required like `name`, `split`, `revision`, `trust_remote_code`, etc depending on the dataset.
+
+:::
+
+### Remote Filesystems
+
+Via the `storage_options` config under `load_dataset`, you can load datasets from remote filesystems like S3, GCS, Azure, and OCI.
+
+::: {.callout-warning}
+
+This is currently experimental. Please let us know if you run into any issues!
+
+:::
+
+The only difference between the providers is that you need to prepend the path with the respective protocols.
+
+```yaml
+datasets:
+    # Single file
+  - path: s3://bucket-name/path/to/your/file.jsonl
+
+    # Directory
+  - path: s3://bucket-name/path/to/your/directory
+```
+
+For directory, we load via `load_from_disk`.
+
+#### S3
+
+Prepend the path with `s3://`.
+
+The credentials are pulled in the following order:
+
+- `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN` environment variables
+- from the `~/.aws/credentials` file
+- for nodes on EC2, the IAM metadata provider
+
+::: {.callout-note}
+
+We assume you have credentials setup and not using anonymous access. If you want to use anonymous access, let us know! We may have to open a config option for this.
+
+:::
+
+Other environment variables that can be set can be found in [boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#using-environment-variables)
+
+#### GCS
+
+Prepend the path with `gs://` or `gcs://`.
+
+The credentials are loaded in the following order:
+
+- gcloud credentials
+- for nodes on GCP, the google metadata service
+- anonymous access
+
+#### Azure
+
+##### Gen 1
+
+Prepend the path with `adl://`.
+
+Ensure you have the following environment variables set:
+
+- `AZURE_STORAGE_TENANT_ID`
+- `AZURE_STORAGE_CLIENT_ID`
+- `AZURE_STORAGE_CLIENT_SECRET`
+
+##### Gen 2
+
+Prepend the path with `abfs://` or `az://`.
+
+Ensure you have the following environment variables set:
+
+- `AZURE_STORAGE_ACCOUNT_NAME`
+- `AZURE_STORAGE_ACCOUNT_KEY`
+
+Other environment variables that can be set can be found in [adlfs docs](https://github.com/fsspec/adlfs?tab=readme-ov-file#setting-credentials)
+
+#### OCI
+
+Prepend the path with `oci://`.
+
+It would attempt to read in the following order:
+
+- `OCIFS_IAM_TYPE`, `OCIFS_CONFIG_LOCATION`, and `OCIFS_CONFIG_PROFILE` environment variables
+- when on OCI resource, resource principal
+
+Other environment variables:
+
+- `OCI_REGION_METADATA`
+
+Please see the [ocifs docs](https://ocifs.readthedocs.io/en/latest/getting-connected.html#Using-Environment-Variables).
+
+### HTTPS
+
+The path should start with `https://`.
+
+```yaml
+datasets:
+  - path: https://path/to/your/dataset/file.jsonl
+```
+
+This must be publically accessible.
+
+## Next steps
+
+Now that you know how to load datasets, you can learn more on how to load your specific dataset format into your target output format [dataset formats docs](dataset-formats).
--- a/docs/multimodal.qmd
+++ b/docs/multimodal.qmd
@@ -9,6 +9,7 @@ format:
 ## Supported Models

 - [Mllama](#sec-mllama)
+- [Llama4](#sec-llama4)
 - [Pixtral](#sec-pixtral)
 - [Llava-1.5](#sec-llava-15)
 - [Mistral-Small-3.1](#sec-mistral-small-31)
@@ -63,6 +64,14 @@ base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
 chat_template: llama3_2_vision
 ```

+### Llama4 {#sec-llama4}
+
+```yaml
+base_model: meta-llama/Llama-4-Scout-17B-16E-Instruct
+
+chat_template: llama4
+```
+
 ### Pixtral {#sec-pixtral}

 ```yaml
--- a/examples/cerebras/btlm-ft.yml
+++ b/examples/cerebras/btlm-ft.yml
@@ -8,9 +8,6 @@ tokenizer_type: GPT2Tokenizer
 trust_remote_code: true
 tokenizer_use_fast: true
 tokenizer_legacy: true
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false
 push_dataset_to_hub:
 hf_use_auth_token: true
@@ -34,7 +31,6 @@ lora_alpha:
 lora_dropout:
 lora_target_modules:
 lora_target_linear:
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -58,16 +54,12 @@ learning_rate: 0.000085
 train_on_inputs: true
 group_by_length: false
 bf16: auto
-fp16:
 tf32: true

 gradient_checkpointing: false
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1

-xformers_attention:
 flash_attention: true
 sdp_attention:
 flash_optimum:
@@ -80,8 +72,6 @@ evals_per_epoch: 4
 saves_per_epoch: 1
 save_total_limit:

-debug:
-deepspeed:
 weight_decay: 0.1
 special_tokens:
  pad_token: "<|endoftext|>"
--- a/examples/cerebras/qlora.yml
+++ b/examples/cerebras/qlora.yml
@@ -22,7 +22,6 @@ lora_target_modules:
  - c_attn
  - c_proj
 lora_target_linear:
-lora_fan_in_fan_out:
 wandb_project:
 wandb_entity:
 wandb_watch:
@@ -36,15 +35,10 @@ optimizer: paged_adamw_8bit
 torchdistx_path:
 lr_scheduler: cosine
 learning_rate: 0.0002
-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: true
 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
 xformers_attention: true
 flash_attention:
@@ -53,10 +47,6 @@ gptq_model_v1:
 warmup_steps: 10
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.1
-fsdp:
-fsdp_config:
 special_tokens:
  pad_token: "<|endoftext|>"
--- a/examples/code-llama/13b/lora.yml
+++ b/examples/code-llama/13b/lora.yml
@@ -26,7 +26,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -41,29 +40,18 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true
-s2_attention:

 warmup_steps: 10
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
--- a/examples/code-llama/13b/qlora.yml
+++ b/examples/code-llama/13b/qlora.yml
@@ -26,9 +26,7 @@ pad_to_sequence_len: true
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
-lora_target_modules:
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -43,28 +41,18 @@ optimizer: paged_adamw_32bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 10
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
--- a/examples/code-llama/34b/lora.yml
+++ b/examples/code-llama/34b/lora.yml
@@ -26,7 +26,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -41,29 +40,18 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true
-s2_attention:

 warmup_steps: 10
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
--- a/examples/code-llama/34b/qlora.yml
+++ b/examples/code-llama/34b/qlora.yml
@@ -26,9 +26,7 @@ pad_to_sequence_len: true
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
-lora_target_modules:
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -43,28 +41,18 @@ optimizer: paged_adamw_32bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 10
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
--- a/examples/code-llama/7b/lora.yml
+++ b/examples/code-llama/7b/lora.yml
@@ -26,7 +26,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -41,29 +40,18 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true
-s2_attention:

 warmup_steps: 10
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
--- a/examples/code-llama/7b/qlora.yml
+++ b/examples/code-llama/7b/qlora.yml
@@ -26,9 +26,7 @@ pad_to_sequence_len: true
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
-lora_target_modules:
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -43,28 +41,18 @@ optimizer: paged_adamw_32bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 10
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
--- a/examples/cohere/command-r-7b-qlora.yml
+++ b/examples/cohere/command-r-7b-qlora.yml
@@ -44,28 +44,16 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: true

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_ratio: 0.1
 evals_per_epoch:
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/dbrx/16bit-lora.yaml
+++ b/examples/dbrx/16bit-lora.yaml
@@ -3,9 +3,6 @@ base_model: LnL-AI/dbrx-base-converted-v2
 # hub_model_id: username/custom_model_name

 trust_remote_code: true
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 datasets:
@@ -48,26 +45,20 @@ optimizer: paged_adamw_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: false  # don't use with fsdp_activation_checkpointing
 gradient_checkpointing_kwargs:
  use_reentrant: false
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 10
 evals_per_epoch:
 saves_per_epoch: 1
-debug:
+
 weight_decay: 0.0
 fsdp:
  - full_shard
--- a/examples/dbrx/8bit-lora.yaml
+++ b/examples/dbrx/8bit-lora.yaml
@@ -48,26 +48,20 @@ optimizer: paged_adamw_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: false  # don't use with fsdp_activation_checkpointing
 gradient_checkpointing_kwargs:
  use_reentrant: false
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 10
 evals_per_epoch:
 saves_per_epoch: 1
-debug:
+
 weight_decay: 0.0
 fsdp:
  - full_shard
--- a/examples/dbrx/fft-ds-zero3.yaml
+++ b/examples/dbrx/fft-ds-zero3.yaml
@@ -3,9 +3,6 @@ base_model: LnL-AI/dbrx-base-converted-v2
 # hub_model_id: username/custom_model_name

 trust_remote_code: true
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 datasets:
@@ -35,25 +32,19 @@ optimizer: paged_adamw_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 10
 evals_per_epoch:
 saves_per_epoch: 1
-debug:
+
 weight_decay: 0.0
 deepspeed: deepspeed_configs/zero3_bf16.json
--- a/examples/deepseek-v2/fft-fsdp-16b.yaml
+++ b/examples/deepseek-v2/fft-fsdp-16b.yaml
@@ -2,9 +2,6 @@ base_model: deepseek-ai/DeepSeek-V2-Lite
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
 trust_remote_code: true
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 datasets:
@@ -31,27 +28,19 @@ optimizer: adamw_torch_fused
 lr_scheduler: cosine
 learning_rate: 2e-5

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
-early_stopping_patience:
 resume_from_checkpoint:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 100
 evals_per_epoch: 2
-eval_table_size:
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
 special_tokens:
 fsdp:
--- a/examples/deepseek-v2/qlora-fsdp-2_5.yaml
+++ b/examples/deepseek-v2/qlora-fsdp-2_5.yaml
@@ -52,27 +52,19 @@ optimizer: adamw_torch_fused
 lr_scheduler: cosine
 learning_rate: 2e-5

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
-early_stopping_patience:
 resume_from_checkpoint:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 100
 evals_per_epoch: 2
-eval_table_size:
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
 special_tokens:
 fsdp:
--- a/examples/falcon/config-7b-lora.yml
+++ b/examples/falcon/config-7b-lora.yml
@@ -25,9 +25,7 @@ max_packed_sequence_len:
 lora_r: 16
 lora_alpha: 32
 lora_dropout: 0.0
-lora_target_modules:
 lora_target_linear: true
-lora_fan_in_fan_out:
 wandb_project:
 wandb_entity:
 wandb_watch:
@@ -41,15 +39,10 @@ optimizer: adamw_bnb_8bit
 torchdistx_path:
 lr_scheduler: cosine
 learning_rate: 0.00003
-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: true
 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
 xformers_attention: true
 flash_attention:
@@ -58,11 +51,7 @@ gptq_model_v1:
 warmup_steps: 40
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  pad_token: "<|endoftext|>"
  bos_token: "<|endoftext|>"
--- a/examples/falcon/config-7b-qlora.yml
+++ b/examples/falcon/config-7b-qlora.yml
@@ -38,9 +38,7 @@ lora_alpha: 16
 # 0.05 for 33B and 65B models
 lora_dropout: 0.05
 # add LoRA modules on all linear layers of the base model
-lora_target_modules:
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -67,10 +65,7 @@ lr_scheduler: cosine
 # - 2e-4 for 7b & 13b
 # - 1e-4 for 33b & 64b
 learning_rate: 0.0002
-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: true
 gradient_checkpointing: true
 # stop training after this many evaluation losses have increased in a row
@@ -78,7 +73,6 @@ gradient_checkpointing: true
 early_stopping_patience: 3
 resume_from_checkpoint:
 auto_resume_from_checkpoints: true
-local_rank:
 logging_steps: 1
 xformers_attention: true
 flash_attention:
@@ -87,11 +81,7 @@ gptq_model_v1:
 warmup_steps: 10
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.000001
-fsdp:
-fsdp_config:
 special_tokens:
  pad_token: "<|endoftext|>"
  bos_token: "<|endoftext|>"
--- a/examples/falcon/config-7b.yml
+++ b/examples/falcon/config-7b.yml
@@ -7,9 +7,6 @@ tokenizer_type: AutoTokenizer

 # required by falcon custom model code: https://huggingface.co/tiiuae/falcon-7b/tree/main
 trust_remote_code: true
-
-load_in_8bit: false
-load_in_4bit: false
 gptq: false
 strict: false
 push_dataset_to_hub:
@@ -25,9 +22,7 @@ max_packed_sequence_len:
 lora_r: 64
 lora_alpha: 32
 lora_dropout: 0.0
-lora_target_modules:
 lora_target_linear: true
-lora_fan_in_fan_out:
 wandb_project:
 wandb_entity:
 wandb_watch:
@@ -41,15 +36,10 @@ optimizer: adamw_bnb_8bit
 torchdistx_path:
 lr_scheduler: cosine
 learning_rate: 0.00003
-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: true
 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
 xformers_attention: true
 flash_attention:
@@ -58,11 +48,7 @@ gptq_model_v1:
 warmup_steps: 40
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  pad_token: "<|endoftext|>"
  bos_token: "<|endoftext|>"
--- a/examples/gemma/qlora.yml
+++ b/examples/gemma/qlora.yml
@@ -42,28 +42,16 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_ratio: 0.1
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/gemma2/qlora.yml
+++ b/examples/gemma2/qlora.yml
@@ -48,28 +48,16 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: true

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_ratio: 0.1
 evals_per_epoch:
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/gemma2/reward-model.yaml
+++ b/examples/gemma2/reward-model.yaml
@@ -5,9 +5,6 @@ num_labels: 1
 tokenizer_type: AutoTokenizer
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 reward_model: true
@@ -38,8 +35,6 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: true
 fp16:
 tf32: true
@@ -47,21 +42,12 @@ tf32: true
 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_ratio: 0.1
 evals_per_epoch:
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/gemma3/gemma-3-1b-qlora.yml
+++ b/examples/gemma3/gemma-3-1b-qlora.yml
@@ -50,30 +50,18 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: true

 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_ratio: 0.1
 evals_per_epoch:
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/gemma3/gemma-3-4b-qlora.yml
+++ b/examples/gemma3/gemma-3-4b-qlora.yml
@@ -44,8 +44,6 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: true
 fp16:
 tf32: true
@@ -53,7 +51,6 @@ tf32: true
 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
-local_rank:
 logging_steps: 1
 flash_attention: true
 eager_attention:
@@ -61,8 +58,4 @@ eager_attention:
 warmup_ratio: 0.1
 evals_per_epoch: 1
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
--- a/examples/gemma3/gemma-3-4b-vision-qlora.yml
+++ b/examples/gemma3/gemma-3-4b-vision-qlora.yml
@@ -46,8 +46,6 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: true
 fp16:
 tf32: true
@@ -55,7 +53,6 @@ tf32: true
 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
-local_rank:
 logging_steps: 1
 flash_attention: true
 eager_attention:
@@ -63,8 +60,4 @@ eager_attention:
 warmup_ratio: 0.1
 evals_per_epoch: 1
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
--- a/examples/gptj/qlora.yml
+++ b/examples/gptj/qlora.yml
@@ -18,9 +18,7 @@ max_packed_sequence_len:
 lora_r: 8
 lora_alpha: 32
 lora_dropout: 0.05
-lora_target_modules:
 lora_target_linear: true
-lora_fan_in_fan_out:
 wandb_project:
 wandb_entity:
 wandb_watch:
@@ -34,15 +32,10 @@ optimizer: paged_adamw_8bit
 torchdistx_path:
 lr_scheduler: cosine
 learning_rate: 0.0001
-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: true
 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
 xformers_attention: true
 flash_attention:
@@ -51,10 +44,6 @@ gptq_model_v1:
 warmup_steps: 10
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.1
-fsdp:
-fsdp_config:
 special_tokens:
  pad_token: "<|endoftext|>"
--- a/examples/jamba/qlora.yaml
+++ b/examples/jamba/qlora.yaml
@@ -40,26 +40,18 @@ optimizer: paged_adamw_8bit
 lr_scheduler: cosine
 learning_rate: 0.00001

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 10
 evals_per_epoch:
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
 special_tokens:
--- a/examples/jamba/qlora_deepspeed.yaml
+++ b/examples/jamba/qlora_deepspeed.yaml
@@ -39,26 +39,20 @@ optimizer: paged_adamw_8bit
 lr_scheduler: cosine
 learning_rate: 0.00001

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 10
 evals_per_epoch:
 saves_per_epoch: 1
-debug:
+
 deepspeed: deepspeed_configs/zero2.json
 weight_decay: 0.0
 special_tokens:
--- a/examples/jamba/qlora_fsdp_large.yaml
+++ b/examples/jamba/qlora_fsdp_large.yaml
@@ -39,8 +39,6 @@ optimizer: adamw_torch_fused
 lr_scheduler: cosine
 learning_rate: 0.00001

-train_on_inputs: false
-group_by_length: false
 bf16: true
 tf32: true

--- a/examples/jeopardy-bot/config.yml
+++ b/examples/jeopardy-bot/config.yml
@@ -33,13 +33,9 @@ optimizer: adamw_bnb_8bit
 torchdistx_path:
 lr_scheduler: cosine
 learning_rate: 0.00003
-train_on_inputs: false
-group_by_length: false
 bf16: auto
 tf32: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 5
 xformers_attention: true
 flash_attention:
@@ -48,11 +44,7 @@ gptq_model_v1:
 warmup_steps: 20
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.1
-fsdp:
-fsdp_config:
 tokens:
  bos_token: "<s>"
  eos_token: "</s>"
--- a/examples/llama-2/fft_optimized.yml
+++ b/examples/llama-2/fft_optimized.yml
@@ -4,9 +4,6 @@ model_type: LlamaForCausalLM
 tokenizer_type: LlamaTokenizer
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 datasets:
@@ -26,7 +23,6 @@ lora_r:
 lora_alpha:
 lora_dropout:
 lora_target_linear:
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -41,18 +37,12 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true
 flash_attn_cross_entropy: false
 flash_attn_rms_norm: true
@@ -61,11 +51,8 @@ flash_attn_fuse_mlp: true

 warmup_steps: 100
 evals_per_epoch: 4
-eval_table_size:
 saves_per_epoch: 1
-debug:
+
 deepspeed: #deepspeed_configs/zero2.json # multi-gpu only
 weight_decay: 0.1
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/llama-2/gptq-lora.yml
+++ b/examples/llama-2/gptq-lora.yml
@@ -10,8 +10,6 @@ gptq_disable_exllama: true

 tokenizer_use_fast: true
 tokenizer_legacy: true
-load_in_8bit: false
-load_in_4bit: false
 strict: false
 push_dataset_to_hub:
 hf_use_auth_token: true
@@ -33,7 +31,6 @@ lora_target_modules:
  - q_proj
  - v_proj
 lora_target_linear:
-lora_fan_in_fan_out:
 wandb_project:
 wandb_watch:
 wandb_name:
@@ -50,26 +47,19 @@ torchdistx_path:
 lr_scheduler: cosine
 lr_quadratic_warmup: true
 learning_rate: 0.000017
-train_on_inputs: false
-group_by_length: false
 bf16: false
 fp16: false
 float16: true
 tf32: true
 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention:
 sdp_attention:
 flash_optimum:
 warmup_steps: 100
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.1
 special_tokens:
  bos_token: "<s>"
--- a/examples/llama-2/lisa.yml
+++ b/examples/llama-2/lisa.yml
@@ -4,9 +4,6 @@ model_type: LlamaForCausalLM
 tokenizer_type: LlamaTokenizer
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 datasets:
@@ -26,7 +23,6 @@ lora_r:
 lora_alpha:
 lora_dropout:
 lora_target_linear:
-lora_fan_in_fan_out:

 lisa_n_layers: 4
 lisa_step_interval: 20
@@ -45,18 +41,12 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 5e-5 # recommendation from lisa paper for 7b

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true
 flash_attn_cross_entropy: false
 flash_attn_rms_norm: true
@@ -65,13 +55,8 @@ flash_attn_fuse_mlp: true

 warmup_steps: 100
 evals_per_epoch: 4
-eval_table_size:
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.1
-fsdp:
-fsdp_config:
 special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
--- a/examples/llama-2/loftq.yml
+++ b/examples/llama-2/loftq.yml
@@ -4,9 +4,6 @@ model_type: LlamaForCausalLM
 tokenizer_type: LlamaTokenizer
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 datasets:
@@ -26,7 +23,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:
 peft:
  loftq_config:
    loftq_bits: 4
@@ -44,29 +40,16 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true
-s2_attention:

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/llama-2/lora.yml
+++ b/examples/llama-2/lora.yml
@@ -26,7 +26,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -41,29 +40,16 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true
-s2_attention:

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/llama-2/qlora-fsdp.yml
+++ b/examples/llama-2/qlora-fsdp.yml
@@ -26,9 +26,7 @@ pad_to_sequence_len: true
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
-lora_target_modules:
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -43,28 +41,19 @@ optimizer: adamw_torch_fused
 lr_scheduler: cosine
 learning_rate: 0.00001

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
 fsdp:
  - full_shard
--- a/examples/llama-2/qlora.yml
+++ b/examples/llama-2/qlora.yml
@@ -26,9 +26,7 @@ pad_to_sequence_len: true
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
-lora_target_modules:
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -43,27 +41,16 @@ optimizer: paged_adamw_32bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/llama-2/relora.yml
+++ b/examples/llama-2/relora.yml
@@ -24,9 +24,7 @@ pad_to_sequence_len: true
 lora_r: 8
 lora_alpha: 16
 lora_dropout: 0.05
-lora_target_modules:
 lora_target_linear: true
-lora_fan_in_fan_out:

 relora_steps: 150
 relora_warmup_steps: 10
@@ -45,28 +43,18 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 10
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
--- a/examples/llama-3-vision/lora-11b.yaml
+++ b/examples/llama-3-vision/lora-11b.yaml
@@ -45,14 +45,11 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: true
 fp16:
 tf32: true

 gradient_checkpointing: true
-local_rank:
 logging_steps: 1
 flash_attention: true
 eager_attention:
@@ -60,8 +57,4 @@ eager_attention:
 warmup_ratio: 0.1
 evals_per_epoch: 1
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
--- a/examples/llama-3/fft-8b-liger-fsdp.yaml
+++ b/examples/llama-3/fft-8b-liger-fsdp.yaml
@@ -42,27 +42,19 @@ optimizer: adamw_torch_fused
 lr_scheduler: cosine
 learning_rate: 2e-5

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
-early_stopping_patience:
 resume_from_checkpoint:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 100
 evals_per_epoch: 2
-eval_table_size:
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
 fsdp:
  - full_shard
--- a/examples/llama-3/fft-8b.yaml
+++ b/examples/llama-3/fft-8b.yaml
@@ -1,9 +1,6 @@
 base_model: NousResearch/Meta-Llama-3.1-8B
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 datasets:
@@ -30,29 +27,19 @@ optimizer: paged_adamw_8bit
 lr_scheduler: cosine
 learning_rate: 2e-5

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
-early_stopping_patience:
 resume_from_checkpoint:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 100
 evals_per_epoch: 2
-eval_table_size:
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  pad_token: <|end_of_text|>
--- a/examples/llama-3/instruct-dpo-lora-8b.yml
+++ b/examples/llama-3/instruct-dpo-lora-8b.yml
@@ -42,7 +42,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -57,28 +56,15 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true
-s2_attention:

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
--- a/examples/llama-3/instruct-lora-8b.yml
+++ b/examples/llama-3/instruct-lora-8b.yml
@@ -37,7 +37,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -52,30 +51,17 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true
-s2_attention:

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
   pad_token: <|end_of_text|>
--- a/examples/llama-3/lora-1b-deduplicate-dpo.yml
+++ b/examples/llama-3/lora-1b-deduplicate-dpo.yml
@@ -58,7 +58,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -73,28 +72,15 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true
-s2_attention:

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
--- a/examples/llama-3/lora-1b-deduplicate-sft.yml
+++ b/examples/llama-3/lora-1b-deduplicate-sft.yml
@@ -31,7 +31,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:
 lora_modules_to_save:
  - embed_tokens
  - lm_head
@@ -49,30 +48,17 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true
-s2_attention:

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
   pad_token: <|end_of_text|>
--- a/examples/llama-3/lora-1b-kernels.yml
+++ b/examples/llama-3/lora-1b-kernels.yml
@@ -1,9 +1,6 @@
 base_model: NousResearch/Llama-3.2-1B
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 datasets:
@@ -24,7 +21,6 @@ lora_r: 16
 lora_alpha: 32
 # Currently, we don't support dropout with our custom Triton kernels
 # lora_dropout: 0.05
-lora_fan_in_fan_out:
 lora_target_modules:
  - gate_proj
  - down_proj
@@ -53,18 +49,12 @@ optimizer: adamw_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 loss_watchdog_threshold: 5.0
@@ -73,10 +63,6 @@ loss_watchdog_patience: 3
 warmup_steps: 10
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  pad_token: "<|end_of_text|>"
--- a/examples/llama-3/lora-1b-ray.yml
+++ b/examples/llama-3/lora-1b-ray.yml
@@ -1,9 +1,6 @@
 base_model: NousResearch/Llama-3.2-1B
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 datasets:
@@ -24,7 +21,6 @@ pad_to_sequence_len: true
 lora_r: 16
 lora_alpha: 32
 lora_dropout: 0.05
-lora_fan_in_fan_out:
 lora_target_modules:
  - gate_proj
  - down_proj
@@ -47,18 +43,12 @@ optimizer: adamw_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 loss_watchdog_threshold: 5.0
@@ -67,11 +57,9 @@ loss_watchdog_patience: 3
 warmup_steps: 10
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
+
 deepspeed: deepspeed_configs/zero3.json
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  pad_token: "<|end_of_text|>"

--- a/examples/llama-3/lora-1b-sample-packing-sequentially.yml
+++ b/examples/llama-3/lora-1b-sample-packing-sequentially.yml
@@ -33,7 +33,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:
 lora_modules_to_save:
  - embed_tokens
  - lm_head
@@ -51,30 +50,17 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true
-s2_attention:

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  pad_token: <|end_of_text|>
--- a/examples/llama-3/lora-1b.yml
+++ b/examples/llama-3/lora-1b.yml
@@ -1,9 +1,6 @@
 base_model: NousResearch/Llama-3.2-1B
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 datasets:
@@ -24,7 +21,6 @@ pad_to_sequence_len: true
 lora_r: 16
 lora_alpha: 32
 lora_dropout: 0.05
-lora_fan_in_fan_out:
 lora_target_modules:
  - gate_proj
  - down_proj
@@ -47,18 +43,12 @@ optimizer: adamw_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 loss_watchdog_threshold: 5.0
@@ -67,10 +57,6 @@ loss_watchdog_patience: 3
 warmup_steps: 10
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  pad_token: "<|end_of_text|>"
--- a/examples/llama-3/lora-8b.yml
+++ b/examples/llama-3/lora-8b.yml
@@ -27,7 +27,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:
 lora_modules_to_save:
  - embed_tokens
  - lm_head
@@ -45,30 +44,17 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true
-s2_attention:

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
   pad_token: <|end_of_text|>
--- a/examples/llama-3/qlora-1b-kto.yaml
+++ b/examples/llama-3/qlora-1b-kto.yaml
@@ -32,7 +32,6 @@ lora_r: 32
 lora_alpha: 64
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -47,31 +46,19 @@ optimizer: adamw_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: true

 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 20
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  pad_token: "<|end_of_text|>"
--- a/examples/llama-3/qlora-1b.yml
+++ b/examples/llama-3/qlora-1b.yml
@@ -24,7 +24,6 @@ pad_to_sequence_len: true
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
-lora_fan_in_fan_out:
 lora_target_modules:
  - gate_proj
  - down_proj
@@ -47,18 +46,12 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 loss_watchdog_threshold: 5.0
@@ -66,13 +59,7 @@ loss_watchdog_patience: 3

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  pad_token: "<|end_of_text|>"
--- a/examples/llama-3/qlora-fsdp-405b.yaml
+++ b/examples/llama-3/qlora-fsdp-405b.yaml
@@ -24,7 +24,6 @@ pad_to_sequence_len: true
 lora_r: 16
 lora_alpha: 16
 lora_dropout: 0.05
-lora_target_modules:
 lora_target_linear: true

 gradient_accumulation_steps: 4
@@ -34,8 +33,6 @@ optimizer: adamw_torch_fused
 lr_scheduler: cosine
 learning_rate: 0.00001

-train_on_inputs: false
-group_by_length: false
 bf16: true
 tf32: true

--- a/examples/llama-3/qlora-fsdp-70b.yaml
+++ b/examples/llama-3/qlora-fsdp-70b.yaml
@@ -26,9 +26,7 @@ pad_to_sequence_len: true
 lora_r: 8
 lora_alpha: 16
 lora_dropout: 0.05
-lora_target_modules:
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -43,28 +41,19 @@ optimizer: adamw_torch_fused
 lr_scheduler: cosine
 learning_rate: 0.00001

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
 fsdp:
  - full_shard
--- a/examples/llama-3/qlora.yml
+++ b/examples/llama-3/qlora.yml
@@ -26,9 +26,7 @@ pad_to_sequence_len: true
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
-lora_target_modules:
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -43,28 +41,17 @@ optimizer: paged_adamw_32bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  pad_token: "<|end_of_text|>"
--- a/examples/llama4/scout-lora.yaml
+++ b/examples/llama4/scout-lora.yaml
@@ -0,0 +1,75 @@
+base_model: meta-llama/Llama-4-Scout-17B-16E
+model_type: Llama4ForConditionalGeneration
+  # Automatically upload checkpoint and final model to HF
+  # hub_model_id: username/custom_model_name
+
+strict: false
+
+  # torch_compile: true
+
+adapter: lora
+lora_r: 32
+lora_alpha: 64
+lora_target_modules:
+  - self_attn.q_proj
+  - self_attn.k_proj
+  - self_attn.v_proj
+  - self_attn.o_proj
+lora_modules_to_save:
+  - lm_head
+  - embed_tokens
+
+chat_template: llama4
+datasets:
+  - path: mlabonne/FineTome-100k
+    type: chat_template
+    split: train[:20%]
+    field_messages: conversations
+    message_property_mappings:
+      role: from
+      content: value
+
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.0
+output_dir: ./outputs/out
+
+sequence_len: 4096
+sample_packing: true
+pad_to_sequence_len: true
+
+gradient_accumulation_steps: 1
+micro_batch_size: 1
+num_epochs: 1
+optimizer: adamw_torch_8bit
+lr_scheduler: cosine
+learning_rate: 2e-5
+
+bf16: true
+tf32: true
+
+# gradient_checkpointing: true
+# gradient_checkpointing_kwargs:
+#   use_reentrant: false
+logging_steps: 1
+flash_attention: true
+
+warmup_steps: 100
+evals_per_epoch: 2
+saves_per_epoch: 1
+weight_decay: 0.0
+fsdp:
+  - auto_wrap
+  - full_shard
+fsdp_config:
+  fsdp_version: 2
+  fsdp_offload_params: false
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_transformer_layer_cls_to_wrap: Llama4TextDecoderLayer
+  fsdp_state_dict_type: SHARDED_STATE_DICT
+  fsdp_sharding_strategy: FULL_SHARD
+  fsdp_reshard_after_forward: true
+  fsdp_activation_checkpointing: true
+special_tokens:
+  pad_token: <|finetune_right_pad_id|>
+  eos_token: <|eot|>
--- a/examples/llava/lora-7b.yaml
+++ b/examples/llava/lora-7b.yaml
@@ -41,14 +41,11 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: true
 fp16:
 tf32: true

 gradient_checkpointing: true
-local_rank:
 logging_steps: 1
 flash_attention: true
 eager_attention:
@@ -56,8 +53,4 @@ eager_attention:
 warmup_ratio: 0.1
 evals_per_epoch: 1
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
--- a/examples/mamba/config.yml
+++ b/examples/mamba/config.yml
@@ -5,9 +5,6 @@ tokenizer_type: AutoTokenizer
 tokenizer_config: EleutherAI/gpt-neox-20b
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 datasets:
@@ -38,27 +35,17 @@ train_on_inputs: false
 group_by_length: true

 bf16: auto
-fp16:
 tf32: true

 gradient_checkpointing: false
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention:

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
 tokens:
 save_safetensors: False
--- a/examples/mistral/bigstral-ds-zero3.yaml
+++ b/examples/mistral/bigstral-ds-zero3.yaml
@@ -6,9 +6,6 @@ tokenizer_type: LlamaTokenizer
 # hub_model_id: username/custom_model_name

 trust_remote_code: true
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 unfrozen_parameters:
@@ -40,27 +37,19 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0001

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 save_total_limit: 1
 save_steps:
-debug:
+
 deepspeed: deepspeed_configs/zero3_bf16_cpuoffload_params.json
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  eos_token: "<|im_end|>"
 tokens:
--- a/examples/mistral/config.yml
+++ b/examples/mistral/config.yml
@@ -4,9 +4,6 @@ model_type: MistralForCausalLM
 tokenizer_type: LlamaTokenizer
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 datasets:
@@ -34,28 +31,16 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.000005

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/mistral/lora-mps.yml
+++ b/examples/mistral/lora-mps.yml
@@ -4,9 +4,6 @@ model_type: MistralForCausalLM
 tokenizer_type: LlamaTokenizer
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 datasets:
@@ -28,7 +25,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:
 lora_target_modules:
  - gate_proj
  - down_proj
@@ -51,18 +47,13 @@ optimizer: adamw_torch_fused
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
 fp16: false
 tf32: true

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: false
 sdp_attention: true

@@ -71,12 +62,6 @@ loss_watchdog_patience: 3

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_table_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/mistral/lora.yml
+++ b/examples/mistral/lora.yml
@@ -27,7 +27,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:
 lora_target_modules:
  - gate_proj
  - down_proj
@@ -50,18 +49,12 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 loss_watchdog_threshold: 5.0
@@ -69,12 +62,6 @@ loss_watchdog_patience: 3

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/mistral/mistral-dpo-qlora.yml
+++ b/examples/mistral/mistral-dpo-qlora.yml
@@ -40,7 +40,6 @@ lora_r: 8
 lora_alpha: 16
 lora_dropout: 0.2
 lora_target_linear: true
-lora_fan_in_fan_out:

 lora_target_modules:
  - gate_proj
@@ -67,31 +66,18 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0001

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: false
-s2_attention:

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  bos_token: "<|im_start|>"
  eos_token: "<|im_end|>"
--- a/examples/mistral/mistral-qlora-fsdp.yml
+++ b/examples/mistral/mistral-qlora-fsdp.yml
@@ -32,7 +32,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -47,18 +46,12 @@ optimizer: paged_adamw_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 loss_watchdog_threshold: 5.0
@@ -66,10 +59,8 @@ loss_watchdog_patience: 3

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
+
 weight_decay: 0.0
 fsdp:
  - full_shard
--- a/examples/mistral/mistral-qlora-orpo.yml
+++ b/examples/mistral/mistral-qlora-orpo.yml
@@ -32,7 +32,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:
 lora_target_modules:
  - gate_proj
  - down_proj
@@ -55,18 +54,12 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 loss_watchdog_threshold: 5.0
@@ -74,12 +67,6 @@ loss_watchdog_patience: 3

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/mistral/mistral-small-3.1-24B-lora.yml
+++ b/examples/mistral/mistral-small-3.1-24B-lora.yml
@@ -43,14 +43,11 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: true
 fp16:
 tf32: true

 gradient_checkpointing: true
-local_rank:
 logging_steps: 1
 flash_attention: false # PixtralVisionModel does not support Flash Attention 2.0 yet.
 eager_attention:
@@ -58,9 +55,5 @@ eager_attention:
 warmup_ratio: 0.1
 evals_per_epoch: 1
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/mistral/mixtral-8x22b-qlora-fsdp.yml
+++ b/examples/mistral/mixtral-8x22b-qlora-fsdp.yml
@@ -30,7 +30,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -45,18 +44,12 @@ optimizer: adamw_torch_fused
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: true

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 loss_watchdog_threshold: 5.0
@@ -64,10 +57,8 @@ loss_watchdog_patience: 3

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
+
 weight_decay: 0.0
 fsdp:
  - full_shard
--- a/examples/mistral/mixtral-qlora-fsdp.yml
+++ b/examples/mistral/mixtral-qlora-fsdp.yml
@@ -32,7 +32,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -47,18 +46,12 @@ optimizer: adamw_torch_fused
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: true

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 loss_watchdog_threshold: 5.0
@@ -66,10 +59,8 @@ loss_watchdog_patience: 3

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
+
 weight_decay: 0.0
 fsdp:
  - full_shard
--- a/examples/mistral/mixtral.yml
+++ b/examples/mistral/mixtral.yml
@@ -41,7 +41,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:
 #lora_target_modules:
 #  - gate
 #  - q_proj
@@ -65,18 +64,12 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 loss_watchdog_threshold: 5.0
@@ -84,12 +77,8 @@ loss_watchdog_patience: 3

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
+
 deepspeed: deepspeed_configs/zero2.json
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/mistral/mixtral_22.yml
+++ b/examples/mistral/mixtral_22.yml
@@ -6,9 +6,6 @@ tokenizer_type: LlamaTokenizer
 # hub_model_id: username/custom_model_name

 trust_remote_code: true
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 unfrozen_parameters:
@@ -38,27 +35,19 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0001

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 save_total_limit: 1
 save_steps:
-debug:
+
 deepspeed: deepspeed_configs/zero3_bf16_cpuoffload_all.json
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  eos_token: "<|im_end|>"
 tokens:
--- a/examples/mistral/qlora.yml
+++ b/examples/mistral/qlora.yml
@@ -27,7 +27,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:
 lora_target_modules:
  - gate_proj
  - down_proj
@@ -50,18 +49,12 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 loss_watchdog_threshold: 5.0
@@ -69,12 +62,6 @@ loss_watchdog_patience: 3

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/mpt-7b/config.yml
+++ b/examples/mpt-7b/config.yml
@@ -35,26 +35,17 @@ optimizer: adamw_bnb_8bit
 torchdistx_path:
 lr_scheduler: cosine
 learning_rate: 0.0000002
-train_on_inputs: false
-group_by_length: false
 bf16: auto
 tf32: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 5
-xformers_attention:
 flash_attention:
 gptq_groupsize:
 gptq_model_v1:
 warmup_steps: 20
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0001
-fsdp:
-fsdp_config:
 tokens:
  pad_token: "<|padding|>"
  bos_token: "<|endoftext|>"
--- a/examples/openllama-3b/config.yml
+++ b/examples/openllama-3b/config.yml
@@ -4,9 +4,6 @@ model_type: LlamaForCausalLM
 tokenizer_type: LlamaTokenizer
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false
 push_dataset_to_hub:
 datasets:
@@ -23,7 +20,6 @@ lora_alpha:
 lora_dropout:
 lora_target_modules:
 lora_target_linear:
-lora_fan_in_fan_out:
 wandb_project:
 wandb_entity:
 wandb_watch:
@@ -37,29 +33,20 @@ optimizer: adamw_bnb_8bit
 torchdistx_path:
 lr_scheduler: cosine
 learning_rate: 0.000003
-train_on_inputs: false
-group_by_length: false
 float16: true
 bf16: false
 fp16: false
 tf32: false
 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true
 gptq_groupsize:
 gptq_model_v1:
 warmup_steps: 20
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.1
-fsdp:
-fsdp_config:
 special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
--- a/examples/openllama-3b/lora.yml
+++ b/examples/openllama-3b/lora.yml
@@ -29,7 +29,6 @@ lora_target_modules:
  - v_proj
  - k_proj
  - o_proj
-lora_fan_in_fan_out:
 wandb_project:
 wandb_entity:
 wandb_watch:
@@ -43,29 +42,19 @@ optimizer: adamw_bnb_8bit
 torchdistx_path:
 lr_scheduler: cosine
 learning_rate: 0.0002
-train_on_inputs: false
-group_by_length: false
 bf16: false
 fp16: true
 tf32: false
 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true
 gptq_groupsize:
-s2_attention:
 gptq_model_v1:
 warmup_steps: 20
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.1
-fsdp:
-fsdp_config:
 special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
--- a/examples/openllama-3b/qlora.yml
+++ b/examples/openllama-3b/qlora.yml
@@ -21,9 +21,7 @@ sample_packing: true
 lora_r: 8
 lora_alpha: 32
 lora_dropout: 0.05
-lora_target_modules:
 lora_target_linear: true
-lora_fan_in_fan_out:
 wandb_project:
 wandb_entity:
 wandb_watch:
@@ -37,28 +35,19 @@ optimizer: paged_adamw_32bit
 torchdistx_path:
 lr_scheduler: cosine
 learning_rate: 0.0002
-train_on_inputs: false
-group_by_length: false
 bf16: false
 fp16: true
 tf32: false
 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true
 gptq_groupsize:
 gptq_model_v1:
 warmup_steps: 20
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.1
-fsdp:
-fsdp_config:
 special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
--- a/examples/phi/lora-3.5.yaml
+++ b/examples/phi/lora-3.5.yaml
@@ -37,7 +37,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -52,28 +51,16 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bfloat16: true
 bf16: true
 fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
-s2_attention:

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 4
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
--- a/examples/phi/phi-ft.yml
+++ b/examples/phi/phi-ft.yml
@@ -4,9 +4,6 @@ model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 datasets:
@@ -27,7 +24,6 @@ lora_r:
 lora_alpha:
 lora_dropout:
 lora_target_linear:
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -45,30 +41,20 @@ max_grad_norm: 1.0
 lr_scheduler: cosine
 learning_rate: 0.000003

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: true

 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: True
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 100
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.1
-fsdp:
-fsdp_config:
 resize_token_embeddings_to_32x: true
 special_tokens:
  pad_token: "<|endoftext|>"
--- a/examples/phi/phi-qlora.yml
+++ b/examples/phi/phi-qlora.yml
@@ -27,7 +27,6 @@ lora_r: 64
 lora_alpha: 32
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -45,30 +44,20 @@ max_grad_norm: 1.0
 lr_scheduler: cosine
 learning_rate: 0.000003

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: true

 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: True
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 100
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.1
-fsdp:
-fsdp_config:
 resize_token_embeddings_to_32x: true
 special_tokens:
  pad_token: "<|endoftext|>"
--- a/examples/phi/phi2-ft.yml
+++ b/examples/phi/phi2-ft.yml
@@ -4,9 +4,6 @@ model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 datasets:
@@ -27,7 +24,6 @@ lora_r:
 lora_alpha:
 lora_dropout:
 lora_target_linear:
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -45,30 +41,20 @@ max_grad_norm: 1.0
 lr_scheduler: cosine
 learning_rate: 0.000003

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: true

 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: True
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 100
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.1
-fsdp:
-fsdp_config:
 resize_token_embeddings_to_32x: true
 special_tokens:
  pad_token: "<|endoftext|>"
--- a/examples/phi/phi3-ft-fsdp.yml
+++ b/examples/phi/phi3-ft-fsdp.yml
@@ -4,9 +4,6 @@ model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 datasets:
@@ -28,7 +25,6 @@ lora_r:
 lora_alpha:
 lora_dropout:
 lora_target_linear:
-lora_fan_in_fan_out:

 wandb_project: phi3
 wandb_entity:
@@ -46,27 +42,19 @@ max_grad_norm: 1.0
 lr_scheduler: cosine
 learning_rate: 0.000003

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: true

 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 100
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.1
 fsdp:
  - full_shard
--- a/examples/phi/phi3-ft.yml
+++ b/examples/phi/phi3-ft.yml
@@ -7,9 +7,6 @@ tokenizer_type: AutoTokenizer
 # hub_model_id: username/custom_model_name

 chat_template: phi_3
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 datasets:
@@ -30,7 +27,6 @@ lora_r: 64
 lora_alpha: 32
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:

 gradient_accumulation_steps: 1
 micro_batch_size: 2
@@ -42,8 +38,6 @@ max_grad_norm: 1.0
 lr_scheduler: cosine
 learning_rate: 5.0e-6

-train_on_inputs: false
-group_by_length: false
 bf16: auto

 gradient_checkpointing: true
@@ -55,9 +49,9 @@ flash_attention: true

 eval_steps: 1000
 save_steps: 5000
-eval_table_size: 2
 eval_batch_size: 2
 eval_sample_packing: false
+eval_table_size: 2
 eval_max_new_tokens: 32
 eval_causal_lm_metrics: ["perplexity"]
 do_causal_lm_eval: true
--- a/examples/pixtral/lora-12b.yml
+++ b/examples/pixtral/lora-12b.yml
@@ -41,14 +41,11 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: true
 fp16:
 tf32: true

 gradient_checkpointing: true
-local_rank:
 logging_steps: 1
 flash_attention: false # PixtralVisionModel does not support Flash Attention 2.0 yet
 eager_attention:
@@ -56,10 +53,6 @@ eager_attention:
 warmup_ratio: 0.1
 evals_per_epoch: 1
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
  pad_token: <pad>
--- a/examples/pythia-12b/config.yml
+++ b/examples/pythia-12b/config.yml
@@ -5,9 +5,6 @@ model_type: GPTNeoXForCausalLM
 tokenizer_type: AutoTokenizer
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
-
-load_in_8bit: false
-load_in_4bit: false
 gptq: false
 device_map: auto
 datasets:
@@ -22,7 +19,6 @@ max_packed_sequence_len: 2048
 lora_r: 64
 lora_alpha: 32
 lora_dropout: 0.0
-lora_target_modules:
 lora_target_linear: true
 lora_fan_in_fan_out: true  # pythia/GPTNeoX lora specific
 wandb_project:
@@ -37,16 +33,10 @@ num_epochs: 5
 learning_rate: 0.00003
 optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
-train_on_inputs: false
-group_by_length: false
 bf16: false
 fp16: false
 float16: true
 tf32: true
 flash_optimum: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 gradient_checkpointing: true
-fsdp:
-fsdp_config:
--- a/examples/pythia/lora.yml
+++ b/examples/pythia/lora.yml
@@ -28,13 +28,9 @@ gradient_accumulation_steps: 1
 micro_batch_size: 4
 num_epochs: 4
 learning_rate: 0.00001
-train_on_inputs: false
-group_by_length: false
 bf16: auto
 tf32: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 weight_decay: 0.1
 evals_per_epoch: 4
 logging_steps: 1
--- a/examples/qwen/lora.yml
+++ b/examples/qwen/lora.yml
@@ -28,7 +28,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -43,28 +42,16 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: false
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention:

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/qwen/qlora.yml
+++ b/examples/qwen/qlora.yml
@@ -28,7 +28,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -43,28 +42,16 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: false
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention:

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/qwen/qwen2-moe-lora.yaml
+++ b/examples/qwen/qwen2-moe-lora.yaml
@@ -3,9 +3,6 @@ base_model: Qwen/Qwen1.5-MoE-A2.7B
 # hub_model_id: username/custom_model_name

 trust_remote_code: true
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 datasets:
@@ -25,7 +22,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -40,28 +36,18 @@ optimizer: paged_adamw_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: true

 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 10
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/qwen/qwen2-moe-qlora.yaml
+++ b/examples/qwen/qwen2-moe-qlora.yaml
@@ -25,7 +25,6 @@ lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -40,28 +39,18 @@ optimizer: paged_adamw_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: true

 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 10
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/qwen2-vl/lora-7b.yaml
+++ b/examples/qwen2-vl/lora-7b.yaml
@@ -41,14 +41,11 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: true
 fp16:
 tf32: true

 gradient_checkpointing: true
-local_rank:
 logging_steps: 1
 flash_attention: true
 eager_attention:
@@ -56,8 +53,4 @@ eager_attention:
 warmup_ratio: 0.1
 evals_per_epoch: 1
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
--- a/examples/qwen2/dpo.yaml
+++ b/examples/qwen2/dpo.yaml
@@ -44,27 +44,15 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: false

 gradient_checkpointing: true
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 10
 evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
--- a/examples/qwen2/prm.yaml
+++ b/examples/qwen2/prm.yaml
@@ -5,9 +5,6 @@ num_labels: 2
 tokenizer_type: AutoTokenizer
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 process_reward_model: true
@@ -43,30 +40,19 @@ optimizer: adamw_torch
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: true
 fp16:
 tf32:
 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_ratio: 0.1
 evals_per_epoch:
-eval_table_size:
-eval_max_new_tokens: 128
 eval_steps: 100
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/examples/qwen2/qlora-fsdp.yaml
+++ b/examples/qwen2/qlora-fsdp.yaml
@@ -26,7 +26,6 @@ lora_r: 32
 lora_alpha: 64
 lora_dropout: 0.05
 lora_target_linear: true
-lora_fan_in_fan_out:

 wandb_project:
 wandb_entity:
@@ -41,27 +40,19 @@ optimizer: adamw_torch_fused
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: auto
-fp16:
 tf32: true

 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_steps: 10
 evals_per_epoch: 4
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
 fsdp:
  - full_shard
--- a/examples/qwen2/reward-model.yaml
+++ b/examples/qwen2/reward-model.yaml
@@ -5,9 +5,6 @@ num_labels: 1
 tokenizer_type: AutoTokenizer
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
-
-load_in_8bit: false
-load_in_4bit: false
 strict: false

 reward_model: true
@@ -38,8 +35,6 @@ optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002

-train_on_inputs: false
-group_by_length: false
 bf16: true
 fp16:
 tf32: true
@@ -47,21 +42,12 @@ tf32: true
 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
-early_stopping_patience:
 resume_from_checkpoint:
-local_rank:
 logging_steps: 1
-xformers_attention:
 flash_attention: true

 warmup_ratio: 0.1
 evals_per_epoch:
-eval_table_size:
-eval_max_new_tokens: 128
 saves_per_epoch: 1
-debug:
-deepspeed:
 weight_decay: 0.0
-fsdp:
-fsdp_config:
 special_tokens:
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Dan Saunders	954b989e88	log warning re: logged losses / gradient scaling per rank	2025-04-07 18:47:43 +00:00
Dan Saunders	c64c881460	using existing packed seqlens util	2025-04-07 18:47:43 +00:00
Dan Saunders	cefd57cecb	adding smoke test	2025-04-07 18:47:43 +00:00
Dan Saunders	2f3c52ea2f	pre-commit fix	2025-04-07 18:47:43 +00:00
Dan Saunders	741015b3cf	refactor and fix multipack seqlens	2025-04-07 18:47:43 +00:00
Dan Saunders	4188700b7b	working on masking fix	2025-04-07 18:47:43 +00:00
NanoCode012	9b89591ead	Feat: Add doc on loading datasets and support for Azure/OCI (#2482 ) * fix: remove unused config * feat: add doc on dataset loading * feat: enable azure and oci remote file system * feat: add adlfs and ocifs to requirements * fix: add links between dataset formats and dataset loading * fix: remove unused condition * Revert "fix: remove unused condition" This reverts commit `5fe13be73e`.	2025-04-07 12:41:13 -04:00
NanoCode012	31498d0230	fix(doc): clarify roles mapping in chat_template (#2490 ) [skip ci]	2025-04-07 12:40:32 -04:00
NanoCode012	d25daebea9	fix: duplicate llama4 chattemplate enum (#2500 ) * fix: duplicate llama4 chattemplate enum * fix: duplicate chat_template string	2025-04-07 12:39:19 -04:00
NanoCode012	e0e5d9b1d6	feat: add llama4 multimodal (#2499 ) * feat: add llama4 multimodal * feat: add torchvision to base docker * just use latest torchvision --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-04-07 10:49:29 -04:00
Wing Lian	8bbad21bfd	llama4 support (#2493 ) * llama4 support * add xet support [skip ci] * be flexible on transformers version and skip test on version * don't use deepspeed for the fix_untrained_tokens test * reordering to trigger torch 2.6.0 tests first * slightly smaller train set * use 4.51.0 for now * remove stray print, add llama4 chat template to schema, bump peft to 0.15.1 * patches to make llama4 performant * add preliminary fp8 support	2025-04-07 10:49:15 -04:00
Wing Lian	5f4af3665d	FSDP2 support (#2469 ) * fsdp2 support * use accelerate release 1.6.0 * allow 8bit optims with fsdp2 * liger + torch compile fix * add fsdp2 e2e tests * use transformers commit with fsdp2 support * skip zero3 tests for this PR for now * fix fsdp2 config for ci * make sure both flex and flash attn work with fsdp2, skip fix untrained tokens * okay, actually use fdsp2... * more fixes to flex for fsdp2 * make sure to patch all the loaded models * additional validation for fsdp2, bump dep versions	2025-04-06 17:08:01 -04:00
Sung Ching Liu	a8f38c367c	Flex Attention + Packing with BlockMask support (#2363 )	2025-04-05 18:02:57 -04:00
Wing Lian	e7e0cd97ce	Update dependencies and show slow tests in CI (#2492 ) * use latest torchao, gradio, schedule-free * get info on slow tests * speed up tests by avoiding gradient checkpointing and reducing eval size	2025-04-05 17:41:31 -04:00
Wing Lian	949471039f	fix tokenizer overrides w gemma3 (#2488 ) * fix tokenizer overrides w gemma3 * fix offline wrapping	2025-04-05 01:25:44 -04:00
NanoCode012	de451f99a5	fix: cohere cce scaling wrong tensor (#2483 )	2025-04-04 13:47:44 -04:00
Wing Lian	9f824ef76a	simplify the example configs to be more minimal and less daunting (#2486 ) [skip ci] * simplify the example configs to be more minimal and less daunting * drop empty s2_attention from example yamls	2025-04-04 13:47:26 -04:00
Wing Lian	dd66fb163c	check if fixture exists in the cache already (#2485 ) * check if fixture exists in the cache already * add docstring explaining what is going on	2025-04-04 13:47:01 -04:00
Dan Saunders	e0cc4f1a87	removing deepspeed guard for LoRA Triton kernels (#2480 )	2025-04-03 14:50:56 -04:00