various fixes

Always re-normalize teacher distribution
kd loss needs to be calculated in full precision
2025-01-30 10:39:18 -05:00 · 2025-01-29 08:36:40 -05:00 · 2025-01-28 19:40:35 -05:00 · 2025-01-27 14:27:35 -05:00 · 2025-01-21 13:09:20 -05:00 · 2025-01-21 11:23:38 -05:00
227 changed files with 6798 additions and 3978 deletions
--- a/.github/workflows/lint.yml
+++ b/.github/workflows/lint.yml
@@ -1,6 +1,7 @@
 name: lint
 on:
  # check on PRs, and manual triggers
+  merge_group:
  pull_request:
      paths:
       - '**.py'
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -25,7 +25,6 @@ jobs:
            python_version: "3.11"
            pytorch: 2.3.1
            axolotl_extras: mamba-ssm
-            is_latest: true
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
@@ -36,6 +35,7 @@ jobs:
            python_version: "3.11"
            pytorch: 2.5.1
            axolotl_extras:
+            is_latest: true
    runs-on: axolotl-gpu-runner
    steps:
      - name: Checkout
@@ -92,7 +92,6 @@ jobs:
            python_version: "3.11"
            pytorch: 2.3.1
            axolotl_extras:
-            is_latest: true
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
@@ -103,6 +102,7 @@ jobs:
            python_version: "3.11"
            pytorch: 2.5.1
            axolotl_extras:
+            is_latest: true
    runs-on: axolotl-gpu-runner
    steps:
      - name: Checkout
--- a/.github/workflows/multi-gpu-e2e.yml
+++ b/.github/workflows/multi-gpu-e2e.yml
@@ -52,7 +52,7 @@ jobs:
      - name: Install Modal
        run: |
          python -m pip install --upgrade pip
-          pip install modal==0.63.64 jinja2
+          pip install modal==0.71.8 jinja2
      - name: Update env vars
        run: |
          echo "BASE_TAG=main-base-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}" >> $GITHUB_ENV
--- a/.github/workflows/tests-nightly.yml
+++ b/.github/workflows/tests-nightly.yml
@@ -129,7 +129,7 @@ jobs:
      - name: Install Modal
        run: |
          python -m pip install --upgrade pip
-          pip install modal==0.63.64 jinja2
+          pip install modal==0.71.8 jinja2
      - name: Update env vars
        run: |
          echo "BASE_TAG=main-base-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}" >> $GITHUB_ENV
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -1,6 +1,7 @@
 name: Tests
 on:
  # check on push/merge to main, PRs, and manual triggers
+  merge_group:
  push:
    branches:
      - "main"
@@ -60,6 +61,15 @@ jobs:
      - name: Check out repository code
        uses: actions/checkout@v4

+      - name: Restore HF cache
+        id: hf-cache-restore
+        uses: actions/cache/restore@v4
+        with:
+          path: |
+            /home/runner/.cache/huggingface/hub/datasets--*
+            /home/runner/.cache/huggingface/hub/models--*
+          key: ${{ runner.os }}-hf-hub-cache-${{ hashFiles('**/conftest.py') }}
+
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
@@ -100,6 +110,15 @@ jobs:
        run: |
          find "$(pip cache dir)/http-v2" -type f -mtime +14 -exec rm {} \;

+      - name: Save HF cache
+        id: hf-cache
+        uses: actions/cache/save@v4
+        with:
+          path: |
+            /home/runner/.cache/huggingface/hub/datasets--*
+            /home/runner/.cache/huggingface/hub/models--*
+          key: ${{ steps.hf-cache-restore.outputs.cache-primary-key }}
+
  pytest-sdist:
    name: PyTest from Source Dist
    runs-on: ubuntu-latest
@@ -115,6 +134,15 @@ jobs:
      - name: Check out repository code
        uses: actions/checkout@v4

+      - name: Restore HF cache
+        id: hf-cache-restore
+        uses: actions/cache/restore@v4
+        with:
+          path: |
+            /home/runner/.cache/huggingface/hub/datasets--*
+            /home/runner/.cache/huggingface/hub/models--*
+          key: ${{ runner.os }}-hf-hub-cache-${{ hashFiles('**/conftest.py') }}
+
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
@@ -156,6 +184,15 @@ jobs:
        run: |
          find "$(pip cache dir)/http-v2" -type f -mtime +14 -exec rm {} \;

+      - name: Save HF cache
+        id: hf-cache
+        uses: actions/cache/save@v4
+        with:
+          path: |
+            /home/runner/.cache/huggingface/hub/datasets--*
+            /home/runner/.cache/huggingface/hub/models--*
+          key: ${{ steps.hf-cache-restore.outputs.cache-primary-key }}
+
  docker-e2e-tests-1st:
    if: ${{ ! contains(github.event.commits[0].message, '[skip e2e]') && github.repository_owner == 'axolotl-ai-cloud' }}
    # this job needs to be run on self-hosted GPU runners...
@@ -170,7 +207,7 @@ jobs:
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
-            pytorch: 2.4.1
+            pytorch: 2.5.1
            num_gpus: 1
            axolotl_extras:
    steps:
@@ -183,7 +220,7 @@ jobs:
      - name: Install Modal
        run: |
          python -m pip install --upgrade pip
-          pip install modal==0.63.64 jinja2
+          pip install modal==0.71.8 jinja2
      - name: Update env vars
        run: |
          echo "BASE_TAG=main-base-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}" >> $GITHUB_ENV
@@ -216,7 +253,7 @@ jobs:
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
-            pytorch: 2.5.1
+            pytorch: 2.4.1
            num_gpus: 1
            axolotl_extras:
    steps:
@@ -229,7 +266,7 @@ jobs:
      - name: Install Modal
        run: |
          python -m pip install --upgrade pip
-          pip install modal==0.63.64 jinja2
+          pip install modal==0.71.8 jinja2
      - name: Update env vars
        run: |
          echo "BASE_TAG=main-base-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}" >> $GITHUB_ENV
--- a/.gitignore
+++ b/.gitignore
@@ -1,6 +1,7 @@
 **/axolotl.egg-info
 configs
 last_run_prepared/
+outputs
 .vscode
 _site/

--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -23,7 +23,7 @@ repos:
    hooks:
    - id: flake8
 -   repo: https://github.com/PyCQA/pylint
-    rev: v2.17.4
+    rev: v3.3.0
    hooks:
    - id: pylint
 -   repo: https://github.com/pre-commit/mirrors-mypy
--- a/.pylintrc
+++ b/.pylintrc
@@ -1,5 +1,5 @@
 [MASTER]
-init-hook="from pylint.config import find_pylintrc; import os, sys; sys.path.append(os.path.dirname(find_pylintrc()))"
+init-hook="from pylint.config import find_default_config_files; import sys; sys.path.append(next(find_default_config_files()).parent.as_posix())"

 [TYPECHECK]

@@ -12,3 +12,4 @@ generated-members=numpy.*, torch.*
 disable=missing-function-docstring, line-too-long, import-error,
    too-many-arguments, too-many-locals, too-many-statements, too-many-branches, too-few-public-methods,
    too-many-instance-attributes, fixme, import-outside-toplevel, logging-fstring-interpolation,
+    too-many-positional-arguments, possibly-used-before-assignment
--- a/cicd/Dockerfile.jinja
+++ b/cicd/Dockerfile.jinja
@@ -8,6 +8,7 @@ ENV PYTORCH_VERSION="{{ PYTORCH_VERSION }}"
 ENV GITHUB_REF="{{ GITHUB_REF }}"
 ENV GITHUB_SHA="{{ GITHUB_SHA }}"
 ENV NIGHTLY_BUILD="{{ NIGHTLY_BUILD }}"
+ENV HF_HOME="{{ HF_HOME }}"

 RUN apt-get update && \
    apt-get install -y --allow-change-held-packages vim curl nano libnccl2 libnccl-dev
--- a/cicd/cicd.sh
+++ b/cicd/cicd.sh
@@ -5,6 +5,6 @@ python -c "import torch; assert '$PYTORCH_VERSION' in torch.__version__"

 pytest -v --durations=10 -n8 --ignore=tests/e2e/ --ignore=tests/patched/ /workspace/axolotl/tests/
 # pytest -v --durations=10 -n8 --dist loadfile /workspace/axolotl/tests/patched/
-pytest -v --durations=10 -n1 --dist loadfile /workspace/axolotl/tests/e2e/patched/
-pytest -v --durations=10 -n1 --dist loadfile /workspace/axolotl/tests/e2e/integrations/
+pytest -v --durations=10 /workspace/axolotl/tests/e2e/patched/
+pytest -v --durations=10 /workspace/axolotl/tests/e2e/integrations/
 pytest -v --durations=10 --ignore=tests/e2e/patched/ --ignore=tests/e2e/multigpu/ --ignore=tests/e2e/integrations/ /workspace/axolotl/tests/e2e/
--- a/cicd/multigpu.py
+++ b/cicd/multigpu.py
@@ -28,6 +28,7 @@ df_args = {
    "CUDA": os.environ.get("CUDA", "121"),
    "GITHUB_REF": os.environ.get("GITHUB_REF", "refs/heads/main"),
    "GITHUB_SHA": os.environ.get("GITHUB_SHA", ""),
+    "HF_HOME": "/workspace/data/huggingface-cache/hub",
 }

 dockerfile_contents = df_template.render(**df_args)
@@ -48,6 +49,12 @@ cicd_image = (

 app = App("Axolotl CI/CD", secrets=[])

+hf_cache_volume = modal.Volume.from_name(
+    "axolotl-ci-hf-hub-cache", create_if_missing=True
+)
+VOLUME_CONFIG = {
+    "/workspace/data/huggingface-cache/hub": hf_cache_volume,
+}

 N_GPUS = int(os.environ.get("N_GPUS", 2))
 GPU_CONFIG = modal.gpu.H100(count=N_GPUS)
@@ -67,6 +74,7 @@ def run_cmd(cmd: str, run_folder: str):
    timeout=60 * 60,
    cpu=8.0,
    memory=131072 * N_GPUS,
+    volumes=VOLUME_CONFIG,
 )
 def cicd_pytest():
    run_cmd("./cicd/multigpu.sh", "/workspace/axolotl")
--- a/cicd/tests.py
+++ b/cicd/tests.py
@@ -29,6 +29,7 @@ df_args = {
    "GITHUB_REF": os.environ.get("GITHUB_REF", "refs/heads/main"),
    "GITHUB_SHA": os.environ.get("GITHUB_SHA", ""),
    "NIGHTLY_BUILD": os.environ.get("NIGHTLY_BUILD", ""),
+    "HF_HOME": "/workspace/data/huggingface-cache/hub",
 }

 dockerfile_contents = df_template.render(**df_args)
@@ -50,9 +51,15 @@ cicd_image = (

 app = App("Axolotl CI/CD", secrets=[])

+hf_cache_volume = modal.Volume.from_name(
+    "axolotl-ci-hf-hub-cache", create_if_missing=True
+)
+VOLUME_CONFIG = {
+    "/workspace/data/huggingface-cache/hub": hf_cache_volume,
+}

 N_GPUS = int(os.environ.get("N_GPUS", 1))
-GPU_CONFIG = modal.gpu.A10G(count=N_GPUS)
+GPU_CONFIG = modal.gpu.L40S(count=N_GPUS)


 def run_cmd(cmd: str, run_folder: str):
@@ -69,6 +76,7 @@ def run_cmd(cmd: str, run_folder: str):
    timeout=60 * 60,
    cpu=8.0,
    memory=131072,
+    volumes=VOLUME_CONFIG,
 )
 def cicd_pytest():
    run_cmd("./cicd/cicd.sh", "/workspace/axolotl")
--- a/deepspeed_configs/zero1_torch_compile.json
+++ b/deepspeed_configs/zero1_torch_compile.json
@@ -0,0 +1,27 @@
+{
+  "zero_optimization": {
+    "stage": 1,
+    "overlap_comm": true
+  },
+  "bf16": {
+    "enabled": "auto"
+  },
+  "fp16": {
+    "enabled": "auto",
+    "auto_cast": false,
+    "loss_scale": 0,
+    "initial_scale_power": 32,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "compile": {
+    "disable": false,
+    "backend": "inductor"
+  },
+  "gradient_accumulation_steps": "auto",
+  "gradient_clipping": "auto",
+  "train_batch_size": "auto",
+  "train_micro_batch_size_per_gpu": "auto",
+  "wall_clock_breakdown": false
+}
--- a/docs/config.qmd
+++ b/docs/config.qmd
@@ -127,34 +127,40 @@ datasets:
    # - tokenizer_default_fallback_*: where * is the name of the chat template to fallback to if the tokenizer does not have a chat template else default to tokenizer. E.g. tokenizer_default_fallback_chatml.
    # - jinja: Uses a custom jinja template for the chat template. The custom jinja template should be provided in the chat_template_jinja field.
    chat_template: tokenizer_default
-    # Custom jinja template for chat template. This will be only used if `chat_template` is set to `jinja` or empty (in which case chat_template is automatically set to `jinja`).
+
+    # Custom jinja chat template. Used only if `chat_template: jinja` or empty.
    chat_template_jinja:
-    # The key in the data example that contains the messages. Default is "messages".
+
+    # Key containing the messages (default: "messages")
    field_messages: messages
-    # The key in the message turn that contains the role. Default is "role".
+    # Key for role in each message (default: "role")
    message_field_role: role
-    # The key in the message turn that contains the content. Default is "content".
+    # Key for content in each message (default:  "content")
    message_field_content: content
-    # Optional[Dict[str, List]]. Roles mapping for the messages.
+
+    # Optional[Dict[str, List]]. Roles mapping in the messages. The default is:
    roles:
      user: ["human", "user"]
-      assistant: ["gpt", "assistant", "ai"]
+      assistant: ["gpt", "assistant"]
      system: ["system"]
+      tool: ["tool"]

-    ## NOTE: Leaving the below empty will default to using the simple legacy tokenization strategy where only last message is trained on.
+    # IMPORTANT: The following fields determine which parts of the conversation to train on.
+    # Priority order: message_field_training > message_field_training_detail > train_on_inputs or role in roles_to_train
+    # See examples at `docs/dataset-formats/conversation.qmd`
+    # Note: If the below 4 fields are empty, defaults to training only on the last message.

    # Optional[List[str]]. Roles to train on. The tokens from these roles will be considered for the loss.
-    roles_to_train: ["gpt", "assistant"]
+    roles_to_train: ["assistant"]  # default
    # Optional[str]. Which EOS tokens to train on in the conversation. Possible values are:
    # - all: train on all EOS tokens
-    # - turn: train on the EOS token at the end of each trainable turn
+    # - turn (default): train on the EOS token at the end of each trainable turn
    # - last: train on the last EOS token in the conversation
    train_on_eos: last
    # The key in the message turn that indicates via boolean whether tokens of a turn should be considered for training. Useful to selectively train on certain turns besides the `roles_to_train`.
    message_field_training: training
    # The key in the message turn that contains the training details. Useful to selectively train on certain tokens in a turn.
    # The value of the key is a List[Dict] containing `begin_offset` (start character index in content), `end_offset` (end character index in content), and `train` (boolean whether to train).
-    # See example at `docs/dataset-formats/conversation.qmd`
    message_field_training_detail: train_detail


@@ -239,6 +245,9 @@ sample_packing_group_size: 100000
 # The number of samples which can be packed into one sequence. Increase if using a large sequence_len with many short samples.
 sample_packing_bin_size: 200

+# Use batch flattening for speedups when not using sample_packing
+batch_flattening:
+
 # Passed through to transformers when loading the model when launched without accelerate
 # Use `sequential` when training w/ model parallelism to limit memory
 device_map:
@@ -331,7 +340,8 @@ comet_experiment_config: # Dictionary for additional configuration settings, see
 output_dir: ./completed-model

 # Whether to use torch.compile and which backend to use
-torch_compile:  # bool
+# setting to `auto` will enable torch compile when torch>=2.5.1
+torch_compile:  # Optional[Union[Literal["auto"], bool]]
 torch_compile_backend:  # Optional[str]

 # Training hyperparameters
@@ -363,6 +373,10 @@ eval_table_size: # Approximate number of predictions sent to wandb depending on
 eval_max_new_tokens: # Total number of tokens generated for predictions sent to wandb. Default is 128
 eval_causal_lm_metrics: # HF evaluate metrics used during evaluation. Default is ["sacrebleu", "comet", "ter", "chrf", "perplexity"]

+profiler_steps: # enable the pytorch profiler to capture the first N steps of training to the output_dir.
+                # see https://pytorch.org/blog/understanding-gpu-memory-1/ for more information
+                # snapshots can be visualized @ https://pytorch.org/memory_viz
+
 loss_watchdog_threshold: # High loss value, indicating the learning has broken down (a good estimate is ~2 times the loss at the start of training)
 loss_watchdog_patience: # Number of high-loss steps in a row before the trainer aborts (default: 3)

--- a/docs/dataset-formats/conversation.qmd
+++ b/docs/dataset-formats/conversation.qmd
@@ -68,6 +68,8 @@ We recommend checking the below examples for other usecases.
 datasets:
  - path: ...
    type: chat_template
+    roles_to_train:
+    train_on_eos:
 ```

 2. Using the `gemma` chat template to override the tokenizer_config.json's chat template on OpenAI messages format, training on all assistant messages.
@@ -77,7 +79,7 @@ chat_template: gemma # this overwrites the tokenizer's chat_template
 datasets:
  - path: ...
    type: chat_template
-    roles_to_train: ["assistant"]
+    roles_to_train: ["assistant"]  # default value
 ```

 3. Using the tokenizer_config.json's chat template or `chatml` as fallback if the former's chat template does not exist, on OpenAI messages format, training on all assistant messages.
@@ -87,7 +89,6 @@ chat_template: tokenizer_default_fallback_chatml # this overwrites the tokenizer
 datasets:
  - path: ...
    type: chat_template
-    roles_to_train: ["assistant"]
 ```

 4. Using a custom jinja template on OpenAI messages format, training on all assistant messages.
@@ -99,7 +100,6 @@ chat_template_jinja: "{{ bos_token }}{% for message in messages %}{% if (message
 datasets:
  - path: ...
    type: chat_template
-    roles_to_train: ["assistant"]
 ```

 5. (Advanced) Using fine-grained control over tokens and turns to train in a conversation
--- a/docs/dataset-formats/pretraining.qmd
+++ b/docs/dataset-formats/pretraining.qmd
@@ -19,7 +19,14 @@ For pretraining, there is no prompt template or roles.  The only required field
 Axolotl usually loads the entire dataset into memory. This will be challenging for large datasets. Use the following config to enable streaming:

 ```{.yaml filename="config.yaml"}
-pretraining_dataset: # hf path only
+pretraining_dataset:
+  - name:
+    path:
+    split:
+    text_column: # column in dataset with the data, usually `text`
+    type: pretrain
+    trust_remote_code:
+    skip: # number of rows of data to skip over from the beginning
 ...
 ```

--- a/docs/rlhf.qmd
+++ b/docs/rlhf.qmd
@@ -29,7 +29,7 @@ datasets:
    type: chatml.intel
  - path: argilla/ultrafeedback-binarized-preferences
    split: train
-    type: chatml.argilla
+    type: chatml
 ```

 #### IPO
--- a/examples/cerebras/btlm-ft.yml
+++ b/examples/cerebras/btlm-ft.yml
@@ -1,6 +1,10 @@
 base_model: cerebras/btlm-3b-8k-base
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: GPT2Tokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 trust_remote_code: true
 tokenizer_use_fast: true
 tokenizer_legacy: true
--- a/examples/cerebras/qlora.yml
+++ b/examples/cerebras/qlora.yml
@@ -1,4 +1,7 @@
 base_model: cerebras/Cerebras-GPT-1.3B
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 load_in_8bit: false
 load_in_4bit: true
 strict: false
--- a/examples/code-llama/13b/lora.yml
+++ b/examples/code-llama/13b/lora.yml
@@ -1,6 +1,9 @@
 base_model: codellama/CodeLlama-13b-hf
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: CodeLlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: true
 load_in_4bit: false
--- a/examples/code-llama/13b/qlora.yml
+++ b/examples/code-llama/13b/qlora.yml
@@ -1,6 +1,9 @@
 base_model: codellama/CodeLlama-13b-hf
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: CodeLlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: true
--- a/examples/code-llama/34b/lora.yml
+++ b/examples/code-llama/34b/lora.yml
@@ -1,6 +1,9 @@
 base_model: codellama/CodeLlama-34b-hf
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: CodeLlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: true
 load_in_4bit: false
--- a/examples/code-llama/34b/qlora.yml
+++ b/examples/code-llama/34b/qlora.yml
@@ -1,6 +1,9 @@
 base_model: codellama/CodeLlama-34b-hf
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: CodeLlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: true
--- a/examples/code-llama/7b/lora.yml
+++ b/examples/code-llama/7b/lora.yml
@@ -1,6 +1,9 @@
 base_model: codellama/CodeLlama-7b-hf
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: CodeLlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: true
 load_in_4bit: false
--- a/examples/code-llama/7b/qlora.yml
+++ b/examples/code-llama/7b/qlora.yml
@@ -1,6 +1,9 @@
 base_model: codellama/CodeLlama-7b-hf
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: CodeLlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: true
--- a/examples/dbrx/16bit-lora.yaml
+++ b/examples/dbrx/16bit-lora.yaml
@@ -1,4 +1,7 @@
 base_model: LnL-AI/dbrx-base-converted-v2
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 trust_remote_code: true

 load_in_8bit: false
--- a/examples/dbrx/8bit-lora.yaml
+++ b/examples/dbrx/8bit-lora.yaml
@@ -1,4 +1,7 @@
 base_model: LnL-AI/dbrx-base-converted-v2
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 trust_remote_code: true

 load_in_8bit: true
--- a/examples/dbrx/fft-ds-zero3.yaml
+++ b/examples/dbrx/fft-ds-zero3.yaml
@@ -1,4 +1,7 @@
 base_model: LnL-AI/dbrx-base-converted-v2
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 trust_remote_code: true

 load_in_8bit: false
--- a/examples/deepseek-v2/fft-fsdp-16b.yaml
+++ b/examples/deepseek-v2/fft-fsdp-16b.yaml
@@ -1,4 +1,6 @@
 base_model: deepseek-ai/DeepSeek-V2-Lite
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
 trust_remote_code: true

 load_in_8bit: false
--- a/examples/deepseek-v2/qlora-fsdp-2_5.yaml
+++ b/examples/deepseek-v2/qlora-fsdp-2_5.yaml
@@ -1,4 +1,7 @@
 base_model: axolotl-quants/DeepSeek-V2.5-bnb-nf4-bf16
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 trust_remote_code: true

 load_in_8bit: false
--- a/examples/falcon/config-7b-lora.yml
+++ b/examples/falcon/config-7b-lora.yml
@@ -1,7 +1,12 @@
 base_model: tiiuae/falcon-7b
-trust_remote_code: true
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
+# required by falcon custom model code: https://huggingface.co/tiiuae/falcon-7b/tree/main
+trust_remote_code: true

 load_in_8bit: true
 load_in_4bit: false
--- a/examples/falcon/config-7b-qlora.yml
+++ b/examples/falcon/config-7b-qlora.yml
@@ -1,10 +1,15 @@
 # 1b: tiiuae/falcon-rw-1b
 # 40b: tiiuae/falcon-40b
 base_model: tiiuae/falcon-7b
-# required by falcon custom model code: https://huggingface.co/tiiuae/falcon-7b/tree/main
-trust_remote_code: true
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
+# required by falcon custom model code: https://huggingface.co/tiiuae/falcon-7b/tree/main
+trust_remote_code: true
+

 load_in_8bit: false
 # enable 4bit for QLoRA
--- a/examples/falcon/config-7b.yml
+++ b/examples/falcon/config-7b.yml
@@ -1,7 +1,12 @@
 base_model: tiiuae/falcon-7b
-trust_remote_code: true
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
+# required by falcon custom model code: https://huggingface.co/tiiuae/falcon-7b/tree/main
+trust_remote_code: true

 load_in_8bit: false
 load_in_4bit: false
--- a/examples/gemma/qlora.yml
+++ b/examples/gemma/qlora.yml
@@ -1,7 +1,10 @@
 # use google/gemma-7b if you have access
 base_model: mhenrichsen/gemma-7b
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: true
--- a/examples/gemma2/qlora.yml
+++ b/examples/gemma2/qlora.yml
@@ -1,6 +1,9 @@
 base_model: google/gemma-2-9b
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: true
--- a/examples/gemma2/reward-model.yaml
+++ b/examples/gemma2/reward-model.yaml
@@ -1,6 +1,9 @@
 base_model: google/gemma-2-2b
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForSequenceClassification
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: false
--- a/examples/gptj/qlora.yml
+++ b/examples/gptj/qlora.yml
@@ -1,4 +1,7 @@
 base_model: EleutherAI/gpt-j-6b
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 load_in_8bit: false
 load_in_4bit: true
 strict: false
--- a/examples/jamba/qlora.yaml
+++ b/examples/jamba/qlora.yaml
@@ -1,4 +1,7 @@
 base_model: ai21labs/Jamba-v0.1
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 trust_remote_code: true

 load_in_8bit: false
--- a/examples/jamba/qlora_deepspeed.yaml
+++ b/examples/jamba/qlora_deepspeed.yaml
@@ -1,4 +1,6 @@
 base_model: ai21labs/Jamba-v0.1
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
 trust_remote_code: true

 load_in_8bit: false
--- a/examples/jamba/qlora_fsdp_large.yaml
+++ b/examples/jamba/qlora_fsdp_large.yaml
@@ -1,5 +1,8 @@
 base_model: ai21labs/AI21-Jamba-1.5-Large
+# optionally might have model_type or tokenizer_type
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_4bit: true
 strict: false
--- a/examples/jeopardy-bot/config.yml
+++ b/examples/jeopardy-bot/config.yml
@@ -1,6 +1,10 @@
 base_model: huggyllama/llama-7b
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 load_in_8bit: false
 datasets:
  - path: openaccess-ai-collective/jeopardy
--- a/examples/llama-2/fft_optimized.yml
+++ b/examples/llama-2/fft_optimized.yml
@@ -1,6 +1,9 @@
 base_model: NousResearch/Llama-2-7b-hf
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: false
--- a/examples/llama-2/gptq-lora.yml
+++ b/examples/llama-2/gptq-lora.yml
@@ -1,8 +1,13 @@
 base_model: TheBloke/Llama-2-7B-GPTQ
-gptq: true
-gptq_disable_exllama: true
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
+gptq: true
+gptq_disable_exllama: true
+
 tokenizer_use_fast: true
 tokenizer_legacy: true
 load_in_8bit: false
--- a/examples/llama-2/lisa.yml
+++ b/examples/llama-2/lisa.yml
@@ -1,6 +1,9 @@
 base_model: NousResearch/Llama-2-7b-hf
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: false
--- a/examples/llama-2/loftq.yml
+++ b/examples/llama-2/loftq.yml
@@ -1,6 +1,9 @@
 base_model: NousResearch/Llama-2-7b-hf
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: false
--- a/examples/llama-2/lora.yml
+++ b/examples/llama-2/lora.yml
@@ -1,6 +1,9 @@
 base_model: NousResearch/Llama-2-7b-hf
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: true
 load_in_4bit: false
--- a/examples/llama-2/qlora-fsdp.yml
+++ b/examples/llama-2/qlora-fsdp.yml
@@ -1,6 +1,9 @@
 base_model: NousResearch/Llama-2-7b-hf
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: true
--- a/examples/llama-2/qlora.yml
+++ b/examples/llama-2/qlora.yml
@@ -1,6 +1,9 @@
 base_model: NousResearch/Llama-2-7b-hf
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: true
--- a/examples/llama-3-vision/lora-11b.yaml
+++ b/examples/llama-3-vision/lora-11b.yaml
@@ -1,5 +1,9 @@
 base_model: alpindale/Llama-3.2-11B-Vision-Instruct
+# optionally might have model_type or tokenizer_type or processor_type
 processor_type: AutoProcessor
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 strict: false

 # these 3 lines are needed for now to handle vision chat templates w images
--- a/examples/llama-3/fft-8b-liger-fsdp.yaml
+++ b/examples/llama-3/fft-8b-liger-fsdp.yaml
@@ -1,4 +1,6 @@
 base_model: NousResearch/Meta-Llama-3.1-8B
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 plugins:
  - axolotl.integrations.liger.LigerPlugin
--- a/examples/llama-3/fft-8b.yaml
+++ b/examples/llama-3/fft-8b.yaml
@@ -1,4 +1,6 @@
 base_model: NousResearch/Meta-Llama-3.1-8B
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: false
--- a/examples/llama-3/instruct-dpo-lora-8b.yml
+++ b/examples/llama-3/instruct-dpo-lora-8b.yml
@@ -1,6 +1,9 @@
 base_model: meta-llama/Meta-Llama-3-8B-Instruct
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: true
 load_in_4bit: false
--- a/examples/llama-3/instruct-lora-8b.yml
+++ b/examples/llama-3/instruct-lora-8b.yml
@@ -1,6 +1,9 @@
 base_model: NousResearch/Meta-Llama-3-8B-Instruct
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: true
 load_in_4bit: false
--- a/examples/llama-3/lora-1b-deduplicate-dpo.yml
+++ b/examples/llama-3/lora-1b-deduplicate-dpo.yml
@@ -1,6 +1,9 @@
 base_model: meta-llama/Llama-3.2-1B
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: true
 load_in_4bit: false
--- a/examples/llama-3/lora-1b-deduplicate-sft.yml
+++ b/examples/llama-3/lora-1b-deduplicate-sft.yml
@@ -1,6 +1,9 @@
 base_model: meta-llama/Llama-3.2-1B
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: true
 load_in_4bit: false
--- a/examples/llama-3/lora-1b.yml
+++ b/examples/llama-3/lora-1b.yml
@@ -1,4 +1,6 @@
 base_model: NousResearch/Llama-3.2-1B
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: false
--- a/examples/llama-3/lora-8b.yml
+++ b/examples/llama-3/lora-8b.yml
@@ -1,6 +1,9 @@
 base_model: NousResearch/Meta-Llama-3-8B
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: true
 load_in_4bit: false
--- a/examples/llama-3/qlora-1b-kto.yaml
+++ b/examples/llama-3/qlora-1b-kto.yaml
@@ -1,4 +1,6 @@
 base_model: meta-llama/Llama-3.2-1B
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: true
--- a/examples/llama-3/qlora-1b.yml
+++ b/examples/llama-3/qlora-1b.yml
@@ -1,4 +1,6 @@
 base_model: NousResearch/Llama-3.2-1B
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: true
--- a/examples/llama-3/qlora-fsdp-405b.yaml
+++ b/examples/llama-3/qlora-fsdp-405b.yaml
@@ -1,5 +1,8 @@
 base_model: hugging-quants/Meta-Llama-3.1-405B-BNB-NF4-BF16
+# optionally might have model_type or tokenizer_type
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_4bit: true
 strict: false
--- a/examples/llama-3/qlora-fsdp-70b.yaml
+++ b/examples/llama-3/qlora-fsdp-70b.yaml
@@ -1,6 +1,9 @@
 base_model: casperhansen/llama-3-70b-fp16
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: AutoTokenizer  # PreTrainedTokenizerFast
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: true
--- a/examples/llama-3/qlora.yml
+++ b/examples/llama-3/qlora.yml
@@ -1,6 +1,9 @@
 base_model: NousResearch/Meta-Llama-3-8B
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: true
--- a/examples/mamba/config.yml
+++ b/examples/mamba/config.yml
@@ -1,7 +1,10 @@
 base_model: state-spaces/mamba-2.8b
+# optionally might have model_type or tokenizer_type or tokenizer_config
 model_type: MambaLMHeadModel
 tokenizer_type: AutoTokenizer
 tokenizer_config: EleutherAI/gpt-neox-20b
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: false
--- a/examples/mistral/bigstral-ds-zero3.yaml
+++ b/examples/mistral/bigstral-ds-zero3.yaml
@@ -1,6 +1,10 @@
 base_model: mistral-community/Mixtral-8x22B-v0.1
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 trust_remote_code: true

 load_in_8bit: false
--- a/examples/mistral/config.yml
+++ b/examples/mistral/config.yml
@@ -1,6 +1,9 @@
 base_model: mistralai/Mistral-7B-v0.1
+# optionally might have model_type or tokenizer_type
 model_type: MistralForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: false
--- a/examples/mistral/lora-mps.yml
+++ b/examples/mistral/lora-mps.yml
@@ -1,6 +1,9 @@
 base_model: mistralai/Mistral-7B-v0.1
+# optionally might have model_type or tokenizer_type
 model_type: MistralForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: false
--- a/examples/mistral/lora.yml
+++ b/examples/mistral/lora.yml
@@ -1,6 +1,9 @@
 base_model: mistralai/Mistral-7B-v0.1
+# optionally might have model_type or tokenizer_type
 model_type: MistralForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: true
 load_in_4bit: false
--- a/examples/mistral/mistral-dpo-qlora.yml
+++ b/examples/mistral/mistral-dpo-qlora.yml
@@ -4,8 +4,11 @@
 #face problems with the special tokens.

 base_model: mistralai/Mistral-7B-Instruct-v0.2
+# optionally might have model_type or tokenizer_type
 model_type: MistralForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: true
--- a/examples/mistral/mistral-qlora-fsdp.yml
+++ b/examples/mistral/mistral-qlora-fsdp.yml
@@ -1,6 +1,10 @@
 base_model: mistralai/Mixtral-8x7B-v0.1
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 trust_remote_code: true

 load_in_8bit: false
--- a/examples/mistral/mistral-qlora-orpo.yml
+++ b/examples/mistral/mistral-qlora-orpo.yml
@@ -1,6 +1,9 @@
 base_model: mistralai/Mistral-7B-v0.1
+# optionally might have model_type or tokenizer_type
 model_type: MistralForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: true
--- a/examples/mistral/mixtral-8x22b-qlora-fsdp.yml
+++ b/examples/mistral/mixtral-8x22b-qlora-fsdp.yml
@@ -1,6 +1,9 @@
 base_model: mistral-community/Mixtral-8x22B-v0.1
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: true
--- a/examples/mistral/mixtral-qlora-fsdp.yml
+++ b/examples/mistral/mixtral-qlora-fsdp.yml
@@ -1,6 +1,10 @@
 base_model: mistralai/Mixtral-8x7B-v0.1
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 trust_remote_code: true

 load_in_8bit: false
--- a/examples/mistral/mixtral.yml
+++ b/examples/mistral/mixtral.yml
@@ -1,6 +1,10 @@
 base_model: mistralai/Mixtral-8x7B-v0.1
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 trust_remote_code: true

 load_in_8bit: false
--- a/examples/mistral/mixtral_22.yml
+++ b/examples/mistral/mixtral_22.yml
@@ -1,6 +1,10 @@
 base_model: mistral-community/Mixtral-8x22B-v0.1
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 trust_remote_code: true

 load_in_8bit: false
--- a/examples/mistral/qlora.yml
+++ b/examples/mistral/qlora.yml
@@ -1,6 +1,9 @@
 base_model: mistralai/Mistral-7B-v0.1
+# optionally might have model_type or tokenizer_type
 model_type: MistralForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: true
--- a/examples/mpt-7b/config.yml
+++ b/examples/mpt-7b/config.yml
@@ -1,5 +1,9 @@
 base_model: mosaicml/mpt-7b
+# optionally might have model_type or tokenizer_type
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 trust_remote_code: true  # required for mpt as their model class is not merged into transformers yet
 load_in_8bit: false
 datasets:
--- a/examples/openllama-3b/config.yml
+++ b/examples/openllama-3b/config.yml
@@ -1,6 +1,10 @@
 base_model: openlm-research/open_llama_3b_v2
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 load_in_8bit: false
 load_in_4bit: false
 strict: false
--- a/examples/openllama-3b/lora.yml
+++ b/examples/openllama-3b/lora.yml
@@ -1,6 +1,10 @@
 base_model: openlm-research/open_llama_3b_v2
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 load_in_8bit: true
 load_in_4bit: false
 strict: false
--- a/examples/openllama-3b/qlora.yml
+++ b/examples/openllama-3b/qlora.yml
@@ -1,6 +1,10 @@
 base_model: openlm-research/open_llama_3b_v2
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 load_in_8bit: false
 load_in_4bit: true
 strict: false
--- a/examples/phi/lora-3.5.yaml
+++ b/examples/phi/lora-3.5.yaml
@@ -1,6 +1,9 @@
 base_model: microsoft/Phi-3.5-mini-instruct
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: true
 load_in_4bit: false
--- a/examples/phi/phi-ft.yml
+++ b/examples/phi/phi-ft.yml
@@ -1,6 +1,9 @@
 base_model: microsoft/phi-1_5
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: false
--- a/examples/phi/phi-qlora.yml
+++ b/examples/phi/phi-qlora.yml
@@ -1,6 +1,9 @@
 base_model: microsoft/phi-1_5
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: true
--- a/examples/phi/phi2-ft.yml
+++ b/examples/phi/phi2-ft.yml
@@ -1,6 +1,9 @@
 base_model: microsoft/phi-2
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: false
--- a/examples/phi/phi3-ft-fsdp.yml
+++ b/examples/phi/phi3-ft-fsdp.yml
@@ -1,6 +1,9 @@
 base_model: microsoft/Phi-3-mini-4k-instruct
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: false
--- a/examples/phi/phi3-ft.yml
+++ b/examples/phi/phi3-ft.yml
@@ -1,7 +1,11 @@
 base_model: microsoft/Phi-3-mini-4k-instruct
+# optionally might have model_type or tokenizer_type
 trust_remote_code: true
 model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 chat_template: phi_3

 load_in_8bit: false
--- a/examples/pythia-12b/config.yml
+++ b/examples/pythia-12b/config.yml
@@ -1,7 +1,11 @@
 base_model: EleutherAI/pythia-12b-deduped
 base_model_ignore_patterns: pytorch*  # prefer safetensors
+# optionally might have model_type or tokenizer_type
 model_type: GPTNeoXForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 load_in_8bit: false
 load_in_4bit: false
 gptq: false
--- a/examples/pythia/lora.yml
+++ b/examples/pythia/lora.yml
@@ -1,4 +1,7 @@
 base_model: EleutherAI/pythia-1.4b-deduped
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 load_in_8bit: true
 datasets:
  - path: teknium/GPT4-LLM-Cleaned
--- a/examples/qwen/lora.yml
+++ b/examples/qwen/lora.yml
@@ -1,6 +1,9 @@
 base_model: Qwen/Qwen-7B
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 trust_remote_code: true

--- a/examples/qwen/qlora.yml
+++ b/examples/qwen/qlora.yml
@@ -1,6 +1,9 @@
 base_model: Qwen/Qwen-7B
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 trust_remote_code: true

--- a/examples/qwen/qwen2-moe-lora.yaml
+++ b/examples/qwen/qwen2-moe-lora.yaml
@@ -1,4 +1,7 @@
 base_model: Qwen/Qwen1.5-MoE-A2.7B
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 trust_remote_code: true

 load_in_8bit: false
--- a/examples/qwen/qwen2-moe-qlora.yaml
+++ b/examples/qwen/qwen2-moe-qlora.yaml
@@ -1,4 +1,7 @@
 base_model: Qwen/Qwen1.5-MoE-A2.7B
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 trust_remote_code: true

 load_in_8bit: false
--- a/examples/qwen2/dpo.yaml
+++ b/examples/qwen2/dpo.yaml
@@ -1,4 +1,6 @@
 base_model: Qwen/Qwen2.5-0.5B
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 strict: false

--- a/examples/qwen2/qlora-fsdp.yaml
+++ b/examples/qwen2/qlora-fsdp.yaml
@@ -1,4 +1,7 @@
 base_model: Qwen/Qwen2-7B
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 trust_remote_code: true

 load_in_8bit: false
--- a/examples/redpajama/config-3b.yml
+++ b/examples/redpajama/config-3b.yml
@@ -1,6 +1,10 @@
 base_model: togethercomputer/RedPajama-INCITE-Chat-3B-v1
+# optionally might have model_type or tokenizer_type
 model_type: GPTNeoXForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 trust_remote_code:
 load_in_8bit: false
 datasets:
--- a/examples/replit-3b/config-lora.yml
+++ b/examples/replit-3b/config-lora.yml
@@ -1,4 +1,7 @@
 base_model: replit/replit-code-v1-3b
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 trust_remote_code: true
 load_in_8bit: false
 datasets:
--- a/examples/stablelm-2/1.6b/fft.yml
+++ b/examples/stablelm-2/1.6b/fft.yml
@@ -1,6 +1,10 @@
 base_model: stabilityai/stablelm-2-1_6b
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 trust_remote_code: true

 load_in_8bit: false
--- a/examples/stablelm-2/1.6b/lora.yml
+++ b/examples/stablelm-2/1.6b/lora.yml
@@ -1,6 +1,10 @@
 base_model: stabilityai/stablelm-2-1_6b
+# optionally might have model_type or tokenizer_type
 model_type: AutoModelForCausalLM
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
 trust_remote_code: true

 load_in_8bit: true
--- a/examples/starcoder2/qlora.yml
+++ b/examples/starcoder2/qlora.yml
@@ -1,4 +1,6 @@
 base_model: bigcode/starcoder2-3b
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: false
 load_in_4bit: true
--- a/examples/tiny-llama/lora-mps.yml
+++ b/examples/tiny-llama/lora-mps.yml
@@ -1,6 +1,9 @@
 base_model: TinyLlama/TinyLlama_v1.1
+# optionally might have model_type or tokenizer_type
 model_type: LlamaForCausalLM
 tokenizer_type: LlamaTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: true
 load_in_4bit: false
--- a/examples/tiny-llama/lora.yml
+++ b/examples/tiny-llama/lora.yml
@@ -1,5 +1,8 @@
 base_model: TinyLlama/TinyLlama_v1.1
+# optionally might have model_type or tokenizer_type
 tokenizer_type: AutoTokenizer
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name

 load_in_8bit: true
 load_in_4bit: false
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Wing Lian	f11227a35a	various fixes	2025-01-30 10:39:18 -05:00
Wing Lian	c434951dd6	Always re-normalize teacher distribution	2025-01-29 08:36:40 -05:00
Wing Lian	42d4732aaf	kd loss needs to be calculated in full precision	2025-01-28 19:40:35 -05:00
Wing Lian	2c9dfbed2e	apply z-score scaling to kd	2025-01-27 14:27:35 -05:00
Wing Lian	4e4a16cd8a	fix finding the top-k rather than assuming first position has the correct val	2025-01-21 13:09:20 -05:00
Wing Lian	67c1c8405e	use iter instead of tuple	2025-01-21 11:23:38 -05:00
Wing Lian	bded6df509	change up logic so we always truncate to top_k	2025-01-21 11:20:01 -05:00
Wing Lian	bb5e6f4b72	make sure to truncate logprobs if there are more than top_k	2025-01-21 10:26:27 -05:00
Wing Lian	32258c247e	no batching for kd chat templates	2025-01-15 08:22:29 -05:00
Wing Lian	04efcb102f	don't shift student logits for kd	2025-01-15 01:07:48 -05:00
Wing Lian	483defb9ae	try tests for kd on l40s	2025-01-14 23:56:00 -05:00
Wing Lian	35a84f2cb8	more fixes	2025-01-14 22:47:49 -05:00
Wing Lian	510cf45317	improve logprob masking and shift in trainer	2025-01-14 22:47:48 -05:00
Wing Lian	7232cbdeab	chore: lint	2025-01-14 22:47:48 -05:00
Wing Lian	e8fceb7091	chore: lint	2025-01-14 22:47:48 -05:00
Wing Lian	a5e0671738	make sure to use tensorboard to capture loss for checks	2025-01-14 22:47:48 -05:00
Wing Lian	b9847553af	fix adapter model check	2025-01-14 22:47:48 -05:00
Wing Lian	513ec9e03b	make sure to use the correct tokenizer	2025-01-14 22:47:48 -05:00
Wing Lian	530347856d	make sure to set tokenizer from l3 70b and save safetensors	2025-01-14 22:47:47 -05:00
Wing Lian	261e4fb619	lower lr	2025-01-14 22:47:47 -05:00
Wing Lian	158071e95f	set lora_dropout explicitly	2025-01-14 22:47:47 -05:00
Wing Lian	432f65f5e6	make the kd e2e fit in vram for ci and add lora version	2025-01-14 22:47:47 -05:00
Wing Lian	1d039f5486	rename test files so it gets picked up	2025-01-14 22:47:47 -05:00
Wing Lian	b9a42b396f	linting	2025-01-14 22:47:47 -05:00
Wing Lian	ff2fb0fc1b	add kd trainer e2e test	2025-01-14 22:47:47 -05:00
Wing Lian	317f290186	reward model doesn't work well with batched	2025-01-14 22:47:46 -05:00
Wing Lian	ab690f3f01	improve check for batched	2025-01-14 22:47:46 -05:00
Wing Lian	47932f21c4	fix reward trainer calls for tokenization	2025-01-14 22:47:46 -05:00
Wing Lian	808328e041	reward can use same batch check	2025-01-14 22:47:46 -05:00
Wing Lian	6784822cfb	tweak check for batched prompt data	2025-01-14 22:47:46 -05:00
Wing Lian	684b38291f	ensure that batch vs single is done properly	2025-01-14 22:47:46 -05:00
Wing Lian	01896b1bde	improve iterable support	2025-01-14 22:47:46 -05:00
Wing Lian	e659c01646	support streaming for processing sft datasts?	2025-01-14 22:47:45 -05:00
Wing Lian	204d6c43b4	make loss torch script compat	2025-01-14 22:47:45 -05:00
Wing Lian	d3c2b7ce9d	kd sample packing	2025-01-14 22:47:45 -05:00
Wing Lian	93dfff92f1	be a bit pickier about loading dynamic prompt strategies	2025-01-14 22:47:45 -05:00
Wing Lian	6e409d2d88	more info on preprocess for kd and fix import	2025-01-14 22:47:45 -05:00
Wing Lian	d5bc214300	remove duplicate code	2025-01-14 22:47:45 -05:00
Wing Lian	92c6c1087e	add copyrights	2025-01-14 22:47:45 -05:00
Wing Lian	feed96f95e	increase logging around loading plugins	2025-01-14 22:47:44 -05:00
Wing Lian	cba6165ae1	make plugin setup concise	2025-01-14 22:47:44 -05:00
Wing Lian	cdfcd69afa	remove moved class from import	2025-01-14 22:47:44 -05:00
Wing Lian	885653d52e	move more things to kd plugin	2025-01-14 22:47:44 -05:00
Wing Lian	27faacbf5a	refactor kd chat template loader	2025-01-14 22:47:44 -05:00
Wing Lian	c51b0337c1	support for custom trainer classes from plugins	2025-01-14 22:47:44 -05:00
Wing Lian	fa055f9f69	handle token/logprob shifting	2025-01-14 22:47:43 -05:00
Wing Lian	f60c623af0	remove references to triton kd for now	2025-01-14 22:47:43 -05:00
Wing Lian	746891eb5c	add license block	2025-01-14 22:47:43 -05:00
Wing Lian	f09b5da60b	refactor so we can easily add new loss functions	2025-01-14 22:47:43 -05:00
Wing Lian	689e1c10ba	chore: lint	2025-01-14 22:47:43 -05:00
Wing Lian	a5c085e003	var naming and add todo	2025-01-14 22:47:43 -05:00
Wing Lian	63146300b7	fix kd loss so it's causal (fixes repeating tokens)	2025-01-14 22:47:43 -05:00
Wing Lian	ca5e397fc5	use kd_alpha in the correct loss method	2025-01-14 22:47:42 -05:00
Wing Lian	3416302b0d	hash for temperature too	2025-01-14 22:47:42 -05:00
Wing Lian	7366efc4ca	better rescaling for temperatures	2025-01-14 22:47:42 -05:00
Wing Lian	d8d817eaed	don't use triton for now	2025-01-14 22:47:42 -05:00
Wing Lian	c0757e8a20	fix kwarg	2025-01-14 22:47:42 -05:00
Wing Lian	e565694914	v3	2025-01-14 22:47:42 -05:00
Wing Lian	081928e55b	no torch.tensor	2025-01-14 22:47:42 -05:00
Wing Lian	dc90c93894	no log etc	2025-01-14 22:47:41 -05:00
Wing Lian	18a46c338a	no torch.exp inside triton kernel	2025-01-14 22:47:41 -05:00
Wing Lian	119d586cf4	v2 trial	2025-01-14 22:47:41 -05:00
Wing Lian	c73acd7de0	no where support	2025-01-14 22:47:41 -05:00
Wing Lian	0b59a242d4	triton wip	2025-01-14 22:47:41 -05:00
Wing Lian	ed490517da	chore: lint	2025-01-14 22:47:41 -05:00
Wing Lian	00ce77e7ef	make sure to multiply against the correct loss	2025-01-14 22:47:41 -05:00
Wing Lian	ae545e0165	cross entropy loss coefficient during KD	2025-01-14 22:47:40 -05:00
Wing Lian	b592c05b93	flipped the slice	2025-01-14 22:47:40 -05:00
Wing Lian	7fe0ad088b	make it work	2025-01-14 22:47:40 -05:00
Wing Lian	ddcf5c68b3	handle padding/collation for KD datasets	2025-01-14 22:47:40 -05:00
Wing Lian	e633a12dbe	make batch smaller	2025-01-14 22:47:40 -05:00
Wing Lian	d584354ee4	filter bad rows	2025-01-14 22:47:40 -05:00
Wing Lian	303cfa71aa	KD dataset loading and KD with logprobs	2025-01-14 22:47:40 -05:00
Wing Lian	88b3198894	refactor trainer to prevent circular dependencies later fix loader default	2025-01-14 22:47:39 -05:00
jwongTensora	8606093921	fix for indexing error from token/embeddings mismatch (#2257 ) Co-authored-by: jwong <jwongTensora@gmail.com>	2025-01-14 22:09:29 -05:00
NanoCode012	cba5a457d9	fix: use text_column even when not packing for pretraining (#2254 ) * fix: use text_column even when not packing for pretraining * feat: update test to check when not packing * chore: lint * Update src/axolotl/utils/data/pretraining.py Co-authored-by: Wing Lian <wing.lian@gmail.com> --------- Co-authored-by: Wing Lian <wing@axolotl.ai> Co-authored-by: Wing Lian <wing.lian@gmail.com>	2025-01-14 22:08:56 -05:00
Wing Lian	19cd83d408	rename references to dpo dataset prep to pref data (#2258 )	2025-01-14 22:07:55 -05:00
Dan Saunders	1ed4de73b6	CLI cleanup and documentation (#2244 ) * CLI init refactor * fix * cleanup and (partial) docs * Adding documentation and continuing cleanup (in progress) * remove finetune.py script * continued cleanup and documentation * pytest fixes * review comments * fix * Fix * typing fixes * make sure the batch dataset patcher for multipack is always loaded when handling datasets * review comments * fix --------- Co-authored-by: Dan Saunders <dan@axolotl.ai> Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-01-13 17:55:29 +00:00
Wing Lian	f89e962119	skip over rows in pretraining dataset (#2223 ) * skip over rows in pretraining dataset * update docs	2025-01-13 10:44:45 -05:00
Wing Lian	bc1c9c20e3	assume empty lora dropout means 0.0 and add tests (#2243 ) * assume empty lora dropout means 0.0 and add tests * remove un-necessary arg * refactor based on pr feedback: * chore: lint	2025-01-13 10:44:11 -05:00
Wing Lian	dd26cc3c0f	add helper to verify the correct model output file exists (#2245 ) * add helper to verify the correct model output file exists * more checks using helper * chore: lint * fix import and relora model check * workaround for trl trainer saves * remove stray print	2025-01-13 10:43:29 -05:00
Wing Lian	d8b4027200	use 2.5.1 docker images as latest tag as it seems stable (#2198 )	2025-01-10 08:35:25 -05:00
Wing Lian	fb3352e21c	rename liger test so it properly runs in ci (#2246 )	2025-01-09 17:31:43 -05:00
NanoCode012	ed77e7001e	feat: add support for data_files in pretraining (#2238 )	2025-01-09 21:04:13 +00:00
Wing Lian	7669a03fb4	update upstream HF deps (#2239 ) * bump axolotl contribs for upstream main conflicts: * bump datasets, tokenizer, trl * remove log workarounds in trl * bump lm-eval * remove unsloth_ import from critical path * remove llama fa2 from conftest * unsloth breaks with latest upstream	2025-01-09 21:01:59 +00:00
Vincenzo di Cicco	6553683170	Use SequentialSampler if curriculum_sampling is enabled with sample_packing (#2235 )	2025-01-09 21:01:22 +00:00
Wing Lian	5e0124e2ab	update modal version for ci (#2242 )	2025-01-09 21:01:02 +00:00
NanoCode012	2e8d7c1adb	fix: mistral nemo does not recognize token_type_ids in forward (#2233 )	2025-01-09 21:00:36 +00:00
Wing Lian	3c1921e400	add hf cache caching for GHA (#2247 ) * add hf cache caching for GHA * use modal volume to cache hf data * make sure to update the cache as we add new fixtures in conftest	2025-01-09 20:59:54 +00:00
Wing Lian	7faf2b6e8e	Merge group queue (#2248 ) * add support for merge groups * also lint merge groups	2025-01-09 15:49:00 -05:00
salman	c1b920f291	Fixing OSX installation (#2231 ) * bumping version, removing non-osx compatible deps * updating pylintrc * fixing linters * reverting changes	2025-01-07 13:42:01 +00:00
Wing Lian	3915abee4c	make sure padding is labeled as -100 for pretraining (#2227 )	2024-12-31 15:22:18 -05:00
NJordan72	7a38dbe674	fix: allow trainer builder to use custom jinja chat template (#2219 ) * fix: allow trainer builder to use custom jinja chat template * chore: use get_chat_template_from_config Co-authored-by: Chirag Jain <jain.chirag925@gmail.com> * fix: swap imports --------- Co-authored-by: Chirag Jain <jain.chirag925@gmail.com>	2024-12-24 16:18:50 -05:00
Wing Lian	e0a2eb2ebd	fix untrained tokens if specified explicitly from a list (#2210 )	2024-12-23 09:08:28 -05:00
Wing Lian	d852d7af7a	inference - don't default w accelerate, fix base model (#2216 ) [skip ci]	2024-12-23 07:48:41 -05:00
Wing Lian	3742deb1de	add deepspeed example with torch compile enabled (#2212 ) [skip ci]	2024-12-22 12:11:39 -05:00
Wing Lian	2312caaa98	GC every n steps (#2209 )	2024-12-21 17:38:33 -05:00
Wing Lian	307cf7c685	move the dataset loading from remote/disk to a shared function so we can re-use for RL (#2204 )	2024-12-20 21:43:52 -05:00
Dan Saunders	70541145f1	adding test_datasets compat with pretraining_dataset (streaming) (#2206 ) [skip ci]	2024-12-20 21:43:33 -05:00
Wing Lian	42bd32a233	add outputs (symlink) to gitignore [skip ci] (#2205 )	2024-12-19 20:14:43 -05:00
Dan Saunders	5b8fb5e939	remove cicd pytest xdist args (#2201 ) * remove cicd pytest xdist args * Delete outputs	2024-12-19 11:44:53 -05:00
Wing Lian	bd2a594b89	use DataCollatorWithFlattening when not sample packing (#2167 )	2024-12-17 17:46:44 -05:00
Wing Lian	3798229d85	handle torch_compile set to auto (#2172 ) [skip ci] * handle torch_compile set to auto * update docs [skip ci] * add tests	2024-12-17 16:42:41 -05:00
NanoCode012	10cfecf02e	fix: use apply_chat_template to find turn boundaries and allow tool_calling field (#2179 ) [skip ci] * fix: use apply_chat_template to find turn boundaries and allow tool_calling field * fix: keys to include in turn * feat(doc): explicitly recommend setting train_on_eos and roles_to_train * fix: eos not being masked for tool due to template padding * chore: clear up docs * fix: default messages format, train_on_eos: turn, and train on all assistant msg * fix: properly warn if empty content * feat: parametrize chat_template tests to test different tokenizers * fix: set proper default for message key * fix: update defaults to match load function * fix: change defaults to use new * feat: add tool_calling dataset * feat: add tool_calling test * fix: add handling of edge case of mistral tokenizer with only system prompt * feat: refactor all test to follow source code * fix: remove unnecessary eos_token from phi35 * fix test for phi3.5 since eos was dropped from chat_template --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2024-12-17 16:42:21 -05:00
Wing Lian	339f3c67e2	dataset tags don't support https uris (#2195 )	2024-12-17 13:58:53 -05:00
Wing Lian	d91feaffc8	upgrade to liger 0.5.2 (#2181 ) [skip ci]	2024-12-17 13:58:21 -05:00
Wing Lian	e246ceffa4	use axolotl contribs for fix_untrained_tokens (#2194 ) [skip ci] * use axolotl contribs for fix_untrained_tokens * remove the module we're replacing * Add check for using fix_untrained_tokens	2024-12-17 13:57:16 -05:00
Wing Lian	8ddc18ec8d	move the setting of PYTORCH_CUDA_ALLOC_CONF to the cli rather than train module (#2183 ) [skip ci] * move the setting of PYTORCH_CUDA_ALLOC_CONF to the cli rather than train module * move set_pytorch_cuda_alloc_conf to a different module to have fewer loaded dependencies for the CLI	2024-12-17 13:56:48 -05:00
Sunny Liu	1c14c4a15c	Add hub model id config options to all example yml files (#2196 ) [skip ci] * added hub model_id in example yml * add hub model id to example yml	2024-12-17 11:24:30 -05:00
Wing Lian	1f623e6cc8	transformers 4.47.1 (#2187 ) * transformers 4.47.1 * drop monkeypatches * can't remove patches yet * make flash attention forward ignore the loss kwargs * patch the flash attention in the modeling arch too * remove fsdp and deepspeed patches * cleanup PR * bump accelerate and torchao, also logically reorder/group requirements * meant to include torchao * use official patch release	2024-12-17 11:01:21 -05:00
Dan Saunders	f865464ae5	Basic evaluate CLI command / codepath (#2188 ) * basic evaluate CLI command / codepath * tests for evaluate CLI command * fixes and cleanup * review comments; slightly DRYing up things --------- Co-authored-by: Dan Saunders <dan@axolotl.ai>	2024-12-16 15:46:31 -05:00
Wing Lian	33090486d7	[feature] add pytorch profiling (#2182 ) * add pytorch profiling * kick off the profiler asap since things may get allcoated before train start * document feature * add url for visualizer [skip ci]	2024-12-16 12:38:43 -05:00
Wing Lian	effc4dc409	pin to 4.47.0 (#2180 )	2024-12-12 20:17:12 -05:00