Compare commits

..

29 Commits

Author SHA1 Message Date
Dan Saunders
30981328fc draft config for devstral 2025-05-23 20:04:21 +00:00
Dan Saunders
b5f1e53a0f models.py -> loaders/ module refactor (#2680)
* models.py -> loaders/ module refactor

* refactor ModelLoader class

* plugin manager changes

* circular import fix

* pytest

* pytest

* minor improvements

* fix

* minor changes

* fix test

* remove dead code

* coderabbit comments

* lint

* fix

* coderabbit suggestion I liked

* more coderabbit

* review comments, yak shaving

* lint

* updating in light of SP ctx manager changes

* review comment

* review comment 2
2025-05-23 15:51:11 -04:00
Dan Saunders
8cde256db2 Remove unused const (#2714)
* remove unused const

* accidentally commited benchmark plot
2025-05-23 12:27:38 -04:00
Dan Saunders
5f8f817200 SP context manager update (#2699)
* utilize accelerate prepare_data_loader with patching

* lint

* cleanup, fix

* update to support DPO quirk

* coderabbit commits, cleanup, remove dead code

* fix

* move ring attn patching to sp ctx manager

* lint

* lint

* test fix

* test fix
2025-05-22 11:18:32 -04:00
NanoCode012
aa0492c366 feat: do not find turn indices if turn is not trainable (#2696)
* feat: do not find turn indices if turn is not trainable

* fix: handle edge case where train on eos/eot is all

* fix: improve warning message
2025-05-22 19:19:59 +07:00
NanoCode012
798b5f5cfd fix(RL): address plugin rl overwriting trainer_cls (#2697) [skip ci]
* fix: plugin rl overwrite trainer_cls

* feat(test): add test to catch trainer_cls is not None
2025-05-22 19:19:12 +07:00
NanoCode012
1c83a1a020 feat(doc): clarify minimum pytorch and cuda to use blackwell (#2704) [skip ci] 2025-05-22 19:18:27 +07:00
Dan Saunders
6aa41740df SP dataloader patching + removing custom sampler / dataloader logic (#2686)
* utilize accelerate prepare_data_loader with patching

* lint

* cleanup, fix

* update to support DPO quirk

* small change

* coderabbit commits, cleanup, remove dead code

* quarto fix

* patch fix

* review comments

* moving monkeypatch up one level

* fix
2025-05-21 11:20:20 -04:00
Wing Lian
a27b909c5c GRPO fixes (peft) (#2676)
* don't set peft_config on grpo to prevent double peft wrap

* remove overrides needed to support bug

* fix grpo tests

* require more CPU for multigpu to help with torch compile for vllm
2025-05-16 15:47:03 -04:00
xzuyn
6cb07b9d12 Fix for setting adam_beta3 and adam_epsilon2 for CAME Optimizer (#2654) [skip ci]
* make setting `adam_beta3` and `adam_epsilon2` work correctly

* update config docs so users know args are specific to CAME optim

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-05-16 15:46:50 -04:00
C080
288653adb6 Fix: Make MLflow config artifact logging respect hf_mlflow_log_artifa… (#2675) [skip ci]
* Fix: Make MLflow config artifact logging respect hf_mlflow_log_artifacts setting

* cleanup and lint

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-05-16 15:46:31 -04:00
NanoCode012
3a5b495a74 Fix: improve doc on merge/inference cli visibility (#2674)
* feat: improve visibility for merge doc

* feat: add tip on reuse config between modes
2025-05-16 13:07:40 -04:00
xzuyn
f661858fc4 Print dataset name (#2668) [skip ci] 2025-05-16 13:06:58 -04:00
Eric Meier
c837c4a424 Add missing init file to liger plugin (#2670) [skip ci] 2025-05-16 13:06:46 -04:00
michelyang
c9797de6bb Add num_proc to fix data set slow processing issue (#2681) [skip ci] 2025-05-16 13:06:20 -04:00
Wing Lian
8f8a7afb05 Add ci and images for CUDA 12.8 for B200s (#2683) [skip ci]
* Add ci and images for CUDA 12.8 for B200s

* add comments explaining CI [skip e2e]
2025-05-16 13:06:08 -04:00
NanoCode012
86472715da fix: remove doc string imports in monkeypatches (#2671) [skip ci] 2025-05-16 13:05:55 -04:00
Wing Lian
c0a0c7534c Activation checkpointing with offloading to disk with prefetch (#2663)
* offload activations to disk instead of CPU RAM

* add prefetch

* Disco :dance:

* include offload_disk in e2e test for AC

* document and make sure to cleanup

* fix annotation to match docs

* fix docs build

* address PR feedback
2025-05-13 16:39:39 -04:00
Wing Lian
7fa1089cea Atropos support (#2666) [skip ci]
* allow peft+liger+grpo and custom vllm serve for atropos support

* set trainer class for RL
2025-05-13 08:30:58 -04:00
Dan Saunders
80304c26a7 SP GRPO support + batch SP fixes (#2643)
* ctx manager for SP

* updates

* update

* further simplifying

* simplifying

* simplifying

* reorg

* batch api HF adapter for ring-flash-attn; cleanup and improvements

* update

* adding all batch ring-flash-attn methods via single adapter

* fix

* fixes for batch API funcs, simplify

* fix

* grpo sp support

* progress

* stronger subclassing of TRL GRPO trainer; custom distributed sampler

* subclassing constructor

* progress

* finalizing SP + GRPO trainer

* minimize diffs to GRPO trainer

* remove (most of) the custom GRPO trainer logic

* debug

* debug

* update

* update

* update

* progress

* cleanup

* cleanup

* minor changes

* update

* update

* update

* small changes

* updates

* cleanup; torch.compile ring_flash_attn functions to prevent numerical instability; lint

* spacing

* cleanup; log in pydantic model config only on main process

* remove comment

* fix sp sampler, update to latest upstream code, doc

* add docs

* update quartodoc autodoc contents

* fix, simplifications

* fixes + simplifications

* review comments

* lint

* removing main process only logs in favor of #2608

* fixes, additional smoke test

* updatse

* more tests

* update

* fix grad accum bug (sort of)

* lint, tests

* todo
2025-05-12 17:52:40 -04:00
NanoCode012
67c4ea9c7c fix: disable auto lora kernel if dropout nonzero (#2655) [skip ci]
* fix: disable auto lora kernel if dropout nonzero

* Add comment from PR feedback

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-05-12 16:23:53 -04:00
Wing Lian
526ddb886d guard on deleting secrets from env (#2653) [skip ci] 2025-05-12 14:18:42 -04:00
Wing Lian
f34eef546a update doc and use P2P=LOC for brittle grpo test (#2649)
* update doc and skip brittle grpo test

* fix the path to run the multigpu tests

* increase timeout, use LOC instead of NVL

* typo

* use hf cache from s3 backed cloudfront

* mark grpo as flaky test dues to vllm start
2025-05-12 14:17:25 -04:00
Wing Lian
c7b6790614 Various fixes for CI, save_only_model for RL, prevent packing multiprocessing deadlocks (#2661)
* lean mistral ft tests, remove e2e torch 2.4.1 test

* make sure to pass save_only_model for RL

* more tests to make ci leaner, add cleanup to modal ci

* fix module for import in e2e tests

* use mp spawn to prevent deadlocks with packing

* make sure cleanup shell script is executable when cloned out
2025-05-12 10:51:18 -04:00
Dan Saunders
47e0e71bc8 don't sort multipack sampler (#2657)
* don't sort multipack sampler

* increased packing efficiency increases loss

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-05-09 20:28:58 -04:00
Wing Lian
0f3587174d swap tinymodels that have safetensors for some ci tests (#2641) 2025-05-07 15:06:07 -04:00
xzuyn
25e6c5f9bd Add CAME Optimizer (#2385) 2025-05-07 10:31:46 -04:00
NanoCode012
32f51bca35 fix(doc): clarify instruction to delinearize llama4 similar to cli doc (#2644) [skip ci] 2025-05-07 10:29:47 -04:00
NanoCode012
9daa04da90 Fix: improve error message on failed dataset load (#2637) [skip ci]
* fix(log): clarify error on dataset loading failed

* fix: add path for easy tracking of broken config

* fix: improve error message based on pr feedback
2025-05-07 10:29:05 -04:00
215 changed files with 5452 additions and 3644 deletions

View File

@@ -31,6 +31,11 @@ jobs:
python_version: "3.11" python_version: "3.11"
pytorch: 2.7.0 pytorch: 2.7.0
axolotl_extras: axolotl_extras:
- cuda: 128
cuda_version: 12.8.1
python_version: "3.11"
pytorch: 2.7.0
axolotl_extras:
runs-on: axolotl-gpu-runner runs-on: axolotl-gpu-runner
steps: steps:
- name: Checkout - name: Checkout
@@ -94,6 +99,11 @@ jobs:
python_version: "3.11" python_version: "3.11"
pytorch: 2.7.0 pytorch: 2.7.0
axolotl_extras: axolotl_extras:
- cuda: 128
cuda_version: 12.8.1
python_version: "3.11"
pytorch: 2.7.0
axolotl_extras:
runs-on: axolotl-gpu-runner runs-on: axolotl-gpu-runner
steps: steps:
- name: Checkout - name: Checkout

View File

@@ -3,7 +3,7 @@ name: docker-multigpu-tests-biweekly
on: on:
pull_request: pull_request:
paths: paths:
- 'tests/e2e/multigpu/*.py' - 'tests/e2e/multigpu/**.py'
- 'requirements.txt' - 'requirements.txt'
- 'setup.py' - 'setup.py'
- 'pyproject.toml' - 'pyproject.toml'

View File

@@ -18,9 +18,96 @@ jobs:
env: env:
SKIP: no-commit-to-branch SKIP: no-commit-to-branch
preload-cache:
name: Preload HF cache
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python_version: ["3.11"]
pytorch_version: ["2.6.0"]
timeout-minutes: 20
env:
AXOLOTL_IS_CI_CACHE_PRELOAD: "1"
steps:
- name: Check out repository code
uses: actions/checkout@v4
- name: Restore HF cache
id: hf-cache-restore
uses: actions/cache/restore@v4
with:
path: |
/home/runner/.cache/huggingface/hub/datasets--*
/home/runner/.cache/huggingface/hub/models--*
key: ${{ runner.os }}-hf-hub-cache-v2
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python_version }}
cache: 'pip' # caching pip dependencies
- name: upgrade pip
run: |
pip3 install --upgrade pip
pip3 install --upgrade packaging==23.2 setuptools==75.8.0 wheel
- name: Install PyTorch
run: |
pip3 install torch==${{ matrix.pytorch_version }}
- name: Install dependencies
run: |
pip3 show torch
pip3 install --no-build-isolation -U -e .
python scripts/unsloth_install.py | sh
python scripts/cutcrossentropy_install.py | sh
pip3 install -r requirements-dev.txt -r requirements-tests.txt
- name: Make sure PyTorch version wasn't clobbered
run: |
python -c "import torch; assert '${{ matrix.pytorch_version }}' in torch.__version__"
- name: Ensure axolotl CLI was installed
run: |
axolotl --help
- name: Pre-Download dataset fixture
run: |
huggingface-cli download --repo-type=dataset axolotl-ai-internal/axolotl-oss-dataset-fixtures
- name: Run tests
run: |
pytest -v tests/conftest.py
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v5
with:
token: ${{ secrets.CODECOV_TOKEN }}
files: ./coverage.xml
flags: unittests,pytorch-${{ matrix.pytorch_version }}
fail_ci_if_error: false
- name: cleanup pip cache
run: |
find "$(pip cache dir)/http-v2" -type f -mtime +14 -exec rm {} \;
- name: Save HF cache
id: hf-cache
uses: actions/cache/save@v4
with:
path: |
/home/runner/.cache/huggingface/hub/datasets--*
/home/runner/.cache/huggingface/hub/models--*
key: ${{ steps.hf-cache-restore.outputs.cache-primary-key }}
pytest: pytest:
name: PyTest name: PyTest
runs-on: ubuntu-latest runs-on: ubuntu-latest
needs: [preload-cache]
strategy: strategy:
fail-fast: false fail-fast: false
max-parallel: 2 max-parallel: 2

View File

@@ -44,96 +44,102 @@ jobs:
env: env:
SKIP: no-commit-to-branch SKIP: no-commit-to-branch
preload-cache: # preload-cache:
name: Preload HF cache # name: Preload HF cache
runs-on: ubuntu-latest # runs-on: ubuntu-latest
strategy: # strategy:
fail-fast: false # fail-fast: false
matrix: # matrix:
python_version: ["3.11"] # python_version: ["3.11"]
pytorch_version: ["2.6.0"] # pytorch_version: ["2.6.0"]
timeout-minutes: 20 # timeout-minutes: 20
#
env: # env:
AXOLOTL_IS_CI_CACHE_PRELOAD: "1" # AXOLOTL_IS_CI_CACHE_PRELOAD: "1"
#
steps: # steps:
- name: Check out repository code # - name: Check out repository code
uses: actions/checkout@v4 # uses: actions/checkout@v4
#
- name: Restore HF cache # - name: Restore HF cache
id: hf-cache-restore # id: hf-cache-restore
uses: actions/cache/restore@v4 # uses: actions/cache/restore@v4
with: # with:
path: | # path: |
/home/runner/.cache/huggingface/hub/datasets--* # /home/runner/.cache/huggingface/hub/datasets--*
/home/runner/.cache/huggingface/hub/models--* # /home/runner/.cache/huggingface/hub/models--*
key: ${{ runner.os }}-hf-hub-cache-v2 # key: ${{ runner.os }}-hf-hub-cache-v2
#
- name: Setup Python # - name: Restore Cache from S3
uses: actions/setup-python@v5 # id: hf-cache-restore-s3
with: # run: |
python-version: ${{ matrix.python_version }} # mkdir -p /home/runner/.cache/huggingface/hub
cache: 'pip' # caching pip dependencies # curl -L https://d1dttdx32dkk5p.cloudfront.net/hf-cache.tar.zst | tar -xf - -C /home/runner/.cache/huggingface/hub/ --use-compress-program unzstd
#
- name: upgrade pip # - name: Setup Python
run: | # uses: actions/setup-python@v5
pip3 install --upgrade pip # with:
pip3 install --upgrade packaging==23.2 setuptools==75.8.0 wheel # python-version: ${{ matrix.python_version }}
# cache: 'pip' # caching pip dependencies
- name: Install PyTorch #
run: | # - name: upgrade pip
pip3 install torch==${{ matrix.pytorch_version }} # run: |
# pip3 install --upgrade pip
- name: Install dependencies # pip3 install --upgrade packaging==23.2 setuptools==75.8.0 wheel
run: | #
pip3 show torch # - name: Install PyTorch
pip3 install --no-build-isolation -U -e . # run: |
python scripts/unsloth_install.py | sh # pip3 install torch==${{ matrix.pytorch_version }}
python scripts/cutcrossentropy_install.py | sh #
pip3 install -r requirements-dev.txt -r requirements-tests.txt # - name: Install dependencies
# run: |
- name: Make sure PyTorch version wasn't clobbered # pip3 show torch
run: | # pip3 install --no-build-isolation -U -e .
python -c "import torch; assert '${{ matrix.pytorch_version }}' in torch.__version__" # python scripts/unsloth_install.py | sh
# python scripts/cutcrossentropy_install.py | sh
- name: Ensure axolotl CLI was installed # pip3 install -r requirements-dev.txt -r requirements-tests.txt
run: | #
axolotl --help # - name: Make sure PyTorch version wasn't clobbered
# run: |
- name: Pre-Download dataset fixture # python -c "import torch; assert '${{ matrix.pytorch_version }}' in torch.__version__"
run: | #
huggingface-cli download --repo-type=dataset axolotl-ai-internal/axolotl-oss-dataset-fixtures # - name: Ensure axolotl CLI was installed
# run: |
- name: Run tests # axolotl --help
run: | #
pytest -v tests/conftest.py # - name: Pre-Download dataset fixture
# run: |
- name: Upload coverage to Codecov # huggingface-cli download --repo-type=dataset axolotl-ai-internal/axolotl-oss-dataset-fixtures
uses: codecov/codecov-action@v5 #
with: # - name: Run tests
token: ${{ secrets.CODECOV_TOKEN }} # run: |
files: ./coverage.xml # pytest -v tests/conftest.py
flags: unittests,pytorch-${{ matrix.pytorch_version }} #
fail_ci_if_error: false # - name: Upload coverage to Codecov
# uses: codecov/codecov-action@v5
- name: cleanup pip cache # with:
run: | # token: ${{ secrets.CODECOV_TOKEN }}
find "$(pip cache dir)/http-v2" -type f -mtime +14 -exec rm {} \; # files: ./coverage.xml
# flags: unittests,pytorch-${{ matrix.pytorch_version }}
- name: Save HF cache # fail_ci_if_error: false
id: hf-cache #
uses: actions/cache/save@v4 # - name: cleanup pip cache
with: # run: |
path: | # find "$(pip cache dir)/http-v2" -type f -mtime +14 -exec rm {} \;
/home/runner/.cache/huggingface/hub/datasets--* #
/home/runner/.cache/huggingface/hub/models--* # - name: Save HF cache
key: ${{ steps.hf-cache-restore.outputs.cache-primary-key }} # id: hf-cache
# uses: actions/cache/save@v4
# with:
# path: |
# /home/runner/.cache/huggingface/hub/datasets--*
# /home/runner/.cache/huggingface/hub/models--*
# key: ${{ steps.hf-cache-restore.outputs.cache-primary-key }}
pytest: pytest:
name: PyTest name: PyTest
runs-on: ubuntu-latest runs-on: ubuntu-latest
needs: [preload-cache] # needs: [preload-cache]
strategy: strategy:
fail-fast: false fail-fast: false
matrix: matrix:
@@ -145,14 +151,20 @@ jobs:
- name: Check out repository code - name: Check out repository code
uses: actions/checkout@v4 uses: actions/checkout@v4
- name: Restore HF cache # - name: Restore HF cache
id: hf-cache-restore # id: hf-cache-restore
uses: actions/cache/restore@v4 # uses: actions/cache/restore@v4
with: # with:
path: | # path: |
/home/runner/.cache/huggingface/hub/datasets--* # /home/runner/.cache/huggingface/hub/datasets--*
/home/runner/.cache/huggingface/hub/models--* # /home/runner/.cache/huggingface/hub/models--*
key: ${{ runner.os }}-hf-hub-cache-v2 # key: ${{ runner.os }}-hf-hub-cache-v2
- name: Restore Cache from S3
id: hf-cache-restore-s3
run: |
mkdir -p /home/runner/.cache/huggingface/hub
curl -L https://d1dttdx32dkk5p.cloudfront.net/hf-cache.tar.zst | tar -xf - -C /home/runner/.cache/huggingface/hub/ --use-compress-program unzstd
- name: Setup Python - name: Setup Python
uses: actions/setup-python@v5 uses: actions/setup-python@v5
@@ -210,7 +222,7 @@ jobs:
pytest-sdist: pytest-sdist:
name: PyTest from Source Dist name: PyTest from Source Dist
runs-on: ubuntu-latest runs-on: ubuntu-latest
needs: [preload-cache] # needs: [preload-cache]
strategy: strategy:
fail-fast: false fail-fast: false
matrix: matrix:
@@ -222,14 +234,20 @@ jobs:
- name: Check out repository code - name: Check out repository code
uses: actions/checkout@v4 uses: actions/checkout@v4
- name: Restore HF cache # - name: Restore HF cache
id: hf-cache-restore # id: hf-cache-restore
uses: actions/cache/restore@v4 # uses: actions/cache/restore@v4
with: # with:
path: | # path: |
/home/runner/.cache/huggingface/hub/datasets--* # /home/runner/.cache/huggingface/hub/datasets--*
/home/runner/.cache/huggingface/hub/models--* # /home/runner/.cache/huggingface/hub/models--*
key: ${{ runner.os }}-hf-hub-cache-v2 # key: ${{ runner.os }}-hf-hub-cache-v2
- name: Restore Cache from S3
id: hf-cache-restore-s3
run: |
mkdir -p /home/runner/.cache/huggingface/hub
curl -L https://d1dttdx32dkk5p.cloudfront.net/hf-cache.tar.zst | tar -xf - -C /home/runner/.cache/huggingface/hub/ --use-compress-program unzstd
- name: Setup Python - name: Setup Python
uses: actions/setup-python@v5 uses: actions/setup-python@v5
@@ -277,6 +295,7 @@ jobs:
find "$(pip cache dir)/http-v2" -type f -mtime +14 -exec rm {} \; find "$(pip cache dir)/http-v2" -type f -mtime +14 -exec rm {} \;
docker-e2e-tests-1st: docker-e2e-tests-1st:
# Run this job first as a gate for running the remainder of the test matrix
if: ${{ ! contains(github.event.commits[0].message, '[skip e2e]') && github.repository_owner == 'axolotl-ai-cloud' }} if: ${{ ! contains(github.event.commits[0].message, '[skip e2e]') && github.repository_owner == 'axolotl-ai-cloud' }}
# this job needs to be run on self-hosted GPU runners... # this job needs to be run on self-hosted GPU runners...
runs-on: [self-hosted, modal] runs-on: [self-hosted, modal]
@@ -323,6 +342,8 @@ jobs:
# this job needs to be run on self-hosted GPU runners... # this job needs to be run on self-hosted GPU runners...
runs-on: [self-hosted, modal] runs-on: [self-hosted, modal]
timeout-minutes: 90 timeout-minutes: 90
# Only run the remainder of the matrix if the first e2e check passed;
# this is to save on wasted compute costs for known failures that get caught in the first run
needs: [pre-commit, pytest, docker-e2e-tests-1st] needs: [pre-commit, pytest, docker-e2e-tests-1st]
strategy: strategy:
@@ -335,12 +356,6 @@ jobs:
pytorch: 2.6.0 pytorch: 2.6.0
num_gpus: 1 num_gpus: 1
axolotl_extras: llmcompressor axolotl_extras: llmcompressor
- cuda: 124
cuda_version: 12.4.1
python_version: "3.11"
pytorch: 2.4.1
num_gpus: 1
axolotl_extras:
- cuda: 124 - cuda: 124
cuda_version: 12.4.1 cuda_version: 12.4.1
python_version: "3.11" python_version: "3.11"
@@ -353,6 +368,12 @@ jobs:
pytorch: 2.7.0 pytorch: 2.7.0
num_gpus: 1 num_gpus: 1
axolotl_extras: axolotl_extras:
- cuda: 128
cuda_version: 12.8.1
python_version: "3.11"
pytorch: 2.7.0
num_gpus: 1
axolotl_extras:
steps: steps:
- name: Checkout - name: Checkout
uses: actions/checkout@v4 uses: actions/checkout@v4
@@ -377,3 +398,43 @@ jobs:
- name: Run tests job on Modal - name: Run tests job on Modal
run: | run: |
modal run cicd.e2e_tests modal run cicd.e2e_tests
docker-e2e-cleanup:
runs-on: [self-hosted, modal]
timeout-minutes: 90
needs: [docker-e2e-tests]
strategy:
fail-fast: false
matrix:
include:
- cuda: 124
cuda_version: 12.4.1
python_version: "3.11"
pytorch: 2.6.0
num_gpus: 1
axolotl_extras: vllm
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Install Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install Modal
run: |
python -m pip install --upgrade pip
pip install modal==0.71.8 jinja2
- name: Update env vars
run: |
echo "BASE_TAG=main-base-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}" >> $GITHUB_ENV
echo "PYTORCH_VERSION=${{ matrix.pytorch}}" >> $GITHUB_ENV
echo "AXOLOTL_ARGS=${{ matrix.axolotl_args}}" >> $GITHUB_ENV
echo "AXOLOTL_EXTRAS=${{ matrix.axolotl_extras}}" >> $GITHUB_ENV
echo "CUDA=${{ matrix.cuda }}" >> $GITHUB_ENV
echo "MODAL_IMAGE_BUILDER_VERSION=2024.10" >> $GITHUB_ENV
echo "N_GPUS=${{ matrix.num_gpus }}" >> $GITHUB_ENV
echo "CODECOV_TOKEN=${{ secrets.CODECOV_TOKEN }}" >> $GITHUB_ENV
- name: Run tests job on Modal
run: |
modal run cicd.cleanup

View File

@@ -57,8 +57,10 @@ async def handler(job):
logger.info("Training Complete.") logger.info("Training Complete.")
# Cleanup # Cleanup
del os.environ["WANDB_API_KEY"] if "WANDB_API_KEY" in os.environ:
del os.environ["HF_TOKEN"] del os.environ["WANDB_API_KEY"]
if "HF_TOKEN" in os.environ:
del os.environ["HF_TOKEN"]
runpod.serverless.start({"handler": handler, "return_aggregate_stream": True}) runpod.serverless.start({"handler": handler, "return_aggregate_stream": True})

View File

@@ -48,8 +48,22 @@ quartodoc:
contents: contents:
- core.trainers.base - core.trainers.base
- core.trainers.trl - core.trainers.trl
- core.trainers.mamba
- core.trainers.relora
- core.trainers.dpo.trainer - core.trainers.dpo.trainer
- core.trainers.grpo.trainer - core.trainers.grpo.trainer
- core.trainers.grpo.sampler
- core.trainers.utils
- title: Mixins
desc: Mixin classes for augmenting trainers
contents:
- core.trainers.mixins.optimizer
- core.trainers.mixins.rng_state_loader
- core.trainers.mixins.scheduler
- title: Context Managers
desc: Context managers for altering trainer behaviors
contents:
- utils.ctx_managers.sequence_parallel
- title: Prompt Strategies - title: Prompt Strategies
desc: Prompt formatting strategies desc: Prompt formatting strategies
contents: contents:
@@ -86,7 +100,7 @@ quartodoc:
- kernels.swiglu - kernels.swiglu
- kernels.quantize - kernels.quantize
- kernels.utils - kernels.utils
- title: MonkeyPatches - title: Monkey Patches
desc: Runtime patches for model optimizations desc: Runtime patches for model optimizations
contents: contents:
- monkeypatch.llama_attn_hijack_flash - monkeypatch.llama_attn_hijack_flash
@@ -124,7 +138,8 @@ quartodoc:
- utils.optimizers.adopt - utils.optimizers.adopt
- utils.data.pretraining - utils.data.pretraining
- utils.data.sft - utils.data.sft
- utils.gradient_checkpointing.unsloth - utils.gradient_checkpointing.offload_cpu
- utils.gradient_checkpointing.offload_disk
- title: Schemas - title: Schemas
desc: Pydantic data models for Axolotl config desc: Pydantic data models for Axolotl config
contents: contents:

View File

@@ -18,7 +18,7 @@ pytest -v --durations=10 \
--cov-append --cov-append
# Run patched tests excluding lora kernels with coverage append # Run patched tests excluding lora kernels with coverage append
pytest -v --durations=10 \ pytest --full-trace -vvv --durations=10 \
--ignore=tests/e2e/patched/lora_kernels \ --ignore=tests/e2e/patched/lora_kernels \
/workspace/axolotl/tests/e2e/patched \ /workspace/axolotl/tests/e2e/patched \
--cov=axolotl \ --cov=axolotl \

19
cicd/cleanup.py Normal file
View File

@@ -0,0 +1,19 @@
"""Modal app to run axolotl GPU cleanup"""
from .single_gpu import VOLUME_CONFIG, app, cicd_image, run_cmd
@app.function(
image=cicd_image,
timeout=60 * 60,
cpu=8.0,
memory=131072,
volumes=VOLUME_CONFIG,
)
def cleanup():
run_cmd("./cicd/cleanup.sh", "/workspace/axolotl")
@app.local_entrypoint()
def main():
cleanup.remote()

6
cicd/cleanup.sh Executable file
View File

@@ -0,0 +1,6 @@
#!/bin/bash
set -e
# cleanup old cache files for datasets processing and intermediate mappings
find /workspace/data/huggingface-cache/hub/datasets -name "cache-*" -type f -mtime +1 -exec rm {} \;
find /workspace/data/huggingface-cache/hub/datasets -name "*.lock" -type f -mtime +1 -exec rm {} \;

View File

@@ -1,75 +1,12 @@
"""Modal app to run axolotl GPU tests""" """Modal app to run axolotl GPU tests"""
# pylint: disable=duplicate-code from .single_gpu import GPU_CONFIG, VOLUME_CONFIG, app, cicd_image, run_cmd
import os
import pathlib
import tempfile
import jinja2
import modal
from jinja2 import select_autoescape
from modal import App, Image
cicd_path = pathlib.Path(__file__).parent.resolve()
template_loader = jinja2.FileSystemLoader(searchpath=cicd_path)
template_env = jinja2.Environment(
loader=template_loader, autoescape=select_autoescape()
)
df_template = template_env.get_template("Dockerfile.jinja")
df_args = {
"AXOLOTL_EXTRAS": os.environ.get("AXOLOTL_EXTRAS", ""),
"AXOLOTL_ARGS": os.environ.get("AXOLOTL_ARGS", ""),
"PYTORCH_VERSION": os.environ.get("PYTORCH_VERSION", "2.4.1"),
"BASE_TAG": os.environ.get("BASE_TAG", "main-base-py3.11-cu121-2.4.1"),
"CUDA": os.environ.get("CUDA", "121"),
"GITHUB_REF": os.environ.get("GITHUB_REF", "refs/heads/main"),
"GITHUB_SHA": os.environ.get("GITHUB_SHA", ""),
"NIGHTLY_BUILD": os.environ.get("NIGHTLY_BUILD", ""),
"CODECOV_TOKEN": os.environ.get("CODECOV_TOKEN", ""),
"HF_HOME": "/workspace/data/huggingface-cache/hub",
}
dockerfile_contents = df_template.render(**df_args)
temp_dir = tempfile.mkdtemp()
with open(pathlib.Path(temp_dir) / "Dockerfile", "w", encoding="utf-8") as f:
f.write(dockerfile_contents)
cicd_image = Image.from_dockerfile(
pathlib.Path(temp_dir) / "Dockerfile",
context_mount=None,
force_build=True,
gpu="A10G",
).env(df_args)
app = App("Axolotl CI/CD", secrets=[])
hf_cache_volume = modal.Volume.from_name(
"axolotl-ci-hf-hub-cache", create_if_missing=True
)
VOLUME_CONFIG = {
"/workspace/data/huggingface-cache/hub": hf_cache_volume,
}
N_GPUS = int(os.environ.get("N_GPUS", 1))
GPU_CONFIG = modal.gpu.L40S(count=N_GPUS)
def run_cmd(cmd: str, run_folder: str):
import subprocess # nosec
# Propagate errors from subprocess.
if exit_code := subprocess.call(cmd.split(), cwd=run_folder): # nosec
exit(exit_code) # pylint: disable=consider-using-sys-exit
@app.function( @app.function(
image=cicd_image, image=cicd_image,
gpu=GPU_CONFIG, gpu=GPU_CONFIG,
timeout=60 * 60, timeout=90 * 60, # 90 min
cpu=8.0, cpu=8.0,
memory=131072, memory=131072,
volumes=VOLUME_CONFIG, volumes=VOLUME_CONFIG,

View File

@@ -70,7 +70,7 @@ def run_cmd(cmd: str, run_folder: str):
image=cicd_image, image=cicd_image,
gpu=GPU_CONFIG, gpu=GPU_CONFIG,
timeout=90 * 60, timeout=90 * 60,
cpu=8.0, cpu=16.0,
memory=131072 * N_GPUS, memory=131072 * N_GPUS,
volumes=VOLUME_CONFIG, volumes=VOLUME_CONFIG,
) )

66
cicd/single_gpu.py Normal file
View File

@@ -0,0 +1,66 @@
"""Modal app to run axolotl GPU tests"""
# pylint: disable=duplicate-code
import os
import pathlib
import tempfile
import jinja2
import modal
from jinja2 import select_autoescape
from modal import App, Image
cicd_path = pathlib.Path(__file__).parent.resolve()
template_loader = jinja2.FileSystemLoader(searchpath=cicd_path)
template_env = jinja2.Environment(
loader=template_loader, autoescape=select_autoescape()
)
df_template = template_env.get_template("Dockerfile.jinja")
df_args = {
"AXOLOTL_EXTRAS": os.environ.get("AXOLOTL_EXTRAS", ""),
"AXOLOTL_ARGS": os.environ.get("AXOLOTL_ARGS", ""),
"PYTORCH_VERSION": os.environ.get("PYTORCH_VERSION", "2.4.1"),
"BASE_TAG": os.environ.get("BASE_TAG", "main-base-py3.11-cu121-2.4.1"),
"CUDA": os.environ.get("CUDA", "121"),
"GITHUB_REF": os.environ.get("GITHUB_REF", "refs/heads/main"),
"GITHUB_SHA": os.environ.get("GITHUB_SHA", ""),
"NIGHTLY_BUILD": os.environ.get("NIGHTLY_BUILD", ""),
"CODECOV_TOKEN": os.environ.get("CODECOV_TOKEN", ""),
"HF_HOME": "/workspace/data/huggingface-cache/hub",
}
dockerfile_contents = df_template.render(**df_args)
temp_dir = tempfile.mkdtemp()
with open(pathlib.Path(temp_dir) / "Dockerfile", "w", encoding="utf-8") as f:
f.write(dockerfile_contents)
cicd_image = Image.from_dockerfile(
pathlib.Path(temp_dir) / "Dockerfile",
context_mount=None,
force_build=True,
gpu="A10G",
).env(df_args)
app = App("Axolotl CI/CD", secrets=[])
hf_cache_volume = modal.Volume.from_name(
"axolotl-ci-hf-hub-cache", create_if_missing=True
)
VOLUME_CONFIG = {
"/workspace/data/huggingface-cache/hub": hf_cache_volume,
}
N_GPUS = int(os.environ.get("N_GPUS", 1))
GPU_CONFIG = modal.gpu.L40S(count=N_GPUS)
def run_cmd(cmd: str, run_folder: str):
import subprocess # nosec
# Propagate errors from subprocess.
if exit_code := subprocess.call(cmd.split(), cwd=run_folder): # nosec
exit(exit_code) # pylint: disable=consider-using-sys-exit

View File

@@ -19,7 +19,7 @@ coverage:
if_no_uploads: error if_no_uploads: error
if_not_found: success if_not_found: success
if_ci_failed: error if_ci_failed: error
only_pulls: false only_pulls: true
flags: null flags: null
paths: null paths: null
patch: patch:

View File

@@ -505,6 +505,7 @@ save_strategy: # Set to `"no"` to skip checkpoint saves, `"epoch"` at end of eac
save_steps: # Leave empty to save at each epoch, integer for every N steps. float for fraction of total steps save_steps: # Leave empty to save at each epoch, integer for every N steps. float for fraction of total steps
saves_per_epoch: # number of times per epoch to save a checkpoint, mutually exclusive with save_steps saves_per_epoch: # number of times per epoch to save a checkpoint, mutually exclusive with save_steps
save_total_limit: # Checkpoints saved at a time save_total_limit: # Checkpoints saved at a time
save_only_model: # Save only the model weights, skipping the optimizer. Using this means you can't resume from checkpoints.
# Maximum number of iterations to train for. It precedes num_epochs which means that # Maximum number of iterations to train for. It precedes num_epochs which means that
# if both are set, num_epochs will not be guaranteed. # if both are set, num_epochs will not be guaranteed.
# e.g., when 1 epoch is 1000 steps => `num_epochs: 2` and `max_steps: 100` will train for 100 steps # e.g., when 1 epoch is 1000 steps => `num_epochs: 2` and `max_steps: 100` will train for 100 steps
@@ -538,7 +539,7 @@ train_on_inputs: false
# Note that training loss may have an oscillating pattern with this enabled. # Note that training loss may have an oscillating pattern with this enabled.
group_by_length: false group_by_length: false
# Whether to use gradient checkpointing. Available options are: true, false, "offload". # Whether to use gradient checkpointing. Available options are: true, false, "offload", "offload_disk".
# https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing # https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing
gradient_checkpointing: false gradient_checkpointing: false
# additional kwargs to pass to the trainer for gradient checkpointing # additional kwargs to pass to the trainer for gradient checkpointing
@@ -612,6 +613,7 @@ lr_div_factor: # Learning rate div factor
# - optimi_adamw # - optimi_adamw
# - ao_adamw_8bit # - ao_adamw_8bit
# - ao_adamw_fp8 # - ao_adamw_fp8
# - came_pytorch
optimizer: optimizer:
# Dictionary of arguments to pass to the optimizer # Dictionary of arguments to pass to the optimizer
optim_args: optim_args:
@@ -631,7 +633,9 @@ weight_decay:
# adamw hyperparams # adamw hyperparams
adam_beta1: adam_beta1:
adam_beta2: adam_beta2:
adam_beta3: # only used for CAME Optimizer
adam_epsilon: adam_epsilon:
adam_epsilon2: # only used for CAME Optimizer
# Gradient clipping max norm # Gradient clipping max norm
max_grad_norm: max_grad_norm:

View File

@@ -8,6 +8,10 @@ format:
This section describes the different Docker images that are released by AxolotlAI at [Docker Hub](https://hub.docker.com/u/axolotlai). This section describes the different Docker images that are released by AxolotlAI at [Docker Hub](https://hub.docker.com/u/axolotlai).
::: {.callout-important}
For Blackwell GPUs, please use the tags with Pytorch 2.7.0 and CUDA 12.8.
:::
## Base ## Base
The base image is the most minimal image that can install Axolotl. It is based on the `nvidia/cuda` image. It includes python, torch, git, git-lfs, awscli, pydantic, and more. The base image is the most minimal image that can install Axolotl. It is based on the `nvidia/cuda` image. It includes python, torch, git, git-lfs, awscli, pydantic, and more.

View File

@@ -104,7 +104,7 @@ the `alpaca` dataset format, which has the following format:
Please see our [Dataset Formats](dataset-formats) for more dataset formats and how to Please see our [Dataset Formats](dataset-formats) for more dataset formats and how to
format them. format them.
2. Prepare your JSONL data in the specified format (in this case, the expected `alpaca 2. Prepare your JSONL data in the specified format (in this case, the expected `alpaca`
format): format):
```json ```json
@@ -120,6 +120,12 @@ axolotl train my_training.yml
## Common Tasks {#sec-common-tasks} ## Common Tasks {#sec-common-tasks}
::: {.callout-tip}
The same yaml file is used for training, inference, and merging.
:::
### Testing Your Model {#sec-testing} ### Testing Your Model {#sec-testing}
After training, test your model: After training, test your model:
@@ -128,6 +134,16 @@ After training, test your model:
axolotl inference my_training.yml --lora-model-dir="./outputs/lora-out" axolotl inference my_training.yml --lora-model-dir="./outputs/lora-out"
``` ```
More details can be found in [Inference](inference.qmd).
### Using a UI {#sec-ui}
Launch a Gradio interface:
```bash
axolotl inference my_training.yml --lora-model-dir="./outputs/lora-out" --gradio
```
### Preprocessing Data {#sec-preprocessing} ### Preprocessing Data {#sec-preprocessing}
For large datasets, preprocess first: For large datasets, preprocess first:
@@ -136,14 +152,22 @@ For large datasets, preprocess first:
axolotl preprocess my_training.yml axolotl preprocess my_training.yml
``` ```
### Using a UI {#sec-ui} Please make sure to set `dataset_prepared_path: ` in your config to set the path to save the prepared dataset.
Launch a Gradio interface: More details can be found in [Dataset Preprocessing](dataset_preprocessing.qmd).
### Merging LoRA weights {#sec-merging-lora}
To merge the LoRA weights back into the base model, run:
```bash ```bash
axolotl inference my_training.yml --lora-model-dir="./outputs/lora-out" --gradio axolotl merge-lora my_training.yml --lora-model-dir="./outputs/lora-out"
``` ```
The merged model will be saved in the `{output_dir}/merged` directory.
More details can be found in [Merging LoRA weights](inference.qmd#sec-merging).
## Next Steps {#sec-next-steps} ## Next Steps {#sec-next-steps}
Now that you have the basics, you might want to: Now that you have the basics, you might want to:
@@ -156,6 +180,7 @@ Now that you have the basics, you might want to:
Check our other guides for details on these topics: Check our other guides for details on these topics:
- [Configuration Guide](config.qmd) - Full configuration options - [Configuration Guide](config.qmd) - Full configuration options
- [Dataset Loading](dataset-loading.qmd) - Loading datasets from various sources
- [Dataset Formats](dataset-formats) - Working with different data formats - [Dataset Formats](dataset-formats) - Working with different data formats
- [Multi-GPU Training](multi-gpu.qmd) - [Multi-GPU Training](multi-gpu.qmd)
- [Multi-Node Training](multi-node.qmd) - [Multi-Node Training](multi-node.qmd)

View File

@@ -25,6 +25,10 @@ Please make sure to have Pytorch installed before installing Axolotl in your loc
Follow the instructions at: [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/) Follow the instructions at: [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/)
::: :::
::: {.callout-important}
For Blackwell GPUs, please use Pytorch 2.7.0 and CUDA 12.8.
:::
### PyPI Installation (Recommended) {#sec-pypi} ### PyPI Installation (Recommended) {#sec-pypi}
```{.bash} ```{.bash}
@@ -72,6 +76,10 @@ docker run --privileged --gpus '"all"' --shm-size 10g --rm -it \
``` ```
::: :::
::: {.callout-important}
For Blackwell GPUs, please use `axolotlai/axolotl:main-py3.11-cu128-2.7.0` or the cloud variant `axolotlai/axolotl-cloud:main-py3.11-cu128-2.7.0`.
:::
Please refer to the [Docker documentation](docker.qmd) for more information on the different Docker images that are available. Please refer to the [Docker documentation](docker.qmd) for more information on the different Docker images that are available.
## Cloud Environments {#sec-cloud} ## Cloud Environments {#sec-cloud}

View File

@@ -87,20 +87,7 @@ We support sequence parallelism (SP) via the
allows one to split up sequences across GPUs, which is useful in the event that a allows one to split up sequences across GPUs, which is useful in the event that a
single sequence causes OOM errors during model training. single sequence causes OOM errors during model training.
First, install `ring-flash-attn`, recommended via `pip install axolotl[ring-flash-attn]`, See our [dedicated guide](sequence_parallelism.qmd) for more information.
or from source with `pip install .[ring-flash-attn]`.
Your Axolotl YAML config should contain the following lines:
```{.yaml}
sequence_parallel_degree: 4 # Split each sequence into 4 parts, one per GPU
flash_attention: true # Required with sequence parallelism
# Optional; strides across the key dimension. Larger values use more memory but will make training faster.
heads_k_stride: 1
```
See our [dedicated guide](sequence_parallelism.qmd) for more details.
### FSDP + QLoRA {#sec-fsdp-qlora} ### FSDP + QLoRA {#sec-fsdp-qlora}

View File

@@ -3,8 +3,6 @@ title: Sequence Parallelism
description: Train with long sequences split across multiple GPUs. description: Train with long sequences split across multiple GPUs.
--- ---
# Sequence Parallelism
Sequence parallelism is a technique that splits sequences across multiple GPUs, Sequence parallelism is a technique that splits sequences across multiple GPUs,
allowing you to train with very long sequences that wouldn't fit on a single GPU. Each allowing you to train with very long sequences that wouldn't fit on a single GPU. Each
GPU processes a different portion of the sequence, and the results are aggregated GPU processes a different portion of the sequence, and the results are aggregated
@@ -27,7 +25,7 @@ To enable sequence parallelism, add the following to your configuration file:
sequence_parallel_degree: 4 # Split sequences across 4 GPUs sequence_parallel_degree: 4 # Split sequences across 4 GPUs
# Optional; strides across the key dimension. Larger values use more memory but should make training faster. # Optional; strides across the key dimension. Larger values use more memory but should make training faster.
heads_k_stride: 1 heads_k_stride: 1
# Optional; one of "varlen_llama3", "batch_ring", "batch_zigzag", "batch_stripe". Defaults to # Optional; one of "varlen_llama3" or "batch_ring". Defaults to
# "varlen_llama3" when `sample_packing: true`, and "batch_ring" otherwise. # "varlen_llama3" when `sample_packing: true`, and "batch_ring" otherwise.
ring_attn_func: ring_attn_func:
``` ```
@@ -43,7 +41,7 @@ When sequence parallelism is enabled:
1. Each sequence is divided into equal chunks across the GPUs in a sequence parallel group 1. Each sequence is divided into equal chunks across the GPUs in a sequence parallel group
2. The data collator handles the chunking of input_ids, attention_mask, labels, and position_ids 2. The data collator handles the chunking of input_ids, attention_mask, labels, and position_ids
3. Position IDs are adjusted to maintain proper relative positions, especially for packed sequences 3. Position IDs are adjusted to maintain proper relative positions
4. The trainer uses special ring communication patterns for attention operations 4. The trainer uses special ring communication patterns for attention operations
## Requirements ## Requirements
@@ -69,9 +67,11 @@ sequence_len: 8192
... ...
sequence_parallel_degree: 4 # Split each sequence into 4 parts, one per GPU sequence_parallel_degree: 4 # Split each sequence into 4 parts, one per GPU
flash_attention: true # Required with sequence parallelism
# Optional; strides across the key dimension. Larger values use more memory but should make training faster. # Optional; strides across the key dimension. Larger values use more memory but should make training faster.
heads_k_stride: 1 heads_k_stride: 1
# Optional; one of "varlen_llama3" or "batch_ring". Defaults to
# "varlen_llama3" when `sample_packing: true`, and "batch_ring" otherwise.
ring_attn_func:
... ...
``` ```

View File

@@ -59,7 +59,9 @@ gradient_checkpointing: false
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
sdp_attention:
flash_optimum:
gptq_groupsize: gptq_groupsize:
gptq_model_v1: gptq_model_v1:

View File

@@ -39,7 +39,8 @@ tf32: true
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: xformers xformers_attention: true
flash_attention:
gptq_groupsize: gptq_groupsize:
gptq_model_v1: gptq_model_v1:
warmup_steps: 10 warmup_steps: 10

View File

@@ -45,8 +45,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -46,8 +46,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -45,8 +45,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -46,8 +46,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -45,8 +45,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -46,8 +46,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -49,8 +49,7 @@ tf32: true
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_ratio: 0.1 warmup_ratio: 0.1
evals_per_epoch: evals_per_epoch:

View File

@@ -112,7 +112,9 @@
"early_stopping_patience:\n", "early_stopping_patience:\n",
"resume_from_checkpoint:\n", "resume_from_checkpoint:\n",
"logging_steps: 1\n", "logging_steps: 1\n",
"attention: sdpa\n", "xformers_attention:\n",
"flash_attention: false\n",
"sdp_attention: true\n",
"\n", "\n",
"warmup_steps: 1\n", "warmup_steps: 1\n",
"max_steps: 25\n", "max_steps: 25\n",

View File

@@ -52,8 +52,7 @@ gradient_checkpointing_kwargs:
use_reentrant: false use_reentrant: false
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: evals_per_epoch:

View File

@@ -55,8 +55,7 @@ gradient_checkpointing_kwargs:
use_reentrant: false use_reentrant: false
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: evals_per_epoch:

View File

@@ -39,8 +39,7 @@ gradient_checkpointing_kwargs:
use_reentrant: false use_reentrant: false
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: evals_per_epoch:

View File

@@ -35,8 +35,7 @@ gradient_checkpointing_kwargs:
use_reentrant: false use_reentrant: false
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 100 warmup_steps: 100
evals_per_epoch: 2 evals_per_epoch: 2

View File

@@ -59,8 +59,7 @@ gradient_checkpointing_kwargs:
use_reentrant: false use_reentrant: false
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 100 warmup_steps: 100
evals_per_epoch: 2 evals_per_epoch: 2

View File

@@ -43,7 +43,8 @@ tf32: true
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: xformers xformers_attention: true
flash_attention:
gptq_groupsize: gptq_groupsize:
gptq_model_v1: gptq_model_v1:
warmup_steps: 40 warmup_steps: 40

View File

@@ -73,7 +73,8 @@ early_stopping_patience: 3
resume_from_checkpoint: resume_from_checkpoint:
auto_resume_from_checkpoints: true auto_resume_from_checkpoints: true
logging_steps: 1 logging_steps: 1
attention: xformers xformers_attention: true
flash_attention:
gptq_groupsize: gptq_groupsize:
gptq_model_v1: gptq_model_v1:
warmup_steps: 10 warmup_steps: 10

View File

@@ -40,7 +40,8 @@ tf32: true
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: xformers xformers_attention: true
flash_attention:
gptq_groupsize: gptq_groupsize:
gptq_model_v1: gptq_model_v1:
warmup_steps: 40 warmup_steps: 40

View File

@@ -47,8 +47,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_ratio: 0.1 warmup_ratio: 0.1
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -53,8 +53,7 @@ tf32: true
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_ratio: 0.1 warmup_ratio: 0.1
evals_per_epoch: evals_per_epoch:

View File

@@ -43,8 +43,7 @@ gradient_checkpointing_kwargs:
use_reentrant: false use_reentrant: false
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_ratio: 0.1 warmup_ratio: 0.1
evals_per_epoch: evals_per_epoch:

View File

@@ -57,8 +57,7 @@ gradient_checkpointing_kwargs:
use_reentrant: false use_reentrant: false
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_ratio: 0.1 warmup_ratio: 0.1
evals_per_epoch: evals_per_epoch:

View File

@@ -51,7 +51,8 @@ gradient_checkpointing: true
gradient_checkpointing_kwargs: gradient_checkpointing_kwargs:
use_reentrant: false use_reentrant: false
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
eager_attention:
warmup_ratio: 0.1 warmup_ratio: 0.1
evals_per_epoch: 1 evals_per_epoch: 1

View File

@@ -53,7 +53,8 @@ gradient_checkpointing: true
gradient_checkpointing_kwargs: gradient_checkpointing_kwargs:
use_reentrant: false use_reentrant: false
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
eager_attention:
warmup_ratio: 0.1 warmup_ratio: 0.1
evals_per_epoch: 1 evals_per_epoch: 1

View File

@@ -36,7 +36,8 @@ tf32: true
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: xformers xformers_attention: true
flash_attention:
gptq_groupsize: gptq_groupsize:
gptq_model_v1: gptq_model_v1:
warmup_steps: 10 warmup_steps: 10

View File

@@ -47,8 +47,7 @@ gradient_checkpointing_kwargs:
use_reentrant: false use_reentrant: false
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: evals_per_epoch:

View File

@@ -46,8 +46,7 @@ gradient_checkpointing_kwargs:
use_reentrant: false use_reentrant: false
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: evals_per_epoch:

View File

@@ -45,8 +45,7 @@ gradient_checkpointing: true
gradient_checkpointing_kwargs: gradient_checkpointing_kwargs:
use_reentrant: true use_reentrant: true
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 1 evals_per_epoch: 1

View File

@@ -37,7 +37,8 @@ bf16: auto
tf32: true tf32: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 5 logging_steps: 5
attention: xformers xformers_attention: true
flash_attention:
gptq_groupsize: gptq_groupsize:
gptq_model_v1: gptq_model_v1:
warmup_steps: 20 warmup_steps: 20

View File

@@ -42,8 +42,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
flash_attn_cross_entropy: false flash_attn_cross_entropy: false
flash_attn_rms_norm: true flash_attn_rms_norm: true
flash_attn_fuse_qkv: false flash_attn_fuse_qkv: false

View File

@@ -53,7 +53,9 @@ tf32: true
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention:
sdp_attention:
flash_optimum:
warmup_steps: 100 warmup_steps: 100
evals_per_epoch: 4 evals_per_epoch: 4
saves_per_epoch: 1 saves_per_epoch: 1

View File

@@ -46,8 +46,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
flash_attn_cross_entropy: false flash_attn_cross_entropy: false
flash_attn_rms_norm: true flash_attn_rms_norm: true
flash_attn_fuse_qkv: false flash_attn_fuse_qkv: false

View File

@@ -45,8 +45,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -45,8 +45,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -48,8 +48,7 @@ gradient_checkpointing_kwargs:
use_reentrant: true use_reentrant: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -46,8 +46,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -48,8 +48,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -50,7 +50,8 @@ tf32: true
gradient_checkpointing: true gradient_checkpointing: true
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
eager_attention:
warmup_ratio: 0.1 warmup_ratio: 0.1
evals_per_epoch: 1 evals_per_epoch: 1

View File

@@ -49,8 +49,7 @@ gradient_checkpointing_kwargs:
use_reentrant: false use_reentrant: false
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 100 warmup_steps: 100
evals_per_epoch: 2 evals_per_epoch: 2

View File

@@ -34,8 +34,7 @@ gradient_checkpointing_kwargs:
use_reentrant: false use_reentrant: false
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 100 warmup_steps: 100
evals_per_epoch: 2 evals_per_epoch: 2

View File

@@ -61,8 +61,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -56,8 +56,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -77,8 +77,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -53,8 +53,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -54,8 +54,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
loss_watchdog_threshold: 5.0 loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3 loss_watchdog_patience: 3

View File

@@ -48,8 +48,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
loss_watchdog_threshold: 5.0 loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3 loss_watchdog_patience: 3

View File

@@ -55,8 +55,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -48,8 +48,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
loss_watchdog_threshold: 5.0 loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3 loss_watchdog_patience: 3

View File

@@ -49,8 +49,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -53,8 +53,7 @@ gradient_checkpointing_kwargs:
use_reentrant: false use_reentrant: false
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 20 warmup_steps: 20
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -51,8 +51,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
loss_watchdog_threshold: 5.0 loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3 loss_watchdog_patience: 3

View File

@@ -39,8 +39,7 @@ gradient_checkpointing: true
gradient_checkpointing_kwargs: gradient_checkpointing_kwargs:
use_reentrant: true use_reentrant: true
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -48,8 +48,7 @@ gradient_checkpointing_kwargs:
use_reentrant: true use_reentrant: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -46,8 +46,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -34,3 +34,5 @@ We provide a script to delinearize Llama 4 linearized models into regular Huggin
```bash ```bash
axolotl delinearize-llama4 --model path/to/model_dir --output path/to/output_dir axolotl delinearize-llama4 --model path/to/model_dir --output path/to/output_dir
``` ```
Note: This only works with the non-quantized linearized model. If you have an adapter, merge it with the *non-quantized linearized* model before delinearizing.

View File

@@ -46,7 +46,8 @@ tf32: true
gradient_checkpointing: true gradient_checkpointing: true
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
eager_attention:
warmup_ratio: 0.1 warmup_ratio: 0.1
evals_per_epoch: 1 evals_per_epoch: 1

View File

@@ -39,7 +39,7 @@ tf32: true
gradient_checkpointing: false gradient_checkpointing: false
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: eager flash_attention:
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -42,8 +42,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
save_total_limit: 1 save_total_limit: 1
save_steps: save_steps:

View File

@@ -36,8 +36,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -0,0 +1,48 @@
base_model: mistralai/Devstral-Small-2505
processor_type: AutoProcessor
# these 3 lines are needed for now to handle vision chat templates w images
skip_prepare_dataset: true
remove_unused_columns: false
sample_packing: false
chat_template: mistral_v7_tekken
datasets:
- path: HuggingFaceH4/llava-instruct-mix-vsft
type: chat_template
split: train[:1%]
field_messages: messages
dataset_prepared_path: last_run_prepared
val_set_size: 0.01
output_dir: ./outputs/out
sequence_len: 2048
pad_to_sequence_len: false
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002
bf16: auto
fp16:
tf32: false
gradient_checkpointing: true
logging_steps: 1
flash_attention: false
eager_attention:
warmup_ratio: 0.1
evals_per_epoch: 1
saves_per_epoch: 1
weight_decay: 0.0
special_tokens:

View File

@@ -53,7 +53,8 @@ tf32: true
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: sdpa flash_attention: false
sdp_attention: true
loss_watchdog_threshold: 5.0 loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3 loss_watchdog_patience: 3

View File

@@ -54,8 +54,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
loss_watchdog_threshold: 5.0 loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3 loss_watchdog_patience: 3

View File

@@ -71,7 +71,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: eager flash_attention: false
warmup_steps: 10 warmup_steps: 10
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -51,8 +51,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
loss_watchdog_threshold: 5.0 loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3 loss_watchdog_patience: 3

View File

@@ -59,8 +59,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
loss_watchdog_threshold: 5.0 loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3 loss_watchdog_patience: 3

View File

@@ -48,7 +48,9 @@ tf32: true
gradient_checkpointing: true gradient_checkpointing: true
logging_steps: 1 logging_steps: 1
attention: eager # PixtralVisionModel does not support Flash Attention 2.0 yet. flash_attention: false # PixtralVisionModel does not support Flash Attention 2.0 yet.
eager_attention:
warmup_ratio: 0.1 warmup_ratio: 0.1
evals_per_epoch: 1 evals_per_epoch: 1
saves_per_epoch: 1 saves_per_epoch: 1

View File

@@ -49,8 +49,7 @@ tf32: true
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
loss_watchdog_threshold: 5.0 loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3 loss_watchdog_patience: 3

View File

@@ -51,8 +51,7 @@ tf32: true
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
loss_watchdog_threshold: 5.0 loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3 loss_watchdog_patience: 3

View File

@@ -69,8 +69,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
loss_watchdog_threshold: 5.0 loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3 loss_watchdog_patience: 3

View File

@@ -40,8 +40,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
save_total_limit: 1 save_total_limit: 1
save_steps: save_steps:

View File

@@ -54,8 +54,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
loss_watchdog_threshold: 5.0 loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3 loss_watchdog_patience: 3

View File

@@ -39,7 +39,7 @@ bf16: auto
tf32: true tf32: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 5 logging_steps: 5
attention: eager flash_attention:
gptq_groupsize: gptq_groupsize:
gptq_model_v1: gptq_model_v1:
warmup_steps: 20 warmup_steps: 20

View File

@@ -39,8 +39,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
gptq_groupsize: gptq_groupsize:
gptq_model_v1: gptq_model_v1:
warmup_steps: 20 warmup_steps: 20

View File

@@ -47,8 +47,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
gptq_groupsize: gptq_groupsize:
gptq_model_v1: gptq_model_v1:
warmup_steps: 20 warmup_steps: 20

View File

@@ -40,8 +40,7 @@ tf32: false
gradient_checkpointing: true gradient_checkpointing: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
gptq_groupsize: gptq_groupsize:
gptq_model_v1: gptq_model_v1:
warmup_steps: 20 warmup_steps: 20

View File

@@ -48,8 +48,7 @@ gradient_checkpointing_kwargs:
use_reentrant: True use_reentrant: True
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 100 warmup_steps: 100
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -51,8 +51,7 @@ gradient_checkpointing_kwargs:
use_reentrant: True use_reentrant: True
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 100 warmup_steps: 100
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -48,8 +48,7 @@ gradient_checkpointing_kwargs:
use_reentrant: True use_reentrant: True
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 100 warmup_steps: 100
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -49,8 +49,7 @@ gradient_checkpointing_kwargs:
use_reentrant: true use_reentrant: true
resume_from_checkpoint: resume_from_checkpoint:
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
warmup_steps: 100 warmup_steps: 100
evals_per_epoch: 4 evals_per_epoch: 4

View File

@@ -44,8 +44,7 @@ gradient_checkpointing_kwargs:
use_reentrant: True use_reentrant: True
early_stopping_patience: 3 early_stopping_patience: 3
logging_steps: 1 logging_steps: 1
attention: flash flash_attention: true
eval_steps: 1000 eval_steps: 1000
save_steps: 5000 save_steps: 5000

Some files were not shown because too many files have changed in this diff Show More