axolotl/.github/workflows/multi-gpu-e2e.yml at 798c8fba897ceb2e238dfaac9bb1731d2f1cacab

Files

NanoCode012 9de5b76336 feat: move to uv first (#3545 )

* feat: move to uv first

* fix: update doc to uv first

* fix: merge dev/tests into uv pyproject

* fix: update docker docs to match current config

* fix: migrate examples to readme

* fix: add llmcompressor to conflict

* feat: rec uv sync with lockfile for dev/ci

* fix: update docker docs to clarify how to use uv images

* chore: docs

* fix: use system python, no venv

* fix: set backend cpu

* fix: only set for installing pytorch step

* fix: remove unsloth kernel and installs

* fix: remove U in tests

* fix: set backend in deps too

* chore: test

* chore: comments

* fix: attempt to lock torch

* fix: workaround torch cuda and not upgraded

* fix: forgot to push

* fix: missed source

* fix: nightly upstream loralinear config

* fix: nightly phi3 long rope not work

* fix: forgot commit

* fix: test phi3 template change

* fix: no more requirements

* fix: carry over changes from new requirements to pyproject

* chore: remove lockfile per discussion

* fix: set match-runtime

* fix: remove unneeded hf hub buildtime

* fix: duplicate cache delete on nightly

* fix: torchvision being overridden

* fix: migrate to uv images

* fix: leftover from merge

* fix: simplify base readme

* fix: update assertion message to be clearer

* chore: docs

* fix: change fallback for cicd script

* fix: match against main exactly

* fix: peft 0.19.1 change

* fix: e2e test

* fix: ci

* fix: e2e test

2026-04-21 10:16:03 -04:00

81 lines

2.8 KiB

YAML

Raw Blame History

 name: docker-multigpu-tests-biweekly
 on:
   pull_request:
     paths:
       - "tests/e2e/multigpu/**.py"
       - "pyproject.toml"
       - ".github/workflows/multi-gpu-e2e.yml"
       - "scripts/cutcrossentropy_install.py"
       - "src/axolotl/core/trainers/mixins/sequence_parallel.py"
       - "src/axolotl/utils/distributed.py"
   workflow_dispatch:
   schedule:
     - cron: "0 0 * * 1,4" # Runs at 00:00 UTC every monday & thursday
 # Cancel jobs on the same ref if a new one is triggered
 concurrency:
   group: ${{ github.workflow }}-${{ github.ref }}
   cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
 permissions:
   contents: read
 env:
   MODAL_IMAGE_BUILDER_VERSION: "2025.06"
 jobs:
   test-axolotl-multigpu:
     if: ${{ ! contains(github.event.commits[0].message, '[skip e2e]') && github.repository_owner == 'axolotl-ai-cloud' && (github.event_name != 'pull_request' || !github.event.pull_request.draft) }}
     strategy:
       fail-fast: false
       matrix:
         include:
           #          - cuda: 129
           #            cuda_version: 12.9.1
           #            python_version: "3.12"
           #            pytorch: 2.9.1
           #            axolotl_extras: "fbgemm-gpu"
           #            num_gpus: 2
           #            dockerfile: "Dockerfile-uv.jinja"
           - cuda: 130
             cuda_version: 13.0.0
             python_version: "3.11"
             pytorch: 2.9.1
             axolotl_extras:
             #            axolotl_extras: fbgemm-gpu
             num_gpus: 2
           - cuda: 128
             cuda_version: 12.8.1
             python_version: "3.11"
             pytorch: 2.10.0
             axolotl_extras: "fbgemm-gpu"
             num_gpus: 2
     runs-on: [self-hosted, modal]
     timeout-minutes: 120
     steps:
       - name: Checkout
         uses: actions/checkout@v4
       - name: Install Python
         uses: actions/setup-python@v5
         with:
           python-version: "3.11"
       - name: Install Modal
         run: |
           python -m pip install --upgrade pip
           pip install modal==1.3.0.post1 jinja2
       - name: Update env vars
         run: |
           echo "BASE_TAG=main-base-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}" >> $GITHUB_ENV
           echo "PYTORCH_VERSION=${{ matrix.pytorch}}" >> $GITHUB_ENV
           echo "AXOLOTL_ARGS=${{ matrix.axolotl_args}}" >> $GITHUB_ENV
           echo "AXOLOTL_EXTRAS=${{ matrix.axolotl_extras}}" >> $GITHUB_ENV
           echo "CUDA=${{ matrix.cuda }}" >> $GITHUB_ENV
           echo "N_GPUS=${{ matrix.num_gpus }}" >> $GITHUB_ENV
           echo "E2E_DOCKERFILE=${{ matrix.dockerfile || 'Dockerfile-uv.jinja'}}" >> $GITHUB_ENV
       - name: Run tests job on Modal
         env:
           CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
         run: |
           modal run -m cicd.multigpu

81 lines 2.8 KiB YAML Raw Blame History

81 lines

2.8 KiB

YAML

Raw Blame History