tag v0.12.1

fix ray train and add fsdp2 smoke test for ray trainer (#3053 )
* add fsdp2 smokle test for ray trainer * fix raytrain with fsdp2
2025-08-11 09:37:40 -04:00 · 2025-08-11 09:36:10 -04:00 · 2025-08-11 09:36:01 -04:00
14 changed files with 109 additions and 58 deletions
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -98,12 +98,6 @@ jobs:
            python_version: "3.11"
            pytorch: 2.7.1
            axolotl_extras:
-            is_latest:
-          - cuda: 126
-            cuda_version: 12.6.3
-            python_version: "3.11"
-            pytorch: 2.7.1
-            axolotl_extras: vllm
            is_latest: true
          - cuda: 128
            cuda_version: 12.8.1
@@ -157,18 +151,6 @@ jobs:
            python_version: "3.11"
            pytorch: 2.6.0
            axolotl_extras:
-          - cuda: 126
-            cuda_version: 12.6.3
-            python_version: "3.11"
-            pytorch: 2.7.1
-            axolotl_extras:
-            is_latest:
-          - cuda: 126
-            cuda_version: 12.6.3
-            python_version: "3.11"
-            pytorch: 2.7.1
-            axolotl_extras: vllm
-            is_latest: true
    runs-on: axolotl-gpu-runner
    steps:
      - name: Checkout
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -105,8 +105,7 @@ jobs:

      - name: Run tests
        run: |
-          pytest -v --durations=10 -n8 --dist loadfile --ignore=tests/e2e/ --ignore=tests/patched/ --ignore=tests/cli/ --ignore=tests/monkeypatch/ tests/ --cov=axolotl --cov-report=xml
-          pytest -v --durations=10 tests/monkeypatch/ --cov=axolotl --cov-append --cov-report=xml
+          pytest -v --durations=10 -n8 --dist loadfile --ignore=tests/e2e/ --ignore=tests/patched/ --ignore=tests/cli/ tests/ --cov=axolotl --cov-report=xml
          pytest -v --durations=10 tests/patched/ --cov=axolotl --cov-append --cov-report=xml
          pytest -v --durations=10 tests/cli/ --cov=axolotl --cov-append --cov-report=xml

@@ -180,8 +179,8 @@ jobs:

      - name: Run tests
        run: |
-          pytest -v --durations=10 -n8 --dist loadfile --ignore=tests/e2e/ --ignore=tests/patched/ --ignore=tests/cli/ --ignore=tests/monkeypatch/ tests/ --cov=axolotl --cov-report=xml
-          pytest -v --durations=10 tests/monkeypatch/ --cov=axolotl --cov-append --cov-report=xml
+          pytest -v --durations=10 -n8 --dist loadfile --ignore=tests/e2e/ --ignore=tests/patched/ --ignore=tests/cli/ tests/
+          pytest -v --durations=10 tests/patched/
          pytest -v --durations=10 tests/cli/

      - name: cleanup pip cache
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -3,7 +3,7 @@ default_language_version:

 repos:
 -   repo: https://github.com/pre-commit/pre-commit-hooks
-    rev: v6.0.0
+    rev: v5.0.0
    hooks:
    -   id: check-yaml
    -   id: end-of-file-fixer
@@ -23,7 +23,7 @@ repos:
    hooks:
    - id: flake8
 -   repo: https://github.com/pylint-dev/pylint
-    rev: v3.3.8
+    rev: v3.3.7
    hooks:
    - id: pylint
 -   repo: https://github.com/pre-commit/mirrors-mypy
--- a/CITATION.cff
+++ b/CITATION.cff
@@ -1,10 +0,0 @@
-cff-version: 1.2.0
-type: software
-title: "Axolotl: Post-Training for AI Models"
-message: "If you use this software, please cite it as below."
-authors:
-  - name: "Axolotl maintainers and contributors"
-repository-code: "https://github.com/axolotl-ai-cloud/axolotl"
-url: "https://axolotl.ai/"
-license: Apache-2.0
-date-released: "2023-05-30"
--- a/README.md
+++ b/README.md
@@ -149,20 +149,6 @@ Contributions are welcome! Please see our [Contributing Guide](https://github.co

 Interested in sponsoring? Contact us at [wing@axolotl.ai](mailto:wing@axolotl.ai)

-## 📝 Citing Axolotl
-
-If you use Axolotl in your research or projects, please cite it as follows:
-
-```bibtex
-@software{axolotl,
-  title = {Axolotl: Post-Training for AI Models},
-  author = {{Axolotl maintainers and contributors}},
-  url = {https://github.com/axolotl-ai-cloud/axolotl},
-  license = {Apache-2.0},
-  year = {2023}
-}
-```
-
 ## 📜 License

 This project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details.
--- a/requirements.txt
+++ b/requirements.txt
@@ -14,7 +14,7 @@ packaging==23.2

 huggingface_hub>=0.33.0
 peft==0.17.0
-transformers @ git+https://github.com/vasqu/transformers@fix-fa-integration
+transformers==4.55.0
 tokenizers>=0.21.1
 accelerate==1.10.0
 datasets==4.0.0
--- a/src/axolotl/init.py
+++ b/src/axolotl/init.py
@@ -4,4 +4,4 @@ import pkgutil

 __path__ = pkgutil.extend_path(__path__, __name__)  # Make this a namespace package

-__version__ = "0.13.0.dev"
+__version__ = "0.12.1"
--- a/src/axolotl/cli/config.py
+++ b/src/axolotl/cli/config.py
@@ -153,14 +153,15 @@ def prepare_plugins(cfg: DictDefault):
        plugin_manager = PluginManager.get_instance()
        for plugin_name in cfg["plugins"]:
            plugin_manager.register(plugin_name)
-        for plugin in plugin_manager.plugins.values():
-            plugin.register(cfg)


 def plugin_set_cfg(cfg: DictDefault):
    if cfg.get("plugins"):
        plugin_manager = PluginManager.get_instance()
        plugin_manager.cfg = cfg
+        # now that we have the finalized cfg, register the plugins individually
+        for plugin in plugin_manager.plugins.values():
+            plugin.register(cfg)


 def load_cfg(
--- a/src/axolotl/integrations/base.py
+++ b/src/axolotl/integrations/base.py
@@ -76,8 +76,8 @@ class BasePlugin:
    def __init__(self):
        """Initializes the BasePlugin."""

-    def register(self, cfg: dict):  # pylint: disable=unused-argument
-        """Registers the plugin with the given configuration as an unparsed dict.
+    def register(self, cfg: DictDefault):  # pylint: disable=unused-argument
+        """Registers the plugin with the given configuration.

        Args:
            cfg: The configuration for the plugin.
--- a/src/axolotl/loaders/patch_manager.py
+++ b/src/axolotl/loaders/patch_manager.py
@@ -73,6 +73,9 @@ class PatchManager:
        self._apply_voxtral_patches()

    def _apply_transformers_patches(self):
+        from axolotl.monkeypatch.transformers.modeling_flash_attention_utils import (
+            patch_prepare_from_posids,
+        )
        from axolotl.monkeypatch.transformers.trainer_loss_calc import (
            patch_evaluation_loop,
            patch_maybe_log_save_evaluate,
@@ -84,6 +87,7 @@ class PatchManager:
            and self.cfg.fsdp_version == 2
        )

+        patch_prepare_from_posids()
        patch_evaluation_loop(patch_fsdp2)
        patch_maybe_log_save_evaluate()

--- a/src/axolotl/monkeypatch/ring_attn/patch.py
+++ b/src/axolotl/monkeypatch/ring_attn/patch.py
@@ -18,7 +18,9 @@ from torch.distributed import DeviceMesh
 try:
    from transformers.modeling_flash_attention_utils import _flash_supports_window
 except ImportError:
-    _flash_supports_window = True
+    from transformers.modeling_flash_attention_utils import (
+        _flash_supports_window_size as _flash_supports_window,
+    )

 from axolotl.monkeypatch.utils import get_cu_seqlens_from_pos_ids
 from axolotl.utils.logging import get_logger
--- a/src/axolotl/monkeypatch/transformers/modeling_flash_attention_utils.py
+++ b/src/axolotl/monkeypatch/transformers/modeling_flash_attention_utils.py
@@ -0,0 +1,87 @@
+"""
+Monkey patch to fix transformers.modeling_flash_attention_utils.
+
+see https://github.com/huggingface/transformers/pull/39653/files
+"""
+
+import sys
+
+import torch
+
+
+def _prepare_from_posids(query, key, value, position_ids):
+    """
+    This function returns necessary arguments to call `flash_attn_varlen_func`.
+    All three query, key, value states will be flattened.
+    Cumulative lengths of each examples in the batch will be extracted from position_ids.
+    NOTE: ideally cumulative lengths should be prepared at the data collator stage
+    Arguments:
+        query (`torch.Tensor`):
+            Query state with padding. Shape: (batch_size, query_length, num_heads, head_dim).
+        key (`torch.Tensor`):
+            Key state with padding. Shape: (batch_size, kv_seq_len, num_key_value_heads, head_dim).
+        value (`torch.Tensor`):
+            Value state with padding. Shape: (batch_size, kv_seq_len, num_key_value_heads, head_dim).
+        position_ids (`torch.Tensor`):
+            Boolean or int tensor of shape (batch_size, sequence_length), 1 means valid and 0 means not valid.
+    Return:
+        query (`torch.Tensor`):
+            Query state without padding. Shape: (total_target_length, num_heads, head_dim).
+        key (`torch.Tensor`):
+            Key state with padding. Shape: (total_source_length, num_key_value_heads, head_dim).
+        value (`torch.Tensor`):
+            Value state with padding. Shape: (total_source_length, num_key_value_heads, head_dim).
+        indices_q (`torch.Tensor`):
+            The indices of non-masked tokens from the flattened input target sequence.
+        (cu_seqlens_q, cu_seqlens_k) (`tuple[int]`):
+            The cumulative sequence lengths for the target (query) and source (key, value), used to index into ragged (unpadded) tensors. `cu_seqlens` shape is (batch_size + 1,).
+        (max_seqlen_in_batch_q, max_seqlen_in_batch_k) (`tuple[int]`):
+            Maximum sequence length in batch (`max_seqlen_in_batch_q` for the target sequence i.e. query, `max_seqlen_in_batch_k` for the source sequence i.e. key/value).
+    """
+    query = query.contiguous().view(-1, query.size(-2), query.size(-1))
+    key = key.contiguous().view(-1, key.size(-2), key.size(-1))
+    value = value.contiguous().view(-1, value.size(-2), value.size(-1))
+
+    position_ids = position_ids.flatten()
+    indices_q = torch.arange(
+        position_ids.size(0), device=position_ids.device, dtype=torch.int32
+    )
+
+    cu_seq_lens = torch.cat(
+        (
+            indices_q[position_ids == 0],
+            torch.tensor(
+                position_ids.size(), device=position_ids.device, dtype=torch.int32
+            ),
+        )
+    )
+    # NOTE: With torch compile, this will cause a graph break if you don't set
+    # `TORCHDYNAMO_CAPTURE_SCALAR_OUTPUTS=1` in the environment or call
+    # `torch._dynamo.config.capture_scalar_outputs = True` before doing the forward pass.
+    # This is a limitation of flash attention API, as the function `flash_attn_varlen_func`
+    # requires `max_length_q`, `max_length_k` to be passed as `int` and not `torch.Tensor`.
+    # https://github.com/Dao-AILab/flash-attention/blob/2dd8078adc1d9b74e315ee99718c0dea0de8eeb6/flash_attn/flash_attn_interface.py#L1423-L1424
+    # We should use cu_seq_lens instead of position_ids to get the max length since position_ids is not always increasing
+    # for some models (e.g. qwen2-vl).
+    max_length = cu_seq_lens.diff().max().item()
+    return (
+        query,
+        key,
+        value,
+        indices_q,
+        (cu_seq_lens, cu_seq_lens),
+        (max_length, max_length),
+    )
+
+
+def patch_prepare_from_posids():
+    import transformers.modeling_flash_attention_utils
+
+    transformers.modeling_flash_attention_utils._prepare_from_posids = (  # pylint: disable=protected-access
+        _prepare_from_posids
+    )
+    setattr(
+        sys.modules["transformers.modeling_flash_attention_utils"],
+        "_prepare_from_posids",
+        _prepare_from_posids,
+    )
--- a/tests/core/test_builders.py
+++ b/tests/core/test_builders.py
@@ -281,9 +281,7 @@ class TestHFRLTrainerBuilder:
        # Other settings
        assert training_arguments.dataloader_num_workers == 1
        assert training_arguments.dataloader_pin_memory is True
-
-        # TODO(wing): restore once trl releases 0.22.0
-        # assert training_arguments.gradient_checkpointing is True
+        assert training_arguments.gradient_checkpointing is False

    def test_dpo_training_arguments(self, dpo_cfg, model, tokenizer):
        builder = HFRLTrainerBuilder(dpo_cfg, model, tokenizer)
--- a/tests/monkeypatch/test_trainer_loss_calc.py
+++ b/tests/monkeypatch/test_trainer_loss_calc.py
@@ -3,6 +3,7 @@
 import unittest

 from axolotl.monkeypatch.transformers.trainer_loss_calc import (
+    check_evaluation_loop_is_fsdp2_patchable,
    check_evaluation_loop_is_patchable,
    check_maybe_log_save_evaluate_is_patchable,
 )
@@ -19,6 +20,7 @@ class TestTrainerLossCalc(unittest.TestCase):
        the patched code changes upstream.
        """
        assert check_evaluation_loop_is_patchable()
+        assert check_evaluation_loop_is_fsdp2_patchable()
        assert check_maybe_log_save_evaluate_is_patchable()
Author	SHA1	Message	Date
Wing Lian	160ba459ea	tag v0.12.1 Some checks failed ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.6.0) (push) Has been cancelled Details ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled Details ci-cd / build-axolotl (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled Details ci-cd / build-axolotl (vllm, 126, 12.6.3, true, 3.11, 2.7.1) (push) Has been cancelled Details publish pypi / Create Release (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.6.0) (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, true, 3.11, 2.7.1) (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled Details ci-cd / build-axolotl-cloud-no-tmux (<nil>, 126, 12.6.3, 3.11, 2.6.0) (push) Has been cancelled Details publish pypi / Upload release to PyPI (push) Has been cancelled Details	2025-08-11 09:37:40 -04:00
Wing Lian	7a09f76644	fix ray train and add fsdp2 smoke test for ray trainer (#3053 ) * add fsdp2 smokle test for ray trainer * fix raytrain with fsdp2	2025-08-11 09:36:10 -04:00
Wing Lian	47304c7f8a	use exec instead of subprocess to make ctrl+c nicer for cli (#3044 ) * use exec instead of subprocess to make ctrl+c nicer for cli * change var name to use_exec * simplify to bool * flush std* * patch subprocess as mock in test * fix tests * more test fixes	2025-08-11 09:36:01 -04:00