fix ddp for incorrect steps (#2915 )

* fix ddp for incorrect steps * add test
Liquid Foundation Model 2 support (#2905 )
2025-07-14 07:51:16 -04:00 · 2025-07-12 11:41:34 -04:00 · 2025-07-12 11:40:30 -04:00 · 2025-07-12 11:40:01 -04:00 · 2025-07-12 11:39:51 -04:00 · 2025-07-12 15:18:01 +01:00
96 changed files with 1755 additions and 486 deletions
--- a/.github/workflows/preview-docs.yml
+++ b/.github/workflows/preview-docs.yml
@@ -28,6 +28,8 @@ jobs:
    steps:
      - name: Check out repository
        uses: actions/checkout@v4
+        with:
+          ref: ${{ github.event.pull_request.head.sha }}

      - name: Set up Quarto
        uses: quarto-dev/quarto-actions/setup@v2
@@ -50,10 +52,11 @@ jobs:

      - name: Netlify Publish
        uses: nwtgck/actions-netlify@v3.0
+        id: netlify
        with:
          publish-dir: './_site'
-          enable-pull-request-comment: true
-          enable-github-deployment: true
+          enable-pull-request-comment: false
+          enable-github-deployment: false
          github-token: ${{ secrets.GITHUB_TOKEN }}
          deploy-message: "Deployed On Netlify"
          github-deployment-environment: 'preview'
@@ -61,3 +64,13 @@ jobs:
        env:
          NETLIFY_AUTH_TOKEN: ${{ secrets.NETLIFY_AUTH_TOKEN }}
          NETLIFY_SITE_ID: ${{ secrets.NETLIFY_SITE_ID }}
+
+      - name: Update PR with preview link
+        if: ${{ steps.netlify.outcome == 'success' }}
+        uses: marocchino/sticky-pull-request-comment@v2
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          message: |
+            📖 **Documentation Preview**: ${{ steps.netlify.outputs.deploy-url }}
+
+            Deployed on Netlify from commit ${{ github.event.pull_request.head.sha }}
--- a/.runpod/src/config/config.yaml
+++ b/.runpod/src/config/config.yaml
@@ -97,7 +97,7 @@
 #       # 'no_input_format' cannot include {input}
 #       no_input_format: "{instruction} "

-#       # For `completion` datsets only, uses the provided field instead of `text` column
+#       # For `completion` datasets only, uses the provided field instead of `text` column
 #       field:

 # # Axolotl attempts to save the dataset as an arrow after packing the data together so
--- a/docs/dataset-formats/conversation.qmd
+++ b/docs/dataset-formats/conversation.qmd
@@ -187,6 +187,7 @@ Instead of passing `tools` via the system prompt, an alternative method would be
            "role": "assistant", // call the function via assistant
            "tool_calls": [
                {
+                    "id": "...",  // required only for mistral
                    "type": "function",
                    "function": {
                        "name": "...",
@@ -199,6 +200,7 @@ Instead of passing `tools` via the system prompt, an alternative method would be
        },
        {
            "role": "tool",
+            "tool_call_id": "...",  // required only for mistral
            "name": "...",
            "content": "..."
        },
--- a/docs/docker.qmd
+++ b/docs/docker.qmd
@@ -34,6 +34,7 @@ Tags examples:

 - `main-base-py3.11-cu128-2.7.1`
 - `main-base-py3.11-cu126-2.7.1`
+- `main-base-py3.11-cu126-2.7.0`
 - `main-base-py3.11-cu126-2.6.0`
 - `main-base-py3.11-cu124-2.6.0`

@@ -75,6 +76,7 @@ Tags examples:

 - `main-py3.11-cu128-2.7.1`
 - `main-py3.11-cu126-2.7.1`
+- `main-py3.11-cu126-2.7.0`
 - `main-py3.11-cu126-2.6.0`
 - `main-py3.11-cu124-2.6.0`
 - `main-latest`
--- a/docs/multi-gpu.qmd
+++ b/docs/multi-gpu.qmd
@@ -23,8 +23,6 @@ Axolotl supports several methods for multi-GPU training:

 ## DeepSpeed {#sec-deepspeed}

-DeepSpeed is the recommended approach for multi-GPU training due to its stability and performance. It provides various optimization levels through ZeRO stages.
-
 ### Configuration {#sec-deepspeed-config}

 Add to your YAML config:
@@ -32,7 +30,6 @@ Add to your YAML config:
 ```{.yaml}
 deepspeed: deepspeed_configs/zero1.json
 ```
-
 ### Usage {#sec-deepspeed-usage}

 ```{.bash}
@@ -66,9 +63,75 @@ Start from Stage 1 -> Stage 2 -> Stage 3.

 :::

-## FSDP {#sec-fsdp}
+::: {.callout-tip}

-### Basic FSDP Configuration {#sec-fsdp-config}
+Using ZeRO Stage 3 with Single-GPU training
+
+ZeRO Stage 3 can be used for training on a single GPU by manually setting the environment variables:
+`WORLD_SIZE=1 LOCAL_RANK=0 MASTER_ADDR=0.0.0.0 MASTER_PORT=29500`
+
+:::
+
+## Fully Sharded Data Parallel (FSDP) {#sec-fsdp}
+
+::: {.callout-note}
+
+FSDP2 is recommended for new users. FSDP1 is deprecated and will be removed in an upcoming release of Axolotl.
+
+:::
+
+### Migrating from FSDP1 to FSDP2 {#sec-migrate-fsdp1-fsdp2}
+
+To migrate your config from FSDP1 to FSDP2, you must use the `fsdp_version` top-level config field to specify the FSDP version, and
+also follow the config field mapping below to update field names.
+
+#### Config mapping
+
+FSDP1 | FSDP2
+-------- | --------
+fsdp_sharding_strategy | reshard_after_forward
+fsdp_backward_prefetch_policy | **REMOVED**
+fsdp_backward_prefetch | **REMOVED**
+fsdp_forward_prefetch | **REMOVED**
+fsdp_sync_module_states | **REMOVED**
+fsdp_cpu_ram_efficient_loading | cpu_ram_efficient_loading
+fsdp_state_dict_type | state_dict_type
+fsdp_use_orig_params | **REMOVED**
+
+
+For example, if you were using the following FSDP1 config:
+
+```{.yaml}
+fsdp_version: 1
+fsdp_config:
+  fsdp_offload_params: false
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer
+  fsdp_state_dict_type: FULL_STATE_DICT
+  fsdp_sharding_strategy: FULL_SHARD
+```
+
+You can migrate to the following FSDP2 config:
+
+```{.yaml}
+fsdp_version: 2
+fsdp_config:
+  offload_params: false
+  cpu_ram_efficient_loading: true
+  auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  transformer_layer_cls_to_wrap: Qwen3DecoderLayer
+  state_dict_type: FULL_STATE_DICT
+  reshard_after_forward: true
+```
+
+### FSDP1 (deprecated) {#sec-fsdp-config}
+
+::: {.callout-note}
+
+Using `fsdp` to configure FSDP is deprecated and will be removed in an upcoming release of Axolotl. Please use `fsdp_config` as above instead.
+
+:::

 ```{.yaml}
 fsdp:
@@ -80,6 +143,7 @@ fsdp_config:
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
 ```

+
 ## Sequence parallelism {#sec-sequence-parallelism}

 We support sequence parallelism (SP) via the
--- a/docs/multi-node.qmd
+++ b/docs/multi-node.qmd
@@ -40,13 +40,13 @@ use_cpu: false

 Configure your model to use FSDP in the Axolotl yaml. For example:
 ```yaml
-fsdp:
-  - full_shard
-  - auto_wrap
+fsdp_version: 2
 fsdp_config:
-  fsdp_offload_params: true
-  fsdp_state_dict_type: FULL_STATE_DICT
-  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
+  offload_params: true
+  state_dict_type: FULL_STATE_DICT
+  auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  transformer_layer_cls_to_wrap: LlamaDecoderLayer
+  reshard_after_forward: true
 ```

 All you have to do now is launch using accelerate as you would usually do on each machine and voila, the processes will start once you have launched accelerate on every machine.
--- a/docs/rlhf.qmd
+++ b/docs/rlhf.qmd
@@ -17,7 +17,6 @@ feedback. Various methods include, but not limited to:
 - [Kahneman-Tversky Optimization (KTO)](#kto)
 - [Odds Ratio Preference Optimization (ORPO)](#orpo)
 - [Group Relative Policy Optimization (GRPO)](#grpo)
- Proximal Policy Optimization (PPO) (not yet supported in axolotl, if you're interested in contributing, please reach out!)


 ## RLHF using Axolotl
@@ -275,15 +274,14 @@ rl: dpo
 datasets:
  - path: ...
    split: train
-    type: user_defined.default
-
-    field_prompt: "prompt"
-    field_system: "system"
-    field_chosen: "chosen"
-    field_rejected: "rejected"
-    prompt_format: "{prompt}"
-    chosen_format: "{chosen}"
-    rejected_format: "{rejected}"
+    type:
+      field_prompt: "prompt"
+      field_system: "system"
+      field_chosen: "chosen"
+      field_rejected: "rejected"
+      prompt_format: "{prompt}"
+      chosen_format: "{chosen}"
+      rejected_format: "{rejected}"
 ```

 The input format is a simple JSON input with customizable fields based on the above config.
@@ -476,14 +474,13 @@ rl: kto
 datasets:
  - path: ...
    split: train
-    type: user_defined.default
-
-    field_prompt: "prompt"
-    field_system: "system"
-    field_completion: "completion"
-    field_label: "label"
-    prompt_format: "{prompt}"
-    completion_format: "{completion}"
+    type:
+      field_prompt: "prompt"
+      field_system: "system"
+      field_completion: "completion"
+      field_label: "label"
+      prompt_format: "{prompt}"
+      completion_format: "{completion}"
 ```

 The input format is a simple JSON input with customizable fields based on the above config.
--- a/examples/archived/README.md
+++ b/examples/archived/README.md
@@ -0,0 +1,5 @@
+# Archived Examples
+
+This directory contains examples that are no longer maintained and may no longer be functional.
+
+We keep them around for archival purposes in case they are useful to others.
--- a/examples/archived/cerebras/btlm-ft.yml
+++ b/examples/archived/cerebras/btlm-ft.yml
--- a/examples/archived/cerebras/qlora.yml
+++ b/examples/archived/cerebras/qlora.yml
--- a/examples/archived/code-llama/13b/lora.yml
+++ b/examples/archived/code-llama/13b/lora.yml
--- a/examples/archived/code-llama/13b/qlora.yml
+++ b/examples/archived/code-llama/13b/qlora.yml
--- a/examples/archived/code-llama/34b/lora.yml
+++ b/examples/archived/code-llama/34b/lora.yml
--- a/examples/archived/code-llama/34b/qlora.yml
+++ b/examples/archived/code-llama/34b/qlora.yml
--- a/examples/archived/code-llama/7b/lora.yml
+++ b/examples/archived/code-llama/7b/lora.yml
--- a/examples/archived/code-llama/7b/qlora.yml
+++ b/examples/archived/code-llama/7b/qlora.yml
--- a/examples/archived/code-llama/README.md
+++ b/examples/archived/code-llama/README.md
--- a/examples/archived/dbrx/16bit-lora.yaml
+++ b/examples/archived/dbrx/16bit-lora.yaml
--- a/examples/archived/dbrx/8bit-lora.yaml
+++ b/examples/archived/dbrx/8bit-lora.yaml
--- a/examples/archived/dbrx/README.md
+++ b/examples/archived/dbrx/README.md
--- a/examples/archived/dbrx/fft-ds-zero3.yaml
+++ b/examples/archived/dbrx/fft-ds-zero3.yaml
--- a/examples/archived/deepcoder/deepcoder-14B-preview-lora.yml
+++ b/examples/archived/deepcoder/deepcoder-14B-preview-lora.yml
--- a/examples/archived/falcon/config-7b-lora.yml
+++ b/examples/archived/falcon/config-7b-lora.yml
--- a/examples/archived/falcon/config-7b-qlora.yml
+++ b/examples/archived/falcon/config-7b-qlora.yml
--- a/examples/archived/falcon/config-7b.yml
+++ b/examples/archived/falcon/config-7b.yml
--- a/examples/archived/gemma/qlora.yml
+++ b/examples/archived/gemma/qlora.yml
--- a/examples/archived/gptj/qlora.yml
+++ b/examples/archived/gptj/qlora.yml
--- a/examples/archived/jeopardy-bot/config.yml
+++ b/examples/archived/jeopardy-bot/config.yml
--- a/examples/archived/mpt-7b/README.md
+++ b/examples/archived/mpt-7b/README.md
--- a/examples/archived/mpt-7b/config.yml
+++ b/examples/archived/mpt-7b/config.yml
--- a/examples/archived/openllama-3b/README.md
+++ b/examples/archived/openllama-3b/README.md
--- a/examples/archived/openllama-3b/config.yml
+++ b/examples/archived/openllama-3b/config.yml
--- a/examples/archived/openllama-3b/lora.yml
+++ b/examples/archived/openllama-3b/lora.yml
--- a/examples/archived/openllama-3b/qlora.yml
+++ b/examples/archived/openllama-3b/qlora.yml
--- a/examples/archived/pythia-12b/README.md
+++ b/examples/archived/pythia-12b/README.md
--- a/examples/archived/pythia-12b/config.yml
+++ b/examples/archived/pythia-12b/config.yml
--- a/examples/archived/pythia/lora.yml
+++ b/examples/archived/pythia/lora.yml
--- a/examples/archived/qwen/README.md
+++ b/examples/archived/qwen/README.md
--- a/examples/archived/qwen/lora.yml
+++ b/examples/archived/qwen/lora.yml
--- a/examples/archived/qwen/qlora.yml
+++ b/examples/archived/qwen/qlora.yml
--- a/examples/archived/qwen/qwen2-moe-lora.yaml
+++ b/examples/archived/qwen/qwen2-moe-lora.yaml
--- a/examples/archived/qwen/qwen2-moe-qlora.yaml
+++ b/examples/archived/qwen/qwen2-moe-qlora.yaml
--- a/examples/archived/redpajama/README.md
+++ b/examples/archived/redpajama/README.md
--- a/examples/archived/redpajama/config-3b.yml
+++ b/examples/archived/redpajama/config-3b.yml
--- a/examples/archived/replit-3b/config-lora.yml
+++ b/examples/archived/replit-3b/config-lora.yml
--- a/examples/archived/stablelm-2/1.6b/fft.yml
+++ b/examples/archived/stablelm-2/1.6b/fft.yml
--- a/examples/archived/stablelm-2/1.6b/lora.yml
+++ b/examples/archived/stablelm-2/1.6b/lora.yml
--- a/examples/archived/stablelm-2/README.md
+++ b/examples/archived/stablelm-2/README.md
--- a/examples/archived/starcoder2/qlora.yml
+++ b/examples/archived/starcoder2/qlora.yml
--- a/examples/archived/tiny-llama/README.md
+++ b/examples/archived/tiny-llama/README.md
--- a/examples/archived/tiny-llama/lora-mps.yml
+++ b/examples/archived/tiny-llama/lora-mps.yml
--- a/examples/archived/tiny-llama/lora.yml
+++ b/examples/archived/tiny-llama/lora.yml
--- a/examples/archived/tiny-llama/pretrain.yml
+++ b/examples/archived/tiny-llama/pretrain.yml
--- a/examples/archived/tiny-llama/qlora.yml
+++ b/examples/archived/tiny-llama/qlora.yml
--- a/examples/archived/xgen-7b/xgen-7b-8k-qlora.yml
+++ b/examples/archived/xgen-7b/xgen-7b-8k-qlora.yml
--- a/examples/archived/yi-34B-chat/README.md
+++ b/examples/archived/yi-34B-chat/README.md
--- a/examples/archived/yi-34B-chat/qlora.yml
+++ b/examples/archived/yi-34B-chat/qlora.yml
--- a/examples/devstral/README.md
+++ b/examples/devstral/README.md
@@ -1,8 +1,12 @@
 # Finetune Devstral with Axolotl

-Devstral Small is a 24B parameter opensource model from MistralAI found on HuggingFace [Devstral-Small-2505](https://huggingface.co/mistralai/Devstral-Small-2505). This guide shows how to fine-tune it with Axolotl with multi-turn conversations with proper masking.
+Devstral Small is a 24B parameter opensource model from MistralAI found on HuggingFace [Devstral-Small-2505](https://huggingface.co/mistralai/Devstral-Small-2505) and [Devstral-Small-2507](https://huggingface.co/mistralai/Devstral-Small-2507). `Devstral-Small-2507` is the latest version of the model and has [function calling](https://mistralai.github.io/mistral-common/usage/tools/) support.

-The model was fine-tuned ontop of [Mistral-Small-3.1](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503) without the vision layer and has a context of upto 128k tokens.
+This guide shows how to fine-tune it with Axolotl with multi-turn conversations with proper masking.
+
+The model was fine-tuned ontop of [Mistral-Small-3.1](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503) without the vision layer and has a context of up to 128k tokens.
+
+Thanks to the team at MistralAI for giving us early access to prepare for this release.

 ## Getting started

@@ -17,11 +21,6 @@ cd axolotl

 pip3 install packaging==23.2 setuptools==75.8.0 wheel ninja
 pip3 install --no-build-isolation -e '.[flash-attn]'
-
-# Install the latest mistral-common from source
-pip3 uninstall mistral-common
-pip3 install git+https://github.com/mistralai/mistral-common.git@039465d
-
 ```

 2. Run the finetuning example:
@@ -39,6 +38,7 @@ Let us know how it goes. Happy finetuning! 🚀
 - You can run a full finetuning by removing the `adapter: qlora` and `load_in_4bit: true` from the config.
 - Read more on how to load your own dataset at [docs](https://docs.axolotl.ai/docs/dataset_loading.html).
 - The dataset format follows the OpenAI Messages format as seen [here](https://docs.axolotl.ai/docs/dataset-formats/conversation.html#chat_template).
+- Learn how to use function calling with Axolotl at [docs](https://docs.axolotl.ai/docs/dataset-formats/conversation.html#using-tool-use).

 ## Optimization Guides

@@ -57,6 +57,7 @@ In addition, we do not support overriding tokens yet.
 ## Related Resources

 - [MistralAI Devstral Blog](https://mistral.ai/news/devstral)
+- [MistralAI Devstral 1.1 Blog](https://mistral.ai/news/devstral-2507)
 - [Axolotl Docs](https://docs.axolotl.ai)
 - [Axolotl GitHub](https://github.com/axolotl-ai-cloud/axolotl)
 - [Axolotl Website](https://axolotl.ai)
--- a/examples/devstral/devstral-small-qlora.yml
+++ b/examples/devstral/devstral-small-qlora.yml
@@ -1,4 +1,4 @@
-base_model: mistralai/Devstral-Small-2505
+base_model: mistralai/Devstral-Small-2507

 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
--- a/examples/lfm2/README.md
+++ b/examples/lfm2/README.md
@@ -0,0 +1,7 @@
+# Liquid Foundation Models 2
+
+LFM2 support in transformers exists in the main branch, but is not yet included in the transformers release.
+
+```bash
+pip install --upgrade --no-deps --force-reinstall git+https://github.com/huggingface/transformers.git
+```
--- a/examples/lfm2/lfm2-350m-fft.yaml
+++ b/examples/lfm2/lfm2-350m-fft.yaml
@@ -0,0 +1,48 @@
+base_model: LiquidAI/LFM2-350M
+
+chunked_cross_entropy: true
+
+chat_template: tokenizer_default
+eot_tokens:
+  - "<|im_end|>"
+datasets:
+  - path: mlabonne/FineTome-100k
+    type: chat_template
+    split: train[:20%]
+    field_messages: conversations
+    message_field_role: from
+    message_field_content: value
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.05
+output_dir: ./outputs/out
+
+sequence_len: 4096
+sample_packing: true
+pad_to_sequence_len: true
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 2
+micro_batch_size: 4
+num_epochs: 1
+optimizer: adamw_torch_fused
+lr_scheduler: cosine
+learning_rate: 5e-5
+
+bf16: true
+tf32: true
+
+gradient_checkpointing: false
+resume_from_checkpoint:
+logging_steps: 1
+flash_attention: true
+
+warmup_ratio: 0.1
+evals_per_epoch: 2
+saves_per_epoch: 1
+
+weight_decay: 0.0
--- a/requirements.txt
+++ b/requirements.txt
@@ -68,4 +68,4 @@ schedulefree==1.4.1
 axolotl-contribs-lgpl==0.0.6
 axolotl-contribs-mit==0.0.3

-mistral-common==1.6.3
+mistral-common==1.7.0
--- a/src/axolotl/init.py
+++ b/src/axolotl/init.py
@@ -4,4 +4,4 @@ import pkgutil

 __path__ = pkgutil.extend_path(__path__, __name__)  # Make this a namespace package

-__version__ = "0.11.0"
+__version__ = "0.12.0.dev"
--- a/src/axolotl/cli/config.py
+++ b/src/axolotl/cli/config.py
@@ -16,6 +16,7 @@ from transformers.utils import is_torch_bf16_gpu_available
 from axolotl.integrations.base import PluginManager
 from axolotl.utils.comet_ import setup_comet_env_vars
 from axolotl.utils.config import (
+    migrate_fsdp_config,
    normalize_cfg_datasets,
    normalize_config,
    validate_config,
@@ -226,6 +227,7 @@ def load_cfg(
        },
    )

+    migrate_fsdp_config(cfg)
    prepare_optim_env(cfg)
    prepare_opinionated_env(cfg)
    normalize_config(cfg)
--- a/src/axolotl/cli/train.py
+++ b/src/axolotl/cli/train.py
@@ -109,6 +109,13 @@ def ray_train_func(kwargs: dict):
    # initialize accelerator before model instantiation
    Accelerator(gradient_accumulation_steps=cfg.gradient_accumulation_steps)

+    # Register plugins in Ray workers
+    if cfg.get("plugins"):
+        from axolotl.cli.config import plugin_set_cfg, prepare_plugins
+
+        prepare_plugins(cfg)
+        plugin_set_cfg(cfg)
+
    kwargs["cfg"] = cfg

    do_train(**kwargs)
--- a/src/axolotl/core/builders/base.py
+++ b/src/axolotl/core/builders/base.py
@@ -501,6 +501,10 @@ class TrainerBuilderBase(abc.ABC):
        if self.cfg.reward_model or self.cfg.rl:
            training_args_kwargs["max_length"] = self.cfg.sequence_len

+        if self.cfg.fsdp_config or self.cfg.fsdp:
+            training_args_kwargs["fsdp_config"] = self.cfg.fsdp_config
+            training_args_kwargs["fsdp"] = self.cfg.fsdp if self.cfg.fsdp else True
+
        self._configure_reporting(training_args_kwargs)
        self._configure_hub_parameters(training_args_kwargs)
        self._configure_scheduler(training_args_kwargs)
--- a/src/axolotl/core/builders/causal.py
+++ b/src/axolotl/core/builders/causal.py
@@ -151,14 +151,6 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        training_arguments_kwargs, trainer_kwargs = self._set_base_training_args(
            total_num_steps
        )
-
-        if self.cfg.fsdp:
-            training_arguments_kwargs["fsdp"] = self.cfg.fsdp
-            if self.cfg.fsdp_config:
-                training_arguments_kwargs["fsdp_config"] = {
-                    k.lstrip("fsdp_"): v for k, v in dict(self.cfg.fsdp_config).items()
-                }
-
        if self.cfg.adapter == "qlora":
            training_arguments_kwargs["qlora"] = True

--- a/src/axolotl/core/builders/rl.py
+++ b/src/axolotl/core/builders/rl.py
@@ -208,7 +208,7 @@ class HFRLTrainerBuilder(TrainerBuilderBase):
            callbacks=self.get_callbacks(),
            **trainer_kwargs,
        )
-        if self.cfg.fsdp:
+        if self.cfg.fsdp_config or self.cfg.fsdp:
            ensure_dtype(trainer.model, dtype=self.cfg.torch_dtype)
            if self.cfg.rl in [RLType.DPO, RLType.IPO] and trainer.ref_model:
                ensure_dtype(trainer.ref_model, dtype=self.cfg.torch_dtype)
@@ -218,21 +218,3 @@ class HFRLTrainerBuilder(TrainerBuilderBase):
            trainer.add_callback(callback)

        return trainer
-
-
-class HFPPOTrainerBuilder(TrainerBuilderBase):
-    """
-    HF Factory class for PPO Trainer
-    """
-
-    def get_callbacks(self):
-        callbacks = super().get_callbacks()
-        return callbacks
-
-    def get_post_trainer_create_callbacks(self, trainer):
-        callbacks = super().get_post_trainer_create_callbacks(trainer=trainer)
-        return callbacks
-
-    def build(self, total_num_steps):
-        # TODO: build PPOConfig
-        raise NotImplementedError("PPO trainer builder is not implemented yet.")
--- a/src/axolotl/core/trainers/init.py
+++ b/src/axolotl/core/trainers/init.py
@@ -14,5 +14,4 @@ from .trl import (
    AxolotlORPOTrainer,
    AxolotlPRMTrainer,
    AxolotlRewardTrainer,
-    TRLPPOTrainer,
 )
--- a/src/axolotl/core/trainers/trl.py
+++ b/src/axolotl/core/trainers/trl.py
@@ -1,12 +1,9 @@
-"""Module for TRL PPO trainer"""
+"""Module for TRL RL trainers"""

-import torch
-from tqdm import tqdm
 from trl import (
    CPOTrainer,
    KTOTrainer,
    ORPOTrainer,
-    PPOTrainer,
    PRMTrainer,
    RewardTrainer,
 )
@@ -16,64 +13,6 @@ from axolotl.core.trainers.mixins.optimizer import OptimizerInitMixin, Optimizer
 from axolotl.core.trainers.mixins.scheduler import SchedulerMixin


-class TRLPPOTrainer(PPOTrainer):
-    """Wrapper for TRL PPO trainer to handle customizations"""
-
-    tag_names = ["axolotl", "ppo"]
-
-    def train(
-        self,
-        reward_pipe,
-        resume_from_checkpoint=None,  # pylint: disable=unused-argument
-    ):
-        generation_kwargs = {
-            "min_length": -1,
-            "top_k": 0.0,
-            "top_p": 1.0,
-            "do_sample": True,
-            "pad_token_id": self.tokenizer.eos_token_id,
-            "max_new_tokens": 32,
-        }
-        sent_kwargs = {
-            "return_all_scores": True,
-            "function_to_apply": "none",
-            "batch_size": 16,
-        }
-
-        for _, batch in tqdm(enumerate(self.dataloader)):
-            query_tensors = batch["input_ids"]
-
-            # generate model response
-            response_tensors, ref_response_tensors = self.generate(
-                query_tensors,
-                return_prompt=False,
-                generate_ref_response=True,
-                **generation_kwargs,
-            )
-            batch["response"] = self.tokenizer.batch_decode(response_tensors)
-            batch["ref_response"] = self.tokenizer.batch_decode(ref_response_tensors)
-
-            # Compute sentiment score
-            texts = [q + r for q, r in zip(batch["query"], batch["response"])]
-            pipe_outputs = reward_pipe(texts, **sent_kwargs)
-            rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]
-            ref_texts = [q + r for q, r in zip(batch["query"], batch["ref_response"])]
-            ref_pipe_outputs = reward_pipe(ref_texts, **sent_kwargs)
-            ref_rewards = [
-                torch.tensor(output[1]["score"]) for output in ref_pipe_outputs
-            ]
-            batch["ref_rewards"] = ref_rewards
-
-            # Run PPO step
-            stats = self.step(query_tensors, response_tensors, rewards)
-            self.log_stats(
-                stats,
-                batch,
-                rewards,
-                columns_to_log=["query", "response", "ref_response", "ref_rewards"],
-            )
-
-
 class AxolotlORPOTrainer(
    RngLoaderMixin, SchedulerMixin, OptimizerMixin, OptimizerInitMixin, ORPOTrainer
 ):
--- a/src/axolotl/loaders/adapter.py
+++ b/src/axolotl/loaders/adapter.py
@@ -122,9 +122,9 @@ def load_lora(
    rank = int(os.environ.get("LOCAL_RANK", 0))

    if (
-        cfg.fsdp
+        cfg.fsdp_config
        and cfg.adapter
-        and cfg.fsdp_config.fsdp_cpu_ram_efficient_loading
+        and cfg.fsdp_config.cpu_ram_efficient_loading
        and rank != 0
    ):
        setup_quantized_meta_for_peft(model)
@@ -152,9 +152,9 @@ def load_lora(
                "Exception caught during model.print_trainable_parameters(): %s", exc
            )
    elif (
-        cfg.fsdp
+        cfg.fsdp_config
        and cfg.adapter
-        and cfg.fsdp_config.fsdp_cpu_ram_efficient_loading
+        and cfg.fsdp_config.cpu_ram_efficient_loading
        and rank != 0
    ):
        setup_quantized_peft_meta_for_training(model)
--- a/src/axolotl/loaders/model.py
+++ b/src/axolotl/loaders/model.py
@@ -140,10 +140,15 @@ class ModelLoader:
        """Check if flash attention is installed."""
        return find_spec("flash_attn") is not None

-    @cached_property
-    def qlora_fsdp(self):
+    @property
+    def is_fsdp_enabled(self):
+        """Property that determines if FSDP is enabled."""
+        return self.cfg.fsdp_config is not None or self.cfg.fsdp is not None
+
+    @property
+    def is_qlora_and_fsdp_enabled(self):
        """Property that determines if FSDP with QLoRA is enabled."""
-        return self.cfg.fsdp and self.cfg.adapter == "qlora"
+        return self.is_fsdp_enabled and self.cfg.adapter == "qlora"

    def load(self) -> tuple[PreTrainedModel | PeftModelForCausalLM, PeftConfig | None]:
        """Load and prepare the model with all configurations and patches.
@@ -189,15 +194,15 @@ class ModelLoader:
        # Handle PeftModel if needed
        if (
            isinstance(self.model, (peft.PeftModel, peft.PeftModelForCausalLM))
-            and not self.qlora_fsdp
+            and not self.is_qlora_and_fsdp_enabled
        ):
            self.model = self.model.merge_and_unload()

        self._resize_token_embeddings()
        self._adjust_model_config()
-        self._log_memory_usage()
        self._configure_embedding_dtypes()
        self._configure_qat()
+        log_gpu_memory_usage(LOG, "Memory usage after model load", 0)

    def _resize_token_embeddings(self):
        """Resize token embeddings if needed."""
@@ -251,22 +256,13 @@ class ModelLoader:
        ):
            self.model.config.eos_token_id = self.tokenizer.eos_token_id

-    def _log_memory_usage(self):
-        """Log device memory usage after model load."""
-        if hasattr(self.model, "device") and self.model.device.type in (
-            "cuda",
-            "mps",
-            "npu",
-        ):
-            log_gpu_memory_usage(LOG, "after model load", self.model.device)
-
    def _configure_embedding_dtypes(self):
        """Configure embedding module dtypes."""
        # Get embedding modules
        embedding_modules = get_linear_embedding_layers(self.cfg.model_config_type)

        # Initial dtype conversion
-        if not self.cfg.fsdp:
+        if not self.is_fsdp_enabled:
            # We don't run this during FSDP because this will leave mixed and bfloat16
            # dtypes in the model which FSDP doesn't like
            if self.cfg.load_in_4bit and self.cfg.embeddings_skip_upcast:
@@ -282,7 +278,7 @@ class ModelLoader:
            self._set_z3_leaf_modules()

        # Apply gradient checkpointing if needed
-        needs_fa2_dtype = self.cfg.adapter or self.cfg.fsdp
+        needs_fa2_dtype = self.cfg.adapter or self.is_fsdp_enabled
        if self.cfg.adapter in ["lora", "qlora"]:
            needs_fa2_dtype = True
            if self.cfg.gradient_checkpointing:
@@ -298,10 +294,12 @@ class ModelLoader:
            # we need to convert them back to fp16/bf16 for flash-attn compatibility.
            (
                (needs_fa2_dtype or self.cfg.flash_attention or self.cfg.flex_attention)
-                and not self.qlora_fsdp
+                and not self.is_qlora_and_fsdp_enabled
+            )
+            or (
+                # CCE requires embedding layers to be in fp16/bf16 for backward pass
+                self.cfg.cut_cross_entropy
            )
-            # CCE requires embedding layers to be in fp16/bf16 for backward pass
-            or self.cfg.cut_cross_entropy
        )

        if should_convert:
@@ -357,7 +355,6 @@ class ModelLoader:
            and not (self.cfg.rl and self.cfg.load_in_4bit)
            and not skip_move_to_device
        ):
-            # TODO: validate this conditional
            self.model.to(f"{str(get_device_type())}:{self.cfg.local_rank}")

        if get_device_count() > 1 and int(os.getenv("WORLD_SIZE", "1")) == 1:
@@ -430,7 +427,17 @@ class ModelLoader:

        self.model_kwargs["torch_dtype"] = self.cfg.torch_dtype

-        if not is_deepspeed_zero3_enabled():
+        is_ds_zero3 = is_deepspeed_zero3_enabled()
+
+        # FSDP requires control over device placement, so don't set device_map when FSDP is enabled
+        if self.is_fsdp_enabled:
+            # For QLoRA + FSDP, we still need to set device_map to "auto" for proper initialization
+            if self.is_qlora_and_fsdp_enabled:
+                self.model_kwargs["device_map"] = {
+                    "": int(os.environ.get("LOCAL_RANK", 0))
+                }
+            # For other FSDP cases, don't set device_map at all
+        elif not is_ds_zero3:
            self.model_kwargs["device_map"] = device_map

            cur_device = get_device_type()
@@ -499,7 +506,7 @@ class ModelLoader:
                "bnb_4bit_quant_storage": torch.bfloat16,
            }
            if self.cfg.model_config_type in ["jamba", "qwen2_moe"] and not (
-                self.cfg.deepspeed or self.cfg.fsdp
+                self.cfg.deepspeed or self.is_fsdp_enabled
            ):
                # for some reason, this causes the loss to be off by an order of magnitude
                # but deepspeed needs this still in bfloat16
@@ -604,9 +611,21 @@ class ModelLoader:
    def _build_model(self) -> bool:
        """Load model, with load strategy depending on config."""
        skip_move_to_device = False
+        if self.is_fsdp_enabled:
+            if self.cfg.fsdp_config.cpu_ram_efficient_loading:
+                skip_move_to_device = True
+                # Don't delete device_map for QLoRA + FSDP - it was set correctly in _set_device_map
+                if (
+                    "device_map" in self.model_kwargs
+                    and not self.is_qlora_and_fsdp_enabled
+                ):
+                    del self.model_kwargs["device_map"]
+            elif self.is_qlora_and_fsdp_enabled:
+                skip_move_to_device = True
+
        if (
-            self.qlora_fsdp
-            and self.cfg.fsdp_config.fsdp_cpu_ram_efficient_loading
+            self.is_qlora_and_fsdp_enabled
+            and self.cfg.fsdp_config.cpu_ram_efficient_loading
            and (
                self.cfg.model_config_type == "dbrx"
                or self.cfg.qlora_sharded_model_loading
@@ -632,12 +651,6 @@ class ModelLoader:
            and not self.cfg.trust_remote_code
            and not self.cfg.gptq
        ):
-            # TODO: Do we need to open this up for all models?
-            if self.cfg.fsdp and self.cfg.fsdp_config.fsdp_cpu_ram_efficient_loading:
-                skip_move_to_device = True
-                if "device_map" in self.model_kwargs:
-                    del self.model_kwargs["device_map"]
-
            # Please don't remove underscore binding without reading the fn docstring.
            _ = self._configure_zero3_memory_efficient_loading()

@@ -691,33 +704,22 @@ class ModelLoader:
                    trust_remote_code=self.cfg.trust_remote_code or False,
                    **self.model_kwargs,
                )
+        elif self.cfg.gptq:
+            self.model = self.auto_model_loader.from_pretrained(
+                self.base_model,
+                config=self.model_config,
+                trust_remote_code=self.cfg.trust_remote_code or False,
+                **self.model_kwargs,
+            )
        else:
-            if self.cfg.gptq:
-                self.model = self.auto_model_loader.from_pretrained(
-                    self.base_model,
-                    config=self.model_config,
-                    trust_remote_code=self.cfg.trust_remote_code or False,
-                    **self.model_kwargs,
-                )
-            else:
-                if (
-                    self.cfg.fsdp
-                    and self.cfg.fsdp_config.fsdp_cpu_ram_efficient_loading
-                ):
-                    # disabling either of these two still leads to VRAM spike before setting back down
-                    skip_move_to_device = True
-                    if "device_map" in self.model_kwargs:
-                        del self.model_kwargs["device_map"]
-
-                # Please don't remove underscore binding without reading the fn docstring.
-                _ = self._configure_zero3_memory_efficient_loading()
-
-                self.model = self.auto_model_loader.from_pretrained(
-                    self.base_model,
-                    config=self.model_config,
-                    trust_remote_code=self.cfg.trust_remote_code or False,
-                    **self.model_kwargs,
-                )
+            # Please don't remove underscore binding without reading the fn docstring.
+            _ = self._configure_zero3_memory_efficient_loading()
+            self.model = self.auto_model_loader.from_pretrained(
+                self.base_model,
+                config=self.model_config,
+                trust_remote_code=self.cfg.trust_remote_code or False,
+                **self.model_kwargs,
+            )
        if is_deepspeed_zero3_enabled():
            skip_move_to_device = True

@@ -753,8 +755,8 @@ class ModelLoader:
            skip_prepare_model_for_kbit_training = True

        if (
-            self.qlora_fsdp
-            or (self.cfg.fsdp and self.cfg.fsdp_config.fsdp_cpu_ram_efficient_loading)
+            self.is_qlora_and_fsdp_enabled
+            or (self.is_fsdp_enabled and self.cfg.fsdp_config.cpu_ram_efficient_loading)
            or is_deepspeed_zero3_enabled()
        ):
            # Make sure everything is in the same dtype
--- a/src/axolotl/loaders/patch_manager.py
+++ b/src/axolotl/loaders/patch_manager.py
@@ -7,6 +7,7 @@ import importlib.util
 from functools import cached_property

 import addict
+import torch
 import transformers
 from transformers import PretrainedConfig, PreTrainedModel

@@ -93,10 +94,14 @@ class PatchManager:

    def _apply_fsdp_patches(self):
        """Apply patches for FSDP configurations."""
-        if self.cfg.fsdp_config and str(self.cfg.fsdp_config.fsdp_version) == "2":
+        if self.cfg.fsdp_config and str(self.cfg.fsdp_version) == "2":
            from axolotl.monkeypatch.accelerate.fsdp2 import patch_accelerate_fsdp2

            patch_accelerate_fsdp2()
+            if self.cfg.rl:
+                from axolotl.monkeypatch.trainer.trl import patch_trl_prepare_fsdp2
+
+                patch_trl_prepare_fsdp2()

        # if self.cfg.fsdp_config:
        #     # see transformers#39152
@@ -165,10 +170,25 @@ class PatchManager:
        """Apply patches for gradient checkpointing."""
        if self.cfg.gradient_checkpointing in ["unsloth", "offload"]:
            from axolotl.monkeypatch.gradient_checkpointing import (
+                CheckpointFunctionWithCPUOffload,
                hf_grad_checkpoint_offload_wrapper,
            )

-            transformers.modeling_utils.checkpoint = hf_grad_checkpoint_offload_wrapper
+            if (
+                self.cfg.gradient_checkpointing_kwargs
+                and "use_reentrant" in self.cfg.gradient_checkpointing_kwargs
+                and self.cfg.gradient_checkpointing_kwargs["use_reentrant"] is False
+            ):
+                transformers.modeling_utils.checkpoint = (
+                    hf_grad_checkpoint_offload_wrapper
+                )
+            else:
+                transformers.modeling_utils.checkpoint.CheckpointFunction = (
+                    CheckpointFunctionWithCPUOffload
+                )
+                torch.utils.checkpoint.CheckpointFunction = (
+                    CheckpointFunctionWithCPUOffload
+                )
        if self.cfg.gradient_checkpointing == "offload_disk":
            from axolotl.monkeypatch.gradient_checkpointing import (
                hf_grad_checkpoint_disk_offload_wrapper,
--- a/src/axolotl/loaders/utils.py
+++ b/src/axolotl/loaders/utils.py
@@ -195,9 +195,11 @@ def ensure_dtype(model: PreTrainedModel, dtype: torch.dtype = torch.bfloat16):
            bias_mismatch = module.bias.dtype != dtype

        if weight_mismatch:
-            print(f"Converting module {name}.weight: {module.weight.dtype} -> {dtype}")
+            LOG.debug(
+                f"Converting module {name}.weight: {module.weight.dtype} -> {dtype}"
+            )
        if bias_mismatch:
-            print(f"Converting module {name}.bias: {module.bias.dtype} -> {dtype}")
+            LOG.debug(f"Converting module {name}.bias: {module.bias.dtype} -> {dtype}")
        if weight_mismatch or bias_mismatch:
            module.to(dtype)

--- a/src/axolotl/monkeypatch/accelerate/fsdp2.py
+++ b/src/axolotl/monkeypatch/accelerate/fsdp2.py
@@ -2,102 +2,65 @@
 monkeypatch for accelerate fsdp2 fix when modifying ordereddict during interation, and saving full state dicts
 """

+import copy
+import functools
 import sys

 import torch
+from torch import nn

+from axolotl.utils.bench import log_gpu_memory_usage
 from axolotl.utils.logging import get_logger

 LOG = get_logger(__name__)


-def fsdp2_load_full_state_dict(accelerator, model: torch.nn.Module, full_sd: dict):
+def fsdp2_load_full_state_dict(
+    _accelerator, model: torch.nn.Module, full_sd: dict, offload_to_cpu: bool = False
+):
    """
    Loads the full state dict (could be only on rank 0) into the sharded model. This is done by broadcasting the
    parameters from rank 0 to all other ranks. This function modifies the model in-place.
-
    Args:
        accelerator (`Accelerator`): The accelerator instance
        model (`torch.nn.Module`):
            The model to load the state dict into, expected to be on meta device or a VRAM spike can occur
        full_sd (`dict`): The full state dict to load, can only be on rank 0
    """
-    import torch.distributed as dist
    from torch.distributed.tensor import distribute_tensor

-    # Model was previously copied to meta device
+    LOG.info("Broadcasting full state dict to all ranks...")
+    import time
+
+    start_time = time.time()
+
    meta_sharded_sd = model.state_dict()
    sharded_sd = {}
-
-    # Rank 0 distributes the full state dict to other ranks
-    def _infer_parameter_dtype(model, param_name, empty_param):
-        try:
-            old_param = model.get_parameter_or_buffer(param_name)
-        except AttributeError:
-            # Need this for LORA, as there some params are not *parameters* of sorts
-            base_param_name, local_param_name = param_name.rsplit(".", 1)
-            submodule = model.get_submodule(base_param_name)
-            old_param = getattr(submodule, local_param_name)
-
-        is_torch_e4m3fn_available = hasattr(torch, "float8_e4m3fn")
-        casting_dtype = None
-        is_param_float8_e4m3fn = (
-            is_torch_e4m3fn_available and empty_param.dtype == torch.float8_e4m3fn
-        )
-
-        if empty_param.dtype.is_floating_point and not is_param_float8_e4m3fn:
-            casting_dtype = old_param.dtype
-
-        return old_param is not None and old_param.is_contiguous(), casting_dtype
-
-    def _cast_and_contiguous(tensor, to_contiguous, dtype):
-        if dtype is not None:
-            tensor = tensor.to(dtype=dtype)
-        if to_contiguous:
-            tensor = tensor.contiguous()
-        return tensor
-
-    param_names = sorted(meta_sharded_sd.keys())
-
-    for param_name in param_names:
-        mesh = meta_sharded_sd[param_name].device_mesh
-        if accelerator.is_main_process:
-            full_param = full_sd[param_name].detach().cuda()
-            dist.broadcast(full_param, src=0, group=mesh.get_group())
-            sharded_tensor = distribute_tensor(
-                full_param, mesh, sharded_sd[param_name].placements
-            )
-            to_contiguous, casting_dtype = _infer_parameter_dtype(
-                model,
-                param_name,
-                full_param,
-            )
-            sharded_tensor = _cast_and_contiguous(
-                sharded_tensor, to_contiguous, casting_dtype
-            )
-            sharded_sd[param_name] = sharded_tensor
-        else:
-            full_tensor = torch.empty(
-                sharded_sd[param_name].size(),
-                device="cuda",
-                dtype=sharded_sd[param_name].dtype,
-            )
-            dist.broadcast(full_tensor, src=0, group=mesh.get_group())
-            sharded_tensor = distribute_tensor(
-                full_tensor, mesh, sharded_sd[param_name].placements
-            )
-            to_contiguous, casting_dtype = _infer_parameter_dtype(
-                model,
-                param_name,
+    for param_name, full_tensor in full_sd.items():
+        sharded_meta_param = meta_sharded_sd.get(param_name)
+        full_tensor = full_tensor.to(sharded_meta_param.dtype).to(torch.device("cuda"))
+        if hasattr(sharded_meta_param, "device_mesh"):
+            sharded_param = distribute_tensor(
                full_tensor,
+                sharded_meta_param.device_mesh,
+                sharded_meta_param.placements,
+                src_data_rank=0,
            )
-            sharded_tensor = _cast_and_contiguous(
-                sharded_tensor, to_contiguous, casting_dtype
-            )
-            sharded_sd[param_name] = sharded_tensor
+        else:
+            sharded_param = full_tensor

-    # we set `assign=True` because our params are on meta device
-    model.load_state_dict(sharded_sd, assign=True)
+        if offload_to_cpu:
+            sharded_param = sharded_param.cpu()
+
+        sharded_sd[param_name] = nn.Parameter(sharded_param)
+        del full_tensor
+        full_sd[param_name] = None
+    model.load_state_dict(sharded_sd, assign=True, strict=True)
+    end_time = time.time()
+    LOG.debug(
+        f"Time taken to load full state dict: {(end_time - start_time):.2f} seconds"
+    )
+    log_gpu_memory_usage(LOG, "Memory usage after broadcasting full state dict", 0)
    return model


@@ -191,17 +154,195 @@ def get_state_dict(self, model, unwrap=True):
    return state_dict


-def patch_accelerate_fsdp2():
-    import accelerate
-    from accelerate.utils import fsdp_utils
+def _process_lora_module_for_fsdp(module, fsdp2_kwargs):
+    """Helper function to process LoRA modules for FSDP2."""
+    from torch.distributed.fsdp import fully_shard

-    fsdp_utils.fsdp2_load_full_state_dict = fsdp2_load_full_state_dict
-    setattr(
-        sys.modules["accelerate.utils.fsdp_utils"],
-        "fsdp2_load_full_state_dict",
-        fsdp2_load_full_state_dict,
+    log_bias_dtype_mismatch = False
+
+    # Linear4Bit will keep it's bias term in fp32. If the weight dtype is in bf16 we are not able to
+    # wrap this. Therefore we must ensure the bias has the same dtype as the weight
+    if module.base_layer.bias is not None:
+        if module.base_layer.weight.dtype != module.base_layer.bias.dtype:
+            log_bias_dtype_mismatch = True
+            module.base_layer.bias.data = module.base_layer.bias.data.to(
+                module.base_layer.weight.dtype
+            )
+
+    for active_adapter in module.active_adapters:
+        if module.lora_A:
+            fully_shard(module.lora_A[active_adapter], **fsdp2_kwargs)
+        if module.lora_B:
+            fully_shard(module.lora_B[active_adapter], **fsdp2_kwargs)
+        if module.lora_embedding_A:
+            fully_shard(module.lora_embedding_A[active_adapter], **fsdp2_kwargs)
+        if module.lora_embedding_B:
+            fully_shard(module.lora_embedding_B[active_adapter], **fsdp2_kwargs)
+        if module.lora_magnitude_vector:
+            fully_shard(module.lora_magnitude_vector[active_adapter], **fsdp2_kwargs)
+    return log_bias_dtype_mismatch
+
+
+def fsdp2_prepare_model(accelerator, model: torch.nn.Module) -> torch.nn.Module:
+    """Prepares the model for FSDP2 in-place. Also returns the model to avoid misuse of the original model.
+
+    Args:
+        accelerator (`Accelerator`): The accelerator instance
+        model (`torch.nn.Module`): The model to prepare
+
+    Returns:
+        `torch.nn.Module`: Prepared model
+    """
+    from accelerate.utils import get_module_children_bottom_up, is_compiled_module
+    from accelerate.utils.fsdp_utils import fsdp2_prepare_auto_wrap_policy
+    from accelerate.utils.modeling import get_non_persistent_buffers
+    from peft import PeftModel
+    from peft.tuners.lora import LoraLayer
+    from torch.distributed.fsdp import (
+        CPUOffloadPolicy,
+        FSDPModule,
+        MixedPrecisionPolicy,
+        fully_shard,
    )

+    is_type_fsdp = isinstance(model, FSDPModule) or (
+        is_compiled_module(model)
+        and isinstance(model._orig_mod, FSDPModule)  # pylint: disable=protected-access
+    )
+    if is_type_fsdp:
+        return model
+
+    fsdp2_plugin = accelerator.state.fsdp_plugin
+
+    original_sd = model.state_dict()
+
+    from torch.distributed.fsdp.wrap import (
+        size_based_auto_wrap_policy,
+        transformer_auto_wrap_policy,
+    )
+
+    # We need the `auto_wrap_policy` original type to create a custom poilicy function for sharding
+    # This is because `fully_shard` doesn't support old auto wrap policies, rather we have to imitate the behaviour
+    if fsdp2_plugin.auto_wrap_policy is transformer_auto_wrap_policy:
+        pass  # auto_wrap_policy_type = "transformer"
+    elif fsdp2_plugin.auto_wrap_policy is size_based_auto_wrap_policy:
+        pass  # auto_wrap_policy_type = "size"
+
+    # We set `auto_wrap_policy` to `functools.partial` to avoid creating it again
+    # This is because of `apply_activation_checkpointing` which will can reuse this function
+    fsdp2_plugin.set_auto_wrap_policy(model)
+
+    if fsdp2_plugin.activation_checkpointing:
+        from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
+            CheckpointImpl,
+            apply_activation_checkpointing,
+            checkpoint_wrapper,
+        )
+
+        # Apply activation checkpointing before applying `fully_shard`
+        apply_activation_checkpointing(
+            model,
+            checkpoint_wrapper_fn=functools.partial(
+                checkpoint_wrapper,
+                checkpoint_impl=CheckpointImpl.NO_REENTRANT,
+            ),
+            auto_wrap_policy=fsdp2_plugin.auto_wrap_policy,
+        )
+
+    fsdp2_kwargs = {
+        "reshard_after_forward": fsdp2_plugin.reshard_after_forward,
+        "offload_policy": fsdp2_plugin.cpu_offload,
+        # `fully_shard` doesn't accept `None` in case of `MixedPrecisionPolicy`
+        "mp_policy": fsdp2_plugin.mixed_precision_policy or MixedPrecisionPolicy(),
+    }
+
+    model_has_params4bit = False
+    for _, param in model.named_parameters():
+        # this is a temporary fix whereby loading models with bnb params cannot be moved from
+        # GPU to a meta device due with FSDP2 because torch operations don't return the original class type
+        # bypassing the move to meta will still cause the VRAM spike, but at least it still will load
+        if param.__class__.__name__ == "Params4bit":
+            model_has_params4bit = True
+            break
+
+    if fsdp2_plugin.cpu_ram_efficient_loading and not model_has_params4bit:
+        # Context: `fully_shard` moves the model to GPU if it was on CPU, however it can also be on `meta` and then it stays there even after `fully_shard`
+        # For this reason, we need to move the model to `meta` device, as then sharding happens on `meta` device
+        # If we kept the model on CPU (`cpu_ram_efficient_loading` has model be on CPU on all ranks, though non-main ranks only have `torch.emtpy`), `fully_shard` would move it to GPU
+        # Afterwards, when we call `fsdp2_load_full_state_dict`, us creating the state_dict would result into briefly having two copies of model state_dict on the GPU -> VRAM spike
+
+        # We need to keep the original non-persistent buffers, as those MAY not be in the state_dict, resulting in them staying on meta device
+        # Also, these buffers aren't getting sharded by default
+        # We get the FQNs of all non-persistent buffers, to re-register them after
+        non_persistent_buffer_fqns = get_non_persistent_buffers(
+            model, recurse=True, fqns=True
+        )
+        original_non_persistent_buffers = copy.deepcopy(
+            {k: v for k, v in model.named_buffers() if k in non_persistent_buffer_fqns}
+        )
+        # We move the model to meta device, as then sharding happens on meta device
+        model = model.to(torch.device("meta"))
+        # We need to re-tie the weights, not exactly sure why, but if we don't do this, reference to `lm_head/embed_tokens` stay hanging -> more VRAM usage
+        # We assume `transformers` models have a `tie_weights` method if they support it
+        if hasattr(model, "tie_weights"):
+            model.tie_weights()
+
+    is_peft_model = isinstance(model, PeftModel)
+
+    auto_wrap_policy = fsdp2_prepare_auto_wrap_policy(fsdp2_plugin, model)
+    log_bias_dtype_mismatch = False
+    if auto_wrap_policy is not None:
+        for module in get_module_children_bottom_up(model)[:-1]:
+            if is_peft_model and isinstance(module, LoraLayer):
+                module_log_bias_mismatch = _process_lora_module_for_fsdp(
+                    module, fsdp2_kwargs
+                )
+                log_bias_dtype_mismatch |= module_log_bias_mismatch
+            if auto_wrap_policy(module) and not isinstance(module, FSDPModule):
+                fully_shard(module, **fsdp2_kwargs)
+
+    fully_shard(model, **fsdp2_kwargs)
+
+    if log_bias_dtype_mismatch:
+        LOG.warning(
+            "Bias dtype mismatch detected in LoRA base linear layer. Bias parameters have been cast to weight dtype."
+        )
+
+    if fsdp2_plugin.cpu_ram_efficient_loading:
+        offload_to_cpu = isinstance(fsdp2_plugin.cpu_offload, CPUOffloadPolicy)
+        fsdp2_load_full_state_dict(
+            accelerator, model, original_sd, offload_to_cpu=offload_to_cpu
+        )
+
+    if fsdp2_plugin.cpu_ram_efficient_loading and not model_has_params4bit:
+        # We re-register the buffers, as they may not be in the state_dict
+        for fqn, buffer_tensor in original_non_persistent_buffers.items():
+            buffer_tensor = buffer_tensor.to(accelerator.device)
+
+            if "." in fqn:
+                parent_fqn, local_buffer_name = fqn.rsplit(".", 1)
+                parent_module = model.get_submodule(parent_fqn)
+            else:
+                local_buffer_name = fqn
+                parent_module = model
+
+            parent_module.register_buffer(
+                local_buffer_name, buffer_tensor, persistent=False
+            )
+
+        # We need to tie the weights again, as call to `load_full_state_dict` breaks the tie
+        # Needs to be called both here and above
+        # removing this call makes the have slightly different loss
+        # removing the call above leads to extra memory usage as explained in the comment above
+        if hasattr(model, "tie_weights"):
+            model.tie_weights()
+    return model
+
+
+def patch_accelerate_fsdp2():
+    import accelerate
+
+    accelerate.accelerator.fsdp2_prepare_model = fsdp2_prepare_model
    accelerate.Accelerator.get_state_dict = get_state_dict
    setattr(
        sys.modules["accelerate"],
--- a/src/axolotl/monkeypatch/attention/flex_attn.py
+++ b/src/axolotl/monkeypatch/attention/flex_attn.py
@@ -6,6 +6,10 @@ from typing import Optional, Tuple, Union
 import torch
 import transformers

+from axolotl.utils.logging import get_logger
+
+LOG = get_logger(__name__)
+

 def patch_flex_wrapper(**flex_attn_compile_kwargs):
    # TODO remove this patch when transformers#37285 is merged and in a release
@@ -46,10 +50,15 @@ def patch_flex_wrapper(**flex_attn_compile_kwargs):
                # cause errors. The suggested fix is to compile with "max-autotune-no-cudagraphs"
                # see https://github.com/pytorch/pytorch/issues/146260 for training
                self.training = training
+                LOG.info(
+                    "Compiling flex attention with kwargs: %s. This may take a while...",
+                    flex_attn_compile_kwargs,
+                )
                self._compiled_flex_attention = torch.compile(
                    flex_attention,
                    **flex_attn_compile_kwargs,
                )
+                LOG.info("Flex attention compiled successfully.")
                self._is_flex_compiled = True

        def __call__(self):
--- a/src/axolotl/monkeypatch/gradient_checkpointing/init.py
+++ b/src/axolotl/monkeypatch/gradient_checkpointing/init.py
@@ -5,7 +5,8 @@ from functools import partial

 from packaging import version

-from axolotl.monkeypatch.gradient_checkpointing.offload_cpu import (
+from axolotl.monkeypatch.gradient_checkpointing.offload_cpu import (  # noqa: F401
+    CheckpointFunctionWithCPUOffload,
    CPU_Offloaded_Gradient_Checkpointer,
 )
 from axolotl.monkeypatch.gradient_checkpointing.offload_disk import (
--- a/src/axolotl/monkeypatch/gradient_checkpointing/offload_cpu.py
+++ b/src/axolotl/monkeypatch/gradient_checkpointing/offload_cpu.py
@@ -13,8 +13,24 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+
+import contextlib
+import inspect
+
 import torch
 from packaging import version
+from torch.utils.checkpoint import (
+    _get_autocast_kwargs,
+    _get_device_module,
+    _infer_device_type,
+    check_backward_validity,
+    detach_variable,
+    get_device_states,
+    set_device_states,
+)
+
+# support different pytorch versions
+has_device_type = "device_type" in inspect.signature(set_device_states).parameters

 torch_version = version.parse(torch.__version__)

@@ -60,3 +76,153 @@ class CPU_Offloaded_Gradient_Checkpointer(  # pylint: disable=invalid-name
        ) + (
            None,
        ) * len(ctx.args)
+
+
+# Copyright 2025 Snowflake Inc.
+# SPDX-License-Identifier: Apache-2.0
+# https://github.com/snowflakedb/ArcticTraining/blob/main/arctic_training/monkey_patches.py
+class CheckpointFunctionWithCPUOffload(torch.autograd.Function):
+    """
+    This is a torch/utils/checkpoint.py CheckpointFunction monkey patch that offloads the first tensor to cpu during forward and back to cuda during backward. This allows significant memory savings when using a very long seqlen. e.g. for llama 8b at 100k it's 24GB saved per gpu: `((100_000*4096)*2*32/2**30)`
+    In the case of a very long seqlen 100k+ the copying to/from cpu overhead is not big, because dense quadratic attention compute will dominate.
+    """
+
+    @staticmethod
+    def forward(ctx, run_function, preserve_rng_state, *args):
+        check_backward_validity(args)
+        ctx.run_function = run_function
+        ctx.preserve_rng_state = preserve_rng_state
+        # Accommodates the (remote) possibility that autocast is enabled for cpu AND gpu.
+        ctx.device_type = _infer_device_type(*args)
+        ctx.device_autocast_kwargs, ctx.cpu_autocast_kwargs = _get_autocast_kwargs(
+            ctx.device_type
+        )
+        if preserve_rng_state:
+            ctx.fwd_cpu_state = torch.get_rng_state()
+            # Don't eagerly initialize the cuda context by accident.
+            # (If the user intends that the context is initialized later, within their
+            # run_function, we SHOULD actually stash the cuda state here.  Unfortunately,
+            # we have no way to anticipate this will happen before we run the function.)
+            ctx.had_device_in_fwd = False
+            device_module = _get_device_module(ctx.device_type)
+            if getattr(device_module, "_initialized", False):
+                ctx.had_device_in_fwd = True
+                ctx.fwd_devices, ctx.fwd_device_states = get_device_states(*args)
+
+        # Save non-tensor inputs in ctx, keep a placeholder None for tensors
+        # to be filled out during the backward.
+        ctx.inputs = []
+        ctx.tensor_indices = []
+        tensor_inputs = []
+        # x = None
+        for i, arg in enumerate(args):
+            if torch.is_tensor(arg):
+                # cpu-offload
+                # we don't want the 2nd tensor - usually it's a shared 4D attn mask which is huge [seq,seq]
+                # upstream could accept a list of arg indices to offload
+                if i == 0:
+                    # print(f"{arg.shape=}")
+                    ctx.x_device = arg.device
+                    ctx.x_requires_grad = arg.requires_grad
+                    t = arg.detach().cpu()
+                else:
+                    t = arg
+                tensor_inputs.append(t)
+                ctx.tensor_indices.append(i)
+                ctx.inputs.append(None)
+            else:
+                ctx.inputs.append(arg)
+
+        ctx.save_for_backward(*tensor_inputs)
+
+        with torch.no_grad():
+            outputs = run_function(*args)
+
+        return outputs
+
+    @staticmethod
+    def backward(ctx, *args):
+        if (
+            not torch.autograd._is_checkpoint_valid()  # pylint: disable=protected-access
+        ):
+            raise RuntimeError(
+                "When use_reentrant=True, torch.utils.checkpoint is incompatible"
+                " with .grad() or passing an `inputs` parameter to .backward()."
+                " To resolve this error, you can either set use_reentrant=False,"
+                " or call .backward() without passing the `inputs` argument."
+            )
+        # Copy the list to avoid modifying original list.
+        inputs = list(ctx.inputs)
+        tensor_indices = ctx.tensor_indices
+        tensors = ctx.saved_tensors
+
+        # Fill in inputs with appropriate saved tensors.
+        for i, idx in enumerate(tensor_indices):
+            if i == 0:
+                t = (
+                    tensors[i]
+                    .to(ctx.x_device)
+                    .detach()
+                    .requires_grad_(ctx.x_requires_grad)
+                )
+            else:
+                t = tensors[i]
+            inputs[idx] = t
+
+        # Stash the surrounding rng state, and mimic the state that was
+        # present at this time during forward.  Restore the surrounding state
+        # when we're done.
+        rng_devices = []
+        if ctx.preserve_rng_state and ctx.had_device_in_fwd:
+            rng_devices = ctx.fwd_devices
+        with torch.random.fork_rng(
+            devices=rng_devices,
+            enabled=ctx.preserve_rng_state,
+            device_type=ctx.device_type,
+        ):
+            if ctx.preserve_rng_state:
+                torch.set_rng_state(ctx.fwd_cpu_state)
+                if ctx.had_device_in_fwd:
+                    if has_device_type:
+                        # newer pytorch (as early as 2.7)
+                        set_device_states(
+                            ctx.fwd_devices,
+                            ctx.fwd_device_states,
+                            device_type=ctx.device_type,
+                        )
+                    else:
+                        # older pytorch (at least 2.4)
+                        set_device_states(ctx.fwd_devices, ctx.fwd_device_states)
+            detached_inputs = detach_variable(tuple(inputs))
+
+            device_autocast_ctx = (
+                torch.amp.autocast(
+                    device_type=ctx.device_type, **ctx.device_autocast_kwargs
+                )
+                if torch.amp.is_autocast_available(ctx.device_type)
+                else contextlib.nullcontext()
+            )
+            with torch.enable_grad(), device_autocast_ctx, torch.amp.autocast("cpu", **ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
+                outputs = ctx.run_function(*detached_inputs)
+
+        if isinstance(outputs, torch.Tensor):
+            outputs = (outputs,)
+
+        # run backward() with only tensor that requires grad
+        outputs_with_grad = []
+        args_with_grad = []
+        for i in range(len(outputs)):  # pylint: disable=consider-using-enumerate
+            if torch.is_tensor(outputs[i]) and outputs[i].requires_grad:
+                outputs_with_grad.append(outputs[i])
+                args_with_grad.append(args[i])
+        if len(outputs_with_grad) == 0:
+            raise RuntimeError(
+                "none of output has requires_grad=True, this checkpoint() is not necessary"
+            )
+        torch.autograd.backward(outputs_with_grad, args_with_grad)
+        grads = tuple(
+            inp.grad if isinstance(inp, torch.Tensor) else None
+            for inp in detached_inputs
+        )
+
+        return (None, None) + grads
--- a/src/axolotl/monkeypatch/tiled_mlp.py
+++ b/src/axolotl/monkeypatch/tiled_mlp.py
@@ -1,6 +1,7 @@
 """Monkeypatch for Tiled MLP implementation"""

 import math
+import os

 import torch
 import torch.distributed as dist
@@ -29,15 +30,18 @@ def patch_tiled_mlp(model_type, use_original_mlp=False, cfg_num_shards=None):

            mlp_forward = torch.compile(generic_mlp_forward)

+        is_distributed = int(os.environ.get("WORLD_SIZE", 1)) > 1
+
        def tiled_mlp_forward(self, x):
            input_shape = x.shape
            seqlen = input_shape[-2]
            hidden = input_shape[-1]
            if cfg_num_shards is None:
                num_shards = math.ceil(seqlen / hidden)
-                num_shards_tensor = torch.tensor(num_shards, device=x.device)
-                dist.all_reduce(num_shards_tensor, op=dist.ReduceOp.MAX)
-                num_shards = num_shards_tensor.item()
+                if is_distributed:
+                    num_shards_tensor = torch.tensor(num_shards, device=x.device)
+                    dist.all_reduce(num_shards_tensor, op=dist.ReduceOp.MAX)
+                    num_shards = num_shards_tensor.item()
            else:
                num_shards = cfg_num_shards

--- a/src/axolotl/monkeypatch/trainer/trl.py
+++ b/src/axolotl/monkeypatch/trainer/trl.py
@@ -0,0 +1,13 @@
+"""Monkeypatch for TRL trainer FSDP preparation."""
+
+
+def prepare_fsdp(model, accelerator):
+    from axolotl.monkeypatch.accelerate.fsdp2 import fsdp2_prepare_model
+
+    return fsdp2_prepare_model(accelerator, model)
+
+
+def patch_trl_prepare_fsdp2():
+    import trl.models.utils
+
+    trl.models.utils.prepare_fsdp = prepare_fsdp
--- a/src/axolotl/prompt_strategies/dpo/user_defined.py
+++ b/src/axolotl/prompt_strategies/dpo/user_defined.py
@@ -33,7 +33,7 @@ def default(cfg, dataset_idx=0, **kwargs):  # pylint: disable=unused-argument
                system=sample[field_system], prompt=sample[field_prompt]
            )
        else:
-            sample["prompt"] = prompt_format.format(prompt=sample["prompt"])
+            sample["prompt"] = prompt_format.format(prompt=sample[field_prompt])
        sample["chosen"] = chosen_format.format(chosen=sample[field_chosen])
        sample["rejected"] = rejected_format.format(rejected=sample[field_rejected])
        return sample
--- a/src/axolotl/train.py
+++ b/src/axolotl/train.py
@@ -15,7 +15,6 @@ from typing import Any, Dict

 import torch
 import transformers.modelcard
-from accelerate.utils import save_fsdp_model
 from datasets import Dataset
 from huggingface_hub.errors import OfflineModeIsEnabled
 from peft import PeftConfig, PeftModel
@@ -68,7 +67,7 @@ def setup_model_and_tokenizer(
            `None`), and processor (if multimodal, else `None`).
    """
    # Load tokenizer
-    LOG.debug(f"loading tokenizer... {cfg.tokenizer_config or cfg.base_model_config}")
+    LOG.debug(f"Loading tokenizer... {cfg.tokenizer_config or cfg.base_model_config}")
    tokenizer = load_tokenizer(cfg)

    # Load processor for multimodal models if needed
@@ -76,11 +75,8 @@ def setup_model_and_tokenizer(
    if cfg.is_multimodal:
        processor = load_processor(cfg, tokenizer)

-    # Load the model and peft_config
-    msg = "loading model"
-    if cfg.adapter:
-        msg += " and peft_config..."
-    LOG.debug(msg)
+    # Load the model
+    LOG.debug("Loading model")

    model_loader = ModelLoader(cfg, tokenizer, processor=processor)
    model, peft_config = model_loader.load()
@@ -264,15 +260,6 @@ def save_trained_model(
            "QAT modules have been converted for PTQ. Please ensure you quantize "
            "your model weights with `axolotl quantize`."
        )
-
-    # Handle FSDP state dict type
-    state_dict_type = "FULL_STATE_DICT"
-    if trainer.is_fsdp_enabled and str(cfg.fsdp_config.fsdp_version) != "2":
-        if cfg.fsdp_final_state_dict_type:
-            state_dict_type = cfg.fsdp_final_state_dict_type
-        trainer.accelerator.state.fsdp_plugin.set_state_dict_type(state_dict_type)
-        LOG.info(f"Set FSDP state dict type to {state_dict_type} for saving.")
-
    # Handle ReLoRA early return case
    if cfg.relora_steps:
        if cfg.adapter == "lora" and not (cfg.load_in_4bit or cfg.load_in_8bit):
@@ -281,22 +268,19 @@ def save_trained_model(
            # final model weights have already been saved by `ReLoRACallback.on_train_end`
            return

-    if cfg.fsdp:
-        # TODO: do we need this fix? https://huggingface.co/docs/accelerate/usage_guides/fsdp#saving-and-loading
-        # only save on rank 0, otherwise it corrupts output on multi-GPU when multiple
-        # processes attempt to write the same file
-        if (
-            state_dict_type == "SHARDED_STATE_DICT"
-            and cfg.fsdp_config.fsdp_state_dict_type == "SHARDED_STATE_DICT"
-        ):
-            save_fsdp_model(
-                trainer.accelerator.state.fsdp_plugin,
-                trainer.accelerator,
-                trainer.model,
-                cfg.output_dir,
+    if trainer.is_fsdp_enabled:
+        if cfg.fsdp_config or cfg.fsdp:
+            if cfg.fsdp_config.final_state_dict_type:
+                state_dict_type = cfg.fsdp_config.final_state_dict_type
+            else:
+                state_dict_type = cfg.fsdp_config.state_dict_type
+            trainer.accelerator.state.fsdp_plugin.set_state_dict_type(state_dict_type)
+        trainer.save_model(cfg.output_dir)
+        if state_dict_type == "SHARDED_STATE_DICT":
+            LOG.info(
+                "The final model was saved with a sharded state dict. Please ensure you merge "
+                "the sharded weights with `merge-sharded-fsdp-weights`."
            )
-        elif state_dict_type == "FULL_STATE_DICT":
-            trainer.save_model(cfg.output_dir)
    elif cfg.deepspeed and is_deepspeed_zero3_enabled():
        # Copied over from: https://github.com/huggingface/accelerate/blob/5ae611118057232f441055f7ef9ba0b0f2b8d533/docs/source/usage_guides/deepspeed.md#saving-and-loading
        trainer.accelerator.wait_for_everyone()
--- a/src/axolotl/utils/bench.py
+++ b/src/axolotl/utils/bench.py
@@ -1,6 +1,7 @@
 """Benchmarking and measurement utilities"""

 import functools
+import logging

 import torch
 from transformers.utils.import_utils import is_torch_npu_available
@@ -91,21 +92,27 @@ def gpu_memory_usage_smi(device=0):
        return 0.0


-def log_gpu_memory_usage(log, msg, device):
-    cur_device = get_device_type()
+def log_gpu_memory_usage(
+    log: logging.Logger | logging.LoggerAdapter,
+    msg: str = "",
+    device: int | torch.device = 0,
+):
+    cur_device_type = str(get_device_type())
    if torch.backends.mps.is_available():
        usage, cache, misc = mps_memory_usage_all()
-    elif "npu" in str(cur_device) and is_torch_npu_available():
+    elif "npu" in cur_device_type and is_torch_npu_available():
        usage, cache, misc = npu_memory_usage_all(device)
-    else:
+    elif "gpu" in cur_device_type and torch.cuda.is_available():
        usage, cache, misc = gpu_memory_usage_all(device)
+    else:
+        return
    extras = []
    if cache > 0:
        extras.append(f"+{cache:.03f}GB cache")
    if misc > 0:
        extras.append(f"+{misc:.03f}GB misc")
+    msg = f"{cur_device_type} memory usage:" if not msg else msg
    log.info(
-        f"{str(cur_device)} memory usage {msg}: {usage:.03f}GB ({', '.join(extras)})",
+        f"{msg} {usage:.03f}GB ({', '.join(extras)})",
        stacklevel=2,
    )
-    return usage, cache, misc
--- a/src/axolotl/utils/config/init.py
+++ b/src/axolotl/utils/config/init.py
@@ -116,9 +116,10 @@ def normalize_config(cfg):
    ]
    choose_device(cfg)
    cfg.ddp = cfg.ddp if cfg.ddp is not None else cfg.world_size != 1
-    if cfg.ddp:
+    if cfg.world_size != 1:
        cfg.device_map = {"": int(os.environ.get("LOCAL_RANK", 0))}
-        cfg.batch_size = cfg.batch_size * cfg.world_size
+        if cfg.fsdp or cfg.fsdp_config or cfg.ddp:
+            cfg.batch_size = cfg.batch_size * cfg.world_size

    if not cfg.use_ray:
        # delay resolving dtype until on worker node when launching with ray
@@ -274,7 +275,7 @@ def validate_config(
    # Convert datasets to proper format if needed
    if cfg.get("datasets"):
        for idx, ds_cfg in enumerate(cfg["datasets"]):
-            if cfg.get("rl") == "dpo" and not isinstance(ds_cfg, DPODataset):
+            if cfg.get("rl") in ["dpo", "simpo"] and not isinstance(ds_cfg, DPODataset):
                cfg["datasets"][idx] = DPODataset(**ds_cfg)
            elif cfg.get("rl") == "kto" and not isinstance(ds_cfg, KTODataset):
                cfg["datasets"][idx] = KTODataset(**dict(ds_cfg))
@@ -313,3 +314,16 @@ def prepare_plugins(cfg):
        plugin_manager = PluginManager.get_instance()
        for plugin_name in cfg["plugins"]:
            plugin_manager.register(plugin_name)
+
+
+# TODO @SalmanMohammadi remove this function in 0.12
+def migrate_fsdp_config(cfg):
+    if cfg.get("fsdp_config"):
+        fsdp_config_keys = cfg.fsdp_config.keys()
+        if "fsdp_version" in fsdp_config_keys:
+            cfg.fsdp_version = cfg.fsdp_config.pop("fsdp_version")
+
+        for key in list(fsdp_config_keys):
+            if key.startswith("fsdp_") and key != "fsdp_version":
+                cfg.fsdp_config[key.replace("fsdp_", "")] = cfg.fsdp_config[key]
+                del cfg.fsdp_config[key]
--- a/src/axolotl/utils/schemas/config.py
+++ b/src/axolotl/utils/schemas/config.py
@@ -203,7 +203,9 @@ class AxolotlInputConfig(
        },
    )
    dataset_processes: int | None = Field(
-        default=min(int(os.environ.get("AXOLOTL_DATASET_PROCESSES", 32)), os.cpu_count()),  # type: ignore[type-var]
+        default=min(
+            int(os.environ.get("AXOLOTL_DATASET_PROCESSES", 32)), os.cpu_count()
+        ),  # type: ignore[type-var]
        json_schema_extra={
            "description": "The maximum number of processes to use while preprocessing your input dataset. This defaults to `os.cpu_count()` if not set."
        },
@@ -572,14 +574,24 @@ class AxolotlInputConfig(
        },
    )
    fsdp: list[str] | None = Field(
-        default=None, json_schema_extra={"description": "FSDP configuration"}
+        default=None,
+        json_schema_extra={"description": "FSDP configuration"},
+        deprecated="Configuring FSDP using `fsdp` is deprecated. Please use `fsdp_config` instead. ",
    )
+    # TODO @SalmanMohammadi strongly type this as its own schema
    fsdp_config: dict[str, Any] | None = Field(
        default=None, json_schema_extra={"description": "FSDP configuration options"}
    )
+    fsdp_version: int | None = Field(
+        default=None,
+        json_schema_extra={"description": "FSDP version"},
+    )
    fsdp_final_state_dict_type: (
        Literal["FULL_STATE_DICT", "LOCAL_STATE_DICT", "SHARDED_STATE_DICT"] | None
-    ) = None
+    ) = Field(
+        default=None,
+        deprecated="Configuring FSDP final state dict type using `fsdp_final_state_dict_type` is deprecated. Please use `fsdp_config.final_state_dict_type` instead.",
+    )

    val_set_size: float | None = Field(
        default=0.0,
@@ -949,11 +961,9 @@ class AxolotlConfigWCapabilities(AxolotlInputConfig):
            or data.get("lora_o_kernel")
        ):
            capabilities = data.get("capabilities")
-            is_fsdp = data.get("fsdp") is not None
-            is_fsdp2 = (
-                data.get("fsdp_config") is not None
-                and str(data.get("fsdp_config").get("fsdp_version")) == "2"
-            )
+            is_fsdp = data.get("fsdp_config") is not None
+            is_fsdp2 = is_fsdp and str(data.get("fsdp_version")) == "2"
+
            if capabilities and capabilities.get("n_gpu", 0) > 1 and not is_fsdp2:
                if is_fsdp:
                    raise ValueError(
@@ -987,11 +997,8 @@ class AxolotlConfigWCapabilities(AxolotlInputConfig):
            # Check multi-GPU compatibility
            capabilities = data.get("capabilities")
            is_multi_gpu = capabilities and capabilities.get("n_gpu", 0) > 1
-            is_fsdp = data.get("fsdp") is not None
-            is_fsdp2 = (
-                data.get("fsdp_config") is not None
-                and str(data.get("fsdp_config").get("fsdp_version")) == "2"
-            )
+            is_fsdp = data.get("fsdp_config") is not None
+            is_fsdp2 = is_fsdp and str(data.get("fsdp_version")) == "2"

            if (
                not is_multi_gpu
@@ -1114,21 +1121,94 @@ class AxolotlConfigWCapabilities(AxolotlInputConfig):

            torch_version = str(torch.__version__).split("+", maxsplit=1)[0]

-        if (
-            data.get("fsdp")
-            and data.get("fsdp_config")
-            and str(data["fsdp_config"].get("fsdp_version")) == "2"
-        ):
-            if version.parse(torch_version) < version.parse("2.7.0"):
-                raise ValueError(
-                    "FSDP2 and QAT are not supported on torch version < 2.7.0"
-                )
-
        if version.parse(torch_version) < version.parse("2.6.0"):
            raise ValueError("QAT is not supported on torch version < 2.6.0")

        return data

+    @model_validator(mode="before")
+    @classmethod
+    def check_fsdp_torch_version(cls, data):
+        env_capabilities = data.get("env_capabilities", {})
+        torch_version = env_capabilities.get("torch_version")
+
+        if torch_version is None:
+            import torch
+
+            torch_version = str(torch.__version__).split("+", maxsplit=1)[0]
+
+        if data.get("fsdp_config") and str(data.get("fsdp_version")) == "2":
+            if version.parse(torch_version) < version.parse("2.7.0"):
+                raise ValueError("FSDP2 is not supported on torch version < 2.7.0")
+
+        return data
+
+    @model_validator(mode="before")
+    @classmethod
+    def check_fsdp_version(cls, data):
+        fsdp_config = data.get("fsdp_config", {})
+        if fsdp_config and str(data.get("fsdp_version")) != "2":
+            LOG.info(
+                "FSDP1 will be deprecated in an upcoming release of Axolotl."
+                "We recommend that you use FSDP version 2 for better performance and compatibility. "
+                "Please see this link for more details: https://docs.axolotl.ai/docs/multi-gpu.html#sec-fsdp "
+                "For more details on migrating your config. "
+            )
+        return data
+
+    @model_validator(mode="before")
+    @classmethod
+    def check_fsdp2_base_model_quant_ram_efficient_loading(cls, data):
+        fsdp_config = data.get("fsdp_config")
+        if fsdp_config and data.get("fsdp_version") == 2:
+            if fsdp_config.get("cpu_ram_efficient_loading") and (
+                data.get("load_in_8bit") or data.get("load_in_4bit")
+            ):
+                raise ValueError(
+                    "FSDP2 does not support load_in_8bit or load_in_4bit with cpu_ram_efficient_loading. Please do one of the following: use DeepSpeed, "
+                    "set fsdp_version to 1, or disable cpu_ram_efficient_loading."
+                )
+        return data
+
+    @model_validator(mode="before")
+    @classmethod
+    def check_fsdp2_base_model_quant_dpo(cls, data):
+        if data.get("fsdp_version") == 2 and data.get("rl") in [
+            RLType.DPO,
+            RLType.KTO,
+            RLType.ORPO,
+            RLType.IPO,
+        ]:
+            if data.get("load_in_8bit") or data.get("load_in_4bit"):
+                raise ValueError(
+                    "FSDP2 does not support load_in_8bit or load_in_4bit with DPO. Please use DeepSpeed or set `fsdp_version` to 1."
+                )
+
+        return data
+
+    @model_validator(mode="before")
+    @classmethod
+    def check_fsdp_version_in_fsdp_config(cls, data):
+        if fsdp_config := data.get("fsdp_config"):
+            if fsdp_config.get("fsdp_version"):
+                LOG.warning(
+                    "Configuring `fsdp_version` in `fsdp_config` is deprecated. "
+                    "Please configure `fsdp_version` as a top-level field."
+                )
+        return data
+
+    @model_validator(mode="before")
+    @classmethod
+    def check_fsdp_config_kwargs_prefix(cls, data):
+        if fsdp_config := data.get("fsdp_config"):
+            for key, _ in fsdp_config.items():
+                if key.startswith("fsdp_"):
+                    LOG.warning_once(
+                        "Configuring FSDP fields with the `fsdp_` prefix is deprecated. "
+                        "Please omit the `fsdp_` prefix from the any fields in `fsdp_config`."
+                    )
+        return data
+
    @model_validator(mode="before")
    @classmethod
    def default_dataloader_opts(cls, data):
--- a/src/axolotl/utils/schemas/datasets.py
+++ b/src/axolotl/utils/schemas/datasets.py
@@ -34,12 +34,6 @@ class UserDefinedPrompterType(BaseModel):
        default=None,
        json_schema_extra={"description": "'no_input_format' cannot include {input}"},
    )
-    field: str | None = Field(
-        default=None,
-        json_schema_extra={
-            "description": "For `completion` datsets only, uses the provided field instead of `text` column"
-        },
-    )


 class SFTDataset(BaseModel):
@@ -104,7 +98,12 @@ class SFTDataset(BaseModel):
        default=None,
        json_schema_extra={"description": "defines the datatype when path is a file"},
    )
-    field: str | None = None
+    field: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "For `completion` datasets only, uses the provided field instead of `text` column"
+        },
+    )
    field_human: str | None = None
    field_model: str | None = None
    field_messages: str | None = Field(
--- a/src/axolotl/utils/schemas/validation.py
+++ b/src/axolotl/utils/schemas/validation.py
@@ -479,8 +479,14 @@ class TrainingValidationMixin:
    @model_validator(mode="before")
    @classmethod
    def check_tiled_mlp_deepspeed(cls, data):
-        if data.get("tiled_mlp", False) and not data.get("deepspeed"):
-            raise ValueError("tiled_mlp requires deepspeed ZeRO to be enabled")
+        capabilities = data.get("capabilities")
+        n_gpu = 0
+        if capabilities and capabilities.get("n_gpu", 0) >= 1:
+            n_gpu = capabilities.get("n_gpu", 0)
+        if data.get("tiled_mlp", False) and (n_gpu > 1 and not data.get("deepspeed")):
+            raise ValueError(
+                "tiled_mlp requires deepspeed ZeRO to be enabled for multi-gpu"
+            )
        return data


@@ -568,15 +574,6 @@ class LoRAValidationMixin:
            raise ValueError("Fused modules are not supported with LoRA/QLoRA")
        return self

-    @model_validator(mode="after")
-    def hint_lora_8bit(self):
-        loftq = (
-            self.peft and self.peft.loftq_config and self.peft.loftq_config.loftq_bits
-        )
-        if not self.load_in_8bit and self.adapter == "lora" and not loftq:
-            LOG.warning("We recommend setting `load_in_8bit: true` for LORA finetuning")
-        return self
-
    @model_validator(mode="before")
    @classmethod
    def warn_qlora_zero3_w_use_reentrant(cls, data):
@@ -780,7 +777,7 @@ class OptimizationValidationMixin:
    @classmethod
    def check_fsdp_sharded_state_dict_w_safetensors(cls, data):
        if (
-            data.get("fsdp")
+            data.get("fsdp_config")
            and data.get("save_safetensors")
            and data.get("fsdp_config")
            and data["fsdp_config"].get("fsdp_state_dict_type") == "SHARDED_STATE_DICT"
@@ -994,7 +991,7 @@ class ComplexValidationMixin:
            if self.adapter not in ("lora", "qlora"):
                raise ValueError("cfg.adapter must be lora or qlora to use ReLoRA")

-            if self.fsdp:
+            if self.fsdp or self.fsdp_config:
                raise ValueError("fsdp not supported with ReLoRA")

            if self.deepspeed:
--- a/src/axolotl/utils/trainer.py
+++ b/src/axolotl/utils/trainer.py
@@ -546,6 +546,15 @@ def setup_deepspeed_env(cfg, stage=None):
    # NOTE(djsaunde): The distribued state cannot be initialized prior to the
    # ACCELERATE_USE_DEEPSPEED assignment, but it must be initialized some time prior
    # to model load.
+    if int(os.environ.get("WORLD_SIZE", "1")) == 1:
+        os.environ["WORLD_SIZE"] = "1"  # force it in case not set
+        os.environ["LOCAL_RANK"] = "0"  # force it in case not set
+        os.environ["RANK"] = os.environ.get("LOCAL_RANK", "0")
+        import deepspeed.comm as dist
+
+        dist.init_distributed(
+            dist_backend="nccl", auto_mpi_discovery=False, dist_init_required=True
+        )
    init_distributed_state()

    # If we don't assign this, it doesn't actually get set in the accelerate weakref
@@ -554,37 +563,39 @@ def setup_deepspeed_env(cfg, stage=None):

 def setup_fsdp_envs(cfg):
    os.environ["ACCELERATE_USE_FSDP"] = "true"
-    if str(cfg.fsdp_config.fsdp_version) == "2":
+
+    # TODO @SalmanMohammadi remove FSDP1 args in 0.12
+    if str(cfg.fsdp_version) == "2":
        os.environ["FSDP_VERSION"] = "2"
-    if cfg.fsdp_config.fsdp_activation_checkpointing:
+    if cfg.fsdp_config.activation_checkpointing:
        os.environ["FSDP_ACTIVATION_CHECKPOINTING"] = "true"
-    if cfg.fsdp_config.fsdp_offload_params:
+    if cfg.fsdp_config.offload_params:
        os.environ["FSDP_OFFLOAD_PARAMS"] = "true"
-    if cfg.fsdp_config.fsdp_sync_module_states:
+    if cfg.fsdp_config.sync_module_states:
        os.environ["FSDP_SYNC_MODULE_STATES"] = "true"
-    if cfg.fsdp_config.fsdp_cpu_ram_efficient_loading:
+    if cfg.fsdp_config.cpu_ram_efficient_loading:
        os.environ["FSDP_CPU_RAM_EFFICIENT_LOADING"] = "true"
-    if cfg.fsdp_config.fsdp_use_orig_params:
+    if cfg.fsdp_config.use_orig_params:
        os.environ["FSDP_USE_ORIG_PARAMS"] = "true"
-    if cfg.fsdp_config.fsdp_state_dict_type:
-        os.environ["FSDP_STATE_DICT_TYPE"] = cfg.fsdp_config.fsdp_state_dict_type
-    if cfg.fsdp_config.fsdp_auto_wrap_policy:
-        os.environ["FSDP_AUTO_WRAP_POLICY"] = cfg.fsdp_config.fsdp_auto_wrap_policy
-    if cfg.fsdp_config.fsdp_transformer_layer_cls_to_wrap:
+    if cfg.fsdp_config.state_dict_type:
+        os.environ["FSDP_STATE_DICT_TYPE"] = cfg.fsdp_config.state_dict_type
+    if cfg.fsdp_config.auto_wrap_policy:
+        os.environ["FSDP_AUTO_WRAP_POLICY"] = cfg.fsdp_config.auto_wrap_policy
+    if cfg.fsdp_config.transformer_layer_cls_to_wrap:
        os.environ["FSDP_TRANSFORMER_CLS_TO_WRAP"] = (
-            cfg.fsdp_config.fsdp_transformer_layer_cls_to_wrap
-        )
-    if cfg.fsdp_config.fsdp_reshard_after_forward is not None:
-        os.environ["FSDP_RESHARD_AFTER_FORWARD"] = (
-            "true" if cfg.fsdp_config.fsdp_reshard_after_forward else "false"
+            cfg.fsdp_config.transformer_layer_cls_to_wrap
        )
+    if cfg.fsdp_config.reshard_after_forward:
+        os.environ["FSDP_RESHARD_AFTER_FORWARD"] = "true"


 def prepare_optim_env(cfg):
    if not check_cuda_p2p_ib_support():
        if os.getenv("NCCL_P2P_DISABLE") is None:
            os.environ["NCCL_P2P_DISABLE"] = "1"
-    if cfg.fsdp:
+    # TODO @SalmanMohammadi remove the cfg.fsdp check in 0.12
+    if cfg.fsdp or cfg.fsdp_config:
+        cfg.fsdp = True if not cfg.fsdp else cfg.fsdp
        setup_fsdp_envs(cfg)
    elif cfg.deepspeed:
        stage = None
@@ -648,11 +659,7 @@ def setup_trainer(
    """
    from axolotl.core.builders import HFCausalTrainerBuilder, HFRLTrainerBuilder

-    if (
-        cfg.torch_compile
-        and cfg.fsdp_config
-        and str(cfg.fsdp_config.fsdp_version) == "2"
-    ):
+    if cfg.torch_compile and cfg.fsdp_config and cfg.fsdp_version == 2:
        patch_evaluation_loop_for_fsdp2()
    if cfg.rl:
        trainer_builder = HFRLTrainerBuilder(cfg, model, tokenizer, processor)
--- a/tests/e2e/multigpu/test_fsdp1.py
+++ b/tests/e2e/multigpu/test_fsdp1.py
@@ -0,0 +1,326 @@
+"""Test module for FSDP1 multi-GPU functionality."""
+
+# pylint: disable=duplicate-code
+
+import os
+from pathlib import Path
+
+import pytest
+import torch
+import yaml
+from accelerate.test_utils import execute_subprocess_async
+from tbparse import SummaryReader
+from transformers.testing_utils import get_torch_dist_unique_port
+
+from axolotl.utils.dict import DictDefault
+
+from tests.e2e.utils import most_recent_subdir
+
+AXOLOTL_ROOT = Path(__file__).parent.parent.parent.parent
+
+
+def verify_training_success(temp_dir):
+    """Verify that training completed successfully by checking artifacts and loss."""
+    output_path = Path(temp_dir)
+
+    model_files = list(output_path.glob("*.bin")) + list(
+        output_path.glob("*.safetensors")
+    )
+    assert len(model_files) > 0, "No model files found - training may have failed"
+
+    checkpoint_files = list(output_path.glob("checkpoint-*"))
+    assert (
+        len(checkpoint_files) > 0
+    ), "No checkpoint files found - training may have failed"
+
+    tb_log_path = most_recent_subdir(temp_dir + "/runs")
+    if tb_log_path:
+        event_files = sorted(os.listdir(tb_log_path))
+        if event_files:
+            event_file = os.path.join(tb_log_path, event_files[0])
+            reader = SummaryReader(event_file)
+            df = reader.scalars
+            train_loss_df = df[df.tag == "train/train_loss"]
+            if len(train_loss_df) > 0:
+                final_loss = train_loss_df.value.values[-1]
+                assert not torch.isnan(
+                    torch.tensor(final_loss)
+                ), f"Training loss is NaN: {final_loss}"
+
+
+class TestFSDP1:
+    """Test class for FSDP1 functionality."""
+
+    @pytest.mark.parametrize(
+        "fsdp_cpu_ram_efficient_loading",
+        [True, False],
+    )
+    def test_fft_sft(self, temp_dir, fsdp_cpu_ram_efficient_loading):
+        cfg = DictDefault(
+            {
+                "base_model": "Qwen/Qwen2.5-0.5B",
+                "sequence_len": 2048,
+                "val_set_size": 0.01,
+                "datasets": [
+                    {
+                        "path": "tatsu-lab/alpaca",
+                        "type": "alpaca",
+                        "split": "train[:10%]",
+                    },
+                ],
+                "num_epochs": 1,
+                "max_steps": 2,
+                "micro_batch_size": 2,
+                "gradient_accumulation_steps": 1,
+                "output_dir": temp_dir,
+                "learning_rate": 0.00001,
+                "optimizer": "adamw_torch_fused",
+                "lr_scheduler": "cosine",
+                "flash_attention": True,
+                "fsdp_version": "1",
+                "fsdp_config": {
+                    "fsdp_offload_params": False,
+                    "fsdp_cpu_ram_efficient_loading": fsdp_cpu_ram_efficient_loading,
+                    "fsdp_transformer_layer_cls_to_wrap": "Qwen2DecoderLayer",
+                    "fsdp_state_dict_type": "FULL_STATE_DICT",
+                    "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
+                    "fsdp_sharding_strategy": "FULL_SHARD",
+                    "fsdp_sync_module_states": True,
+                    "fsdp_use_orig_params": False,
+                },
+                "use_tensorboard": True,
+                "bf16": True,
+            }
+        )
+
+        # write cfg to yaml file
+        Path(temp_dir).mkdir(parents=True, exist_ok=True)
+        with open(Path(temp_dir) / "config.yaml", "w", encoding="utf-8") as fout:
+            fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
+
+        execute_subprocess_async(
+            [
+                "axolotl",
+                "train",
+                str(Path(temp_dir) / "config.yaml"),
+                "--num-processes",
+                "2",
+                "--main-process-port",
+                f"{get_torch_dist_unique_port()}",
+            ]
+        )
+
+        verify_training_success(temp_dir)
+
+    @pytest.mark.parametrize(
+        "adapter_config",
+        [
+            {
+                "adapter": "lora",
+                "load_in_4bit": False,
+            },
+            {
+                "adapter": "qlora",
+                "load_in_4bit": True,
+            },
+        ],
+    )
+    def test_lora_sft(self, temp_dir, adapter_config):
+        cfg = DictDefault(
+            {
+                "base_model": "Qwen/Qwen2.5-0.5B",
+                "sequence_len": 2048,
+                "val_set_size": 0.01,
+                "datasets": [
+                    {
+                        "path": "tatsu-lab/alpaca",
+                        "type": "alpaca",
+                        "split": "train[:10%]",
+                    },
+                ],
+                "adapter": adapter_config["adapter"],
+                "load_in_4bit": adapter_config["load_in_4bit"],
+                "lora_r": 8,
+                "lora_alpha": 16,
+                "lora_dropout": 0.05,
+                "lora_target_linear": True,
+                "num_epochs": 1,
+                "max_steps": 2,
+                "micro_batch_size": 2,
+                "gradient_accumulation_steps": 1,
+                "output_dir": temp_dir,
+                "learning_rate": 0.00001,
+                "optimizer": "adamw_torch_fused",
+                "lr_scheduler": "cosine",
+                "flash_attention": True,
+                "fsdp_version": "1",
+                "fsdp_config": {
+                    "fsdp_offload_params": False,
+                    "fsdp_cpu_ram_efficient_loading": True,
+                    "fsdp_transformer_layer_cls_to_wrap": "Qwen2DecoderLayer",
+                    "fsdp_state_dict_type": "FULL_STATE_DICT",
+                    "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
+                    "fsdp_sharding_strategy": "FULL_SHARD",
+                    "fsdp_sync_module_states": True,
+                    "fsdp_use_orig_params": False,
+                },
+                "use_tensorboard": True,
+                "bf16": True,
+            }
+        )
+
+        # write cfg to yaml file
+        Path(temp_dir).mkdir(parents=True, exist_ok=True)
+        with open(Path(temp_dir) / "config.yaml", "w", encoding="utf-8") as fout:
+            fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
+
+        execute_subprocess_async(
+            [
+                "axolotl",
+                "train",
+                str(Path(temp_dir) / "config.yaml"),
+                "--num-processes",
+                "2",
+                "--main-process-port",
+                f"{get_torch_dist_unique_port()}",
+            ]
+        )
+
+        verify_training_success(temp_dir)
+
+    def test_dpo_fft(self, temp_dir):
+        cfg = DictDefault(
+            {
+                "base_model": "Qwen/Qwen2.5-0.5B",
+                "sequence_len": 2048,
+                "val_set_size": 0.01,
+                "rl": "dpo",
+                "chat_template": "chatml",
+                "datasets": [
+                    {
+                        "path": "Intel/orca_dpo_pairs",
+                        "split": "train",
+                        "type": "chatml.intel",
+                    },
+                ],
+                "num_epochs": 1,
+                "max_steps": 2,
+                "micro_batch_size": 2,
+                "gradient_accumulation_steps": 1,
+                "output_dir": temp_dir,
+                "learning_rate": 0.00001,
+                "optimizer": "adamw_torch_fused",
+                "lr_scheduler": "cosine",
+                "flash_attention": True,
+                "fsdp_version": "1",
+                "fsdp_config": {
+                    "fsdp_offload_params": False,
+                    "fsdp_cpu_ram_efficient_loading": True,
+                    "fsdp_transformer_layer_cls_to_wrap": "Qwen2DecoderLayer",
+                    "fsdp_state_dict_type": "FULL_STATE_DICT",
+                    "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
+                    "fsdp_sharding_strategy": "FULL_SHARD",
+                    "fsdp_sync_module_states": True,
+                    "fsdp_use_orig_params": False,
+                },
+                "use_tensorboard": True,
+            }
+        )
+
+        # write cfg to yaml file
+        Path(temp_dir).mkdir(parents=True, exist_ok=True)
+        with open(Path(temp_dir) / "config.yaml", "w", encoding="utf-8") as fout:
+            fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
+
+        execute_subprocess_async(
+            [
+                "axolotl",
+                "train",
+                str(Path(temp_dir) / "config.yaml"),
+                "--num-processes",
+                "2",
+                "--main-process-port",
+                f"{get_torch_dist_unique_port()}",
+            ]
+        )
+
+        verify_training_success(temp_dir)
+
+    @pytest.mark.parametrize(
+        "adapter_config",
+        [
+            {
+                "adapter": "lora",
+                "load_in_4bit": False,
+            },
+            {
+                "adapter": "qlora",
+                "load_in_4bit": True,
+            },
+        ],
+    )
+    def test_dpo_lora(self, temp_dir, adapter_config):
+        cfg = DictDefault(
+            {
+                "base_model": "Qwen/Qwen2.5-0.5B",
+                "load_in_4bit": adapter_config["load_in_4bit"],
+                "rl": "dpo",
+                "chat_template": "chatml",
+                "sequence_len": 2048,
+                "adapter": adapter_config["adapter"],
+                "lora_r": 8,
+                "lora_alpha": 16,
+                "lora_dropout": 0.05,
+                "lora_target_linear": True,
+                "val_set_size": 0.01,
+                "datasets": [
+                    {
+                        "path": "Intel/orca_dpo_pairs",
+                        "split": "train",
+                        "type": "chatml.intel",
+                    },
+                ],
+                "num_epochs": 1,
+                "max_steps": 2,
+                "micro_batch_size": 2,
+                "gradient_accumulation_steps": 1,
+                "output_dir": temp_dir,
+                "learning_rate": 0.00001,
+                "optimizer": "adamw_torch_fused",
+                "lr_scheduler": "cosine",
+                "flash_attention": True,
+                "fsdp_version": "1",
+                "fsdp_config": {
+                    "fsdp_offload_params": False,
+                    "fsdp_cpu_ram_efficient_loading": True,
+                    "fsdp_transformer_layer_cls_to_wrap": "Qwen2DecoderLayer",
+                    "fsdp_state_dict_type": "FULL_STATE_DICT",
+                    "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
+                    "fsdp_sharding_strategy": "FULL_SHARD",
+                    "fsdp_sync_module_states": True,
+                    "fsdp_use_orig_params": False,
+                },
+                "use_tensorboard": True,
+                "bf16": "auto",
+                "tf32": True,
+            }
+        )
+
+        # write cfg to yaml file
+        Path(temp_dir).mkdir(parents=True, exist_ok=True)
+        with open(Path(temp_dir) / "config.yaml", "w", encoding="utf-8") as fout:
+            fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
+
+        execute_subprocess_async(
+            [
+                "axolotl",
+                "train",
+                str(Path(temp_dir) / "config.yaml"),
+                "--num-processes",
+                "2",
+                "--main-process-port",
+                f"{get_torch_dist_unique_port()}",
+            ]
+        )
+
+        verify_training_success(temp_dir)
--- a/tests/e2e/multigpu/test_fsdp2.py
+++ b/tests/e2e/multigpu/test_fsdp2.py
@@ -0,0 +1,355 @@
+"""Test module for FSDP2 multi-GPU functionality."""
+
+# pylint: disable=duplicate-code
+
+import os
+from pathlib import Path
+
+import pytest
+import torch
+import yaml
+from accelerate.test_utils import execute_subprocess_async
+from tbparse import SummaryReader
+from transformers.testing_utils import get_torch_dist_unique_port
+
+from axolotl.utils.dict import DictDefault
+
+from tests.e2e.utils import most_recent_subdir, require_torch_2_7_0
+
+AXOLOTL_ROOT = Path(__file__).parent.parent.parent.parent
+
+
+def verify_training_success(temp_dir):
+    """Verify that training completed successfully by checking artifacts and loss."""
+    output_path = Path(temp_dir)
+
+    model_files = list(output_path.glob("*.bin")) + list(
+        output_path.glob("*.safetensors")
+    )
+    assert len(model_files) > 0, "No model files found - training may have failed"
+
+    checkpoint_files = list(output_path.glob("checkpoint-*"))
+    assert (
+        len(checkpoint_files) > 0
+    ), "No checkpoint files found - training may have failed"
+
+    tb_log_path = most_recent_subdir(temp_dir + "/runs")
+    if tb_log_path:
+        event_files = sorted(os.listdir(tb_log_path))
+        if event_files:
+            event_file = os.path.join(tb_log_path, event_files[0])
+            reader = SummaryReader(event_file)
+            df = reader.scalars
+            train_loss_df = df[df.tag == "train/train_loss"]
+            if len(train_loss_df) > 0:
+                final_loss = train_loss_df.value.values[-1]
+                assert not torch.isnan(
+                    torch.tensor(final_loss)
+                ), f"Training loss is NaN: {final_loss}"
+
+
+class TestFSDP2:
+    """Test class for FSDP2 functionality."""
+
+    @require_torch_2_7_0
+    @pytest.mark.parametrize(
+        "fsdp_cpu_ram_efficient_loading",
+        [True, False],
+    )
+    def test_fft_sft(self, temp_dir, fsdp_cpu_ram_efficient_loading):
+        cfg = DictDefault(
+            {
+                "base_model": "Qwen/Qwen2.5-0.5B",
+                "sequence_len": 2048,
+                "val_set_size": 0.01,
+                "datasets": [
+                    {
+                        "path": "tatsu-lab/alpaca",
+                        "type": "alpaca",
+                        "split": "train[:10%]",
+                    },
+                ],
+                "num_epochs": 1,
+                "max_steps": 2,
+                "micro_batch_size": 2,
+                "gradient_accumulation_steps": 1,
+                "output_dir": temp_dir,
+                "learning_rate": 0.00001,
+                "optimizer": "adamw_torch_fused",
+                "lr_scheduler": "cosine",
+                "flash_attention": True,
+                "fsdp_version": 2,
+                "fsdp_config": {
+                    "offload_params": False,
+                    "cpu_ram_efficient_loading": fsdp_cpu_ram_efficient_loading,
+                    "transformer_layer_cls_to_wrap": "Qwen2DecoderLayer",
+                    "state_dict_type": "FULL_STATE_DICT",
+                    "auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
+                    "reshard_after_forward": True,
+                },
+                "use_tensorboard": True,
+                "bf16": True,
+            }
+        )
+
+        # write cfg to yaml file
+        Path(temp_dir).mkdir(parents=True, exist_ok=True)
+        with open(Path(temp_dir) / "config.yaml", "w", encoding="utf-8") as fout:
+            fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
+
+        execute_subprocess_async(
+            [
+                "axolotl",
+                "train",
+                str(Path(temp_dir) / "config.yaml"),
+                "--num-processes",
+                "2",
+                "--main-process-port",
+                f"{get_torch_dist_unique_port()}",
+            ]
+        )
+
+        verify_training_success(temp_dir)
+
+    @require_torch_2_7_0
+    @pytest.mark.parametrize("peft_use_dora", [True, False])
+    def test_lora_sft(self, temp_dir, peft_use_dora):
+        cfg = DictDefault(
+            {
+                "base_model": "Qwen/Qwen2.5-0.5B",
+                "sequence_len": 2048,
+                "val_set_size": 0.01,
+                "datasets": [
+                    {
+                        "path": "tatsu-lab/alpaca",
+                        "type": "alpaca",
+                        "split": "train[:10%]",
+                    },
+                ],
+                "peft_use_dora": peft_use_dora,
+                "adapter": "lora",
+                "lora_r": 8,
+                "lora_alpha": 16,
+                "lora_dropout": 0.05,
+                "lora_target_linear": True,
+                "num_epochs": 1,
+                "max_steps": 2,
+                "micro_batch_size": 2,
+                "gradient_accumulation_steps": 1,
+                "output_dir": temp_dir,
+                "learning_rate": 0.00001,
+                "optimizer": "adamw_torch_fused",
+                "lr_scheduler": "cosine",
+                "flash_attention": True,
+                "fsdp_version": 2,
+                "fsdp_config": {
+                    "offload_params": False,
+                    "cpu_ram_efficient_loading": False,
+                    "transformer_layer_cls_to_wrap": "Qwen2DecoderLayer",
+                    "state_dict_type": "FULL_STATE_DICT",
+                    "auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
+                    "reshard_after_forward": True,
+                },
+                "use_tensorboard": True,
+                "bf16": True,
+            }
+        )
+
+        # write cfg to yaml file
+        Path(temp_dir).mkdir(parents=True, exist_ok=True)
+        with open(Path(temp_dir) / "config.yaml", "w", encoding="utf-8") as fout:
+            fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
+
+        execute_subprocess_async(
+            [
+                "axolotl",
+                "train",
+                str(Path(temp_dir) / "config.yaml"),
+                "--num-processes",
+                "2",
+                "--main-process-port",
+                f"{get_torch_dist_unique_port()}",
+            ]
+        )
+
+        verify_training_success(temp_dir)
+
+    @require_torch_2_7_0
+    def test_qlora_sft(self, temp_dir):
+        cfg = DictDefault(
+            {
+                "base_model": "Qwen/Qwen2.5-0.5B",
+                "sequence_len": 2048,
+                "val_set_size": 0.01,
+                "datasets": [
+                    {
+                        "path": "tatsu-lab/alpaca",
+                        "type": "alpaca",
+                        "split": "train[:10%]",
+                    },
+                ],
+                "load_in_4bit": True,
+                "adapter": "qlora",
+                "lora_r": 8,
+                "lora_alpha": 16,
+                "lora_dropout": 0.05,
+                "lora_target_linear": True,
+                "num_epochs": 1,
+                "max_steps": 2,
+                "micro_batch_size": 2,
+                "gradient_accumulation_steps": 1,
+                "output_dir": temp_dir,
+                "learning_rate": 0.00001,
+                "optimizer": "adamw_torch_fused",
+                "lr_scheduler": "cosine",
+                "flash_attention": True,
+                "fsdp_version": 2,
+                "fsdp_config": {
+                    "offload_params": False,
+                    "cpu_ram_efficient_loading": False,
+                    "transformer_layer_cls_to_wrap": "Qwen2DecoderLayer",
+                    "state_dict_type": "FULL_STATE_DICT",
+                    "auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
+                    "reshard_after_forward": True,
+                },
+                "use_tensorboard": True,
+                "bf16": True,
+            }
+        )
+
+        # write cfg to yaml file
+        Path(temp_dir).mkdir(parents=True, exist_ok=True)
+        with open(Path(temp_dir) / "config.yaml", "w", encoding="utf-8") as fout:
+            fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
+
+        execute_subprocess_async(
+            [
+                "axolotl",
+                "train",
+                str(Path(temp_dir) / "config.yaml"),
+                "--num-processes",
+                "2",
+                "--main-process-port",
+                f"{get_torch_dist_unique_port()}",
+            ]
+        )
+
+        verify_training_success(temp_dir)
+
+    @require_torch_2_7_0
+    def test_dpo_fft(self, temp_dir):
+        cfg = DictDefault(
+            {
+                "base_model": "Qwen/Qwen2.5-0.5B",
+                "sequence_len": 2048,
+                "val_set_size": 0.01,
+                "rl": "dpo",
+                "chat_template": "chatml",
+                "datasets": [
+                    {
+                        "path": "Intel/orca_dpo_pairs",
+                        "split": "train",
+                        "type": "chatml.intel",
+                    },
+                ],
+                "num_epochs": 1,
+                "max_steps": 2,
+                "micro_batch_size": 2,
+                "gradient_accumulation_steps": 1,
+                "output_dir": temp_dir,
+                "learning_rate": 0.00001,
+                "optimizer": "adamw_torch_fused",
+                "lr_scheduler": "cosine",
+                "flash_attention": True,
+                "fsdp_version": 2,
+                "fsdp_config": {
+                    "offload_params": False,
+                    "cpu_ram_efficient_loading": False,
+                    "transformer_layer_cls_to_wrap": "Qwen2DecoderLayer",
+                    "state_dict_type": "FULL_STATE_DICT",
+                    "auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
+                    "reshard_after_forward": True,
+                },
+                "use_tensorboard": True,
+            }
+        )
+
+        # write cfg to yaml file
+        Path(temp_dir).mkdir(parents=True, exist_ok=True)
+        with open(Path(temp_dir) / "config.yaml", "w", encoding="utf-8") as fout:
+            fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
+
+        execute_subprocess_async(
+            [
+                "axolotl",
+                "train",
+                str(Path(temp_dir) / "config.yaml"),
+                "--num-processes",
+                "2",
+                "--main-process-port",
+                f"{get_torch_dist_unique_port()}",
+            ]
+        )
+
+        verify_training_success(temp_dir)
+
+    @require_torch_2_7_0
+    def test_dpo_lora(self, temp_dir):
+        cfg = DictDefault(
+            {
+                "base_model": "Qwen/Qwen2.5-0.5B",
+                "sequence_len": 2048,
+                "rl": "dpo",
+                "chat_template": "chatml",
+                "datasets": [
+                    {
+                        "path": "Intel/orca_dpo_pairs",
+                        "split": "train",
+                        "type": "chatml.intel",
+                    },
+                ],
+                "adapter": "lora",
+                "lora_r": 8,
+                "lora_alpha": 16,
+                "lora_dropout": 0.05,
+                "lora_target_linear": True,
+                "num_epochs": 1,
+                "max_steps": 2,
+                "micro_batch_size": 2,
+                "gradient_accumulation_steps": 1,
+                "output_dir": temp_dir,
+                "learning_rate": 0.00001,
+                "optimizer": "adamw_torch_fused",
+                "lr_scheduler": "cosine",
+                "flash_attention": True,
+                "fsdp_version": 2,
+                "fsdp_config": {
+                    "offload_params": False,
+                    "cpu_ram_efficient_loading": False,
+                    "transformer_layer_cls_to_wrap": "Qwen2DecoderLayer",
+                    "state_dict_type": "FULL_STATE_DICT",
+                    "auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
+                    "reshard_after_forward": True,
+                },
+                "use_tensorboard": True,
+            }
+        )
+
+        # write cfg to yaml file
+        Path(temp_dir).mkdir(parents=True, exist_ok=True)
+        with open(Path(temp_dir) / "config.yaml", "w", encoding="utf-8") as fout:
+            fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
+
+        execute_subprocess_async(
+            [
+                "axolotl",
+                "train",
+                str(Path(temp_dir) / "config.yaml"),
+                "--num-processes",
+                "2",
+                "--main-process-port",
+                f"{get_torch_dist_unique_port()}",
+            ]
+        )
+
+        verify_training_success(temp_dir)
--- a/tests/e2e/multigpu/test_qwen2.py
+++ b/tests/e2e/multigpu/test_qwen2.py
@@ -1,93 +0,0 @@
-"""
-E2E tests for multigpu qwen2
-"""
-
-from pathlib import Path
-
-import pytest
-import yaml
-from accelerate.test_utils import execute_subprocess_async
-from transformers.testing_utils import get_torch_dist_unique_port
-
-from axolotl.utils.dict import DictDefault
-
-
-class TestMultiGPUQwen2:
-    """
-    Test case for Llama models using LoRA
-    """
-
-    @pytest.mark.parametrize("base_model", ["Qwen/Qwen2-0.5B", "Qwen/Qwen2.5-0.5B"])
-    def test_qlora_fsdp_dpo(self, base_model, temp_dir):
-        # pylint: disable=duplicate-code
-        cfg = DictDefault(
-            {
-                "base_model": base_model,
-                "load_in_4bit": True,
-                "rl": "dpo",
-                "chat_template": "chatml",
-                "sequence_len": 2048,
-                "adapter": "qlora",
-                "lora_r": 8,
-                "lora_alpha": 16,
-                "lora_dropout": 0.05,
-                "lora_target_linear": True,
-                "val_set_size": 0.01,
-                "datasets": [
-                    {
-                        "path": "Intel/orca_dpo_pairs",
-                        "split": "train",
-                        "type": "chatml.intel",
-                    },
-                ],
-                "num_epochs": 1,
-                "max_steps": 2,
-                "warmup_steps": 20,
-                "micro_batch_size": 2,
-                "gradient_accumulation_steps": 2,
-                "output_dir": temp_dir,
-                "dataset_prepared_path": temp_dir + "/last_run_prepared",
-                "learning_rate": 0.00001,
-                "optimizer": "adamw_torch_fused",
-                "lr_scheduler": "cosine",
-                "flash_attention": True,
-                "bf16": "auto",
-                "tf32": True,
-                # "gradient_checkpointing": True,
-                "gradient_checkpointing_kwargs": {
-                    "use_reentrant": False,
-                },
-                "fsdp": [
-                    "full_shard",
-                    "auto_wrap",
-                ],
-                "fsdp_config": {
-                    "fsdp_limit_all_gathers": True,
-                    "fsdp_offload_params": False,
-                    "fsdp_sync_module_states": True,
-                    "fsdp_use_orig_params": False,
-                    "fsdp_cpu_ram_efficient_loading": False,
-                    "fsdp_transformer_layer_cls_to_wrap": "Qwen2DecoderLayer",
-                    "fsdp_state_dict_type": "FULL_STATE_DICT",
-                    "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
-                    "fsdp_sharding_strategy": "FULL_SHARD",
-                },
-            }
-        )
-
-        # write cfg to yaml file
-        Path(temp_dir).mkdir(parents=True, exist_ok=True)
-        with open(Path(temp_dir) / "config.yaml", "w", encoding="utf-8") as fout:
-            fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
-
-        execute_subprocess_async(
-            [
-                "axolotl",
-                "train",
-                str(Path(temp_dir) / "config.yaml"),
-                "--num-processes",
-                "2",
-                "--main-process-port",
-                f"{get_torch_dist_unique_port()}",
-            ]
-        )
--- a/tests/e2e/utils.py
+++ b/tests/e2e/utils.py
@@ -77,6 +77,18 @@ def require_torch_2_6_0(test_case):
    return unittest.skipUnless(is_min_2_6_0(), "test requires torch>=2.6.0")(test_case)


+def require_torch_2_7_0(test_case):
+    """
+    Decorator marking a test that requires torch >= 2.7.0
+    """
+
+    def is_min_2_7_0():
+        torch_version = version.parse(torch.__version__)
+        return torch_version >= version.parse("2.7.0")
+
+    return unittest.skipUnless(is_min_2_7_0(), "test requires torch>=2.7.0")(test_case)
+
+
 def require_torch_lt_2_6_0(test_case):
    """
    Decorator marking a test that requires torch < 2.6.0
--- a/tests/prompt_strategies/conftest.py
+++ b/tests/prompt_strategies/conftest.py
@@ -172,6 +172,14 @@ def fixture_devstral_tokenizer():
    return tokenizer


+@pytest.fixture(name="devstral_1_1_tokenizer")
+def fixture_devstral_1_1_tokenizer():
+    from axolotl.utils.mistral_tokenizer import HFMistralTokenizer
+
+    tokenizer = HFMistralTokenizer.from_pretrained("mistralai/Devstral-Small-2507")
+    return tokenizer
+
+
@pytest.fixture(name="mistralv03_tokenizer_chat_template_jinja")
 def fixture_mistralv03_chat_template_jinja_w_system() -> str:
    return '{%- if messages[0]["role"] == "system" %}\n    {%- set system_message = messages[0]["content"] %}\n    {%- set loop_messages = messages[1:] %}\n{%- else %}\n    {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n    {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr("role", "equalto", "user") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n    {%- if not (message.role == "tool" or message.role == "tool_results" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n        {%- if (message["role"] == "user") != (ns.index % 2 == 0) %}\n            {{- raise_exception("After the optional system message, conversation roles must alternate user/assistant/user/assistant/...") }}\n        {%- endif %}\n        {%- set ns.index = ns.index + 1 %}\n    {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n    {%- if message["role"] == "user" %}\n        {%- if tools is not none and (message == user_messages[-1]) %}\n            {{- "[AVAILABLE_TOOLS] [" }}\n            {%- for tool in tools %}\n                {%- set tool = tool.function %}\n                {{- \'{"type": "function", "function": {\' }}\n                {%- for key, val in tool.items() if key != "return" %}\n                    {%- if val is string %}\n                        {{- \'"\' + key + \'": "\' + val + \'"\' }}\n                    {%- else %}\n                        {{- \'"\' + key + \'": \' + val|tojson }}\n                    {%- endif %}\n                    {%- if not loop.last %}\n                        {{- ", " }}\n                    {%- endif %}\n                {%- endfor %}\n                {{- "}}" }}\n                {%- if not loop.last %}\n                    {{- ", " }}\n                {%- else %}\n                    {{- "]" }}\n                {%- endif %}\n            {%- endfor %}\n            {{- "[/AVAILABLE_TOOLS]" }}\n            {%- endif %}\n        {%- if loop.first and system_message is defined %}\n            {{- "[INST] " + system_message + "\\n\\n" + message["content"] + "[/INST]" }}\n        {%- else %}\n            {{- "[INST] " + message["content"] + "[/INST]" }}\n        {%- endif %}\n    {%- elif message.tool_calls is defined and message.tool_calls is not none %}\n        {{- "[TOOL_CALLS] [" }}\n        {%- for tool_call in message.tool_calls %}\n            {%- set out = tool_call.function|tojson %}\n            {{- out[:-1] }}\n            {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n                {{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }}\n            {%- endif %}\n            {{- \', "id": "\' + tool_call.id + \'"}\' }}\n            {%- if not loop.last %}\n                {{- ", " }}\n            {%- else %}\n                {{- "]" + eos_token }}\n            {%- endif %}\n        {%- endfor %}\n    {%- elif message["role"] == "assistant" %}\n        {{- " " + message["content"]|trim + eos_token}}\n    {%- elif message["role"] == "tool_results" or message["role"] == "tool" %}\n        {%- if message.content is defined and message.content.content is defined %}\n            {%- set content = message.content.content %}\n        {%- else %}\n            {%- set content = message.content %}\n        {%- endif %}\n        {{- \'[TOOL_RESULTS] {"content": \' + content|string + ", " }}\n        {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n            {{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }}\n        {%- endif %}\n        {{- \'"call_id": "\' + message.tool_call_id + \'"}[/TOOL_RESULTS]\' }}\n    {%- else %}\n        {{- raise_exception("Only user and assistant roles are supported, with the exception of an initial optional system message!") }}\n    {%- endif %}\n{%- endfor %}\n'
--- a/tests/prompt_strategies/test_chat_templates_mistral.py
+++ b/tests/prompt_strategies/test_chat_templates_mistral.py
@@ -11,16 +11,18 @@ if TYPE_CHECKING:

 # fmt: off
@pytest.mark.parametrize(
-    ("tokenizer_str", "assistant_toolcall_ids"),
+    ("tokenizer_str", "assistant_toolcall_ids", "tool_result_ids"),
    (
-        ("magistral_tokenizer", (9, 44627, 3684, 33, 19881, 1049, 1050, 1051, 1052, 1053, 32, 19227, 12856, 2811, 1032, 1049, 1054, 1044, 1429, 33319, 2811, 1032, 1050, 1125, 2)),
-        ("devstral_tokenizer", (9, 1091, 19227, 2391, 2811, 1429, 44627, 3684, 1897, 1429, 61906, 2811, 16753, 12856, 2811, 1032, 1049, 1054, 1044, 1429, 33319, 2811, 1032, 1050, 4179, 1429, 1327, 2811, 1429, 19881, 1049, 1050, 1051, 1052, 1053, 1034, 27028, 2)),
+        ("magistral_tokenizer", (9, 44627, 3684, 33, 19881, 1049, 1050, 1051, 1052, 1053, 32, 19227, 12856, 2811, 1032, 1049, 1054, 1044, 1429, 33319, 2811, 1032, 1050, 1125, 2), (7, 19881, 1049, 1050, 1051, 1052, 1053, 19, 1049, 1044, 1050, 8)),
+        ("devstral_tokenizer", (9, 1091, 19227, 2391, 2811, 1429, 44627, 3684, 1897, 1429, 61906, 2811, 16753, 12856, 2811, 1032, 1049, 1054, 1044, 1429, 33319, 2811, 1032, 1050, 4179, 1429, 1327, 2811, 1429, 19881, 1049, 1050, 1051, 1052, 1053, 1034, 27028, 2), (7, 19881, 1049, 1050, 1051, 1052, 1053, 19, 1049, 1044, 1050, 8)),
+        ("devstral_1_1_tokenizer", (9, 44627, 3684, 32, 19227, 12856, 2811, 1032, 1049, 1054, 1044, 1429, 33319, 2811, 1032, 1050, 1125, 2,), (7, 1049, 1044, 1050, 8)),
    )
 )
 # fmt: on
 def test_mistral_chat_template(
    tokenizer_str: str,
    assistant_toolcall_ids: tuple[int, ...],
+    tool_result_ids: tuple[int, ...],
    request: pytest.FixtureRequest,
 ):
    """Test chat template with the Magistral/Devstral tokenizer"""
@@ -238,7 +240,7 @@ def test_mistral_chat_template(
        5, 1091, 19227, 4994, 2811, 1429, 5165, 1897, 1429, 5165, 2811, 16753, 2391, 2811, 1429, 44627, 3684, 1897, 1429, 14653, 2811, 1429, 10639, 2130, 1261, 2951, 1307, 1747, 1278, 60092, 1307, 1261, 2782, 1455, 1584, 4289, 2224, 1261, 4265, 6139, 39249, 1429, 26204, 2811, 16753, 4994, 2811, 1429, 6371, 1897, 1429, 48649, 2811, 16753, 12856, 2811, 16753, 4994, 2811, 1429, 49039, 1897, 1429, 14653, 2811, 1429, 1784, 2782, 1317, 3081, 60092, 1307, 2613, 4179, 1429, 33319, 2811, 16753, 4994, 2811, 1429, 49039, 1897, 1429, 14653, 2811, 1429, 1784, 9229, 6139, 1394, 1278, 60092, 2613, 47579, 1429, 15760, 2811, 12161, 12856, 1897, 1429, 33319, 4964, 2821, 27028, 6,  # tool prompt
        3, 46634, 1044, 1710, 1636, 5628, 1639, 1261, 44433, 1307, 2606, 1317, 5388, 1420, 54191, 2424, 1286, 8967, 1063, 15621, 1044, 2549, 30305, 2196, 3560, 1044, 1321, 2606, 1710, 1362, 2016, 8605, 2015, 1317, 5524, 118931, 2036, 32951, 1063, 1362, 2933, 2269, 12106, 1408, 101987, 1044, 6939, 1044, 1321, 9216, 1455, 2084, 3180, 1278, 8967, 119141, 1689, 5935, 1033, 4,  # user
        *assistant_toolcall_ids,  # assistant tool calling
-        7, 19881, 1049, 1050, 1051, 1052, 1053, 19, 1049, 1044, 1050, 8,  # tool result
+        *tool_result_ids,  # tool result
        1784, 60092, 1307, 1032, 1049, 1054, 1395, 1032, 1049, 1321, 1032, 1050, 1046,  # assistant
        2  # eos
    ]
@@ -248,7 +250,7 @@ def test_mistral_chat_template(
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,  # tool prompt
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,  # user prompt
        *assistant_toolcall_ids,  # assistant tool calling
-        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,  # tool result
+        *([-100] * len(tool_result_ids)),  # tool result
        1784, 60092, 1307, 1032, 1049, 1054, 1395, 1032, 1049, 1321, 1032, 1050, 1046,  # assistant
        2  # eos
    ]
--- a/tests/test_normalize_config.py
+++ b/tests/test_normalize_config.py
@@ -5,7 +5,11 @@ Test classes for checking functionality of the cfg normalization
 import unittest
 from unittest.mock import patch

-from axolotl.utils.config import normalize_cfg_datasets, normalize_config
+from axolotl.utils.config import (
+    migrate_fsdp_config,
+    normalize_cfg_datasets,
+    normalize_config,
+)
 from axolotl.utils.dict import DictDefault


@@ -90,3 +94,104 @@ class NormalizeConfigTestCase(unittest.TestCase):

        self.assertTrue(cfg.bf16)
        self.assertFalse(cfg.fp16)
+
+    def test_migrate_fsdp_config(self):
+        """Test basic FSDP config migration with and without fsdp_version"""
+        cfg_with_version = DictDefault(
+            {
+                "fsdp_config": {
+                    "fsdp_version": 2,
+                    "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
+                    "fsdp_offload_params": False,
+                    "fsdp_cpu_ram_efficient_loading": True,
+                    "regular_param": "value",
+                }
+            }
+        )
+
+        migrate_fsdp_config(cfg_with_version)
+
+        self.assertEqual(cfg_with_version.fsdp_version, 2)
+        self.assertEqual(
+            cfg_with_version.fsdp_config.auto_wrap_policy, "TRANSFORMER_BASED_WRAP"
+        )
+        self.assertEqual(cfg_with_version.fsdp_config.offload_params, False)
+        self.assertEqual(cfg_with_version.fsdp_config.cpu_ram_efficient_loading, True)
+        self.assertEqual(cfg_with_version.fsdp_config.regular_param, "value")
+
+        self.assertNotIn("fsdp_auto_wrap_policy", cfg_with_version.fsdp_config)
+        self.assertNotIn("fsdp_offload_params", cfg_with_version.fsdp_config)
+        self.assertNotIn("fsdp_cpu_ram_efficient_loading", cfg_with_version.fsdp_config)
+        self.assertNotIn("fsdp_version", cfg_with_version.fsdp_config)
+        self.assertNotIn("version", cfg_with_version.fsdp_config)
+
+        cfg_without_version = DictDefault(
+            {
+                "fsdp_config": {
+                    "fsdp_auto_wrap_policy": "SIZE_BASED_WRAP",
+                    "fsdp_offload_params": True,
+                    "regular_param": "value",
+                }
+            }
+        )
+
+        migrate_fsdp_config(cfg_without_version)
+
+        self.assertNotIn("fsdp_version", cfg_without_version)
+        self.assertEqual(
+            cfg_without_version.fsdp_config.auto_wrap_policy, "SIZE_BASED_WRAP"
+        )
+        self.assertEqual(cfg_without_version.fsdp_config.offload_params, True)
+        self.assertEqual(cfg_without_version.fsdp_config.regular_param, "value")
+
+        self.assertNotIn("fsdp_auto_wrap_policy", cfg_without_version.fsdp_config)
+        self.assertNotIn("fsdp_offload_params", cfg_without_version.fsdp_config)
+
+    def test_migrate_fsdp_config_no_fsdp_config(self):
+        """Test that function doesn't crash when no fsdp_config is present"""
+        cfg = DictDefault({"some_other_config": "value"})
+
+        migrate_fsdp_config(cfg)
+
+        self.assertNotIn("fsdp_config", cfg)
+        self.assertNotIn("fsdp_version", cfg)
+        self.assertEqual(cfg.some_other_config, "value")
+
+    def test_migrate_fsdp_config_empty_fsdp_config(self):
+        """Test migration with empty fsdp_config"""
+        cfg = DictDefault({"fsdp_config": {}})
+
+        migrate_fsdp_config(cfg)
+
+        self.assertNotIn("fsdp_version", cfg)
+        self.assertEqual(cfg.fsdp_config, {})
+
+    def test_migrate_fsdp_config_mixed_keys(self):
+        """Test migration with a mix of fsdp_ and non-fsdp_ keys"""
+        cfg = DictDefault(
+            {
+                "fsdp_config": {
+                    "fsdp_version": 1,
+                    "fsdp_state_dict_type": "FULL_STATE_DICT",
+                    "mixed_precision_policy": "fp16",
+                    "activation_checkpointing": True,
+                    "fsdp_reshard_after_forward": False,
+                }
+            }
+        )
+
+        migrate_fsdp_config(cfg)
+
+        self.assertEqual(cfg.fsdp_version, 1)
+        self.assertEqual(cfg.fsdp_config.state_dict_type, "FULL_STATE_DICT")
+        self.assertEqual(cfg.fsdp_config.reshard_after_forward, False)
+        self.assertEqual(cfg.fsdp_config.mixed_precision_policy, "fp16")
+        self.assertEqual(cfg.fsdp_config.activation_checkpointing, True)
+
+        # Check original fsdp_ keys are removed
+        self.assertNotIn("fsdp_version", cfg.fsdp_config)
+        self.assertNotIn("fsdp_state_dict_type", cfg.fsdp_config)
+        self.assertNotIn("fsdp_reshard_after_forward", cfg.fsdp_config)
+
+        # Ensure no duplicate version key
+        self.assertNotIn("version", cfg.fsdp_config)
--- a/tests/test_train.py
+++ b/tests/test_train.py
@@ -0,0 +1,44 @@
+"""Test for batch size calculation for multi-gpu training."""
+
+import pytest
+
+from axolotl.utils.config import normalize_config, validate_config
+from axolotl.utils.dict import DictDefault
+
+
+@pytest.fixture(name="train_base_cfg")
+def fixture_train_base_cfg():
+    return DictDefault(
+        base_model="gpt2",
+        learning_rate=1e-3,
+        datasets=[
+            {
+                "path": "mhenrichsen/alpaca_2k_test",
+                "type": "alpaca",
+            },
+        ],
+        micro_batch_size=2,
+        gradient_accumulation_steps=4,
+        sequence_len=2048,
+        sample_packing=True,
+        num_epochs=1,
+    )
+
+
+class TestTrain:
+    """test class for train related tests"""
+
+    @pytest.mark.parametrize(
+        "world_size, expected_batch_size",
+        [
+            (1, 8),
+            (4, 32),
+        ],
+    )
+    def test_batch_size_ddp(
+        self, train_base_cfg, monkeypatch, world_size, expected_batch_size
+    ):
+        monkeypatch.setenv("WORLD_SIZE", str(world_size))
+        cfg = validate_config(train_base_cfg)
+        normalize_config(cfg)
+        assert cfg.batch_size == expected_batch_size
Author	SHA1	Message	Date
Wing Lian	41664c7c4c	fix ddp for incorrect steps (#2915 ) * fix ddp for incorrect steps * add test	2025-07-14 07:51:16 -04:00
Wing Lian	9a8073e73d	Liquid Foundation Model 2 support (#2905 ) * LFM2 support * docs * packing seems to work * update install to force install in case already on dev version * default to use chunked cross entropy	2025-07-12 11:41:34 -04:00
Jiawei Liu	7fb8441e0e	fix: customized dataset with simpo (#2894 ) [skip ci]	2025-07-12 11:40:30 -04:00
NanoCode012	4dc5910e1c	feat(doc): re-add docker 2.7.0 tag back (#2902 ) [skip ci]	2025-07-12 11:40:01 -04:00
Wing Lian	fb7bc9250d	move unmaintained examples to archive (#2903 ) [skip ci]	2025-07-12 11:39:51 -04:00
salman	d6e4a611e5	FSDP1 -> FSDP2 (#2760 ) * FSDP2 args migration implementation This commit implements the migration to FSDP2 arguments including: - FSDP2 support with LoRA training - DPO integration with FSDP2 - Model loading fixes and refactoring - CPU offloading and PEFT handling - Test updates and CI improvements - Bug fixes for dtype errors and various edge cases	2025-07-12 15:18:01 +01:00
Ed Sealing	eb662557a7	Register Plugins in Ray Workers (#2901 ) [skip ci] * Access plugins in ray cluster * Add comment * chore: lint --------- Co-authored-by: Ed Sealing <ed.sealing@patapsco.ai> Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-07-11 16:59:59 -04:00
salman	03b2a113fe	Update doc preview workflow to use sticky comments (#2873 )	2025-07-11 14:08:35 +01:00
NanoCode012	9b95a625ab	feat: add devstral small 2507 (#2896 ) * feat: add devstral small 2507 * chore: update blog doc	2025-07-11 09:34:19 +07:00
Wing Lian	c370d0795c	[doc] Fix docs for text field mapping for completion datasets (#2890 ) * Fix docs for text field mapping for completion datasets * update another reference	2025-07-09 14:52:44 -04:00
Wing Lian	76aeb16156	tiled_mlp supports single gpu (#2891 ) * tiled_mlp supports single gpu * use checkpoint offloading for arctic training * patch torch checkpoint too * support for single gpu zero3 * add linkback to where it was copied from	2025-07-09 12:48:22 -04:00
Wing Lian	7c5ea0010f	bump dev version (#2889 ) [skip ci]	2025-07-09 09:43:42 -04:00