Feat: add Magistral Small 2509 and native mistral3 tokenizer support (#3165)

* feat: update mistral common * feat: add mistral3processor * fix: loading * fix: cast pixel_values to fp32 * fix: image tensor conversion * feat: add FA2 support for pixtral based models * fix: update mistral small 3.1 to use native tokenizer * fix: install tips * fix: improve info on sample dataset files * chore: move mistral configs into subfolders * fix: remove unneeded patch * fix: indent * feat: add integration tests * chore: move * feat: add magistral 2509 docs and example * fix: convert tensor to bool * feat: expand tests * chore: move tests
2025-09-18 15:42:20 +07:00
parent 4065bc14c6
commit 09959fac70
32 changed files with 757 additions and 39 deletions
--- a/examples/magistral/think/README.md
+++ b/examples/magistral/think/README.md
@@ -0,0 +1,73 @@
+# Magistral Small Thinking Fine-tuning
+
+This guide covers fine-tuning [Magistral Small 2507](https://huggingface.co/mistralai/Magistral-Small-2507) with thinking capabilities using Axolotl. The thinking model enables explicit Chain-of-Thought reasoning with separate thinking and response sections.
+
+## Prerequisites
+
+Before starting, ensure you have:
+- Installed Axolotl (see [main README](../README.md))
+
+## Getting Started
+
+Run the thinking model fine-tuning:
+
+```bash
+axolotl train magistral-small-think-qlora.yaml
+```
+
+This config uses about 19.1 GiB VRAM.
+
+### Tips
+
+- Dataset uses multi-content format with `type: thinking` support. See [Dataset Format](#dataset-format) below.
+- You cannot mix `content: str` and `content: list[dict]`, otherwise, dataset loading will fail. Keep it consistent.
+
+## Dataset Format
+
+The thinking model requires the multi-content dataset format with support for an extra `role: thinking` within system and assistant messages.
+
+Example format:
+
+```json
+{
+    "messages": [
+        {
+            "role": "system",
+            "content": [
+                { "type": "text", "text": "{SYSTEM_PROMPT}"}
+            ]
+        },
+        {
+            "role": "user",
+            "content": [
+                { "type": "text", "text": "Solve this step by step: What is 15% of 240?"}
+            ]
+        },
+        {
+            "role": "assistant",
+            "content": [
+                {
+                    "type": "thinking",
+                    "thinking": "I need to calculate 15% of 240. First, I'll convert 15% to decimal: 0.15. Then multiply: 0.15 × 240 = 36."
+                },
+                {
+                    "type": "text",
+                    "text": "To find 15% of 240, I'll multiply 240 by 0.15:\n\n240 × 0.15 = 36\n\nTherefore, 15% of 240 is 36."
+                }
+            ]
+        }
+    ]
+}
+```
+
+### Advanced Options
+
+The `thinking` section supports an optional `closed` parameter:
+
+```json
+{
+    "type": "thinking",
+    "thinking": "Internal reasoning here...",
+    "closed": true  // Default: true, controls adding the closing [/THINK] tag
+}
+```
--- a/examples/magistral/think/magistral-small-think-qlora.yaml
+++ b/examples/magistral/think/magistral-small-think-qlora.yaml
@@ -0,0 +1,67 @@
+base_model: mistralai/Magistral-Small-2507
+
+# Enable to use mistral-common tokenizer
+tokenizer_use_mistral_common: true
+
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
+plugins:
+  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
+
+load_in_8bit: false
+load_in_4bit: true
+
+datasets:
+  - path: Nanobit/text-think-2k-test
+    type: chat_template
+
+dataset_prepared_path: last_run_prepared
+val_set_size: 0
+output_dir: ./outputs/lora-out
+
+adapter: qlora
+lora_model_dir:
+
+sequence_len: 2048
+sample_packing: true
+
+lora_r: 32
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_linear: true
+lora_target_modules:
+  - gate_proj
+  - down_proj
+  - up_proj
+  - q_proj
+  - v_proj
+  - k_proj
+  - o_proj
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 2
+num_epochs: 1
+optimizer: adamw_bnb_8bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+bf16: auto
+tf32: false
+
+gradient_checkpointing: true
+resume_from_checkpoint:
+logging_steps: 1
+flash_attention: true
+
+warmup_ratio: 0.1
+evals_per_epoch: 1
+saves_per_epoch: 1
+
+# save_first_step: true  # uncomment this to validate checkpoint saving works with your config