* feat: add mistral small 4 * fix: update mistral common * fix: deepcopy when passing in tokenizer * feat: add doc on reasoning and thinking section * fix: don't use custom tokenizer and quantize experts * chore: update docs and configs * chore: update doc to follow official name * feat: update cce to include mistral4 * chore: move * fix: naming * fix: test mock breaking get_text_config check * fix: enable CCE and add expert block targetting to configs * chore: docs * fix: use act checkpointing * chore: doc * chore: docs * chore: docs
Finetune Mistral Small 4 with Axolotl
Mistral Small 4 is a 119B parameter (6.5B active) multimodal MoE model from MistralAI that unifies instruct, reasoning, and coding capabilities into a single model. It is available on HuggingFace at Mistral-Small-4-119B-2603.
Thanks to the team at MistralAI for giving us early access to prepare for this release.
Getting started
Note: Training this model requires weights in BF16 which we will link to later. Users interested in training can convert / descale the existing FP8 weights.
-
Install Axolotl following the installation guide.
-
Install Cut Cross Entropy to reduce training VRAM usage
-
Install transformers from main
pip install git+https://github.com/huggingface/transformers.git
- Run one of the example configs:
# text-only
axolotl train examples/mistral4/qlora-text.yml # no experts ~69 GiB, experts ~93 GiB
axolotl train examples/mistral4/fft-text.yml
# text + vision
# run: wget https://huggingface.co/datasets/Nanobit/text-vision-2k-test/resolve/main/African_elephant.jpg
axolotl train examples/mistral4/qlora-vision.yml # no experts ~68 GiB
axolotl train examples/mistral4/fft-vision.yml
Note: FFT configs provided as reference. Please adjust hyperparameters as needed.
Reasoning Effort
The chat template supports a reasoning_effort variable to control the model's reasoning depth:
"none"— instruct mode (default)"high"— reasoning mode with explicit thinking steps
Pass it via chat_template_kwargs under your dataset config:
datasets:
- path: your/dataset
type: chat_template
chat_template_kwargs:
reasoning_effort: high
Thinking Support
The chat template supports a thinking content type in assistant messages for training on reasoning traces (rendered as [THINK]...[/THINK] blocks).
To use thinking datasets, add the thinking mapping via message_property_mappings:
datasets:
- path: your/thinking-dataset
type: chat_template
message_property_mappings:
role: role
content: content
thinking: thinking
chat_template_kwargs:
reasoning_effort: high
See the Magistral thinking guide for dataset format details.
Tips
- Read more on how to load your own dataset at docs.
- The text dataset format follows the OpenAI Messages format as seen here.
- The vision model requires multi-modal dataset format as documented here.