Mistral Small 3.1/3.2 Fine-tuning
This guide covers fine-tuning Mistral Small 3.1 and Mistral Small 3.2 with vision capabilities using Axolotl.
Prerequisites
Before starting, ensure you have:
- Installed Axolotl (see Installation docs)
Getting Started
-
Install the required vision lib:
uv pip install 'mistral-common[opencv]==1.8.5' -
Download the example dataset image:
wget https://huggingface.co/datasets/Nanobit/text-vision-2k-test/resolve/main/African_elephant.jpg -
Run the fine-tuning:
axolotl train examples/mistral/mistral-small/mistral-small-3.1-24B-lora.yml
This config uses about 29.4 GiB VRAM.
Dataset Format
The vision model requires multi-modal dataset format as documented here.
One exception is that, passing "image": PIL.Image is not supported. MistralTokenizer only supports path, url, and base64 for now.
Example:
{
"messages": [
{"role": "system", "content": [{ "type": "text", "text": "{SYSTEM_PROMPT}"}]},
{"role": "user", "content": [
{ "type": "text", "text": "What's in this image?"},
{"type": "image", "path": "path/to/image.jpg" }
]},
{"role": "assistant", "content": [{ "type": "text", "text": "..." }]},
],
}
Limitations
- Sample Packing is not supported for multi-modality training currently.