Mistral Small 3.1/3.2 Fine-tuning

This guide covers fine-tuning Mistral Small 3.1 and Mistral Small 3.2 with vision capabilities using Axolotl.

Prerequisites

Before starting, ensure you have:

Installed Axolotl (see Installation docs)

Getting Started

Install the required vision lib:

uv pip install 'mistral-common[opencv]==1.8.5'

Download the example dataset image:

wget https://huggingface.co/datasets/Nanobit/text-vision-2k-test/resolve/main/African_elephant.jpg

Run the fine-tuning:

axolotl train examples/mistral/mistral-small/mistral-small-3.1-24B-lora.yml

This config uses about 29.4 GiB VRAM.

Dataset Format

The vision model requires multi-modal dataset format as documented here.

One exception is that, passing "image": PIL.Image is not supported. MistralTokenizer only supports path, url, and base64 for now.

Example:

{
    "messages": [
        {"role": "system", "content": [{ "type": "text", "text": "{SYSTEM_PROMPT}"}]},
        {"role": "user", "content": [
            { "type": "text", "text": "What's in this image?"},
            {"type": "image", "path": "path/to/image.jpg" }
        ]},
        {"role": "assistant", "content": [{ "type": "text", "text": "..." }]},
    ],
}

Limitations

Sample Packing is not supported for multi-modality training currently.