axolotl/docs/multimodal.qmd

---
title: MultiModal / Vision Language Models (BETA)
format:
  html:
    toc: true
    toc-depth: 3
---

## Supported Models

- [Mllama](#sec-mllama)
- [Llama4](#sec-llama4)
- [Pixtral](#sec-pixtral)
- [Llava-1.5](#sec-llava-15)
- [Mistral-Small-3.1](#sec-mistral-small-31)
- [Gemma-3](#sec-gemma-3)
- [Gemma-3n](#sec-gemma-3n)
- [Qwen2-VL](#sec-qwen2-vl)
- [Qwen2.5-VL](#sec-qwen25-vl)

## Usage

Multimodal support is limited and doesn't have full feature parity.

Here are the hyperparams you'll need to use to finetune a multimodal model.

```yaml
processor_type: AutoProcessor

skip_prepare_dataset: true
remove_unused_columns: false  # leave columns in place as they are needed to handle image embeddings during training
sample_packing: false  # not yet supported with multimodal

chat_template:  # see in next section

# example dataset
datasets:
  - path: HuggingFaceH4/llava-instruct-mix-vsft
    type: chat_template
    split: train[:1%]
    field_messages: messages

# (optional) if doing lora, only finetune the Language model,
# leave the vision model and vision tower frozen
# load_in_8bit: true
adapter: lora
lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'

# (optional) if you want to resize images to a set size
image_size: 512
image_resize_algorithm: bilinear
```

Please see [examples](https://github.com/axolotl-ai/axolotl/tree/main/examples) folder for full configs.

::: {.callout-warning}
Some of our chat_templates have been extended to support broader dataset types. This should not break any existing configs.
:::

### Mllama {#sec-mllama}

```yaml
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct

chat_template: llama3_2_vision
```

### Llama4 {#sec-llama4}

```yaml
base_model: meta-llama/Llama-4-Scout-17B-16E-Instruct

chat_template: llama4
```

### Pixtral {#sec-pixtral}

```yaml
base_model: mistralai/Pixtral-12B-2409

chat_template: pixtral
```

### Llava-1.5 {#sec-llava-15}

```yaml
base_model: llava-hf/llava-1.5-7b-hf

chat_template: llava
```

### Mistral-Small-3.1 {#sec-mistral-small-31}

```yaml
base_model: mistralai/Mistral-Small-3.1-24B-Instruct-2503

chat_template: mistral_v7_tekken
```

### Gemma-3 {#sec-gemma-3}

::: {.callout-tip}
The Gemma3-1B model is a text-only model, so please train as regular text model.
:::

For multi-modal 4B/12B/27B models, use the following config:

```yaml
base_model: google/gemma-3-4b-it

chat_template: gemma3
```

### Gemma-3n {#sec-gemma-3n}

::: {.callout-warning}
The model's initial loss and grad norm will be very high. We suspect this to be due to the Conv in the vision layers.
:::

::: {.callout-tip}
Please make sure to install `timm` via `pip3 install timm==1.0.17`
:::

```yaml
base_model: google/gemma-3n-E2B-it

chat_template: gemma3n
```

### Qwen2-VL {#sec-qwen2-vl}

```yaml
base_model: Qwen/Qwen2-VL-7B-Instruct

chat_template: qwen2_vl
```

### Qwen2.5-VL {#sec-qwen25-vl}

```yaml
base_model: Qwen/Qwen2.5-VL-7B-Instruct

chat_template: qwen2_vl  # same as qwen2-vl
```

## Dataset Format

For multi-modal datasets, we adopt an extended `chat_template` format similar to OpenAI's Message format.

- A message is a list of `role` and `content`.
- `role` can be `system`, `user`, `assistant`, etc.
- `content` is a list of `type` and (`text`, `image`, `path`, `url`, `base64`, or `audio`).

### Image

::: {.callout-note}
For backwards compatibility:

- If the dataset has a `images` or `image` column of `list[Image]`, it will be appended to the first `content` list as `{"type": "image", "image": ...}`. However, if the content already has a `{"type": "image"}` but no `image` key, it will be set the `image` key.
- If `content` is a string, it will be converted to a list with `type` as `text`.
:::

For image loading, you can use the following keys within `content` alongside `"type": "image"`:

- `"path": "/path/to/image.jpg"`
- `"url": "https://example.com/image.jpg"`
- `"base64": "..."`
- `"image": PIL.Image`

### Audio

For audio loading, you can use the following keys within `content` alongside `"type": "audio"`:

- `"path": "/path/to/audio.mp3"`
- `"url": "https://example.com/audio.mp3"`
- `"audio": np.ndarray`

::: {.callout-tip}

You may need to install `librosa` via `pip3 install librosa==0.11.0`.

:::

### Example

Here is an example of a multi-modal dataset:
```json
[
  {
    "messages": [
        {
            "role": "system",
            "content": [
              {"type": "text", "text": "You are a helpful assistant."}
              ]
        },
        {
            "role": "user",
            "content": [
                {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
                {"type": "text", "text": "Describe this image in detail."}
            ]
        },
        {
            "role": "assistant",
            "content": [
              {"type": "text", "text": "The image is a bee."}
            ]
        }
    ]
  }
]
```

## FAQ

1. `PIL.UnidentifiedImageError: cannot identify image file ...`

`PIL` could not retrieve the file at `url` using `requests`. Please check for typo. One alternative reason is that the request is blocked by the server.