* add glm support + patch * lint * lint * Update examples/glm4/glm-4-6v-flash-qlora.yaml Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * Update examples/glm4/glm-4-6v-flash-qlora.yaml Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * Update src/axolotl/processing_strategies.py Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * patch removed * lint * lint2 * docs + rename * rmv moe * docs * removed processor * sdpa T_T" * ddp_find_unused_parameters: true * muti gpu yaml tested both * muti gpu yaml tested both * Update examples/glm46v/README.md Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * Update examples/glm46v/README.md Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * Update examples/glm46v/README.md Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * rmv text only section + v5 comments * rename --------- Co-authored-by: Ved <ved.work2024@gmail.com> Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
317 lines
7.3 KiB
Plaintext
317 lines
7.3 KiB
Plaintext
---
|
|
title: MultiModal / Vision Language Models (BETA)
|
|
format:
|
|
html:
|
|
toc: true
|
|
toc-depth: 3
|
|
---
|
|
|
|
## Supported Models
|
|
|
|
- [Mllama](#sec-mllama)
|
|
- [Llama4](#sec-llama4)
|
|
- [Pixtral](#sec-pixtral)
|
|
- [Llava-1.5](#sec-llava-15)
|
|
- [Mistral-Small-3.1](#sec-mistral-small-31)
|
|
- [Magistral-Small-2509](#sec-magistral-small-2509)
|
|
- [Voxtral](#sec-voxtral)
|
|
- [Gemma-3](#sec-gemma-3)
|
|
- [Gemma-3n](#sec-gemma-3n)
|
|
- [Qwen2-VL](#sec-qwen2-vl)
|
|
- [Qwen2.5-VL](#sec-qwen25-vl)
|
|
- [GLM-4.6V](#sec-glm-4-6v)
|
|
- [SmolVLM2](#sec-smolvlm2)
|
|
- [LFM2-VL](#sec-lfm2-vl)
|
|
- [Intern-VL](#sec-intern-vl)
|
|
|
|
## Usage
|
|
|
|
Multimodal support is limited and doesn't have full feature parity.
|
|
|
|
Here are the hyperparams you'll need to use to finetune a multimodal model.
|
|
|
|
```yaml
|
|
processor_type: AutoProcessor
|
|
|
|
skip_prepare_dataset: true
|
|
remove_unused_columns: false # leave columns in place as they are needed to handle image embeddings during training
|
|
sample_packing: false # not yet supported with multimodal
|
|
|
|
chat_template: # see in next section if specified
|
|
|
|
# example dataset
|
|
datasets:
|
|
- path: HuggingFaceH4/llava-instruct-mix-vsft
|
|
type: chat_template
|
|
split: train[:1%]
|
|
|
|
# (optional) if doing lora, only finetune the Language model,
|
|
# leave the vision model and vision tower frozen
|
|
# load_in_8bit: true
|
|
adapter: lora
|
|
lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
|
|
|
|
# (optional) if you want to resize images to a set size
|
|
image_size: 512
|
|
image_resize_algorithm: bilinear
|
|
```
|
|
|
|
Please see [examples](https://github.com/axolotl-ai/axolotl/tree/main/examples) folder for full configs.
|
|
|
|
::: {.callout-tip}
|
|
Some of our chat_templates have been extended to support broader dataset types. This should not break any existing configs.
|
|
:::
|
|
|
|
::: {.callout-note}
|
|
As of now, we do not truncate nor drop samples based on `sequence_len` as each arch has different ways to process non-text tokens. We are looking for help on this.
|
|
:::
|
|
|
|
### Mllama {#sec-mllama}
|
|
|
|
```yaml
|
|
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
|
|
|
|
chat_template: llama3_2_vision
|
|
```
|
|
|
|
### Llama4 {#sec-llama4}
|
|
|
|
```yaml
|
|
base_model: meta-llama/Llama-4-Scout-17B-16E-Instruct
|
|
|
|
chat_template: llama4
|
|
```
|
|
|
|
### Pixtral {#sec-pixtral}
|
|
|
|
```yaml
|
|
base_model: mistralai/Pixtral-12B-2409
|
|
|
|
chat_template: pixtral
|
|
```
|
|
|
|
### Llava-1.5 {#sec-llava-15}
|
|
|
|
```yaml
|
|
base_model: llava-hf/llava-1.5-7b-hf
|
|
|
|
chat_template: llava
|
|
```
|
|
|
|
### Mistral-Small-3.1 {#sec-mistral-small-31}
|
|
|
|
::: {.callout-tip}
|
|
Please make sure to install vision lib via `pip install 'mistral-common[opencv]==1.8.5'`
|
|
:::
|
|
|
|
```yaml
|
|
base_model: mistralai/Mistral-Small-3.1-24B-Instruct-2503
|
|
```
|
|
|
|
### Magistral-Small-2509 {#sec-magistral-small-2509}
|
|
|
|
::: {.callout-tip}
|
|
Please make sure to install vision lib via `pip install 'mistral-common[opencv]==1.8.5'`
|
|
:::
|
|
|
|
```yaml
|
|
base_model: mistralai/Magistral-Small-2509
|
|
```
|
|
|
|
### Voxtral {#sec-voxtral}
|
|
|
|
::: {.callout-tip}
|
|
Please make sure to install audio lib via `pip3 install librosa==0.11.0 'mistral_common[audio]==1.8.3'`
|
|
:::
|
|
|
|
```yaml
|
|
base_model: mistralai/Voxtral-Mini-3B-2507
|
|
|
|
processor_type: VoxtralProcessor
|
|
```
|
|
|
|
### Gemma-3 {#sec-gemma-3}
|
|
|
|
::: {.callout-tip}
|
|
The Gemma3-1B model is a text-only model, so please train as regular text model.
|
|
:::
|
|
|
|
For multi-modal 4B/12B/27B models, use the following config:
|
|
|
|
```yaml
|
|
base_model: google/gemma-3-4b-it
|
|
|
|
chat_template: gemma3
|
|
```
|
|
|
|
### Gemma-3n {#sec-gemma-3n}
|
|
|
|
::: {.callout-warning}
|
|
The model's initial loss and grad norm will be very high. We suspect this to be due to the Conv in the vision layers.
|
|
:::
|
|
|
|
::: {.callout-tip}
|
|
Please make sure to install `timm` via `pip3 install timm==1.0.17`
|
|
:::
|
|
|
|
```yaml
|
|
base_model: google/gemma-3n-E2B-it
|
|
|
|
chat_template: gemma3n
|
|
```
|
|
|
|
### Qwen2-VL {#sec-qwen2-vl}
|
|
|
|
```yaml
|
|
base_model: Qwen/Qwen2-VL-7B-Instruct
|
|
|
|
chat_template: qwen2_vl
|
|
```
|
|
|
|
### Qwen2.5-VL {#sec-qwen25-vl}
|
|
|
|
```yaml
|
|
base_model: Qwen/Qwen2.5-VL-7B-Instruct
|
|
|
|
chat_template: qwen2_vl # same as qwen2-vl
|
|
```
|
|
|
|
### Qwen3-VL {#sec-qwen3-vl}
|
|
|
|
```yaml
|
|
base_model: Qwen/Qwen3-VL-4B-Instruct
|
|
|
|
chat_template: qwen2_vl # same as qwen2-vl
|
|
```
|
|
|
|
### GLM-4.6V {#sec-glm-4-6v}
|
|
|
|
Both GLM-4.6V (106B MoE) and GLM-4.6V-Flash (9B) are supported.
|
|
|
|
```yaml
|
|
# GLM-4.6V (106B MoE version)
|
|
base_model: zai-org/GLM-4.6V
|
|
|
|
# OR GLM-4.6V-Flash (9B version)
|
|
base_model: zai-org/GLM-4.6V-Flash
|
|
```
|
|
|
|
### SmolVLM2 {#sec-smolvlm2}
|
|
|
|
::: {.callout-tip}
|
|
Please make sure to install `num2words` via `pip3 install num2words==0.5.14`
|
|
:::
|
|
|
|
```yaml
|
|
base_model: HuggingFaceTB/SmolVLM2-500M-Video-Instruct
|
|
```
|
|
|
|
### LFM2-VL {#sec-lfm2-vl}
|
|
|
|
::: {.callout-warning}
|
|
Please uninstall `causal-conv1d` via `pip3 uninstall -y causal-conv1d`
|
|
:::
|
|
|
|
```yaml
|
|
base_model: LiquidAI/LFM2-VL-450M
|
|
```
|
|
|
|
### Intern-VL {#sec-intern-vl}
|
|
|
|
::: {.callout-tip}
|
|
Please make sure to install `timm` via `pip3 install timm==1.0.19`
|
|
:::
|
|
|
|
```yaml
|
|
base_model: OpenGVLab/InternVL3_5-8B
|
|
```
|
|
|
|
## Dataset Format
|
|
|
|
For multi-modal datasets, we adopt an extended `chat_template` format similar to OpenAI's Message format.
|
|
|
|
- A message is a list of `role` and `content`.
|
|
- `role` can be `system`, `user`, `assistant`, etc.
|
|
- `content` is a list of `type` and (`text`, `image`, `path`, `url`, `base64`, or `audio`).
|
|
|
|
### Image
|
|
|
|
::: {.callout-note}
|
|
For backwards compatibility:
|
|
|
|
- If the dataset has a `images` or `image` column of `list[Image]`, it will be appended to the first `content` list as `{"type": "image", "image": ...}`. However, if the content already has a `{"type": "image"}` but no `image` key, it will be set the `image` key.
|
|
- If `content` is a string, it will be converted to a list with `type` as `text`.
|
|
:::
|
|
|
|
For image loading, you can use the following keys within `content` alongside `"type": "image"`:
|
|
|
|
- `"path": "/path/to/image.jpg"`
|
|
- `"url": "https://example.com/image.jpg"`
|
|
- `"base64": "..."`
|
|
- `"image": PIL.Image`
|
|
|
|
### Audio
|
|
|
|
For audio loading, you can use the following keys within `content` alongside `"type": "audio"`:
|
|
|
|
- `"path": "/path/to/audio.mp3"`
|
|
- `"url": "https://example.com/audio.mp3"`
|
|
- `"audio": np.ndarray`
|
|
|
|
::: {.callout-tip}
|
|
|
|
You may need to install `librosa` via `pip3 install librosa==0.11.0`.
|
|
|
|
:::
|
|
|
|
### Video
|
|
|
|
::: {.callout-warning}
|
|
|
|
This is not well tested at the moment. We welcome contributors!
|
|
|
|
:::
|
|
|
|
For video loading, you can use the following keys within `content` alongside `"type": "video"`:
|
|
|
|
- `"path": "/path/to/video.mp4"`
|
|
- `"url": "https://example.com/video.mp4"`
|
|
- `"video": np.ndarray | list[PIL.Image.Image] | torch.Tensor` (or list of the aforementioned)
|
|
|
|
### Example
|
|
|
|
Here is an example of a multi-modal dataset:
|
|
```json
|
|
[
|
|
{
|
|
"messages": [
|
|
{
|
|
"role": "system",
|
|
"content": [
|
|
{"type": "text", "text": "You are a helpful assistant."}
|
|
]
|
|
},
|
|
{
|
|
"role": "user",
|
|
"content": [
|
|
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
|
|
{"type": "text", "text": "Describe this image in detail."}
|
|
]
|
|
},
|
|
{
|
|
"role": "assistant",
|
|
"content": [
|
|
{"type": "text", "text": "The image is a bee."}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
]
|
|
```
|
|
|
|
## FAQ
|
|
|
|
1. `PIL.UnidentifiedImageError: cannot identify image file ...`
|
|
|
|
`PIL` could not retrieve the file at `url` using `requests`. Please check for typo. One alternative reason is that the request is blocked by the server.
|