MultiModal / Vision Language Models (BETA)
Supported Models
Usage
Multimodal support is limited and doesn’t have full feature parity.
Here are the hyperparams you’ll need to use to finetune a multimodal model.
processor_type: AutoProcessor
skip_prepare_dataset: true
remove_unused_columns: false # leave columns in place as they are needed to handle image embeddings during training
sample_packing: false # not yet supported with multimodal
chat_template: # see in next section
# example dataset
datasets:
- path: HuggingFaceH4/llava-instruct-mix-vsft
type: chat_template
split: train[:1%]
field_messages: messages
# (optional) if doing lora, only finetune the Language model,
# leave the vision model and vision tower frozen
# load_in_8bit: true
adapter: lora
lora_target_modules: 'language_model.model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
# (optional) if you want to resize images to a set size
image_size: 512
image_resize_algorithm: bilinearPlease see examples folder for full configs.
Warning
Some of our chat_templates have been extended to support broader dataset types. This should not break any existing configs.
Mllama
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
chat_template: llama3_2_visionLlama4
base_model: meta-llama/Llama-4-Scout-17B-16E-Instruct
chat_template: llama4Pixtral
base_model: mistralai/Pixtral-12B-2409
chat_template: pixtralLlava-1.5
base_model: llava-hf/llava-1.5-7b-hf
chat_template: llavaMistral-Small-3.1
base_model: mistralai/Mistral-Small-3.1-24B-Instruct-2503
chat_template: mistral_v7_tekkenGemma-3
Tip
The Gemma3-1B model is a text-only model, so please train as regular text model.
For multi-modal 4B/12B/27B models, use the following config:
base_model: google/gemma-3-4b-it
chat_template: gemma3Qwen2-VL
base_model: Qwen/Qwen2-VL-7B-Instruct
chat_template: qwen2_vlQwen2.5-VL
base_model: Qwen/Qwen2.5-VL-7B-Instruct
chat_template: qwen2_vl # same as qwen2-vlDataset Format
For multi-modal datasets, we adopt an extended chat_template format similar to OpenAI’s Message format.
- A message is a list of
roleandcontent. rolecan besystem,user,assistant, etc.contentis a list oftypeand (textorimageorpathorurlorbase64).
Note
For backwards compatibility:
- If the dataset has a
imagesorimagecolumn oflist[Image], it will be appended to the firstcontentlist as{"type": "image", "image": ...}. However, if the content already has a{"type": "image"}but noimagekey, it will be set theimagekey. - If
contentis a string, it will be converted to a list withtypeastext.
Tip
For image loading, you can use the following keys within content alongside "type": "image":
"path": "/path/to/image.jpg""url": "https://example.com/image.jpg""base64": "...""image": PIL.Image
Here is an example of a multi-modal dataset:
[
{
"messages": [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful assistant."}
]
},
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
{"type": "text", "text": "Describe this image in detail."}
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "The image is a bee."}
]
}
]
}
]