--------- Co-authored-by: Sunny <sunny@Sunnys-MacBook-Air.local> Co-authored-by: sunny <sunnyliu19981005@gmail.com>
29 lines
871 B
Plaintext
29 lines
871 B
Plaintext
# MultiModal / Vision Language Models (BETA)
|
|
|
|
### Supported Models
|
|
|
|
- Mllama, i.e. llama with vision models
|
|
|
|
### Usage
|
|
|
|
Currently multimodal support is limited and doesn't have full feature parity. To finetune a multimodal Llama w/ LoRA,
|
|
you'll need to use the following in YAML in combination with the rest of the required hyperparams.
|
|
|
|
```yaml
|
|
base_model: alpindale/Llama-3.2-11B-Vision-Instruct
|
|
processor_type: AutoProcessor
|
|
skip_prepare_dataset: true
|
|
|
|
chat_template: llama3_2_vision
|
|
datasets:
|
|
- path: HuggingFaceH4/llava-instruct-mix-vsft
|
|
type: chat_template
|
|
split: train[:1%]
|
|
field_messages: messages
|
|
remove_unused_columns: false
|
|
sample_packing: false
|
|
|
|
# only finetune the Language model, leave the vision model and vision tower frozen
|
|
lora_target_modules: 'language_model.model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
|
|
```
|