Finetune SmolVLM2 with Axolotl

SmolVLM2 are a family of lightweight, open-source multimodal models from HuggingFace designed to analyze and understand video, image, and text content.

These models are built for efficiency, making them well-suited for on-device applications where computational resources are limited. Models are available in multiple sizes, including 2.2B, 500M, and 256M.

This guide shows how to fine-tune SmolVLM2 models with Axolotl.

Getting Started

Install Axolotl following the installation guide.

Here is an example of how to install from pip:

# Ensure you have a compatible version of Pytorch installed
# Option A: manage dependencies in your project
uv add 'axolotl>=0.12.0'
uv pip install flash-attn --no-build-isolation

# Option B: quick install
uv pip install 'axolotl>=0.12.0'
uv pip install flash-attn --no-build-isolation

Install an extra dependency:
```
uv pip install num2words==0.5.14
```

Run the finetuning example:

# LoRA SFT (1x48GB @ 6.8GiB)
axolotl train examples/smolvlm2/smolvlm2-2B-lora.yaml

TIPS

Dataset Format: For video finetuning, your dataset must be compatible with the multi-content Messages format. For more details, see our documentation on Multimodal Formats.
Dataset Loading: Read more on how to prepare and load your own datasets in our documentation.

2.1 KiB Raw Blame History

Finetune SmolVLM2 with Axolotl

Getting Started

TIPS

Optimization Guides

Related Resources

2.1 KiB

Raw Blame History