Files

NanoCode012 17fc747f99 fix: docker build failing (#3622 )

* fix: uv leftover docs

* fix: docker build failing

* chore: doc

* fix: remove old pytorch build

* fix: stop recommend flash-attn optional, let transformers pull

* fix: remove ring flash attention from image

* fix: quotes [skip ci]

* chore: naming [skip ci]

2026-04-24 14:23:09 +07:00

1.8 KiB

Raw Permalink Blame History

Finetune SmolVLM2 with Axolotl

SmolVLM2 are a family of lightweight, open-source multimodal models from HuggingFace designed to analyze and understand video, image, and text content.

These models are built for efficiency, making them well-suited for on-device applications where computational resources are limited. Models are available in multiple sizes, including 2.2B, 500M, and 256M.

This guide shows how to fine-tune SmolVLM2 models with Axolotl.

Getting Started

Install Axolotl following the installation guide.

Here is an example of how to install from pip:

# Ensure you have a compatible version of Pytorch installed
uv pip install --no-build-isolation 'axolotl>=0.16.1'

Install an extra dependency:
```
uv pip install num2words==0.5.14
```

Run the finetuning example:

# LoRA SFT (1x48GB @ 6.8GiB)
axolotl train examples/smolvlm2/smolvlm2-2B-lora.yaml

TIPS

Dataset Format: For video finetuning, your dataset must be compatible with the multi-content Messages format. For more details, see our documentation on Multimodal Formats.
Dataset Loading: Read more on how to prepare and load your own datasets in our documentation.

Optimization Guides

Please check the Optimizations doc.

1.8 KiB Raw Permalink Blame History

Finetune SmolVLM2 with Axolotl

Getting Started

TIPS

Optimization Guides

Related Resources

1.8 KiB

Raw Permalink Blame History