Built site for gh-pages

2025-08-15 14:58:20 +00:00
parent 950186348c
commit f1f9851422
4 changed files with 320 additions and 240 deletions
--- a/search.json
+++ b/search.json
@@ -563,7 +563,7 @@
    "href": "docs/multimodal.html",
    "title": "MultiModal / Vision Language Models (BETA)",
    "section": "",
-    "text": "Mllama\nLlama4\nPixtral\nLlava-1.5\nMistral-Small-3.1\nGemma-3\nGemma-3n\nQwen2-VL\nQwen2.5-VL",
+    "text": "Mllama\nLlama4\nPixtral\nLlava-1.5\nMistral-Small-3.1\nVoxtral\nGemma-3\nGemma-3n\nQwen2-VL\nQwen2.5-VL\nSmolVLM2\nLFM2-VL",
    "crumbs": [
      "How To Guides",
      "MultiModal / Vision Language Models (BETA)"
@@ -574,7 +574,7 @@
    "href": "docs/multimodal.html#supported-models",
    "title": "MultiModal / Vision Language Models (BETA)",
    "section": "",
-    "text": "Mllama\nLlama4\nPixtral\nLlava-1.5\nMistral-Small-3.1\nGemma-3\nGemma-3n\nQwen2-VL\nQwen2.5-VL",
+    "text": "Mllama\nLlama4\nPixtral\nLlava-1.5\nMistral-Small-3.1\nVoxtral\nGemma-3\nGemma-3n\nQwen2-VL\nQwen2.5-VL\nSmolVLM2\nLFM2-VL",
    "crumbs": [
      "How To Guides",
      "MultiModal / Vision Language Models (BETA)"
@@ -585,7 +585,7 @@
    "href": "docs/multimodal.html#usage",
    "title": "MultiModal / Vision Language Models (BETA)",
    "section": "Usage",
-    "text": "Usage\nMultimodal support is limited and doesn’t have full feature parity.\nHere are the hyperparams you’ll need to use to finetune a multimodal model.\nprocessor_type: AutoProcessor\n\nskip_prepare_dataset: true\nremove_unused_columns: false  # leave columns in place as they are needed to handle image embeddings during training\nsample_packing: false  # not yet supported with multimodal\n\nchat_template:  # see in next section\n\n# example dataset\ndatasets:\n  - path: HuggingFaceH4/llava-instruct-mix-vsft\n    type: chat_template\n    split: train[:1%]\n    field_messages: messages\n\n# (optional) if doing lora, only finetune the Language model,\n# leave the vision model and vision tower frozen\n# load_in_8bit: true\nadapter: lora\nlora_target_modules: 'model.language_model.layers.[\\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'\n\n# (optional) if you want to resize images to a set size\nimage_size: 512\nimage_resize_algorithm: bilinear\nPlease see examples folder for full configs.\n\n\n\n\n\n\nWarning\n\n\n\nSome of our chat_templates have been extended to support broader dataset types. This should not break any existing configs.\n\n\n\nMllama\nbase_model: meta-llama/Llama-3.2-11B-Vision-Instruct\n\nchat_template: llama3_2_vision\n\n\nLlama4\nbase_model: meta-llama/Llama-4-Scout-17B-16E-Instruct\n\nchat_template: llama4\n\n\nPixtral\nbase_model: mistralai/Pixtral-12B-2409\n\nchat_template: pixtral\n\n\nLlava-1.5\nbase_model: llava-hf/llava-1.5-7b-hf\n\nchat_template: llava\n\n\nMistral-Small-3.1\nbase_model: mistralai/Mistral-Small-3.1-24B-Instruct-2503\n\nchat_template: mistral_v7_tekken\n\n\nGemma-3\n\n\n\n\n\n\nTip\n\n\n\nThe Gemma3-1B model is a text-only model, so please train as regular text model.\n\n\nFor multi-modal 4B/12B/27B models, use the following config:\nbase_model: google/gemma-3-4b-it\n\nchat_template: gemma3\n\n\nGemma-3n\n\n\n\n\n\n\nWarning\n\n\n\nThe model’s initial loss and grad norm will be very high. We suspect this to be due to the Conv in the vision layers.\n\n\n\n\n\n\n\n\nTip\n\n\n\nPlease make sure to install timm via pip3 install timm==1.0.17\n\n\nbase_model: google/gemma-3n-E2B-it\n\nchat_template: gemma3n\n\n\nQwen2-VL\nbase_model: Qwen/Qwen2-VL-7B-Instruct\n\nchat_template: qwen2_vl\n\n\nQwen2.5-VL\nbase_model: Qwen/Qwen2.5-VL-7B-Instruct\n\nchat_template: qwen2_vl  # same as qwen2-vl",
+    "text": "Usage\nMultimodal support is limited and doesn’t have full feature parity.\nHere are the hyperparams you’ll need to use to finetune a multimodal model.\nprocessor_type: AutoProcessor\n\nskip_prepare_dataset: true\nremove_unused_columns: false  # leave columns in place as they are needed to handle image embeddings during training\nsample_packing: false  # not yet supported with multimodal\n\nchat_template:  # see in next section if specified\n\n# example dataset\ndatasets:\n  - path: HuggingFaceH4/llava-instruct-mix-vsft\n    type: chat_template\n    split: train[:1%]\n    field_messages: messages\n\n# (optional) if doing lora, only finetune the Language model,\n# leave the vision model and vision tower frozen\n# load_in_8bit: true\nadapter: lora\nlora_target_modules: 'model.language_model.layers.[\\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'\n\n# (optional) if you want to resize images to a set size\nimage_size: 512\nimage_resize_algorithm: bilinear\nPlease see examples folder for full configs.\n\n\n\n\n\n\nWarning\n\n\n\nSome of our chat_templates have been extended to support broader dataset types. This should not break any existing configs.\n\n\n\nMllama\nbase_model: meta-llama/Llama-3.2-11B-Vision-Instruct\n\nchat_template: llama3_2_vision\n\n\nLlama4\nbase_model: meta-llama/Llama-4-Scout-17B-16E-Instruct\n\nchat_template: llama4\n\n\nPixtral\nbase_model: mistralai/Pixtral-12B-2409\n\nchat_template: pixtral\n\n\nLlava-1.5\nbase_model: llava-hf/llava-1.5-7b-hf\n\nchat_template: llava\n\n\nMistral-Small-3.1\nbase_model: mistralai/Mistral-Small-3.1-24B-Instruct-2503\n\nchat_template: mistral_v7_tekken\n\n\nVoxtral\n\n\n\n\n\n\nTip\n\n\n\nPlease make sure to install audio lib via pip3 install librosa==0.11.0 'mistral_common[audio]==1.8.3'\n\n\nbase_model: mistralai/Voxtral-Mini-3B-2507\n\n\nGemma-3\n\n\n\n\n\n\nTip\n\n\n\nThe Gemma3-1B model is a text-only model, so please train as regular text model.\n\n\nFor multi-modal 4B/12B/27B models, use the following config:\nbase_model: google/gemma-3-4b-it\n\nchat_template: gemma3\n\n\nGemma-3n\n\n\n\n\n\n\nWarning\n\n\n\nThe model’s initial loss and grad norm will be very high. We suspect this to be due to the Conv in the vision layers.\n\n\n\n\n\n\n\n\nTip\n\n\n\nPlease make sure to install timm via pip3 install timm==1.0.17\n\n\nbase_model: google/gemma-3n-E2B-it\n\nchat_template: gemma3n\n\n\nQwen2-VL\nbase_model: Qwen/Qwen2-VL-7B-Instruct\n\nchat_template: qwen2_vl\n\n\nQwen2.5-VL\nbase_model: Qwen/Qwen2.5-VL-7B-Instruct\n\nchat_template: qwen2_vl  # same as qwen2-vl\n\n\nSmolVLM2\n\n\n\n\n\n\nTip\n\n\n\nPlease make sure to install num2words via pip3 install num2words==0.5.14\n\n\nbase_model: HuggingFaceTB/SmolVLM2-500M-Video-Instruct\n\n\nLFM2-VL\n\n\n\n\n\n\nWarning\n\n\n\nPlease uninstall causal-conv1d via pip3 uninstall -y causal-conv1d\n\n\nbase_model: LiquidAI/LFM2-VL-450M",
    "crumbs": [
      "How To Guides",
      "MultiModal / Vision Language Models (BETA)"
@@ -596,7 +596,7 @@
    "href": "docs/multimodal.html#dataset-format",
    "title": "MultiModal / Vision Language Models (BETA)",
    "section": "Dataset Format",
-    "text": "Dataset Format\nFor multi-modal datasets, we adopt an extended chat_template format similar to OpenAI’s Message format.\n\nA message is a list of role and content.\nrole can be system, user, assistant, etc.\ncontent is a list of type and (text, image, path, url, base64, or audio).\n\n\nImage\n\n\n\n\n\n\nNote\n\n\n\nFor backwards compatibility:\n\nIf the dataset has a images or image column of list[Image], it will be appended to the first content list as {\"type\": \"image\", \"image\": ...}. However, if the content already has a {\"type\": \"image\"} but no image key, it will be set the image key.\nIf content is a string, it will be converted to a list with type as text.\n\n\n\nFor image loading, you can use the following keys within content alongside \"type\": \"image\":\n\n\"path\": \"/path/to/image.jpg\"\n\"url\": \"https://example.com/image.jpg\"\n\"base64\": \"...\"\n\"image\": PIL.Image\n\n\n\nAudio\nFor audio loading, you can use the following keys within content alongside \"type\": \"audio\":\n\n\"path\": \"/path/to/audio.mp3\"\n\"url\": \"https://example.com/audio.mp3\"\n\"audio\": np.ndarray\n\n\n\n\n\n\n\nTip\n\n\n\nYou may need to install librosa via pip3 install librosa==0.11.0.\n\n\n\n\nExample\nHere is an example of a multi-modal dataset:\n[\n  {\n    \"messages\": [\n        {\n            \"role\": \"system\",\n            \"content\": [\n              {\"type\": \"text\", \"text\": \"You are a helpful assistant.\"}\n              ]\n        },\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\": \"image\", \"url\": \"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg\"},\n                {\"type\": \"text\", \"text\": \"Describe this image in detail.\"}\n            ]\n        },\n        {\n            \"role\": \"assistant\",\n            \"content\": [\n              {\"type\": \"text\", \"text\": \"The image is a bee.\"}\n            ]\n        }\n    ]\n  }\n]",
+    "text": "Dataset Format\nFor multi-modal datasets, we adopt an extended chat_template format similar to OpenAI’s Message format.\n\nA message is a list of role and content.\nrole can be system, user, assistant, etc.\ncontent is a list of type and (text, image, path, url, base64, or audio).\n\n\nImage\n\n\n\n\n\n\nNote\n\n\n\nFor backwards compatibility:\n\nIf the dataset has a images or image column of list[Image], it will be appended to the first content list as {\"type\": \"image\", \"image\": ...}. However, if the content already has a {\"type\": \"image\"} but no image key, it will be set the image key.\nIf content is a string, it will be converted to a list with type as text.\n\n\n\nFor image loading, you can use the following keys within content alongside \"type\": \"image\":\n\n\"path\": \"/path/to/image.jpg\"\n\"url\": \"https://example.com/image.jpg\"\n\"base64\": \"...\"\n\"image\": PIL.Image\n\n\n\nAudio\nFor audio loading, you can use the following keys within content alongside \"type\": \"audio\":\n\n\"path\": \"/path/to/audio.mp3\"\n\"url\": \"https://example.com/audio.mp3\"\n\"audio\": np.ndarray\n\n\n\n\n\n\n\nTip\n\n\n\nYou may need to install librosa via pip3 install librosa==0.11.0.\n\n\n\n\nVideo\n\n\n\n\n\n\nWarning\n\n\n\nThis is not well tested at the moment. We welcome contributors!\n\n\nFor video loading, you can use the following keys within content alongside \"type\": \"video\":\n\n\"path\": \"/path/to/video.mp4\"\n\"url\": \"https://example.com/video.mp4\"\n\"video\": np.ndarray | list[PIL.Image.Image] | torch.Tensor (or list of the aforementioned)\n\n\n\nExample\nHere is an example of a multi-modal dataset:\n[\n  {\n    \"messages\": [\n        {\n            \"role\": \"system\",\n            \"content\": [\n              {\"type\": \"text\", \"text\": \"You are a helpful assistant.\"}\n              ]\n        },\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\": \"image\", \"url\": \"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg\"},\n                {\"type\": \"text\", \"text\": \"Describe this image in detail.\"}\n            ]\n        },\n        {\n            \"role\": \"assistant\",\n            \"content\": [\n              {\"type\": \"text\", \"text\": \"The image is a bee.\"}\n            ]\n        }\n    ]\n  }\n]",
    "crumbs": [
      "How To Guides",
      "MultiModal / Vision Language Models (BETA)"