diff --git a/docs/multimodal.qmd b/docs/multimodal.qmd
index ec51a8ec3..df12f6e68 100644
--- a/docs/multimodal.qmd
+++ b/docs/multimodal.qmd
@@ -132,7 +132,9 @@ For multi-modal datasets, we adopt an extended `chat_template` format similar to
 
 - A message is a list of `role` and `content`.
 - `role` can be `system`, `user`, `assistant`, etc.
-- `content` is a list of `type` and (`text` or `image` or `path` or `url` or `base64`).
+- `content` is a list of `type` and (`text`, `image`, `path`, `url`, `base64`, or `audio`).
+
+### Image
 
 ::: {.callout-note}
 For backwards compatibility:
@@ -141,14 +143,22 @@ For backwards compatibility:
 - If `content` is a string, it will be converted to a list with `type` as `text`.
 :::
 
-::: {.callout-tip}
 For image loading, you can use the following keys within `content` alongside `"type": "image"`:
 
 - `"path": "/path/to/image.jpg"`
 - `"url": "https://example.com/image.jpg"`
 - `"base64": "..."`
 - `"image": PIL.Image`
-:::
+
+### Audio
+
+For audio loading, you can use the following keys within `content` alongside `"type": "audio"`:
+
+- `"path": "/path/to/audio.mp3"`
+- `"url": "https://example.com/audio.mp3"`
+- `"audio": np.ndarray`
+
+### Example
 
 Here is an example of a multi-modal dataset:
 ```json