axolotl/docs/attention.qmd

---
title: Attention
description: Supported attention modules in Axolotl
---

## SDP Attention

This is the default built-in attention in PyTorch.

```yaml
sdp_attention: true
```

For more details: [PyTorch docs](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)

## Flash Attention 2

Uses efficient kernels to compute attention.

```yaml
flash_attention: true
```

For more details: [Flash Attention](https://github.com/Dao-AILab/flash-attention/)

### Nvidia

Requirements: Ampere, Ada, or Hopper GPUs

Note: For Turing GPUs or lower, please use other attention methods.

```bash
pip install flash-attn --no-build-isolation
```

::: {.callout-tip}

If you get `undefined symbol` while training, ensure you installed PyTorch prior to Axolotl. Alternatively, try reinstall or downgrade a version.

:::

#### Flash Attention 3

Requirements: Hopper only and CUDA 12.8 (recommended)

```bash
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention/hopper

python setup.py install
```

### AMD

Requirements: ROCm 6.0 and above.

See [Flash Attention AMD docs](https://github.com/Dao-AILab/flash-attention/tree/main?tab=readme-ov-file#amd-rocm-support).

## Flex Attention

A flexible PyTorch API for attention used in combination with `torch.compile`.

```yaml
flex_attention: true

# recommended
torch_compile: true
```

::: {.callout-note}

We recommend using latest stable version of PyTorch for best performance.

:::

For more details: [PyTorch docs](https://pytorch.org/blog/flexattention/)

## SageAttention

Attention kernels with QK Int8 and PV FP16 accumulator.

```yaml
sage_attention: true
```

Requirements: Ampere, Ada, or Hopper GPUs

```bash
pip install sageattention==2.2.0 --no-build-isolation
```

::: {.callout-warning}

Only LoRA/QLoRA recommended at the moment. We found loss drop to 0 for full finetuning. See [GitHub Issue](https://github.com/thu-ml/SageAttention/issues/198).

:::

For more details: [Sage Attention](https://github.com/thu-ml/SageAttention)

::: {.callout-note}

We do not support SageAttention 3 at the moment. If you are interested on adding this or improving SageAttention implementation, please make an Issue.

:::


## xFormers

```yaml
xformers_attention: true
```

::: {.callout-tip}

We recommend using with Turing GPUs or below (such as on Colab).

:::

For more details: [xFormers](https://github.com/facebookresearch/xformers)

## Shifted Sparse Attention

::: {.callout-warning}

We plan to deprecate this! If you use this feature, we recommend switching to methods above.

:::

Requirements: LLaMA model architecture

```yaml
flash_attention: true
s2_attention: true
```

::: {.callout-tip}

No sample packing support!

:::