Attention
+SDP Attention
+This is the default built-in attention in PyTorch.
+sdp_attention: trueFor more details: PyTorch docs
+Flash Attention 2
+Uses efficient kernels to compute attention.
+flash_attention: trueFor more details: Flash Attention
+Nvidia
+Requirements: Ampere, Ada, or Hopper GPUs
+Note: For Turing GPUs or lower, please use other attention methods.
+pip install flash-attn --no-build-isolationIf you get undefined symbol while training, ensure you installed PyTorch prior to Axolotl. Alternatively, try reinstall or downgrade a version.
Flash Attention 3
+Requirements: Hopper only and CUDA 12.8 (recommended)
+git clone https://github.com/Dao-AILab/flash-attention.git
+cd flash-attention/hopper
+
+python setup.py installAMD
+Requirements: ROCm 6.0 and above.
+ +Flex Attention
+A flexible PyTorch API for attention used in combination with torch.compile.
flex_attention: true
+
+# recommended
+torch_compile: trueWe recommend using latest stable version of PyTorch for best performance.
+For more details: PyTorch docs
+SageAttention
+Attention kernels with QK Int8 and PV FP16 accumulator.
+sage_attention: trueRequirements: Ampere, Ada, or Hopper GPUs
+pip install sageattention==2.2.0 --no-build-isolationOnly LoRA/QLoRA recommended at the moment. We found loss drop to 0 for full finetuning. See GitHub Issue.
+For more details: Sage Attention
+We do not support SageAttention 3 at the moment. If you are interested on adding this or improving SageAttention implementation, please make an Issue.
+xFormers
+xformers_attention: trueWe recommend using with Turing GPUs or below (such as on Colab).
+For more details: xFormers
+Shifted Sparse Attention
+We plan to deprecate this! If you use this feature, we recommend switching to methods above.
+Requirements: LLaMA model architecture
+flash_attention: true
+s2_attention: trueNo sample packing support!
+