From 3a01ba3a16b1517bae937a3b8169cf27e5480d0b Mon Sep 17 00:00:00 2001 From: Quarto GHA Workflow Runner Date: Wed, 6 Aug 2025 13:53:34 +0000 Subject: [PATCH] Built site for gh-pages --- .nojekyll | 2 +- .../monkeypatch.llama_attn_hijack_flash.html | 232 +--- ...monkeypatch.mistral_attn_hijack_flash.html | 230 +--- docs/config-reference.html | 1069 +++++++++-------- docs/custom_integrations.html | 2 +- .../colab-axolotl-example.html | 2 +- search.json | 31 +- sitemap.xml | 396 +++--- 8 files changed, 758 insertions(+), 1206 deletions(-) diff --git a/.nojekyll b/.nojekyll index 1cac6b7cb..fdea9e03f 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -441b49d6 \ No newline at end of file +8501682d \ No newline at end of file diff --git a/docs/api/monkeypatch.llama_attn_hijack_flash.html b/docs/api/monkeypatch.llama_attn_hijack_flash.html index 01db9d0c7..280d8c041 100644 --- a/docs/api/monkeypatch.llama_attn_hijack_flash.html +++ b/docs/api/monkeypatch.llama_attn_hijack_flash.html @@ -493,16 +493,9 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true}); @@ -518,119 +511,6 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});

monkeypatch.llama_attn_hijack_flash

monkeypatch.llama_attn_hijack_flash

Flash attention monkey patch for llama model

-
-

Classes

- - - - - - - - - - - - - - - - - -
NameDescription
FusedAttentionFused QKV Attention layer for incrementally improved training efficiency
LlamaDecoderLayerpatched version of LlamaDecoderLayer to pass through the precalculated cu_seqlens
-
-

FusedAttention

-
monkeypatch.llama_attn_hijack_flash.FusedAttention(config, q, k, v, o)
-

Fused QKV Attention layer for incrementally improved training efficiency

-
-
-

LlamaDecoderLayer

-
monkeypatch.llama_attn_hijack_flash.LlamaDecoderLayer()
-

patched version of LlamaDecoderLayer to pass through the precalculated cu_seqlens

-
-

Methods

- - - - - - - - - - - - - -
NameDescription
forward
-
-
forward
-
monkeypatch.llama_attn_hijack_flash.LlamaDecoderLayer.forward(
-    hidden_states,
-    attention_mask=None,
-    position_ids=None,
-    past_key_value=None,
-    output_attentions=False,
-    use_cache=False,
-    padding_mask=None,
-    cu_seqlens=None,
-    max_seqlen=None,
-)
-
-
Parameters
- ------ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
NameTypeDescriptionDefault
hidden_statestorch.FloatTensorinput to the layer of shape (batch, seq_len, embed_dim)required
attention_masktorch.FloatTensor, optionalattention mask of size (batch, 1, tgt_len, src_len) where padding elements are indicated by very large negative values.None
output_attentionsbool, optionalWhether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.False
use_cachebool, optionalIf set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).False
past_key_valueTuple(torch.FloatTensor), optionalcached past key and value projection statesNone
-
-
-
-
-

Functions

@@ -642,123 +522,35 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true}); - - - - - - - -
flashattn_forwardInput shape: Batch x Time x Channel
flashattn_forward_with_s2attn Input shape: Batch x Time x Channel
generate_qkv
-
-

flashattn_forward

-
monkeypatch.llama_attn_hijack_flash.flashattn_forward(
-    self,
-    hidden_states,
-    attention_mask=None,
-    position_ids=None,
-    past_key_value=None,
-    output_attentions=False,
-    use_cache=False,
-    padding_mask=None,
-    cu_seqlens=None,
-    max_seqlen=None,
-)
-

Input shape: Batch x Time x Channel

-

attention_mask: [bsz, q_len]

-

flashattn_forward_with_s2attn

-
monkeypatch.llama_attn_hijack_flash.flashattn_forward_with_s2attn(
-    self,
-    hidden_states,
-    attention_mask=None,
-    position_ids=None,
-    past_key_value=None,
-    output_attentions=False,
-    use_cache=False,
-    padding_mask=None,
-    cu_seqlens=None,
-    max_seqlen=None,
-)
+
monkeypatch.llama_attn_hijack_flash.flashattn_forward_with_s2attn(
+    self,
+    hidden_states,
+    attention_mask=None,
+    position_ids=None,
+    past_key_value=None,
+    output_attentions=False,
+    use_cache=False,
+    padding_mask=None,
+    cu_seqlens=None,
+    max_seqlen=None,
+)

Input shape: Batch x Time x Channel

From: https://github.com/dvlab-research/LongLoRA/blob/main/llama_attn_replace.py

attention_mask: [bsz, q_len]

cu_seqlens will be ignored if provided max_seqlen will be ignored if provided

-
-
-

generate_qkv

-
monkeypatch.llama_attn_hijack_flash.generate_qkv(
-    q,
-    k,
-    v,
-    query_padding_mask=None,
-    key_padding_mask=None,
-    kvpacked=False,
-    qkvpacked=False,
-)
-
-

Parameters

- ------ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
NameTypeDescriptionDefault
q(batch_size, seqlen_q, nheads, d)required
k(batch_size, seqlen_k, nheads_k, d)required
v(batch_size, seqlen_k, nheads_k, d)required
query_padding_mask(batch_size, seqlen), boolNone
key_padding_mask(batch_size, seqlen), boolNone
-