Sequence parallelism quick follow-ups; remove ModelCallback (#2450)

* guard return if ring attn alrady registered

* add docs link, bits in multi-gpu docs, remove save model callback (subsumed by HF trainers)

* configurable heads_k_stride from ring-flash-attn hf adapter
This commit is contained in:
Dan Saunders
2025-03-31 09:13:42 -04:00
committed by GitHub
parent cf0c79d52e
commit 5410195e0b
10 changed files with 56 additions and 31 deletions

View File

@@ -658,6 +658,9 @@ ddp_broadcast_buffers:
# subsequences, or set to 4 to split into four equal-sized subsequences.
# See https://axolotl-ai-cloud.github.io/axolotl/docs/sequence_parallelism.html for more details.
sequence_parallel_degree:
# Optional; strides across the key dimension. Larger values use more memory but should make training faster.
# Must evenly divide the number of KV heads in your model.
heads_k_stride: 1
# Path to torch distx for optim 'adamw_anyprecision'
torchdistx_path: