* chore: format * feat: add usage for alst * chore: wording * feat: add optimizations doc * Apply suggestion from @SalmanMohammadi Co-authored-by: salman <salman.mohammadi@outlook.com> * Update docs/dataset-formats/index.qmd Co-authored-by: salman <salman.mohammadi@outlook.com> * feat: add alst, act offloading, nd parallelism, use relative links, and fix format * chore: comments --------- Co-authored-by: salman <salman.mohammadi@outlook.com>
Arctic Long Sequence Training (ALST)
Artic Long Sequence Training (ALST) is a technique for training long context models using a variety of optimization techniques. It is a combination of:
- TiledMLP: Leverage tiling over the sequence dimension on MLP layers to reduce memory usage
- Tiled Loss: Using optimized loss functions like Liger-Kernel or Cut Cross Entropy to reduce memory usage
- Activation Offloading: Offload activations to CPU RAM to reduce memory usage
For more information, you can check out the ALST paper here.
Usage
tiled_mlp: true
# See Sequence Parallelism docs
# https://docs.axolotl.ai/docs/sequence_parallelism.html
context_parallel_size: int
plugins:
# See Cut Cross Entropy docs
# https://docs.axolotl.ai/docs/custom_integrations.html#cut-cross-entropy
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
# or Liger Kernel docs
# https://docs.axolotl.ai/docs/custom_integrations.html#liger-kernels
- axolotl.integrations.liger.LigerPlugin
# ...