Files

NanoCode012 856ff12171 feat(doc): add optimizations table of content to our improvements (#3175 ) [skip ci]

* chore: format

* feat: add usage for alst

* chore: wording

* feat: add optimizations doc

* Apply suggestion from @SalmanMohammadi

Co-authored-by: salman <salman.mohammadi@outlook.com>

* Update docs/dataset-formats/index.qmd

Co-authored-by: salman <salman.mohammadi@outlook.com>

* feat: add alst, act offloading, nd parallelism, use relative links, and fix format

* chore: comments

---------

Co-authored-by: salman <salman.mohammadi@outlook.com>

2025-09-24 16:13:49 -04:00

llama3-8b-deepspeed-alst.yaml

Distributed/ND-Parallel (#2977 )

2025-07-31 15:25:02 -04:00

llama3-8b-fsdp2-alst.yaml

TiledMLP support for FSDP2 (#2950 )

2025-07-25 07:15:03 -04:00

README.md

feat(doc): add optimizations table of content to our improvements (#3175 ) [skip ci]

2025-09-24 16:13:49 -04:00

README.md

Arctic Long Sequence Training (ALST)

Artic Long Sequence Training (ALST) is a technique for training long context models using a variety of optimization techniques. It is a combination of:

TiledMLP: Leverage tiling over the sequence dimension on MLP layers to reduce memory usage
Tiled Loss: Using optimized loss functions like Liger-Kernel or Cut Cross Entropy to reduce memory usage
Activation Offloading: Offload activations to CPU RAM to reduce memory usage

For more information, you can check out the ALST paper here.

Usage

tiled_mlp: true

# See Sequence Parallelism docs
# https://docs.axolotl.ai/docs/sequence_parallelism.html
context_parallel_size: int

plugins:
# See Cut Cross Entropy docs
# https://docs.axolotl.ai/docs/custom_integrations.html#cut-cross-entropy
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin

# or Liger Kernel docs
# https://docs.axolotl.ai/docs/custom_integrations.html#liger-kernels
  - axolotl.integrations.liger.LigerPlugin
# ...