Files
axolotl/examples/alst/README.md
Wing Lian f7ea140838 TiledMLP support for FSDP2 (#2950)
* make TiledMLP work with FSDP

* cleanup/gc at start of train to prevent large VRAM spike

* chore: lint

* generic function for non-deepspeed training

* unify patch to fix imports

* update readme for ALST and add examples

* make deepspeed attribute on params check more robust

* update with new info from PR review
2025-07-25 07:15:03 -04:00

573 B

Arctic Long Sequence Training (ALST)

Artic Long Sequence Training (ALST) is a technique for training long context models using a variety of optimization techniques. It is a combination of:

  • TiledMLP: Leverage tiling over the sequence dimension on MLP layers to reduce memory usage
  • Tiled Loss: Using optimized loss functions like Liger-Kernel or Cut Cross Entropy to reduce memory usage
  • Activation Offloading: Offload activations to CPU RAM to reduce memory usage

For more information, you can check out the ALST paper here.