Files

Wing Lian f7ea140838 TiledMLP support for FSDP2 (#2950 )

* make TiledMLP work with FSDP

* cleanup/gc at start of train to prevent large VRAM spike

* chore: lint

* generic function for non-deepspeed training

* unify patch to fix imports

* update readme for ALST and add examples

* make deepspeed attribute on params check more robust

* update with new info from PR review

2025-07-25 07:15:03 -04:00

573 B

Raw Blame History

Arctic Long Sequence Training (ALST)

Artic Long Sequence Training (ALST) is a technique for training long context models using a variety of optimization techniques. It is a combination of:

TiledMLP: Leverage tiling over the sequence dimension on MLP layers to reduce memory usage
Tiled Loss: Using optimized loss functions like Liger-Kernel or Cut Cross Entropy to reduce memory usage
Activation Offloading: Offload activations to CPU RAM to reduce memory usage

For more information, you can check out the ALST paper here.

573 B Raw Blame History

Arctic Long Sequence Training (ALST)

573 B

Raw Blame History