* chore: format * feat: add usage for alst * chore: wording * feat: add optimizations doc * Apply suggestion from @SalmanMohammadi Co-authored-by: salman <salman.mohammadi@outlook.com> * Update docs/dataset-formats/index.qmd Co-authored-by: salman <salman.mohammadi@outlook.com> * feat: add alst, act offloading, nd parallelism, use relative links, and fix format * chore: comments --------- Co-authored-by: salman <salman.mohammadi@outlook.com>
31 lines
1.0 KiB
Markdown
31 lines
1.0 KiB
Markdown
# Arctic Long Sequence Training (ALST)
|
|
|
|
Artic Long Sequence Training (ALST) is a technique for training long context models using a variety of optimization
|
|
techniques. It is a combination of:
|
|
- TiledMLP: Leverage tiling over the sequence dimension on MLP layers to reduce memory usage
|
|
- Tiled Loss: Using optimized loss functions like Liger-Kernel or Cut Cross Entropy to reduce memory usage
|
|
- Activation Offloading: Offload activations to CPU RAM to reduce memory usage
|
|
|
|
For more information, you can check out the ALST paper [here](https://www.arxiv.org/abs/2506.13996).
|
|
|
|
## Usage
|
|
|
|
```yaml
|
|
tiled_mlp: true
|
|
|
|
# See Sequence Parallelism docs
|
|
# https://docs.axolotl.ai/docs/sequence_parallelism.html
|
|
context_parallel_size: int
|
|
|
|
plugins:
|
|
# See Cut Cross Entropy docs
|
|
# https://docs.axolotl.ai/docs/custom_integrations.html#cut-cross-entropy
|
|
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
|
|
|
|
# or Liger Kernel docs
|
|
# https://docs.axolotl.ai/docs/custom_integrations.html#liger-kernels
|
|
- axolotl.integrations.liger.LigerPlugin
|
|
# ...
|
|
|
|
```
|