130 lines
2.9 KiB
Plaintext
130 lines
2.9 KiB
Plaintext
---
|
|
title: Optimizers
|
|
description: Configuring optimizers
|
|
---
|
|
|
|
## Overview
|
|
|
|
Axolotl supports all optimizers supported by [transformers OptimizerNames](https://github.com/huggingface/transformers/blob/51f94ea06d19a6308c61bbb4dc97c40aabd12bad/src/transformers/training_args.py#L142-L187)
|
|
|
|
Here is a list of optimizers supported by transformers as of `v4.54.0`:
|
|
|
|
- `adamw_torch`
|
|
- `adamw_torch_fused`
|
|
- `adamw_torch_xla`
|
|
- `adamw_torch_npu_fused`
|
|
- `adamw_apex_fused`
|
|
- `adafactor`
|
|
- `adamw_anyprecision`
|
|
- `adamw_torch_4bit`
|
|
- `adamw_torch_8bit`
|
|
- `ademamix`
|
|
- `sgd`
|
|
- `adagrad`
|
|
- `adamw_bnb_8bit`
|
|
- `adamw_8bit` # alias for adamw_bnb_8bit
|
|
- `ademamix_8bit`
|
|
- `lion_8bit`
|
|
- `lion_32bit`
|
|
- `paged_adamw_32bit`
|
|
- `paged_adamw_8bit`
|
|
- `paged_ademamix_32bit`
|
|
- `paged_ademamix_8bit`
|
|
- `paged_lion_32bit`
|
|
- `paged_lion_8bit`
|
|
- `rmsprop`
|
|
- `rmsprop_bnb`
|
|
- `rmsprop_bnb_8bit`
|
|
- `rmsprop_bnb_32bit`
|
|
- `galore_adamw`
|
|
- `galore_adamw_8bit`
|
|
- `galore_adafactor`
|
|
- `galore_adamw_layerwise`
|
|
- `galore_adamw_8bit_layerwise`
|
|
- `galore_adafactor_layerwise`
|
|
- `lomo`
|
|
- `adalomo`
|
|
- `grokadamw`
|
|
- `schedule_free_radam`
|
|
- `schedule_free_adamw`
|
|
- `schedule_free_sgd`
|
|
- `apollo_adamw`
|
|
- `apollo_adamw_layerwise`
|
|
- `stable_adamw`
|
|
|
|
|
|
## Custom Optimizers
|
|
|
|
Enable custom optimizers by passing a string to the `optimizer` argument. Each optimizer will receive beta and epsilon args, however, some may accept additional args which are detailed below.
|
|
|
|
### optimi_adamw
|
|
|
|
```yaml
|
|
optimizer: optimi_adamw
|
|
```
|
|
|
|
### ao_adamw_4bit
|
|
|
|
Deprecated: Please use `adamw_torch_4bit`.
|
|
|
|
### ao_adamw_8bit
|
|
|
|
Deprecated: Please use `adamw_torch_8bit`.
|
|
|
|
### ao_adamw_fp8
|
|
|
|
|
|
```yaml
|
|
optimizer: ao_adamw_fp8
|
|
```
|
|
|
|
### adopt_adamw
|
|
|
|
GitHub: [https://github.com/iShohei220/adopt](https://github.com/iShohei220/adopt)
|
|
Paper: [https://arxiv.org/abs/2411.02853](https://arxiv.org/abs/2411.02853)
|
|
|
|
```yaml
|
|
optimizer: adopt_adamw
|
|
```
|
|
|
|
### came_pytorch
|
|
|
|
GitHub: [https://github.com/yangluo7/CAME/tree/master](https://github.com/yangluo7/CAME/tree/master)
|
|
Paper: [https://arxiv.org/abs/2307.02047](https://arxiv.org/abs/2307.02047)
|
|
|
|
```yaml
|
|
optimizer: came_pytorch
|
|
|
|
# optional args (defaults below)
|
|
adam_beta1: 0.9
|
|
adam_beta2: 0.999
|
|
adam_beta3: 0.9999
|
|
adam_epsilon: 1e-30
|
|
adam_epsilon2: 1e-16
|
|
```
|
|
|
|
### muon
|
|
|
|
Blog: [https://kellerjordan.github.io/posts/muon/](https://kellerjordan.github.io/posts/muon/)
|
|
Paper: [https://arxiv.org/abs/2502.16982v1](https://arxiv.org/abs/2502.16982v1)
|
|
|
|
```yaml
|
|
optimizer: muon
|
|
```
|
|
|
|
### dion
|
|
|
|
Microsoft's Dion (DIstributed OrthoNormalization) optimizer is a scalable and communication-efficient
|
|
orthonormalizing optimizer that uses low-rank approximations to reduce gradient communication.
|
|
|
|
GitHub: [https://github.com/microsoft/dion](https://github.com/microsoft/dion)
|
|
Paper: [https://arxiv.org/pdf/2504.05295](https://arxiv.org/pdf/2504.05295)
|
|
Note: Implementation written for PyTorch 2.7+ for DTensor
|
|
|
|
```yaml
|
|
optimizer: dion
|
|
dion_lr: 0.01
|
|
dion_momentum: 0.95
|
|
lr: 0.00001 # learning rate for embeddings and parameters that fallback to AdamW
|
|
```
|