Files
axolotl/examples/dbrx/README.md
Wing Lian 132eb740f0 DBRX Model Support (#1462)
* wip for dbrx finetuning

* add fastcore for parallel loading of sharded weights

* fix dtype for load, use PartialState instead of accelerator to init process group, remove redundant wandb callback

* update to use v2 of the converted model

* more fixes for dbrx loras

* make sure to enable fsdp activation checkpointing

* fix support for 8bit loras too for dbrx

* apply z3 leaf moe fix for DBRX with deepspeed

* don't raise value error since child module searches could fail and be ok

* revert a previous change to fix fsdp

* update mistral/mistral qlora+fsdp yamls

* fix qlora+fsdp quant storage type

* more edge cases for qlora-fsdp

* fixes for fsdp+qlora w optimizer in 8bit

* add bigstral z3 config and make sure to use full_state_dict for fsdp
2024-04-12 09:02:36 -04:00

1.1 KiB

DBRX MoE

Currently, for LoRA, only the q_proj, k_proj, v_proj out_proj and layer Linear layers are trainable.

We are using the "converted" base models based on this issue where the Experts are fused as an nn.Parameter rather than a nn.Linear layer. However, the implementation is still a bit buggy and attempting to train a LoRA adapter over those w1, w2 and v1 layers results in the trainer hanging.

FSDP

We've tested using the LnL-AI/dbrx-base-converted-v2 model as the base model for FSDP.

The high memory usage seen w/ FSDP is due to FSDP not supporting 8bit optimizers.

  • 16-bit LoRA w/ FSDP
    • w/o CPU Offload - 8x80GB uses ~80GiB/gpu
    • w/ CPU Offload - paged_adamw_8bit optimizer errors from being on cpu
  • 8-bit LoRA w/ FSDP
  • 4-bit QLoRA w/ FSDP - errors w/: Error an illegal memory access was encountered at line 90 in file /src/csrc/ops.cu
  • bf16 full finetune w/ FSDP, freezing all but first 8 layers (8x80GB uses ~78GiB/gpu)

Deepspeed

WIP