* wip for dbrx finetuning * add fastcore for parallel loading of sharded weights * fix dtype for load, use PartialState instead of accelerator to init process group, remove redundant wandb callback * update to use v2 of the converted model * more fixes for dbrx loras * make sure to enable fsdp activation checkpointing * fix support for 8bit loras too for dbrx * apply z3 leaf moe fix for DBRX with deepspeed * don't raise value error since child module searches could fail and be ok * revert a previous change to fix fsdp * update mistral/mistral qlora+fsdp yamls * fix qlora+fsdp quant storage type * more edge cases for qlora-fsdp * fixes for fsdp+qlora w optimizer in 8bit * add bigstral z3 config and make sure to use full_state_dict for fsdp
1.1 KiB
1.1 KiB
DBRX MoE
Currently, for LoRA, only the q_proj, k_proj, v_proj out_proj and layer Linear layers are trainable.
We are using the "converted" base models based on this issue
where the Experts are fused as an nn.Parameter rather than a nn.Linear layer. However, the implementation
is still a bit buggy and attempting to train a LoRA adapter over those w1, w2 and v1 layers
results in the trainer hanging.
FSDP
We've tested using the LnL-AI/dbrx-base-converted-v2 model as the base model for FSDP.
The high memory usage seen w/ FSDP is due to FSDP not supporting 8bit optimizers.
- 16-bit LoRA w/ FSDP
- ✅ w/o CPU Offload - 8x80GB uses ~80GiB/gpu
- ❌ w/ CPU Offload -
paged_adamw_8bitoptimizer errors from being on cpu
- ✅ 8-bit LoRA w/ FSDP
- ❌ 4-bit QLoRA w/ FSDP - errors w/:
Error an illegal memory access was encountered at line 90 in file /src/csrc/ops.cu - ✅ bf16 full finetune w/ FSDP, freezing all but first 8 layers (8x80GB uses ~78GiB/gpu)
Deepspeed
WIP