super nemo support (#3508)
* nemo support * config * rename , config * nemotron packing * config fix * read me + configs * gc compat bug * config chnages for qwen and pad token nemo * patch nemotron_h weight renaming so it doesn't get reversed to embedding (singular noun) on checkpoint save * lint * revert qwen3.5 config changes, not needed in this pr * lint * Update examples/nemotron-h/120b-a12b-qlora.yaml Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * Update examples/nemotron-h/nano-30b-a3b-qlora.yaml Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * readme + validation * lazy load comment * Update examples/nemotron-h/120b-a12b-qlora.yaml Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * val fix * add nemo to multi packing --------- Co-authored-by: Wing Lian <wing@axolotl.ai> Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
This commit is contained in:
74
examples/nemotron-h/120b-a12b-qlora.yaml
Normal file
74
examples/nemotron-h/120b-a12b-qlora.yaml
Normal file
@@ -0,0 +1,74 @@
|
||||
base_model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
|
||||
|
||||
# LoRA kernel patches are incompatible with this architecture — see README.
|
||||
lora_mlp_kernel: false
|
||||
lora_qkv_kernel: false
|
||||
lora_o_kernel: false
|
||||
|
||||
chat_template: tokenizer_default
|
||||
datasets:
|
||||
- path: mlabonne/FineTome-100k
|
||||
type: chat_template
|
||||
split: train[:20%]
|
||||
field_messages: conversations
|
||||
message_property_mappings:
|
||||
role: from
|
||||
content: value
|
||||
|
||||
val_set_size: 0.0
|
||||
output_dir: ./outputs/out
|
||||
dataset_prepared_path: last_run_prepared
|
||||
|
||||
sequence_len: 4096
|
||||
sample_packing: true
|
||||
|
||||
use_cut_cross_entropy: true
|
||||
|
||||
load_in_4bit: true
|
||||
quantize_moe_experts: true
|
||||
adapter: qlora
|
||||
lora_r: 16
|
||||
lora_alpha: 32
|
||||
lora_dropout: 0.0
|
||||
lora_target_modules:
|
||||
# Attention projection layers (present in ~12 attention layers out of 88)
|
||||
- q_proj
|
||||
- k_proj
|
||||
- v_proj
|
||||
- o_proj
|
||||
# To also train MoE expert weights, add them via lora_target_parameters
|
||||
# (they are 3D nn.Parameter tensors, not nn.Linear — no gate_proj):
|
||||
# lora_target_parameters:
|
||||
# - up_proj
|
||||
# - down_proj
|
||||
|
||||
wandb_project:
|
||||
wandb_entity:
|
||||
wandb_watch:
|
||||
wandb_name:
|
||||
wandb_log_model:
|
||||
|
||||
gradient_accumulation_steps: 4
|
||||
micro_batch_size: 1
|
||||
num_epochs: 1
|
||||
optimizer: adamw_torch_4bit
|
||||
lr_scheduler: cosine
|
||||
learning_rate: 0.0002
|
||||
|
||||
bf16: auto
|
||||
tf32: true
|
||||
|
||||
gradient_checkpointing: true
|
||||
gradient_checkpointing_kwargs:
|
||||
use_reentrant: false
|
||||
|
||||
resume_from_checkpoint:
|
||||
logging_steps: 1
|
||||
flash_attention: true
|
||||
|
||||
warmup_ratio: 0.1
|
||||
evals_per_epoch: 2
|
||||
saves_per_epoch: 1
|
||||
weight_decay: 0.0
|
||||
|
||||
special_tokens:
|
||||
48
examples/nemotron-h/README.md
Normal file
48
examples/nemotron-h/README.md
Normal file
@@ -0,0 +1,48 @@
|
||||
# Nemotron-H (nvidia/NVIDIA-Nemotron-3-*)
|
||||
|
||||
Hybrid Mamba2 / Attention / MoE architecture (`model_type: nemotron_h`).
|
||||
|
||||
| Model | Total params | Active params | Layers |
|
||||
|---|---|---|---|
|
||||
| NVIDIA-Nemotron-3-Super-120B-A12B-BF16 | 120B | ~12B | 88 |
|
||||
| NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 | 30B | ~3B | — |
|
||||
|
||||
## Requirements
|
||||
|
||||
```bash
|
||||
pip install mamba-ssm causal-conv1d # fast Mamba2 CUDA kernels
|
||||
```
|
||||
|
||||
## Architecture notes
|
||||
|
||||
- Three block types per layer: **Mamba2** (selective SSM), **Attention** (sparse), **MoE** (mixture-of-experts).
|
||||
- Only ~12 out of 88 blocks are attention layers (120B variant).
|
||||
- MLP activation is `relu2` via `mlp_hidden_act` (not the usual `hidden_act`).
|
||||
|
||||
## LoRA kernel patches
|
||||
|
||||
All three LoRA Triton kernel patches must be disabled:
|
||||
|
||||
```yaml
|
||||
lora_qkv_kernel: false # attention lives in NemotronHBlock.mixer, not layer.self_attn
|
||||
lora_o_kernel: false # same reason
|
||||
lora_mlp_kernel: false # relu2 (mlp_hidden_act) is not supported by lora_mlp_kernel
|
||||
```
|
||||
|
||||
## MoE expert weights
|
||||
|
||||
NemotronH experts store `up_proj` and `down_proj` as 3D `nn.Parameter` tensors
|
||||
(shape `[num_experts, out_dim, in_dim]`), **not** `nn.Linear` modules — there is no
|
||||
`gate_proj`. To fine-tune them alongside attention, use `lora_target_parameters`
|
||||
instead of `lora_target_modules`:
|
||||
|
||||
```yaml
|
||||
lora_target_parameters:
|
||||
- up_proj
|
||||
- down_proj
|
||||
```
|
||||
|
||||
## Limitations
|
||||
|
||||
- **MoE Triton kernels**: `lora_mlp_kernel` is not supported for NemotronH's MoE expert layers. The expert weights are 3D `nn.Parameter` tensors (not `nn.Linear`), which the Triton kernel does not support. Keep `lora_mlp_kernel: false`.
|
||||
- **Gradient checkpointing**: Only supported when `sample_packing: true`. Without sample packing the upstream model marks `supports_gradient_checkpointing = False`.
|
||||
74
examples/nemotron-h/nano-30b-a3b-qlora.yaml
Normal file
74
examples/nemotron-h/nano-30b-a3b-qlora.yaml
Normal file
@@ -0,0 +1,74 @@
|
||||
# See examples/nemotron-h/README.md for architecture notes and requirements.
|
||||
base_model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
|
||||
|
||||
# LoRA kernel patches are incompatible with this architecture — see README.
|
||||
lora_mlp_kernel: false
|
||||
lora_qkv_kernel: false
|
||||
lora_o_kernel: false
|
||||
|
||||
chat_template: tokenizer_default
|
||||
datasets:
|
||||
- path: mlabonne/FineTome-100k
|
||||
type: chat_template
|
||||
split: train[:20%]
|
||||
field_messages: conversations
|
||||
message_property_mappings:
|
||||
role: from
|
||||
content: value
|
||||
|
||||
val_set_size: 0.0
|
||||
output_dir: ./outputs/out
|
||||
dataset_prepared_path: last_run_prepared
|
||||
|
||||
sequence_len: 4096
|
||||
sample_packing: true
|
||||
|
||||
use_cut_cross_entropy: true
|
||||
|
||||
load_in_4bit: true
|
||||
quantize_moe_experts: true
|
||||
adapter: qlora
|
||||
lora_r: 16
|
||||
lora_alpha: 32
|
||||
lora_dropout: 0.0
|
||||
lora_target_modules:
|
||||
- q_proj
|
||||
- k_proj
|
||||
- v_proj
|
||||
- o_proj
|
||||
# To also train MoE expert weights, add them via lora_target_parameters
|
||||
# (they are 3D nn.Parameter tensors, not nn.Linear — no gate_proj):
|
||||
# lora_target_parameters:
|
||||
# - up_proj
|
||||
# - down_proj
|
||||
|
||||
wandb_project:
|
||||
wandb_entity:
|
||||
wandb_watch:
|
||||
wandb_name:
|
||||
wandb_log_model:
|
||||
|
||||
gradient_accumulation_steps: 2
|
||||
micro_batch_size: 1
|
||||
num_epochs: 1
|
||||
optimizer: adamw_torch_4bit
|
||||
lr_scheduler: cosine
|
||||
learning_rate: 0.0002
|
||||
|
||||
bf16: auto
|
||||
tf32: true
|
||||
|
||||
gradient_checkpointing: true
|
||||
gradient_checkpointing_kwargs:
|
||||
use_reentrant: false
|
||||
|
||||
resume_from_checkpoint:
|
||||
logging_steps: 1
|
||||
flash_attention: true
|
||||
|
||||
warmup_ratio: 0.1
|
||||
evals_per_epoch: 4
|
||||
saves_per_epoch: 1
|
||||
weight_decay: 0.0
|
||||
|
||||
special_tokens:
|
||||
Reference in New Issue
Block a user