* nemo support * config * rename , config * nemotron packing * config fix * read me + configs * gc compat bug * config chnages for qwen and pad token nemo * patch nemotron_h weight renaming so it doesn't get reversed to embedding (singular noun) on checkpoint save * lint * revert qwen3.5 config changes, not needed in this pr * lint * Update examples/nemotron-h/120b-a12b-qlora.yaml Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * Update examples/nemotron-h/nano-30b-a3b-qlora.yaml Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * readme + validation * lazy load comment * Update examples/nemotron-h/120b-a12b-qlora.yaml Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * val fix * add nemo to multi packing --------- Co-authored-by: Wing Lian <wing@axolotl.ai> Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
1.7 KiB
1.7 KiB
Nemotron-H (nvidia/NVIDIA-Nemotron-3-*)
Hybrid Mamba2 / Attention / MoE architecture (model_type: nemotron_h).
| Model | Total params | Active params | Layers |
|---|---|---|---|
| NVIDIA-Nemotron-3-Super-120B-A12B-BF16 | 120B | ~12B | 88 |
| NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 | 30B | ~3B | — |
Requirements
pip install mamba-ssm causal-conv1d # fast Mamba2 CUDA kernels
Architecture notes
- Three block types per layer: Mamba2 (selective SSM), Attention (sparse), MoE (mixture-of-experts).
- Only ~12 out of 88 blocks are attention layers (120B variant).
- MLP activation is
relu2viamlp_hidden_act(not the usualhidden_act).
LoRA kernel patches
All three LoRA Triton kernel patches must be disabled:
lora_qkv_kernel: false # attention lives in NemotronHBlock.mixer, not layer.self_attn
lora_o_kernel: false # same reason
lora_mlp_kernel: false # relu2 (mlp_hidden_act) is not supported by lora_mlp_kernel
MoE expert weights
NemotronH experts store up_proj and down_proj as 3D nn.Parameter tensors
(shape [num_experts, out_dim, in_dim]), not nn.Linear modules — there is no
gate_proj. To fine-tune them alongside attention, use lora_target_parameters
instead of lora_target_modules:
lora_target_parameters:
- up_proj
- down_proj
Limitations
- MoE Triton kernels:
lora_mlp_kernelis not supported for NemotronH's MoE expert layers. The expert weights are 3Dnn.Parametertensors (notnn.Linear), which the Triton kernel does not support. Keeplora_mlp_kernel: false. - Gradient checkpointing: Only supported when
sample_packing: true. Without sample packing the upstream model markssupports_gradient_checkpointing = False.