Files

VED bb622b83de super nemo support (#3508 )

* nemo support

* config

* rename , config

* nemotron packing

* config fix

* read me + configs

* gc compat bug

* config chnages for qwen  and pad token nemo

* patch nemotron_h  weight renaming so it doesn't get reversed to embedding (singular noun) on checkpoint save

* lint

* revert qwen3.5 config changes, not needed in this pr

* lint

* Update examples/nemotron-h/120b-a12b-qlora.yaml

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

* Update examples/nemotron-h/nano-30b-a3b-qlora.yaml

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

* readme + validation

* lazy load comment

* Update examples/nemotron-h/120b-a12b-qlora.yaml

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

* val fix

* add nemo to multi packing

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

2026-03-30 18:12:50 -04:00

1.7 KiB

Raw Blame History

Nemotron-H (nvidia/NVIDIA-Nemotron-3-*)

Hybrid Mamba2 / Attention / MoE architecture (model_type: nemotron_h).

Model	Total params	Active params	Layers
NVIDIA-Nemotron-3-Super-120B-A12B-BF16	120B	~12B	88
NVIDIA-Nemotron-3-Nano-30B-A3B-BF16	30B	~3B	—

Requirements

pip install mamba-ssm causal-conv1d   # fast Mamba2 CUDA kernels

Architecture notes

Three block types per layer: Mamba2 (selective SSM), Attention (sparse), MoE (mixture-of-experts).
Only ~12 out of 88 blocks are attention layers (120B variant).
MLP activation is relu2 via mlp_hidden_act (not the usual hidden_act).

LoRA kernel patches

All three LoRA Triton kernel patches must be disabled:

lora_qkv_kernel: false   # attention lives in NemotronHBlock.mixer, not layer.self_attn
lora_o_kernel: false     # same reason
lora_mlp_kernel: false   # relu2 (mlp_hidden_act) is not supported by lora_mlp_kernel

MoE expert weights

NemotronH experts store up_proj and down_proj as 3D nn.Parameter tensors (shape [num_experts, out_dim, in_dim]), not nn.Linear modules — there is no gate_proj. To fine-tune them alongside attention, use lora_target_parameters instead of lora_target_modules:

lora_target_parameters:
  - up_proj
  - down_proj

Limitations

MoE Triton kernels: lora_mlp_kernel is not supported for NemotronH's MoE expert layers. The expert weights are 3D nn.Parameter tensors (not nn.Linear), which the Triton kernel does not support. Keep lora_mlp_kernel: false.
Gradient checkpointing: Only supported when sample_packing: true. Without sample packing the upstream model marks supports_gradient_checkpointing = False.

1.7 KiB Raw Blame History