super nemo support (#3508)

* nemo support * config * rename , config * nemotron packing * config fix * read me + configs * gc compat bug * config chnages for qwen and pad token nemo * patch nemotron_h weight renaming so it doesn't get reversed to embedding (singular noun) on checkpoint save * lint * revert qwen3.5 config changes, not needed in this pr * lint * Update examples/nemotron-h/120b-a12b-qlora.yaml Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * Update examples/nemotron-h/nano-30b-a3b-qlora.yaml Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * readme + validation * lazy load comment * Update examples/nemotron-h/120b-a12b-qlora.yaml Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * val fix * add nemo to multi packing --------- Co-authored-by: Wing Lian <wing@axolotl.ai> Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
2026-03-31 03:42:50 +05:30
parent 00dee05fc6
commit bb622b83de
15 changed files with 651 additions and 7 deletions
--- a/examples/nemotron-h/120b-a12b-qlora.yaml
+++ b/examples/nemotron-h/120b-a12b-qlora.yaml
@@ -0,0 +1,74 @@
+base_model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
+
+# LoRA kernel patches are incompatible with this architecture — see README.
+lora_mlp_kernel: false
+lora_qkv_kernel: false
+lora_o_kernel: false
+
+chat_template: tokenizer_default
+datasets:
+  - path: mlabonne/FineTome-100k
+    type: chat_template
+    split: train[:20%]
+    field_messages: conversations
+    message_property_mappings:
+      role: from
+      content: value
+
+val_set_size: 0.0
+output_dir: ./outputs/out
+dataset_prepared_path: last_run_prepared
+
+sequence_len: 4096
+sample_packing: true
+
+use_cut_cross_entropy: true
+
+load_in_4bit: true
+quantize_moe_experts: true
+adapter: qlora
+lora_r: 16
+lora_alpha: 32
+lora_dropout: 0.0
+lora_target_modules:
+  # Attention projection layers (present in ~12 attention layers out of 88)
+  - q_proj
+  - k_proj
+  - v_proj
+  - o_proj
+  # To also train MoE expert weights, add them via lora_target_parameters
+  # (they are 3D nn.Parameter tensors, not nn.Linear — no gate_proj):
+  #   lora_target_parameters:
+  #     - up_proj
+  #     - down_proj
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 1
+num_epochs: 1
+optimizer: adamw_torch_4bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+bf16: auto
+tf32: true
+
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+
+resume_from_checkpoint:
+logging_steps: 1
+flash_attention: true
+
+warmup_ratio: 0.1
+evals_per_epoch: 2
+saves_per_epoch: 1
+weight_decay: 0.0
+
+special_tokens:
--- a/examples/nemotron-h/README.md
+++ b/examples/nemotron-h/README.md
@@ -0,0 +1,48 @@
+# Nemotron-H (nvidia/NVIDIA-Nemotron-3-*)
+
+Hybrid Mamba2 / Attention / MoE architecture (`model_type: nemotron_h`).
+
+| Model | Total params | Active params | Layers |
+|---|---|---|---|
+| NVIDIA-Nemotron-3-Super-120B-A12B-BF16 | 120B | ~12B | 88 |
+| NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 | 30B | ~3B | — |
+
+## Requirements
+
+```bash
+pip install mamba-ssm causal-conv1d   # fast Mamba2 CUDA kernels
+```
+
+## Architecture notes
+
+- Three block types per layer: **Mamba2** (selective SSM), **Attention** (sparse), **MoE** (mixture-of-experts).
+- Only ~12 out of 88 blocks are attention layers (120B variant).
+- MLP activation is `relu2` via `mlp_hidden_act` (not the usual `hidden_act`).
+
+## LoRA kernel patches
+
+All three LoRA Triton kernel patches must be disabled:
+
+```yaml
+lora_qkv_kernel: false   # attention lives in NemotronHBlock.mixer, not layer.self_attn
+lora_o_kernel: false     # same reason
+lora_mlp_kernel: false   # relu2 (mlp_hidden_act) is not supported by lora_mlp_kernel
+```
+
+## MoE expert weights
+
+NemotronH experts store `up_proj` and `down_proj` as 3D `nn.Parameter` tensors
+(shape `[num_experts, out_dim, in_dim]`), **not** `nn.Linear` modules — there is no
+`gate_proj`. To fine-tune them alongside attention, use `lora_target_parameters`
+instead of `lora_target_modules`:
+
+```yaml
+lora_target_parameters:
+  - up_proj
+  - down_proj
+```
+
+## Limitations
+
+- **MoE Triton kernels**: `lora_mlp_kernel` is not supported for NemotronH's MoE expert layers. The expert weights are 3D `nn.Parameter` tensors (not `nn.Linear`), which the Triton kernel does not support. Keep `lora_mlp_kernel: false`.
+- **Gradient checkpointing**: Only supported when `sample_packing: true`. Without sample packing the upstream model marks `supports_gradient_checkpointing = False`.
--- a/examples/nemotron-h/nano-30b-a3b-qlora.yaml
+++ b/examples/nemotron-h/nano-30b-a3b-qlora.yaml
@@ -0,0 +1,74 @@
+# See examples/nemotron-h/README.md for architecture notes and requirements.
+base_model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
+
+# LoRA kernel patches are incompatible with this architecture — see README.
+lora_mlp_kernel: false
+lora_qkv_kernel: false
+lora_o_kernel: false
+
+chat_template: tokenizer_default
+datasets:
+  - path: mlabonne/FineTome-100k
+    type: chat_template
+    split: train[:20%]
+    field_messages: conversations
+    message_property_mappings:
+      role: from
+      content: value
+
+val_set_size: 0.0
+output_dir: ./outputs/out
+dataset_prepared_path: last_run_prepared
+
+sequence_len: 4096
+sample_packing: true
+
+use_cut_cross_entropy: true
+
+load_in_4bit: true
+quantize_moe_experts: true
+adapter: qlora
+lora_r: 16
+lora_alpha: 32
+lora_dropout: 0.0
+lora_target_modules:
+  - q_proj
+  - k_proj
+  - v_proj
+  - o_proj
+  # To also train MoE expert weights, add them via lora_target_parameters
+  # (they are 3D nn.Parameter tensors, not nn.Linear — no gate_proj):
+  #   lora_target_parameters:
+  #     - up_proj
+  #     - down_proj
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 2
+micro_batch_size: 1
+num_epochs: 1
+optimizer: adamw_torch_4bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+bf16: auto
+tf32: true
+
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+
+resume_from_checkpoint:
+logging_steps: 1
+flash_attention: true
+
+warmup_ratio: 0.1
+evals_per_epoch: 4
+saves_per_epoch: 1
+weight_decay: 0.0
+
+special_tokens: