From ad0c825bcbede899f833476b229f515573f6787b Mon Sep 17 00:00:00 2001 From: lhl Date: Tue, 28 Oct 2025 10:18:17 +0000 Subject: [PATCH] sample packing and telemetry docs --- src/axolotl/integrations/aux_free_router/README.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/src/axolotl/integrations/aux_free_router/README.md b/src/axolotl/integrations/aux_free_router/README.md index 1b77e49eb..a84ec17cd 100644 --- a/src/axolotl/integrations/aux_free_router/README.md +++ b/src/axolotl/integrations/aux_free_router/README.md @@ -39,3 +39,13 @@ Compatibility Notes - If you also enable Liger’s aux-loss paths, the plugin neutralizes aux loss when aux-free is on. - Telemetry: logs per-layer min/mean/max token loads, `|bias| max`, and bias sign flip fraction at the configured interval. +- Sample packing: packed batches are compatible with aux-free routing. Because load counts are accumulated on-device per expert before reduction, packing tends to smooth token histograms and reduce bias oscillation. Keep `pad_to_sequence_len: true` when packing to preserve the target token budget per expert. + +Telemetry metrics +- `moe_afb/l{idx}_load_min|mean|max`: token frequency per expert after reduction (0–1 range, sums to 1). +- `moe_afb/l{idx}_bias_abs_max`: absolute maximum of the learned bias for the layer. +- `moe_afb/l{idx}_bias_sign_flip_frac`: fraction of experts whose bias sign changed since the previous step (simple oscillation indicator). + +Usage tips +- Leave `moe_afb_telemetry_interval` unset to log on the Trainer’s `logging_steps`. Increase the interval for large jobs to reduce log volume. +- Compare aux-free vs. aux-loss load metrics by plotting the `load_*` series; aux-free typically tightens min/max spread without the auxiliary loss term.