Misc fixes 20250130 (#2301)

* misc fixes for garbage collection and L40S w NCCL P2P

* patch bnb fix for triton check

* chore: lint

* change up import

* try patching differently

* remove patch for bnb fix for now

* more verbose checks and tweak train loss threshold
This commit is contained in:
Wing Lian
2025-01-31 08:58:04 -05:00
committed by GitHub
parent 6f294c3d8d
commit cf17649ef3
5 changed files with 14 additions and 5 deletions

View File

@@ -63,7 +63,7 @@ class TestProcessRewardSmolLM2(unittest.TestCase):
train(cfg=cfg, dataset_meta=dataset_meta)
check_tensorboard(
temp_dir + "/runs", "train/train_loss", 2.5, "Train Loss is too high"
temp_dir + "/runs", "train/train_loss", 2.7, "Train Loss (%s) is too high"
)
check_model_output_exists(temp_dir, cfg)