add e2e tests for checking functionality of resume from checkpoint (#865)

* use tensorboard to see if resume from checkpoint works

* make sure e2e test is either fp16 or bf16

* set max_steps and save limit so we have the checkpoint when testing resuming

* fix test parameters
This commit is contained in:
Wing Lian
2023-11-15 23:05:55 -05:00
committed by GitHub
parent 8a8d1c4023
commit b3a61e8ce2
4 changed files with 109 additions and 1 deletions

View File

@@ -101,6 +101,7 @@ class TestLoraLlama(unittest.TestCase):
"learning_rate": 0.00001,
"optimizer": "adamw_torch",
"lr_scheduler": "cosine",
"bf16": True,
}
)
normalize_config(cfg)