update doc and use P2P=LOC for brittle grpo test (#2649)

* update doc and skip brittle grpo test

* fix the path to run the multigpu tests

* increase timeout, use LOC instead of NVL

* typo

* use hf cache from s3 backed cloudfront

* mark grpo as flaky test dues to vllm start
This commit is contained in:
Wing Lian
2025-05-12 14:17:25 -04:00
committed by GitHub
parent c7b6790614
commit f34eef546a
6 changed files with 131 additions and 110 deletions

View File

@@ -6,7 +6,7 @@ from .single_gpu import GPU_CONFIG, VOLUME_CONFIG, app, cicd_image, run_cmd
@app.function(
image=cicd_image,
gpu=GPU_CONFIG,
timeout=60 * 60,
timeout=90 * 60, # 90 min
cpu=8.0,
memory=131072,
volumes=VOLUME_CONFIG,