Files
avc-phone-ai/bot.py
tocmo0nlord a521dc168e Fix GPU OOM: share one Whisper model across calls (was leaking per call)
Calls were dropping right after answer with "CUDA failed with error out of
memory". Cause: each call constructed a new HintedWhisperSTTService -> new
ctranslate2 WhisperModel on the GPU, and that VRAM was never released when the
call ended. Over ~13 calls the python process grew to 9.7GB; with the pinned LLM
(6GB) the 16GB GPU filled (14 MiB free) and Whisper load failed on every call.

Fix: cache one WhisperModel per (model,device,compute) in _WHISPER_MODEL_CACHE
and reuse it across all calls; bake the fixed hotwords into the shared model's
transcribe() once (drops the racy per-call monkey-patch). VRAM now constant
(~6GB LLM + ~1.5GB Whisper). Verified: two instances share one model object;
GPU back to 6.0/16GB used after restart. Documented the VRAM budget.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 22:07:59 +00:00

38 KiB