Log/surface the reason, pin LLM warm for latency, doc insurance rule
- Reason visibility: the reason WAS extracted ("disintegrated eyes") but only
lived in the Odoo description note. Add it to the post-call log line and to
the Odoo lead title so it's visible at a glance.
- Latency: split the timing — Whisper is ~0.1s, latency is LLM-side. The ~3s
tail was cold model reloads after Ollama's keep-alive expired. server.py now
warms + pins the model on startup (keep_alive=-1, ollama ps UNTIL=Forever),
removing cold first-turn stalls. Whisper size left alone (not the bottleneck).
- CLAUDE.md: insurance rule (never suggest/guess the plan), latency note.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
21
server.py
21
server.py
@@ -64,6 +64,27 @@ BUSY_MESSAGE = os.environ.get(
|
||||
|
||||
app = FastAPI()
|
||||
|
||||
|
||||
@app.on_event("startup")
|
||||
async def _warm_llm():
|
||||
"""Pin the LLM in VRAM (keep_alive=-1) so the first turn of a call isn't a cold model
|
||||
reload. Cold reloads were adding ~3s of dead air to the first reply; latency is otherwise
|
||||
LLM-side (Whisper STT is ~0.1s). Best-effort — a failure here never blocks startup."""
|
||||
import httpx
|
||||
|
||||
base = os.environ.get("OLLAMA_URL", "http://127.0.0.1:11434/v1").rstrip("/")
|
||||
if base.endswith("/v1"):
|
||||
base = base[:-3]
|
||||
model = os.environ.get("OLLAMA_MODEL", "activeblue-avc:latest")
|
||||
try:
|
||||
async with httpx.AsyncClient(timeout=120) as c:
|
||||
await c.post(f"{base}/api/generate",
|
||||
json={"model": model, "prompt": "ok", "stream": False, "keep_alive": -1})
|
||||
logger.info(f"Warmed + pinned Ollama model {model} (keep_alive=-1)")
|
||||
except Exception as e:
|
||||
logger.warning(f"LLM warmup failed (first call may be slow): {e!r}")
|
||||
|
||||
|
||||
# Live count of active /ws pipelines (the real GPU consumers), guarded by a lock.
|
||||
_active_calls = 0
|
||||
_active_lock = asyncio.Lock()
|
||||
|
||||
Reference in New Issue
Block a user