From 856f9c284d94e0ff2b22eebc43e40c3827697f5d Mon Sep 17 00:00:00 2001 From: tocmo0nlord Date: Sat, 27 Jun 2026 14:41:01 +0000 Subject: [PATCH] Docs: Phase 1 change log + gate status Document all post-revert Phase 1 changes (Whisper base->medium, lifespan LLM warmup/pin keep_alive=-1, num_ctx 8192, call workflow, TTS digit/name spelling, capture-and-defer dates, insurance never-suggest/guess, broadened symptom reason capture, hang-up grace, office selection). Mark gate items: AVC-side termination, AudioHeartbeat, zombie-free, JSON visibility = done; capacity gating, 10-call consecutive run, and latency re-measure = still need live testing. Co-Authored-By: Claude Opus 4.8 --- CLAUDE.md | 45 +++++++++++++++++++++++++++++++-------------- 1 file changed, 31 insertions(+), 14 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 16ff55f..180989c 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -348,9 +348,10 @@ for rollback. Keep the live system prompt lean for the same reason. Per-turn latency is **LLM-side**, not STT: Whisper runs ~0.1s (VAD-stop → transcript), while transcript → first TTS is ~0.26s median. The tail (P95 ~3s) came from **cold model reloads** — Ollama unloads after its keep-alive window, so the first reply of a call after an idle gap paid -a ~3s load. Fix: `server.py` fires a startup warmup that pins the model with `keep_alive=-1` -(`ollama ps` shows UNTIL = Forever). Residual ~3s spikes on some later turns are 8B generation -variance. Switching Whisper size would NOT help — it's not the bottleneck. +a ~3s load. Fix: `server.py` has a `lifespan` handler that warms + pins the model with +`keep_alive=-1` on startup (`ollama ps` shows UNTIL = Forever). Residual ~3s spikes on some +later turns are 8B generation variance. Switching Whisper size would NOT help — it's not the +bottleneck (STT model `medium` is for accuracy, not latency). ### Why Q4_K_M not Q8_0 @@ -391,20 +392,36 @@ Claude Code must not scaffold Phase N+1 until Phase N gate is marked complete. **Goal:** Every utterance gets a response. Zero silent failures. AVC hangs up — not the caller. -- [x] Change 1: STT — Deepgram evaluated, reverted; staying on Whisper (`medium`) +- [x] Change 1: STT — Deepgram evaluated, reverted; staying on Whisper (`base` → `medium`) - [x] Change 2: Twilio auth — API Key evaluated, reverted; staying on Auth Token - [x] Change 3: `.env` — Auth Token + Whisper vars; `OLLAMA_MODEL=activeblue-avc:latest` -- [ ] Verify `EndCallProcessor` termination in Twilio call logs (AVC side, not caller) -- [ ] Verify `AudioHeartbeat` diagnostic logging active -- [ ] Verify `MAX_CONCURRENT_CALLS` capacity gating works +- [x] `EndCallProcessor` AVC-side termination — confirmed in call logs (closing → hang-up); Twilio shows status `completed` +- [x] `AudioHeartbeat` diagnostic logging — active (`[audio-in]` ticks ~every 5s) +- [ ] `MAX_CONCURRENT_CALLS` capacity gating — NOT yet tested (slot reserve/release works; the busy-reject path needs 3 concurrent calls) -**Gate — all five must pass:** -1. 10 consecutive test calls — zero silent non-responses -2. Zero zombie pipeline instances after call ends (`ps`/`pgrep` — service runs as a bare - systemd/host process, not Docker) -3. Call termination from AVC side confirmed in Twilio call logs -4. JSON parse failure rate visible in logs — measurable not invisible -5. Response latency P95 under 3 seconds from STT end-of-utterance to first TTS audio +**Gate — status:** +1. ⏳ 10 consecutive calls, zero silent non-responses — zero *genuine* silent non-responses seen so far; no clean 10-in-a-row run after the latest fixes. **RE-TEST.** +2. ✅ Zero zombie pipeline instances — single process, slots release to `0/2` each call (`ps`/`pgrep`; bare process, not Docker). +3. ✅ AVC-side termination confirmed — logs (closing → hang-up) + Twilio call status `completed`. +4. ✅ JSON parse-failure rate visible — extractor logs every save/failure; 0% parse failures observed. +5. ⏳ Latency P95 < 3s — measured P95 ~3.18s (median 0.26s); cold-reload spikes removed by pinning the model warm. **RE-MEASURE** on a fresh batch. + +**Still needs live testing before Phase 1 is signed off:** capacity gating (3 concurrent calls), a clean 10-call consecutive run, and a latency re-measure now that the model is pinned. + +### Phase 1 — refinements since the revert + +Beyond the three reverted changes, the following hardening is live (see git history): + +- **STT model** — Whisper default raised `base` → `medium` for telephony accuracy; latency impact negligible (STT ≈ 0.1s; see latency note). +- **LLM warmup/pin** — `server.py` `lifespan` handler pins the model with `keep_alive=-1` on startup so the first call turn isn't a cold reload (`ollama ps` → UNTIL = Forever). +- **Context window** — `num_ctx` 4096 → 8192 (fixes mid-call silence; see note above). +- **Call workflow** — directed script: reason → location → caller info (address by name) → verify phone (read back) → wrap-up "anything else?" before the gated "Goodbye". See Call Workflow. +- **TTS** — `SpokenKokoroTTSService` reads phone/street/zip digit-by-digit; agent name respelled via `AGENT_NAME_SPOKEN=Eva`; caller-ID injected pre-spelled so it isn't mangled. +- **Dates** — capture-and-defer (no in-call computation); post-call best-effort `resolved_date`. +- **Insurance** — log only; never suggest or guess a plan (don't read plan names from the list, never invent one not stated); capture only what the caller says. +- **Reason capture** — post-call extractor broadened to capture the eye problem/symptom as the reason (not just visit types); reason now shown in the log line and the Odoo lead title. +- **Hang-up** — `HANGUP_DELAY_SECS=4` grace pause before dropping the carrier leg. +- **Office selection** — confirm the matching office; never offer/compare others. ### Phase 2 — Accuracy (RAG + validation)