Docs: Phase 1 change log + gate status

Document all post-revert Phase 1 changes (Whisper base->medium, lifespan LLM
warmup/pin keep_alive=-1, num_ctx 8192, call workflow, TTS digit/name spelling,
capture-and-defer dates, insurance never-suggest/guess, broadened symptom reason
capture, hang-up grace, office selection). Mark gate items: AVC-side termination,
AudioHeartbeat, zombie-free, JSON visibility = done; capacity gating, 10-call
consecutive run, and latency re-measure = still need live testing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
tocmo0nlord
2026-06-27 14:41:01 +00:00
parent 550550975f
commit 856f9c284d

View File

@@ -348,9 +348,10 @@ for rollback. Keep the live system prompt lean for the same reason.
Per-turn latency is **LLM-side**, not STT: Whisper runs ~0.1s (VAD-stop → transcript), while
transcript → first TTS is ~0.26s median. The tail (P95 ~3s) came from **cold model reloads**
Ollama unloads after its keep-alive window, so the first reply of a call after an idle gap paid
a ~3s load. Fix: `server.py` fires a startup warmup that pins the model with `keep_alive=-1`
(`ollama ps` shows UNTIL = Forever). Residual ~3s spikes on some later turns are 8B generation
variance. Switching Whisper size would NOT help — it's not the bottleneck.
a ~3s load. Fix: `server.py` has a `lifespan` handler that warms + pins the model with
`keep_alive=-1` on startup (`ollama ps` shows UNTIL = Forever). Residual ~3s spikes on some
later turns are 8B generation variance. Switching Whisper size would NOT help — it's not the
bottleneck (STT model `medium` is for accuracy, not latency).
### Why Q4_K_M not Q8_0
@@ -391,20 +392,36 @@ Claude Code must not scaffold Phase N+1 until Phase N gate is marked complete.
**Goal:** Every utterance gets a response. Zero silent failures. AVC hangs up — not
the caller.
- [x] Change 1: STT — Deepgram evaluated, reverted; staying on Whisper (`medium`)
- [x] Change 1: STT — Deepgram evaluated, reverted; staying on Whisper (`base``medium`)
- [x] Change 2: Twilio auth — API Key evaluated, reverted; staying on Auth Token
- [x] Change 3: `.env` — Auth Token + Whisper vars; `OLLAMA_MODEL=activeblue-avc:latest`
- [ ] Verify `EndCallProcessor` termination in Twilio call logs (AVC side, not caller)
- [ ] Verify `AudioHeartbeat` diagnostic logging active
- [ ] Verify `MAX_CONCURRENT_CALLS` capacity gating works
- [x] `EndCallProcessor` AVC-side termination — confirmed in call logs (closing → hang-up); Twilio shows status `completed`
- [x] `AudioHeartbeat` diagnostic logging active (`[audio-in]` ticks ~every 5s)
- [ ] `MAX_CONCURRENT_CALLS` capacity gating — NOT yet tested (slot reserve/release works; the busy-reject path needs 3 concurrent calls)
**Gate — all five must pass:**
1. 10 consecutive test calls — zero silent non-responses
2. Zero zombie pipeline instances after call ends (`ps`/`pgrep` — service runs as a bare
systemd/host process, not Docker)
3. Call termination from AVC side confirmed in Twilio call logs
4. JSON parse failure rate visible in logs — measurable not invisible
5. Response latency P95 under 3 seconds from STT end-of-utterance to first TTS audio
**Gate — status:**
1. 10 consecutive calls, zero silent non-responses — zero *genuine* silent non-responses seen so far; no clean 10-in-a-row run after the latest fixes. **RE-TEST.**
2. Zero zombie pipeline instances — single process, slots release to `0/2` each call (`ps`/`pgrep`; bare process, not Docker).
3. ✅ AVC-side termination confirmed — logs (closing → hang-up) + Twilio call status `completed`.
4. ✅ JSON parse-failure rate visible — extractor logs every save/failure; 0% parse failures observed.
5. ⏳ Latency P95 < 3s — measured P95 ~3.18s (median 0.26s); cold-reload spikes removed by pinning the model warm. **RE-MEASURE** on a fresh batch.
**Still needs live testing before Phase 1 is signed off:** capacity gating (3 concurrent calls), a clean 10-call consecutive run, and a latency re-measure now that the model is pinned.
### Phase 1 — refinements since the revert
Beyond the three reverted changes, the following hardening is live (see git history):
- **STT model** — Whisper default raised `base``medium` for telephony accuracy; latency impact negligible (STT ≈ 0.1s; see latency note).
- **LLM warmup/pin** — `server.py` `lifespan` handler pins the model with `keep_alive=-1` on startup so the first call turn isn't a cold reload (`ollama ps` → UNTIL = Forever).
- **Context window** — `num_ctx` 4096 → 8192 (fixes mid-call silence; see note above).
- **Call workflow** — directed script: reason → location → caller info (address by name) → verify phone (read back) → wrap-up "anything else?" before the gated "Goodbye". See Call Workflow.
- **TTS** — `SpokenKokoroTTSService` reads phone/street/zip digit-by-digit; agent name respelled via `AGENT_NAME_SPOKEN=Eva`; caller-ID injected pre-spelled so it isn't mangled.
- **Dates** — capture-and-defer (no in-call computation); post-call best-effort `resolved_date`.
- **Insurance** — log only; never suggest or guess a plan (don't read plan names from the list, never invent one not stated); capture only what the caller says.
- **Reason capture** — post-call extractor broadened to capture the eye problem/symptom as the reason (not just visit types); reason now shown in the log line and the Odoo lead title.
- **Hang-up** — `HANGUP_DELAY_SECS=4` grace pause before dropping the carrier leg.
- **Office selection** — confirm the matching office; never offer/compare others.
### Phase 2 — Accuracy (RAG + validation)