From 856f9c284d94e0ff2b22eebc43e40c3827697f5d Mon Sep 17 00:00:00 2001
From: tocmo0nlord <mr.garcia09@gmail.com>
Date: Sat, 27 Jun 2026 14:41:01 +0000
Subject: [PATCH] Docs: Phase 1 change log + gate status

Document all post-revert Phase 1 changes (Whisper base->medium, lifespan LLM
warmup/pin keep_alive=-1, num_ctx 8192, call workflow, TTS digit/name spelling,
capture-and-defer dates, insurance never-suggest/guess, broadened symptom reason
capture, hang-up grace, office selection). Mark gate items: AVC-side termination,
AudioHeartbeat, zombie-free, JSON visibility = done; capacity gating, 10-call
consecutive run, and latency re-measure = still need live testing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 CLAUDE.md | 45 +++++++++++++++++++++++++++++++--------------
 1 file changed, 31 insertions(+), 14 deletions(-)

diff --git a/CLAUDE.md b/CLAUDE.md
index 16ff55f..180989c 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -348,9 +348,10 @@ for rollback. Keep the live system prompt lean for the same reason.
 Per-turn latency is **LLM-side**, not STT: Whisper runs ~0.1s (VAD-stop → transcript), while
 transcript → first TTS is ~0.26s median. The tail (P95 ~3s) came from **cold model reloads** —
 Ollama unloads after its keep-alive window, so the first reply of a call after an idle gap paid
-a ~3s load. Fix: `server.py` fires a startup warmup that pins the model with `keep_alive=-1`
-(`ollama ps` shows UNTIL = Forever). Residual ~3s spikes on some later turns are 8B generation
-variance. Switching Whisper size would NOT help — it's not the bottleneck.
+a ~3s load. Fix: `server.py` has a `lifespan` handler that warms + pins the model with
+`keep_alive=-1` on startup (`ollama ps` shows UNTIL = Forever). Residual ~3s spikes on some
+later turns are 8B generation variance. Switching Whisper size would NOT help — it's not the
+bottleneck (STT model `medium` is for accuracy, not latency).
 
 ### Why Q4_K_M not Q8_0
 
@@ -391,20 +392,36 @@ Claude Code must not scaffold Phase N+1 until Phase N gate is marked complete.
 **Goal:** Every utterance gets a response. Zero silent failures. AVC hangs up — not
 the caller.
 
-- [x] Change 1: STT — Deepgram evaluated, reverted; staying on Whisper (`medium`)
+- [x] Change 1: STT — Deepgram evaluated, reverted; staying on Whisper (`base` → `medium`)
 - [x] Change 2: Twilio auth — API Key evaluated, reverted; staying on Auth Token
 - [x] Change 3: `.env` — Auth Token + Whisper vars; `OLLAMA_MODEL=activeblue-avc:latest`
-- [ ] Verify `EndCallProcessor` termination in Twilio call logs (AVC side, not caller)
-- [ ] Verify `AudioHeartbeat` diagnostic logging active
-- [ ] Verify `MAX_CONCURRENT_CALLS` capacity gating works
+- [x] `EndCallProcessor` AVC-side termination — confirmed in call logs (closing → hang-up); Twilio shows status `completed`
+- [x] `AudioHeartbeat` diagnostic logging — active (`[audio-in]` ticks ~every 5s)
+- [ ] `MAX_CONCURRENT_CALLS` capacity gating — NOT yet tested (slot reserve/release works; the busy-reject path needs 3 concurrent calls)
 
-**Gate — all five must pass:**
-1. 10 consecutive test calls — zero silent non-responses
-2. Zero zombie pipeline instances after call ends (`ps`/`pgrep` — service runs as a bare
-   systemd/host process, not Docker)
-3. Call termination from AVC side confirmed in Twilio call logs
-4. JSON parse failure rate visible in logs — measurable not invisible
-5. Response latency P95 under 3 seconds from STT end-of-utterance to first TTS audio
+**Gate — status:**
+1. ⏳ 10 consecutive calls, zero silent non-responses — zero *genuine* silent non-responses seen so far; no clean 10-in-a-row run after the latest fixes. **RE-TEST.**
+2. ✅ Zero zombie pipeline instances — single process, slots release to `0/2` each call (`ps`/`pgrep`; bare process, not Docker).
+3. ✅ AVC-side termination confirmed — logs (closing → hang-up) + Twilio call status `completed`.
+4. ✅ JSON parse-failure rate visible — extractor logs every save/failure; 0% parse failures observed.
+5. ⏳ Latency P95 < 3s — measured P95 ~3.18s (median 0.26s); cold-reload spikes removed by pinning the model warm. **RE-MEASURE** on a fresh batch.
+
+**Still needs live testing before Phase 1 is signed off:** capacity gating (3 concurrent calls), a clean 10-call consecutive run, and a latency re-measure now that the model is pinned.
+
+### Phase 1 — refinements since the revert
+
+Beyond the three reverted changes, the following hardening is live (see git history):
+
+- **STT model** — Whisper default raised `base` → `medium` for telephony accuracy; latency impact negligible (STT ≈ 0.1s; see latency note).
+- **LLM warmup/pin** — `server.py` `lifespan` handler pins the model with `keep_alive=-1` on startup so the first call turn isn't a cold reload (`ollama ps` → UNTIL = Forever).
+- **Context window** — `num_ctx` 4096 → 8192 (fixes mid-call silence; see note above).
+- **Call workflow** — directed script: reason → location → caller info (address by name) → verify phone (read back) → wrap-up "anything else?" before the gated "Goodbye". See Call Workflow.
+- **TTS** — `SpokenKokoroTTSService` reads phone/street/zip digit-by-digit; agent name respelled via `AGENT_NAME_SPOKEN=Eva`; caller-ID injected pre-spelled so it isn't mangled.
+- **Dates** — capture-and-defer (no in-call computation); post-call best-effort `resolved_date`.
+- **Insurance** — log only; never suggest or guess a plan (don't read plan names from the list, never invent one not stated); capture only what the caller says.
+- **Reason capture** — post-call extractor broadened to capture the eye problem/symptom as the reason (not just visit types); reason now shown in the log line and the Odoo lead title.
+- **Hang-up** — `HANGUP_DELAY_SECS=4` grace pause before dropping the carrier leg.
+- **Office selection** — confirm the matching office; never offer/compare others.
 
 ### Phase 2 — Accuracy (RAG + validation)