Docs: Phase 1 change log + gate status
Document all post-revert Phase 1 changes (Whisper base->medium, lifespan LLM warmup/pin keep_alive=-1, num_ctx 8192, call workflow, TTS digit/name spelling, capture-and-defer dates, insurance never-suggest/guess, broadened symptom reason capture, hang-up grace, office selection). Mark gate items: AVC-side termination, AudioHeartbeat, zombie-free, JSON visibility = done; capacity gating, 10-call consecutive run, and latency re-measure = still need live testing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
45
CLAUDE.md
45
CLAUDE.md
@@ -348,9 +348,10 @@ for rollback. Keep the live system prompt lean for the same reason.
|
||||
Per-turn latency is **LLM-side**, not STT: Whisper runs ~0.1s (VAD-stop → transcript), while
|
||||
transcript → first TTS is ~0.26s median. The tail (P95 ~3s) came from **cold model reloads** —
|
||||
Ollama unloads after its keep-alive window, so the first reply of a call after an idle gap paid
|
||||
a ~3s load. Fix: `server.py` fires a startup warmup that pins the model with `keep_alive=-1`
|
||||
(`ollama ps` shows UNTIL = Forever). Residual ~3s spikes on some later turns are 8B generation
|
||||
variance. Switching Whisper size would NOT help — it's not the bottleneck.
|
||||
a ~3s load. Fix: `server.py` has a `lifespan` handler that warms + pins the model with
|
||||
`keep_alive=-1` on startup (`ollama ps` shows UNTIL = Forever). Residual ~3s spikes on some
|
||||
later turns are 8B generation variance. Switching Whisper size would NOT help — it's not the
|
||||
bottleneck (STT model `medium` is for accuracy, not latency).
|
||||
|
||||
### Why Q4_K_M not Q8_0
|
||||
|
||||
@@ -391,20 +392,36 @@ Claude Code must not scaffold Phase N+1 until Phase N gate is marked complete.
|
||||
**Goal:** Every utterance gets a response. Zero silent failures. AVC hangs up — not
|
||||
the caller.
|
||||
|
||||
- [x] Change 1: STT — Deepgram evaluated, reverted; staying on Whisper (`medium`)
|
||||
- [x] Change 1: STT — Deepgram evaluated, reverted; staying on Whisper (`base` → `medium`)
|
||||
- [x] Change 2: Twilio auth — API Key evaluated, reverted; staying on Auth Token
|
||||
- [x] Change 3: `.env` — Auth Token + Whisper vars; `OLLAMA_MODEL=activeblue-avc:latest`
|
||||
- [ ] Verify `EndCallProcessor` termination in Twilio call logs (AVC side, not caller)
|
||||
- [ ] Verify `AudioHeartbeat` diagnostic logging active
|
||||
- [ ] Verify `MAX_CONCURRENT_CALLS` capacity gating works
|
||||
- [x] `EndCallProcessor` AVC-side termination — confirmed in call logs (closing → hang-up); Twilio shows status `completed`
|
||||
- [x] `AudioHeartbeat` diagnostic logging — active (`[audio-in]` ticks ~every 5s)
|
||||
- [ ] `MAX_CONCURRENT_CALLS` capacity gating — NOT yet tested (slot reserve/release works; the busy-reject path needs 3 concurrent calls)
|
||||
|
||||
**Gate — all five must pass:**
|
||||
1. 10 consecutive test calls — zero silent non-responses
|
||||
2. Zero zombie pipeline instances after call ends (`ps`/`pgrep` — service runs as a bare
|
||||
systemd/host process, not Docker)
|
||||
3. Call termination from AVC side confirmed in Twilio call logs
|
||||
4. JSON parse failure rate visible in logs — measurable not invisible
|
||||
5. Response latency P95 under 3 seconds from STT end-of-utterance to first TTS audio
|
||||
**Gate — status:**
|
||||
1. ⏳ 10 consecutive calls, zero silent non-responses — zero *genuine* silent non-responses seen so far; no clean 10-in-a-row run after the latest fixes. **RE-TEST.**
|
||||
2. ✅ Zero zombie pipeline instances — single process, slots release to `0/2` each call (`ps`/`pgrep`; bare process, not Docker).
|
||||
3. ✅ AVC-side termination confirmed — logs (closing → hang-up) + Twilio call status `completed`.
|
||||
4. ✅ JSON parse-failure rate visible — extractor logs every save/failure; 0% parse failures observed.
|
||||
5. ⏳ Latency P95 < 3s — measured P95 ~3.18s (median 0.26s); cold-reload spikes removed by pinning the model warm. **RE-MEASURE** on a fresh batch.
|
||||
|
||||
**Still needs live testing before Phase 1 is signed off:** capacity gating (3 concurrent calls), a clean 10-call consecutive run, and a latency re-measure now that the model is pinned.
|
||||
|
||||
### Phase 1 — refinements since the revert
|
||||
|
||||
Beyond the three reverted changes, the following hardening is live (see git history):
|
||||
|
||||
- **STT model** — Whisper default raised `base` → `medium` for telephony accuracy; latency impact negligible (STT ≈ 0.1s; see latency note).
|
||||
- **LLM warmup/pin** — `server.py` `lifespan` handler pins the model with `keep_alive=-1` on startup so the first call turn isn't a cold reload (`ollama ps` → UNTIL = Forever).
|
||||
- **Context window** — `num_ctx` 4096 → 8192 (fixes mid-call silence; see note above).
|
||||
- **Call workflow** — directed script: reason → location → caller info (address by name) → verify phone (read back) → wrap-up "anything else?" before the gated "Goodbye". See Call Workflow.
|
||||
- **TTS** — `SpokenKokoroTTSService` reads phone/street/zip digit-by-digit; agent name respelled via `AGENT_NAME_SPOKEN=Eva`; caller-ID injected pre-spelled so it isn't mangled.
|
||||
- **Dates** — capture-and-defer (no in-call computation); post-call best-effort `resolved_date`.
|
||||
- **Insurance** — log only; never suggest or guess a plan (don't read plan names from the list, never invent one not stated); capture only what the caller says.
|
||||
- **Reason capture** — post-call extractor broadened to capture the eye problem/symptom as the reason (not just visit types); reason now shown in the log line and the Odoo lead title.
|
||||
- **Hang-up** — `HANGUP_DELAY_SECS=4` grace pause before dropping the carrier leg.
|
||||
- **Office selection** — confirm the matching office; never offer/compare others.
|
||||
|
||||
### Phase 2 — Accuracy (RAG + validation)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user