Fix re-asking: deterministic slot memory + user-turn merge + reason-loop prompt

Historical calls showed the 8B re-asking for name/reason/phone it already had ("I already gave you my full name", the "I want an appointment" -> "what brings you in?" loop) and VAD splitting one utterance into consecutive user turns. - callstate.py: CallStateGroomer between agg.user() and the LLM. After each agent turn (off the critical path) it extracts collected slots via one short JSON-mode Ollama pass, then before each generation injects an ALREADY COLLECTED / STILL NEEDED checklist into the system message and merges VAD-fragmented consecutive user messages. Callback-type calls get an explicit "no booking questions" line. CALL_STATE_TRACKING env (auto: on for ollama, off for anthropic). - bot.py prompt step 1: "I want an appointment" is the booking intent, not the reason - ask the visit reason once, never twice. - scripts/ab_replay.py: regression harness replaying the real failed calls. llama3.1-8b raw = 3 failures; with CALL STATE = 0 failures across all scenarios (chat latency 0.31s -> 0.55s median, well under the 3s gate). Qwen3-14B A/B'd and rejected: no better raw, ~3s/turn, 11GB VRAM. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-03 23:49:39 +00:00
parent bae388420b
commit a47f4b423c
5 changed files with 445 additions and 2 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -83,7 +83,23 @@ audio while the bot is speaking (+`ECHO_TAIL_SECS`, default 0.5s) so echo never
 Trade-off: half-duplex — the caller can't barge in mid-utterance (fine for short replies).
 `HALF_DUPLEX=false` restores barge-in. Keep it on for telephony.

-**Post-call extraction (`extract.py`)** — single JSON-mode completion after call ends.
+**`CallStateGroomer` (`callstate.py`) — deterministic slot memory (2026-07-03).** Fixes the
+8B re-asking for things the caller already gave (name, reason, phone — seen repeatedly in the
+historical call logs: "Didn't you say you had my phone number?", "I already gave you my full
+name", the "I want an appointment"→"what brings you in?" loop). Sits between `agg.user()` and
+the LLM. Two jobs: (1) on upstream `BotStoppedSpeakingFrame` (agent finished; Ollama idle,
+caller talking) it runs a ~1.2s JSON-mode extraction over the transcript-so-far — OFF the
+latency-critical path, result applied next turn; (2) on downstream `LLMContextFrame` (right
+before generation) it synchronously merges VAD-fragmented consecutive user messages
+("Monday" / "3 p.m." → one turn) and injects an explicit checklist into the system message:
+`CALL STATE ... ALREADY COLLECTED (NEVER ask again): name=Carlos Garcia ... STILL NEEDED:
+insurance, preferred day/time`. It also carries call type (`callback` → "do NOT ask booking
+questions"). Verified via `scripts/ab_replay.py` (replays the real failed calls): llama3.1-8B
+raw = 3 failures, +CALL STATE = **0 failures**, chat latency 0.31s→0.55s med (system-message
+churn re-evals the prompt; acceptable, still ≪ the 3s gate). Env: `CALL_STATE_TRACKING`
+(default: on for ollama, off for anthropic — Claude tracks state fine on its own; extraction
+always runs on the local Ollama model). Qwen3-14B was A/B'd as an alternative and rejected
+for now: no better raw, ~3s/turn with state, needs `think:false` handling, ~11GB VRAM.
 Correctly uses `format: json`, uses verified Twilio caller-ID instead of trusting model
 output, falls back to JSONL if Odoo is unreachable. Keep it.
 **Classifies `request_type`:** `appointment` (booking), `callback` (a non-booking request staff
@@ -495,6 +511,10 @@ Beyond the three reverted changes, the following hardening is live (see git hist
 - **Reason capture** — post-call extractor broadened to capture the eye problem/symptom as the reason (not just visit types); reason now shown in the log line and the Odoo lead title.
 - **Hang-up** — `HANGUP_DELAY_SECS=4` grace pause before dropping the carrier leg.
 - **Office selection** — confirm the matching office; never offer/compare others.
+- **Re-ask fix (2026-07-03)** — `CallStateGroomer` slot-state checklist + user-turn merge (see
+  component note above); prompt step 1 now says "I want an appointment" is intent not reason —
+  ask the visit reason ONCE, then move on. Regression harness: `scripts/ab_replay.py [--state]
+  <models...>` replays the historical failure scenarios and flags re-asks.

 ### Phase 2 — Accuracy (RAG + validation)