Fix re-asking: deterministic slot memory + user-turn merge + reason-loop prompt

Historical calls showed the 8B re-asking for name/reason/phone it already had
("I already gave you my full name", the "I want an appointment" -> "what brings
you in?" loop) and VAD splitting one utterance into consecutive user turns.

- callstate.py: CallStateGroomer between agg.user() and the LLM. After each
  agent turn (off the critical path) it extracts collected slots via one short
  JSON-mode Ollama pass, then before each generation injects an ALREADY
  COLLECTED / STILL NEEDED checklist into the system message and merges
  VAD-fragmented consecutive user messages. Callback-type calls get an explicit
  "no booking questions" line. CALL_STATE_TRACKING env (auto: on for ollama,
  off for anthropic).
- bot.py prompt step 1: "I want an appointment" is the booking intent, not the
  reason - ask the visit reason once, never twice.
- scripts/ab_replay.py: regression harness replaying the real failed calls.
  llama3.1-8b raw = 3 failures; with CALL STATE = 0 failures across all
  scenarios (chat latency 0.31s -> 0.55s median, well under the 3s gate).
  Qwen3-14B A/B'd and rejected: no better raw, ~3s/turn, 11GB VRAM.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
tocmo0nlord
2026-07-03 23:49:39 +00:00
parent bae388420b
commit a47f4b423c
5 changed files with 445 additions and 2 deletions

View File

@@ -83,7 +83,23 @@ audio while the bot is speaking (+`ECHO_TAIL_SECS`, default 0.5s) so echo never
Trade-off: half-duplex — the caller can't barge in mid-utterance (fine for short replies).
`HALF_DUPLEX=false` restores barge-in. Keep it on for telephony.
**Post-call extraction (`extract.py`)** — single JSON-mode completion after call ends.
**`CallStateGroomer` (`callstate.py`) — deterministic slot memory (2026-07-03).** Fixes the
8B re-asking for things the caller already gave (name, reason, phone — seen repeatedly in the
historical call logs: "Didn't you say you had my phone number?", "I already gave you my full
name", the "I want an appointment"→"what brings you in?" loop). Sits between `agg.user()` and
the LLM. Two jobs: (1) on upstream `BotStoppedSpeakingFrame` (agent finished; Ollama idle,
caller talking) it runs a ~1.2s JSON-mode extraction over the transcript-so-far — OFF the
latency-critical path, result applied next turn; (2) on downstream `LLMContextFrame` (right
before generation) it synchronously merges VAD-fragmented consecutive user messages
("Monday" / "3 p.m." → one turn) and injects an explicit checklist into the system message:
`CALL STATE ... ALREADY COLLECTED (NEVER ask again): name=Carlos Garcia ... STILL NEEDED:
insurance, preferred day/time`. It also carries call type (`callback` → "do NOT ask booking
questions"). Verified via `scripts/ab_replay.py` (replays the real failed calls): llama3.1-8B
raw = 3 failures, +CALL STATE = **0 failures**, chat latency 0.31s→0.55s med (system-message
churn re-evals the prompt; acceptable, still ≪ the 3s gate). Env: `CALL_STATE_TRACKING`
(default: on for ollama, off for anthropic — Claude tracks state fine on its own; extraction
always runs on the local Ollama model). Qwen3-14B was A/B'd as an alternative and rejected
for now: no better raw, ~3s/turn with state, needs `think:false` handling, ~11GB VRAM.
Correctly uses `format: json`, uses verified Twilio caller-ID instead of trusting model
output, falls back to JSONL if Odoo is unreachable. Keep it.
**Classifies `request_type`:** `appointment` (booking), `callback` (a non-booking request staff
@@ -495,6 +511,10 @@ Beyond the three reverted changes, the following hardening is live (see git hist
- **Reason capture** — post-call extractor broadened to capture the eye problem/symptom as the reason (not just visit types); reason now shown in the log line and the Odoo lead title.
- **Hang-up** — `HANGUP_DELAY_SECS=4` grace pause before dropping the carrier leg.
- **Office selection** — confirm the matching office; never offer/compare others.
- **Re-ask fix (2026-07-03)** — `CallStateGroomer` slot-state checklist + user-turn merge (see
component note above); prompt step 1 now says "I want an appointment" is intent not reason —
ask the visit reason ONCE, then move on. Regression harness: `scripts/ab_replay.py [--state]
<models...>` replays the historical failure scenarios and flags re-asks.
### Phase 2 — Accuracy (RAG + validation)