Fix re-asking: deterministic slot memory + user-turn merge + reason-loop prompt

Historical calls showed the 8B re-asking for name/reason/phone it already had ("I already gave you my full name", the "I want an appointment" -> "what brings you in?" loop) and VAD splitting one utterance into consecutive user turns. - callstate.py: CallStateGroomer between agg.user() and the LLM. After each agent turn (off the critical path) it extracts collected slots via one short JSON-mode Ollama pass, then before each generation injects an ALREADY COLLECTED / STILL NEEDED checklist into the system message and merges VAD-fragmented consecutive user messages. Callback-type calls get an explicit "no booking questions" line. CALL_STATE_TRACKING env (auto: on for ollama, off for anthropic). - bot.py prompt step 1: "I want an appointment" is the booking intent, not the reason - ask the visit reason once, never twice. - scripts/ab_replay.py: regression harness replaying the real failed calls. llama3.1-8b raw = 3 failures; with CALL STATE = 0 failures across all scenarios (chat latency 0.31s -> 0.55s median, well under the 3s gate). Qwen3-14B A/B'd and rejected: no better raw, ~3s/turn, 11GB VRAM. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-03 23:49:39 +00:00
parent bae388420b
commit a47f4b423c
5 changed files with 445 additions and 2 deletions
--- a/bot.py
+++ b/bot.py
@@ -53,6 +53,7 @@ from pipecat.transports.websocket.fastapi import (
    FastAPIWebsocketTransport,
 )

+from callstate import CallStateGroomer
 from practice import practice_summary

 # ── Config (env-overridable) ─────────────────────────────────────────────────
@@ -120,6 +121,17 @@ ECHO_TAIL_SECS = float(os.environ.get("ECHO_TAIL_SECS", "0.25"))
 SILENCE_WATCHDOG = os.environ.get("SILENCE_WATCHDOG", "true").lower() not in ("false", "0", "no")
 SILENCE_REPROMPT_SECS = float(os.environ.get("SILENCE_REPROMPT_SECS", "7.0"))
 MAX_REPROMPTS = int(os.environ.get("MAX_REPROMPTS", "2"))
+# Deterministic slot-state tracking (callstate.py): after each agent turn, extract what the
+# caller already provided and inject an explicit ALREADY-COLLECTED / STILL-NEEDED checklist
+# into the system message, plus merge VAD-fragmented user turns. Fixes the 8B re-asking for
+# name/reason/phone it was already given. Extraction runs on the local Ollama model, so it
+# auto-disables for the anthropic provider (Claude tracks state fine on its own).
+_call_state_env = os.environ.get("CALL_STATE_TRACKING")
+CALL_STATE_TRACKING = (
+    _call_state_env.lower() in ("1", "true", "yes")
+    if _call_state_env is not None
+    else (LLM_PROVIDER == "ollama")
+)
 # Record each call to a stereo WAV (caller = left, agent = right) for review/debugging.
 RECORD_CALLS = os.environ.get("RECORD_CALLS", "true").lower() not in ("false", "0", "no")
 RECORDINGS_DIR = os.environ.get("RECORDINGS_DIR", os.path.join(HERE, "recordings"))
@@ -165,7 +177,11 @@ SYSTEM_PROMPT = (
    "THIS case — switch to taking a message; never force booking questions on a non-booking caller.\n"
    "  • A BOOKING (they want to schedule a visit) — work through these steps in order:\n"
    "  1. REASON FIRST — find out what they are calling about (the reason for the visit, or "
-    "their question). If it is only a question, answer it.\n"
+    "their question). If it is only a question, answer it. NOTE: 'I want an appointment' / 'I "
+    "need to make an appointment' is the booking INTENT, not the reason — never treat it as a "
+    "non-answer. Acknowledge it and ask ONCE what the visit is for, e.g. 'Happy to help — what "
+    "would you like to be seen for?'. If they just say 'an appointment' again or give no medical "
+    "reason, note it as a general visit and MOVE ON to location — NEVER ask the reason twice.\n"
    "  2. LOCATION — ask which city or area is most convenient, then confirm the matching "
    "office (see the office rule below).\n"
    "  3. CALLER INFO — get their FULL name (first and last; if they give only a first name, "
@@ -656,6 +672,11 @@ async def run_agent(transport, caller_number=None, call_sid=None, do_capture=Tru
        context_kwargs["tools"] = _build_tools()
    context = LLMContext(**context_kwargs)
    agg = LLMContextAggregatorPair(context)
+    # Deterministic slot memory: merges fragmented user turns + injects the live
+    # collected/needed checklist into the system message before each generation.
+    groomer = CallStateGroomer(
+        context, base_system=system_content, ollama_url=OLLAMA_URL, model=OLLAMA_MODEL,
+    ) if CALL_STATE_TRACKING else None
    # Deterministic phone-confirmation safety net: if the agent reaches a closing without
    # having read the caller-ID back, EndCallProcessor speaks this scripted line first.
    if caller_number:
@@ -687,6 +708,7 @@ async def run_agent(transport, caller_number=None, call_sid=None, do_capture=Tru
            vad,
            stt,
            agg.user(),
+            *( [groomer] if groomer else [] ),  # slot-state checklist + user-turn merge
            llm,
            endcall,
            *( [watchdog] if watchdog else [] ),  # re-prompt on caller silence