Fix re-asking: deterministic slot memory + user-turn merge + reason-loop prompt

Historical calls showed the 8B re-asking for name/reason/phone it already had
("I already gave you my full name", the "I want an appointment" -> "what brings
you in?" loop) and VAD splitting one utterance into consecutive user turns.

- callstate.py: CallStateGroomer between agg.user() and the LLM. After each
  agent turn (off the critical path) it extracts collected slots via one short
  JSON-mode Ollama pass, then before each generation injects an ALREADY
  COLLECTED / STILL NEEDED checklist into the system message and merges
  VAD-fragmented consecutive user messages. Callback-type calls get an explicit
  "no booking questions" line. CALL_STATE_TRACKING env (auto: on for ollama,
  off for anthropic).
- bot.py prompt step 1: "I want an appointment" is the booking intent, not the
  reason - ask the visit reason once, never twice.
- scripts/ab_replay.py: regression harness replaying the real failed calls.
  llama3.1-8b raw = 3 failures; with CALL STATE = 0 failures across all
  scenarios (chat latency 0.31s -> 0.55s median, well under the 3s gate).
  Qwen3-14B A/B'd and rejected: no better raw, ~3s/turn, 11GB VRAM.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
tocmo0nlord
2026-07-03 23:49:39 +00:00
parent bae388420b
commit a47f4b423c
5 changed files with 445 additions and 2 deletions

24
bot.py
View File

@@ -53,6 +53,7 @@ from pipecat.transports.websocket.fastapi import (
FastAPIWebsocketTransport,
)
from callstate import CallStateGroomer
from practice import practice_summary
# ── Config (env-overridable) ─────────────────────────────────────────────────
@@ -120,6 +121,17 @@ ECHO_TAIL_SECS = float(os.environ.get("ECHO_TAIL_SECS", "0.25"))
SILENCE_WATCHDOG = os.environ.get("SILENCE_WATCHDOG", "true").lower() not in ("false", "0", "no")
SILENCE_REPROMPT_SECS = float(os.environ.get("SILENCE_REPROMPT_SECS", "7.0"))
MAX_REPROMPTS = int(os.environ.get("MAX_REPROMPTS", "2"))
# Deterministic slot-state tracking (callstate.py): after each agent turn, extract what the
# caller already provided and inject an explicit ALREADY-COLLECTED / STILL-NEEDED checklist
# into the system message, plus merge VAD-fragmented user turns. Fixes the 8B re-asking for
# name/reason/phone it was already given. Extraction runs on the local Ollama model, so it
# auto-disables for the anthropic provider (Claude tracks state fine on its own).
_call_state_env = os.environ.get("CALL_STATE_TRACKING")
CALL_STATE_TRACKING = (
_call_state_env.lower() in ("1", "true", "yes")
if _call_state_env is not None
else (LLM_PROVIDER == "ollama")
)
# Record each call to a stereo WAV (caller = left, agent = right) for review/debugging.
RECORD_CALLS = os.environ.get("RECORD_CALLS", "true").lower() not in ("false", "0", "no")
RECORDINGS_DIR = os.environ.get("RECORDINGS_DIR", os.path.join(HERE, "recordings"))
@@ -165,7 +177,11 @@ SYSTEM_PROMPT = (
"THIS case — switch to taking a message; never force booking questions on a non-booking caller.\n"
" • A BOOKING (they want to schedule a visit) — work through these steps in order:\n"
" 1. REASON FIRST — find out what they are calling about (the reason for the visit, or "
"their question). If it is only a question, answer it.\n"
"their question). If it is only a question, answer it. NOTE: 'I want an appointment' / 'I "
"need to make an appointment' is the booking INTENT, not the reason — never treat it as a "
"non-answer. Acknowledge it and ask ONCE what the visit is for, e.g. 'Happy to help — what "
"would you like to be seen for?'. If they just say 'an appointment' again or give no medical "
"reason, note it as a general visit and MOVE ON to location — NEVER ask the reason twice.\n"
" 2. LOCATION — ask which city or area is most convenient, then confirm the matching "
"office (see the office rule below).\n"
" 3. CALLER INFO — get their FULL name (first and last; if they give only a first name, "
@@ -656,6 +672,11 @@ async def run_agent(transport, caller_number=None, call_sid=None, do_capture=Tru
context_kwargs["tools"] = _build_tools()
context = LLMContext(**context_kwargs)
agg = LLMContextAggregatorPair(context)
# Deterministic slot memory: merges fragmented user turns + injects the live
# collected/needed checklist into the system message before each generation.
groomer = CallStateGroomer(
context, base_system=system_content, ollama_url=OLLAMA_URL, model=OLLAMA_MODEL,
) if CALL_STATE_TRACKING else None
# Deterministic phone-confirmation safety net: if the agent reaches a closing without
# having read the caller-ID back, EndCallProcessor speaks this scripted line first.
if caller_number:
@@ -687,6 +708,7 @@ async def run_agent(transport, caller_number=None, call_sid=None, do_capture=Tru
vad,
stt,
agg.user(),
*( [groomer] if groomer else [] ), # slot-state checklist + user-turn merge
llm,
endcall,
*( [watchdog] if watchdog else [] ), # re-prompt on caller silence