Fix re-asking: deterministic slot memory + user-turn merge + reason-loop prompt

Historical calls showed the 8B re-asking for name/reason/phone it already had
("I already gave you my full name", the "I want an appointment" -> "what brings
you in?" loop) and VAD splitting one utterance into consecutive user turns.

- callstate.py: CallStateGroomer between agg.user() and the LLM. After each
  agent turn (off the critical path) it extracts collected slots via one short
  JSON-mode Ollama pass, then before each generation injects an ALREADY
  COLLECTED / STILL NEEDED checklist into the system message and merges
  VAD-fragmented consecutive user messages. Callback-type calls get an explicit
  "no booking questions" line. CALL_STATE_TRACKING env (auto: on for ollama,
  off for anthropic).
- bot.py prompt step 1: "I want an appointment" is the booking intent, not the
  reason - ask the visit reason once, never twice.
- scripts/ab_replay.py: regression harness replaying the real failed calls.
  llama3.1-8b raw = 3 failures; with CALL STATE = 0 failures across all
  scenarios (chat latency 0.31s -> 0.55s median, well under the 3s gate).
  Qwen3-14B A/B'd and rejected: no better raw, ~3s/turn, 11GB VRAM.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
tocmo0nlord
2026-07-03 23:49:39 +00:00
parent bae388420b
commit a47f4b423c
5 changed files with 445 additions and 2 deletions

View File

@@ -68,3 +68,7 @@ VAD_CONFIDENCE=0.5
VAD_MIN_VOLUME=0.15 VAD_MIN_VOLUME=0.15
VAD_START_SECS=0.1 VAD_START_SECS=0.1
VAD_STOP_SECS=0.5 VAD_STOP_SECS=0.5
# Deterministic slot memory (callstate.py): injects an ALREADY-COLLECTED / STILL-NEEDED
# checklist into the system prompt each turn + merges VAD-fragmented user turns, so the
# local 8B stops re-asking for name/reason/phone. Default: on for ollama, off for anthropic.
#CALL_STATE_TRACKING=true

View File

@@ -83,7 +83,23 @@ audio while the bot is speaking (+`ECHO_TAIL_SECS`, default 0.5s) so echo never
Trade-off: half-duplex — the caller can't barge in mid-utterance (fine for short replies). Trade-off: half-duplex — the caller can't barge in mid-utterance (fine for short replies).
`HALF_DUPLEX=false` restores barge-in. Keep it on for telephony. `HALF_DUPLEX=false` restores barge-in. Keep it on for telephony.
**Post-call extraction (`extract.py`)** — single JSON-mode completion after call ends. **`CallStateGroomer` (`callstate.py`) — deterministic slot memory (2026-07-03).** Fixes the
8B re-asking for things the caller already gave (name, reason, phone — seen repeatedly in the
historical call logs: "Didn't you say you had my phone number?", "I already gave you my full
name", the "I want an appointment"→"what brings you in?" loop). Sits between `agg.user()` and
the LLM. Two jobs: (1) on upstream `BotStoppedSpeakingFrame` (agent finished; Ollama idle,
caller talking) it runs a ~1.2s JSON-mode extraction over the transcript-so-far — OFF the
latency-critical path, result applied next turn; (2) on downstream `LLMContextFrame` (right
before generation) it synchronously merges VAD-fragmented consecutive user messages
("Monday" / "3 p.m." → one turn) and injects an explicit checklist into the system message:
`CALL STATE ... ALREADY COLLECTED (NEVER ask again): name=Carlos Garcia ... STILL NEEDED:
insurance, preferred day/time`. It also carries call type (`callback` → "do NOT ask booking
questions"). Verified via `scripts/ab_replay.py` (replays the real failed calls): llama3.1-8B
raw = 3 failures, +CALL STATE = **0 failures**, chat latency 0.31s→0.55s med (system-message
churn re-evals the prompt; acceptable, still ≪ the 3s gate). Env: `CALL_STATE_TRACKING`
(default: on for ollama, off for anthropic — Claude tracks state fine on its own; extraction
always runs on the local Ollama model). Qwen3-14B was A/B'd as an alternative and rejected
for now: no better raw, ~3s/turn with state, needs `think:false` handling, ~11GB VRAM.
Correctly uses `format: json`, uses verified Twilio caller-ID instead of trusting model Correctly uses `format: json`, uses verified Twilio caller-ID instead of trusting model
output, falls back to JSONL if Odoo is unreachable. Keep it. output, falls back to JSONL if Odoo is unreachable. Keep it.
**Classifies `request_type`:** `appointment` (booking), `callback` (a non-booking request staff **Classifies `request_type`:** `appointment` (booking), `callback` (a non-booking request staff
@@ -495,6 +511,10 @@ Beyond the three reverted changes, the following hardening is live (see git hist
- **Reason capture** — post-call extractor broadened to capture the eye problem/symptom as the reason (not just visit types); reason now shown in the log line and the Odoo lead title. - **Reason capture** — post-call extractor broadened to capture the eye problem/symptom as the reason (not just visit types); reason now shown in the log line and the Odoo lead title.
- **Hang-up** — `HANGUP_DELAY_SECS=4` grace pause before dropping the carrier leg. - **Hang-up** — `HANGUP_DELAY_SECS=4` grace pause before dropping the carrier leg.
- **Office selection** — confirm the matching office; never offer/compare others. - **Office selection** — confirm the matching office; never offer/compare others.
- **Re-ask fix (2026-07-03)** — `CallStateGroomer` slot-state checklist + user-turn merge (see
component note above); prompt step 1 now says "I want an appointment" is intent not reason —
ask the visit reason ONCE, then move on. Regression harness: `scripts/ab_replay.py [--state]
<models...>` replays the historical failure scenarios and flags re-asks.
### Phase 2 — Accuracy (RAG + validation) ### Phase 2 — Accuracy (RAG + validation)

24
bot.py
View File

@@ -53,6 +53,7 @@ from pipecat.transports.websocket.fastapi import (
FastAPIWebsocketTransport, FastAPIWebsocketTransport,
) )
from callstate import CallStateGroomer
from practice import practice_summary from practice import practice_summary
# ── Config (env-overridable) ───────────────────────────────────────────────── # ── Config (env-overridable) ─────────────────────────────────────────────────
@@ -120,6 +121,17 @@ ECHO_TAIL_SECS = float(os.environ.get("ECHO_TAIL_SECS", "0.25"))
SILENCE_WATCHDOG = os.environ.get("SILENCE_WATCHDOG", "true").lower() not in ("false", "0", "no") SILENCE_WATCHDOG = os.environ.get("SILENCE_WATCHDOG", "true").lower() not in ("false", "0", "no")
SILENCE_REPROMPT_SECS = float(os.environ.get("SILENCE_REPROMPT_SECS", "7.0")) SILENCE_REPROMPT_SECS = float(os.environ.get("SILENCE_REPROMPT_SECS", "7.0"))
MAX_REPROMPTS = int(os.environ.get("MAX_REPROMPTS", "2")) MAX_REPROMPTS = int(os.environ.get("MAX_REPROMPTS", "2"))
# Deterministic slot-state tracking (callstate.py): after each agent turn, extract what the
# caller already provided and inject an explicit ALREADY-COLLECTED / STILL-NEEDED checklist
# into the system message, plus merge VAD-fragmented user turns. Fixes the 8B re-asking for
# name/reason/phone it was already given. Extraction runs on the local Ollama model, so it
# auto-disables for the anthropic provider (Claude tracks state fine on its own).
_call_state_env = os.environ.get("CALL_STATE_TRACKING")
CALL_STATE_TRACKING = (
_call_state_env.lower() in ("1", "true", "yes")
if _call_state_env is not None
else (LLM_PROVIDER == "ollama")
)
# Record each call to a stereo WAV (caller = left, agent = right) for review/debugging. # Record each call to a stereo WAV (caller = left, agent = right) for review/debugging.
RECORD_CALLS = os.environ.get("RECORD_CALLS", "true").lower() not in ("false", "0", "no") RECORD_CALLS = os.environ.get("RECORD_CALLS", "true").lower() not in ("false", "0", "no")
RECORDINGS_DIR = os.environ.get("RECORDINGS_DIR", os.path.join(HERE, "recordings")) RECORDINGS_DIR = os.environ.get("RECORDINGS_DIR", os.path.join(HERE, "recordings"))
@@ -165,7 +177,11 @@ SYSTEM_PROMPT = (
"THIS case — switch to taking a message; never force booking questions on a non-booking caller.\n" "THIS case — switch to taking a message; never force booking questions on a non-booking caller.\n"
" • A BOOKING (they want to schedule a visit) — work through these steps in order:\n" " • A BOOKING (they want to schedule a visit) — work through these steps in order:\n"
" 1. REASON FIRST — find out what they are calling about (the reason for the visit, or " " 1. REASON FIRST — find out what they are calling about (the reason for the visit, or "
"their question). If it is only a question, answer it.\n" "their question). If it is only a question, answer it. NOTE: 'I want an appointment' / 'I "
"need to make an appointment' is the booking INTENT, not the reason — never treat it as a "
"non-answer. Acknowledge it and ask ONCE what the visit is for, e.g. 'Happy to help — what "
"would you like to be seen for?'. If they just say 'an appointment' again or give no medical "
"reason, note it as a general visit and MOVE ON to location — NEVER ask the reason twice.\n"
" 2. LOCATION — ask which city or area is most convenient, then confirm the matching " " 2. LOCATION — ask which city or area is most convenient, then confirm the matching "
"office (see the office rule below).\n" "office (see the office rule below).\n"
" 3. CALLER INFO — get their FULL name (first and last; if they give only a first name, " " 3. CALLER INFO — get their FULL name (first and last; if they give only a first name, "
@@ -656,6 +672,11 @@ async def run_agent(transport, caller_number=None, call_sid=None, do_capture=Tru
context_kwargs["tools"] = _build_tools() context_kwargs["tools"] = _build_tools()
context = LLMContext(**context_kwargs) context = LLMContext(**context_kwargs)
agg = LLMContextAggregatorPair(context) agg = LLMContextAggregatorPair(context)
# Deterministic slot memory: merges fragmented user turns + injects the live
# collected/needed checklist into the system message before each generation.
groomer = CallStateGroomer(
context, base_system=system_content, ollama_url=OLLAMA_URL, model=OLLAMA_MODEL,
) if CALL_STATE_TRACKING else None
# Deterministic phone-confirmation safety net: if the agent reaches a closing without # Deterministic phone-confirmation safety net: if the agent reaches a closing without
# having read the caller-ID back, EndCallProcessor speaks this scripted line first. # having read the caller-ID back, EndCallProcessor speaks this scripted line first.
if caller_number: if caller_number:
@@ -687,6 +708,7 @@ async def run_agent(transport, caller_number=None, call_sid=None, do_capture=Tru
vad, vad,
stt, stt,
agg.user(), agg.user(),
*( [groomer] if groomer else [] ), # slot-state checklist + user-turn merge
llm, llm,
endcall, endcall,
*( [watchdog] if watchdog else [] ), # re-prompt on caller silence *( [watchdog] if watchdog else [] ), # re-prompt on caller silence

232
callstate.py Normal file
View File

@@ -0,0 +1,232 @@
"""In-call slot-state tracking — deterministic memory for a weak LLM.
The 8B keeps re-asking for things the caller already said (name, reason, phone) because
it has to *infer* call state from a long transcript under ~1,400 tokens of rules. This
module makes the state explicit instead: after each agent turn (while the caller is
talking — off the latency-critical path) it runs one short JSON-mode extraction over the
transcript, then injects a live checklist into the system message before the next
generation:
CALL STATE ... ALREADY COLLECTED (never ask again): name=Carlos Garcia, ...
STILL NEEDED: insurance, preferred day/time
Small models follow an explicit checklist at the end of the system prompt far more
reliably than they track slots from conversation history. Same philosophy as the
deterministic phone-confirm safety net in EndCallProcessor: scaffold around the model.
CallStateGroomer also merges consecutive user messages in the context (VAD splits one
utterance like "Monday" / "3 p.m." into two turns, which derails the 8B) — done
synchronously on LLMContextFrame, right before the LLM reads the context.
"""
import asyncio
import json
import httpx
from loguru import logger
from pipecat.frames.frames import BotStoppedSpeakingFrame, Frame, LLMContextFrame
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
# Short, in-call variant of the post-call extractor (extract.py): only what's needed to
# build the checklist, temperature 0, capped output. Runs on the local Ollama model.
_STATE_INSTRUCTIONS = (
"You are tracking the state of a LIVE phone call between a caller and the receptionist "
"of an optometry practice. From the transcript, extract only what the CALLER has clearly "
"provided so far. Respond with ONLY a JSON object with these keys:\n"
' "call_type": "booking" (wants to schedule a visit), "callback" (wants something staff '
"must check off-phone: order/frames/lens/prescription status, billing, account lookup, "
'reach a person), "question" (just asking something), or "unknown"\n'
' "reason": string or null — for booking, why they want to be seen (visit type or eye '
"problem); for callback, a one-line note of what they need. 'an appointment' alone is NOT "
"a reason — use null.\n"
' "location": string or null — the office/city the caller wants\n'
' "patient_name": string or null — the caller\'s name as given (full or first-only)\n'
' "name_is_full": boolean — true only if it clearly has first AND last name\n'
' "insurance": string or null — the plan the caller named, exactly as said\n'
' "preferred_time": string or null — day/time in the caller\'s own words\n'
"Use null unless the caller clearly stated it. Never invent values."
)
# Booking slots in the order the call script gathers them.
_BOOKING_ORDER = [
("reason", "reason for the visit"),
("location", "which office/city"),
("patient_name", "full name"),
("insurance", "insurance"),
("preferred_time", "preferred day and time"),
]
async def extract_call_state(messages, ollama_url, model, timeout=15):
"""One short JSON-mode pass over the transcript-so-far. Returns the state dict or None."""
turns = [
f"{m['role']}: {m['content']}"
for m in messages
if m.get("role") in ("user", "assistant")
and isinstance(m.get("content"), str) and m["content"].strip()
]
if not turns:
return None
base = ollama_url.rstrip("/")
if base.endswith("/v1"):
base = base[:-3]
body_extra = {}
if "qwen3" in model or "deepseek-r1" in model:
body_extra["think"] = False # thinking models emit non-JSON otherwise
async with httpx.AsyncClient(timeout=timeout) as client:
r = await client.post(
f"{base}/api/chat",
json={
"model": model,
"format": "json",
"stream": False,
"options": {"temperature": 0, "num_predict": 200},
**body_extra,
"messages": [
{"role": "system", "content": _STATE_INSTRUCTIONS},
{"role": "user", "content": "Transcript:\n" + "\n".join(turns)},
],
},
)
r.raise_for_status()
return json.loads(r.json()["message"]["content"])
def build_state_block(state) -> str:
"""Render the extracted state as an explicit checklist for the system prompt.
Returns "" when there's nothing worth injecting yet (first turns)."""
if not state:
return ""
ctype = (state.get("call_type") or "unknown").strip().lower()
got, needed = [], []
for key, label in _BOOKING_ORDER:
val = (state.get(key) or "").strip() if isinstance(state.get(key), str) else ""
if key == "patient_name" and val and not state.get("name_is_full"):
got.append(f"first name: {val}")
needed.append("their LAST name (you have the first)")
continue
if val:
got.append(f"{label}: {val}")
else:
needed.append(label)
if ctype == "callback":
lines = [
"CALL STATE (auto-tracked from this conversation — trust it over your memory):",
"- This is a NON-BOOKING call: the caller needs staff to handle something off the "
"phone. Do NOT ask about insurance, office, or a preferred day/time.",
]
if got:
lines.append("- ALREADY COLLECTED — NEVER ask for these again: " + "; ".join(got))
if state.get("patient_name") is None:
lines.append("- Still needed: their name. Then confirm the callback number and close.")
else:
lines.append("- You have what you need: confirm the callback number and close.")
return "\n".join(lines)
if ctype == "booking" and (got or needed):
lines = ["CALL STATE (auto-tracked from this conversation — trust it over your memory):"]
if got:
lines.append("- ALREADY COLLECTED — NEVER ask for these again: " + "; ".join(got))
if needed:
lines.append("- STILL NEEDED — ask for the FIRST of these, one per turn: "
+ ", ".join(needed))
# The observed failure loop: caller says "an appointment", model keeps asking why.
if not (state.get("reason") or "").strip():
lines.append("- No visit reason yet: if you have ALREADY asked what the visit "
"is for and they only said 'an appointment', do NOT ask again — "
"note it as a general visit and ask the next needed item instead.")
else:
lines.append("- All booking details collected: confirm the callback number, recap "
"as a REQUEST, ask if there's anything else, then close.")
return "\n".join(lines)
return "" # question/unknown — nothing useful to inject
def merge_consecutive_user_messages(messages):
"""Collapse back-to-back user messages (VAD-fragmented utterances) into one turn.
Returns a new list; non-string content (tool results) is left untouched."""
out = []
for m in messages:
prev = out[-1] if out else None
if (
prev is not None
and m.get("role") == "user" and prev.get("role") == "user"
and isinstance(m.get("content"), str) and isinstance(prev.get("content"), str)
):
prev = dict(prev)
prev["content"] = (prev["content"].rstrip() + " " + m["content"].lstrip()).strip()
out[-1] = prev
else:
out.append(m)
return out
class CallStateGroomer(FrameProcessor):
"""Sits between the user aggregator and the LLM.
Downstream LLMContextFrame (= a generation is about to start): synchronously groom the
context — merge fragmented user turns, refresh the system message with the latest
CALL STATE checklist.
Upstream BotStoppedSpeakingFrame (= the agent finished a reply; Ollama is idle and the
caller is about to talk): kick off the next state extraction in the background. Its
result is applied on the *next* LLMContextFrame — one turn of lag, zero added latency.
"""
def __init__(self, context, base_system: str, ollama_url: str, model: str):
super().__init__()
self._context = context
self._base_system = base_system
self._ollama_url = ollama_url
self._model = model
self._state = None
self._task = None
def _extract_done(self, task):
self._task = None
if task.cancelled():
return
exc = task.exception()
if exc:
logger.warning(f"CallState extraction failed: {exc}")
return
state = task.result()
if state:
self._state = state
logger.info(f"CallState updated: {json.dumps(state, ensure_ascii=False)}")
def _maybe_extract(self):
if self._task is not None: # one in flight at a time
return
messages = list(self._context.messages)
if not any(m.get("role") == "user" for m in messages):
return # greeting only — nothing to extract yet
self._task = asyncio.create_task(
extract_call_state(messages, self._ollama_url, self._model)
)
self._task.add_done_callback(self._extract_done)
def _groom_context(self):
messages = merge_consecutive_user_messages(list(self._context.messages))
block = build_state_block(self._state)
for i, m in enumerate(messages):
if m.get("role") == "system":
content = self._base_system + ("\n\n" + block if block else "")
if m.get("content") != content:
messages[i] = {**m, "content": content}
break
self._context.set_messages(messages)
async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
if isinstance(frame, LLMContextFrame) and direction == FrameDirection.DOWNSTREAM:
try:
self._groom_context()
except Exception:
logger.exception("CallState groom failed (continuing with raw context)")
elif isinstance(frame, BotStoppedSpeakingFrame):
self._maybe_extract()
await self.push_frame(frame, direction)

165
scripts/ab_replay.py Normal file
View File

@@ -0,0 +1,165 @@
#!/usr/bin/env python3
"""A/B replay: re-run the historical problem scenarios against candidate LLMs.
Replays scripted caller turns (taken from real failed calls in the run logs) through the
production system prompt and checks each model for the observed failure modes: re-asking
the reason ("I want an appointment" loop), re-asking name/phone, and forcing booking
questions (insurance/day-time) on non-booking callers. Also reports per-turn latency.
Usage (inside the pipecat venv):
python scripts/ab_replay.py activeblue-avc:latest qwen3:14b
python scripts/ab_replay.py --state activeblue-avc:latest # with CALL STATE injection
--state simulates the CallStateGroomer: between turns it runs the callstate extraction
and injects the ALREADY COLLECTED / STILL NEEDED checklist, exactly as in-call.
"""
import argparse
import asyncio
import re
import sys
import time
from pathlib import Path
import httpx
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from bot import SYSTEM_PROMPT # noqa: E402 (import parses env + practice facts only)
from callstate import build_state_block, extract_call_state # noqa: E402
OLLAMA = "http://127.0.0.1:11434"
CALLER_LINE = (
"\n\nCALLER ID: the caller's number on file, written so you read it digit by digit, "
"is: nine seven three, five seven three, one six seven one. Near the end, state it back "
"and invite a correction only ('...; if that's not the best number, just let me know.') — "
"do NOT ask a yes/no question or wait for a 'yes'. Only change it if they give a different "
"number. Do not say it any earlier in the call."
)
GREETING = "Thank you for calling Advanced Vision Care, this is AVA. How can I help you today?"
# Failure-mode detectors: (label, regex counted across assistant turns, max allowed count)
ASK_REASON = re.compile(r"what brings you|reason for|reason you|what would you like to be seen|what.s the visit for|what seems to be", re.I)
ASK_NAME = re.compile(r"(full |your |the )name", re.I)
ASK_INSURANCE = re.compile(r"insurance", re.I)
# Asking FOR a number is the failure; the statement-form readback ("I have your number
# as ...; if that's not the best number, just let me know") is correct behavior.
ASK_PHONE_Q = re.compile(r"(what('| i)s|can I (get|have)|may I (get|have)|could I (get|have)|give me).{0,40}(phone|number)", re.I)
ASK_LOCATION = re.compile(r"(which|what).{0,30}(city|area|office|location)", re.I)
# Scenarios distilled from real failed calls (log refs in comments). A tuple within
# `turns` = VAD-fragmented utterance (two user messages, one reply) — from log.10 call#1.
SCENARIOS = [
dict(
name="reason-loop (avc_run.log call#1 / log.21 call#5)",
turns=["I want an appointment.", "appointment", "Kendall",
"Carlos Garcia", "Humana", ("Monday", "3 p.m."), "No, that's all, thank you."],
checks=[("re-asked reason", ASK_REASON, 1),
("re-asked name", ASK_NAME, 1),
("re-asked location", ASK_LOCATION, 1),
("asked for phone (has caller-ID)", ASK_PHONE_Q, 0)],
),
dict(
name="glasses callback (log.23/24/25)",
turns=["Hey, I'm a patient in Kendall and I need to know when my glasses are ready.",
"Carlos Garcia", "That's what I'm asking — the status of my order.",
"Yes, that's a good number.", "No, that's all."],
checks=[("asked insurance on non-booking call", ASK_INSURANCE, 0),
("asked day/time on non-booking call",
re.compile(r"(what|which) day|day and time|preferred (day|time)", re.I), 0),
("re-asked name", ASK_NAME, 1)],
),
dict(
name="early-info booking (log.4 call#1: reason+city up front)",
turns=["I'm having eye pain and I'm in Kendall, Florida.", "Yes please.",
"Carlos Garcia", "Florida Blue Medicare", ("Monday", "5 p.m."),
"No, that's everything."],
checks=[("re-asked reason", ASK_REASON, 1),
("re-asked location", ASK_LOCATION, 0), # was given in turn 1
("re-asked name", ASK_NAME, 1),
("asked for phone (has caller-ID)", ASK_PHONE_Q, 0)],
),
]
async def chat(client, model, messages, think_capable):
body = {
"model": model, "stream": False, "messages": messages,
"options": {"temperature": 0.3, "num_predict": 160, "num_ctx": 8192},
}
if think_capable:
body["think"] = False
t0 = time.time()
r = await client.post(f"{OLLAMA}/api/chat", json=body)
r.raise_for_status()
return r.json()["message"]["content"].strip(), time.time() - t0
async def run_scenario(client, model, sc, with_state):
think_capable = "qwen3" in model or "deepseek-r1" in model
base_system = SYSTEM_PROMPT + CALLER_LINE
msgs = [{"role": "system", "content": base_system},
{"role": "assistant", "content": GREETING}]
lats, transcript = [], [("A", GREETING)]
for turn in sc["turns"]:
frags = turn if isinstance(turn, tuple) else (turn,)
for f in frags:
msgs.append({"role": "user", "content": f})
transcript.append(("C", f))
if with_state:
try:
state = await extract_call_state(msgs, OLLAMA, model)
block = build_state_block(state)
msgs[0]["content"] = base_system + ("\n\n" + block if block else "")
except Exception as e:
print(f" (state extraction failed: {e})")
reply, dt = await chat(client, model, msgs, think_capable)
lats.append(dt)
msgs.append({"role": "assistant", "content": reply})
transcript.append(("A", reply))
if "goodbye" in reply.lower():
break
return transcript, lats
def score(sc, transcript):
replies = [t for r, t in transcript if r == "A"]
fails = []
for label, rx, max_ok in sc["checks"]:
n = sum(1 for t in replies if rx.search(t))
if n > max_ok:
fails.append(f"{label} ({n}x, max {max_ok})")
return fails
async def main():
ap = argparse.ArgumentParser()
ap.add_argument("models", nargs="+")
ap.add_argument("--state", action="store_true", help="inject CALL STATE checklist per turn")
ap.add_argument("-v", "--verbose", action="store_true", help="print transcripts")
args = ap.parse_args()
async with httpx.AsyncClient(timeout=120) as client:
results = {}
for model in args.models:
print(f"\n{'='*70}\nMODEL: {model}{' + CALL STATE' if args.state else ''}\n{'='*70}")
total_fails, all_lats = 0, []
for sc in SCENARIOS:
transcript, lats = await run_scenario(client, model, sc, args.state)
fails = score(sc, transcript)
total_fails += len(fails)
all_lats += lats
mark = "PASS" if not fails else "FAIL: " + "; ".join(fails)
print(f"\n--- {sc['name']} -> {mark}")
if args.verbose or fails:
for r, t in transcript:
print(f" {r}: {t}")
lat = sorted(all_lats)
results[model] = (total_fails, lat[len(lat)//2], lat[-1])
print(f"\n{model}: {total_fails} failure(s) | latency med={lat[len(lat)//2]:.2f}s max={lat[-1]:.2f}s")
print(f"\n{'='*70}\nSUMMARY{' (+state)' if args.state else ''}")
for m, (f, med, mx) in results.items():
print(f" {m:35s} failures={f} lat med={med:.2f}s max={mx:.2f}s")
if __name__ == "__main__":
asyncio.run(main())