CLAUDE.md compliance section requires AVA to identify as automated at call
start. Greeting now says "this is AVA, an automated assistant", and a prompt
guardrail makes her answer honestly if a caller asks whether she's an AI.
Replay harness greeting kept in sync.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Historical calls showed the 8B re-asking for name/reason/phone it already had
("I already gave you my full name", the "I want an appointment" -> "what brings
you in?" loop) and VAD splitting one utterance into consecutive user turns.
- callstate.py: CallStateGroomer between agg.user() and the LLM. After each
agent turn (off the critical path) it extracts collected slots via one short
JSON-mode Ollama pass, then before each generation injects an ALREADY
COLLECTED / STILL NEEDED checklist into the system message and merges
VAD-fragmented consecutive user messages. Callback-type calls get an explicit
"no booking questions" line. CALL_STATE_TRACKING env (auto: on for ollama,
off for anthropic).
- bot.py prompt step 1: "I want an appointment" is the booking intent, not the
reason - ask the visit reason once, never twice.
- scripts/ab_replay.py: regression harness replaying the real failed calls.
llama3.1-8b raw = 3 failures; with CALL STATE = 0 failures across all
scenarios (chat latency 0.31s -> 0.55s median, well under the 3s gate).
Qwen3-14B A/B'd and rejected: no better raw, ~3s/turn, 11GB VRAM.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Parse the per-call outcome (appointment / callback / none / skipped / incomplete)
from the new "Post-call <kind> saved" / "no actionable request" / "skipping card"
log lines. Adds a per-call type column, a "By outcome type" tally, and splits the
leads count into appointment + callback — so a mixed test batch is easy to verify.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Capacity gating verified deterministically: atomic _reserve_call_slot grants
exactly MAX_CONCURRENT_CALLS (2), refuses the 3rd, frees on hangup, and 10
simultaneous attempts grant only 2 (no race); /voice returns BUSY + Hangup at
cap. Marked the gate item done (end-to-end 3-phone test optional).
Add scripts/score_calls.py: grades recent calls from the server log against the
Phase 1 gate (turns, latency LLM->TTS, AVC-side hangup, leads, watchdog
re-prompts, errors) — for scoring the 10-call run once placed.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>