A caller gave their insurance; AVA replied with a bare acknowledgment ("staff
will verify your coverage") and stopped, with no follow-up question. Both sides
then waited -> dead air (pipeline idle, no GPU/LLM activity, matching flat
memory/wattage). Caller had to break the silence with "what questions do you
have?". Root cause: the one-sentence brevity rule made AVA end a booking turn on
a dead-end statement.
Fix: prompt now requires, until the booking is complete, that every turn end
with the next question — acknowledgment + next question in the same turn (e.g.
insurance ack -> immediately ask day/time). Verified 4/4. Documented in CLAUDE.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
727 lines
31 KiB
Markdown
727 lines
31 KiB
Markdown
# AVC Phone Agent — Project Specification
|
||
> Claude Code authoritative reference. All architecture, security, and build decisions live here.
|
||
> Repo: `git.activeblue.net/tocmo0nlord/avc-phone-ai`
|
||
> Last updated: 2026-06-25 | Active Blue LLC
|
||
|
||
---
|
||
|
||
## Project Overview
|
||
|
||
**Name:** AVC Phone Agent
|
||
**Owner:** Active Blue LLC
|
||
**Client:** Advanced Vision Care (AVC) — multi-location ophthalmology/optometry practice (FL + TX)
|
||
**Agent name:** AVA (Advanced Vision Assistant)
|
||
**Purpose:** Automated AI phone agent that answers patient calls, books tentative appointments
|
||
into Odoo CRM with call recordings and transcripts attached, and self-improves via
|
||
Claude-powered transcript monitoring and a fine-tuning feedback loop.
|
||
|
||
---
|
||
|
||
## Existing Codebase — What to Keep, What to Change
|
||
|
||
The previous build at `/home/tocmo0nlord/avc-phone/` is a working foundation.
|
||
**Do not rewrite what works.** Apply only the changes documented in this section.
|
||
|
||
### Files and their status
|
||
|
||
| File | Status | Action |
|
||
|------|--------|--------|
|
||
| `bot.py` | Keep as-is | Whisper STT retained (real-time). Deepgram evaluated and rejected — see Change 1 |
|
||
| `server.py` | Keep as-is | Twilio Auth Token retained. API Key swap evaluated and rejected — see Change 2 |
|
||
| `practice.py` | Keep as-is | No changes |
|
||
| `extract.py` | Keep as-is | No changes |
|
||
| `odoo_client.py` | Keep as-is | Already uses API key auth correctly |
|
||
|
||
### What is already solved — do not touch
|
||
|
||
**`EndCallProcessor` in `bot.py`** — AVC-side call termination is fully implemented.
|
||
Watches LLM text stream for closing keywords ("goodbye"), waits for TTS to finish via
|
||
`BotStoppedSpeakingFrame`, pauses `HANGUP_DELAY_SECS` (default 4s) so the caller isn't
|
||
clipped, then pushes `EndTaskFrame` upstream. `TwilioFrameSerializer` with `auto_hang_up`
|
||
drops the carrier leg. Verified working in the Phase 1 gate (4/4 clean hang-ups).
|
||
|
||
It also **deterministically guarantees the callback number is confirmed** on booking calls:
|
||
the 8B reads the number back only ~half the time, so if a closing is reached on a booking
|
||
call (booking keyword seen) without the agent having spoken the number (`phone_marker` not
|
||
seen in its replies), the hang-up is suppressed and a scripted confirmation line
|
||
(`phone_confirm_line`, the caller-ID spelled out) is injected as a `TTSSpeakFrame` first.
|
||
The agent's own readback satisfies the gate, so there's no double-ask in the common case;
|
||
info-only calls (no booking keyword) are never asked for a number.
|
||
|
||
**Mulaw 8kHz ↔ 16kHz conversion** — handled internally by `TwilioFrameSerializer`.
|
||
`PIPELINE_SAMPLE_RATE = 16000`, `WIRE_SAMPLE_RATE = 8000` are already set correctly.
|
||
No custom audio module needed.
|
||
|
||
**VAD tuned for telephony** — `confidence=0.5`, `min_volume=0.3` already loosened from
|
||
desktop defaults. These settings directly address the repeat-yourself problem on the
|
||
VAD side.
|
||
|
||
**Capacity gating** — `MAX_CONCURRENT_CALLS=2` with atomic slot reservation in
|
||
`server.py` prevents GPU thrashing. Keep it.
|
||
|
||
**`AudioHeartbeat`** — diagnostic processor that distinguishes VAD failure from
|
||
transport stall. Keep it.
|
||
|
||
**Post-call extraction (`extract.py`)** — single JSON-mode completion after call ends.
|
||
Correctly uses `format: json`, uses verified Twilio caller-ID instead of trusting model
|
||
output, falls back to JSONL if Odoo is unreachable. Keep it.
|
||
|
||
**Odoo integration (`odoo_client.py`)** — already uses `ODOO_API_KEY` for XML-RPC auth,
|
||
not password. Correct pattern. No changes.
|
||
|
||
**`SpokenKokoroTTSService` in `bot.py`** — number normalization for speech. Kokoro reads
|
||
raw digit strings as cardinals with symbols spoken aloud ("983-4969" → "nine hundred
|
||
eighty-three dash forty-nine sixty-nine"). This subclass normalizes the text in `run_tts`
|
||
(which receives the full sentence) so US phone numbers and 4–5 digit runs (street numbers,
|
||
zips) are spoken one digit at a time — country code dropped, no "dash"/parens; dates and
|
||
times left natural ("Monday the fifth", "three thirty"). It also respells the all-caps agent
|
||
name to `AGENT_NAME_SPOKEN` (Kokoro reads "AVA" as "A-V-A"; set to "Eva" so it says
|
||
"EE-vuh"). Deterministic, so it's robust to whatever the model emits. Keep it.
|
||
`tts_normalize()` holds the rules.
|
||
|
||
> Note: don't rely on the model to read raw digits — it mangles them (it emitted
|
||
> "197-three five seven three…" once). The caller-ID is injected into the prompt **already
|
||
> spelled out** so AVA just repeats clean words; `tts_normalize` is the backstop for any
|
||
> other numbers.
|
||
|
||
---
|
||
|
||
## Change 1 — Real-time STT stays on Whisper (`bot.py`)
|
||
|
||
**Decision (2026-06-25): keep Whisper. Deepgram Nova-2 was evaluated and rejected.**
|
||
|
||
Deepgram Nova-2 was trialed to cut STT latency (Whisper buffers ~1-3s before the LLM
|
||
sees input). The swap was applied and then reverted — the project stays on local
|
||
faster-whisper. No external STT dependency, no per-minute STT cost, and no audio
|
||
leaving the box (HIPAA posture). Latency is instead managed via VAD tuning and the
|
||
`medium` model on the RTX 5080.
|
||
|
||
**Current `bot.py` STT (in place — do not change):**
|
||
```python
|
||
from pipecat.services.whisper.stt import WhisperSTTService
|
||
|
||
WHISPER_MODEL = os.environ.get("WHISPER_MODEL", "medium") # tiny|base|small|medium
|
||
WHISPER_DEVICE = os.environ.get("WHISPER_DEVICE", "cuda") # cuda for the 5080
|
||
WHISPER_COMPUTE = os.environ.get("WHISPER_COMPUTE", "float16")
|
||
WHISPER_HOTWORDS = os.environ.get("WHISPER_HOTWORDS", "...") # domain vocab bias
|
||
|
||
# HintedWhisperSTTService wraps WhisperSTTService to inject faster-whisper `hotwords`
|
||
# (office cities + optometry terms) per call. Instantiated in run_agent():
|
||
stt = HintedWhisperSTTService(
|
||
settings=WhisperSTTService.Settings(model=WHISPER_MODEL),
|
||
device=WHISPER_DEVICE,
|
||
compute_type=WHISPER_COMPUTE,
|
||
hotwords=WHISPER_HOTWORDS,
|
||
)
|
||
```
|
||
|
||
**Note:** Whisper large-v3 also serves post-call transcription in Phase 3
|
||
(`recording/transcriber.py`). If real-time latency proves unacceptable in the Phase 1
|
||
gate, revisit a streaming STT then — but do not reintroduce the dependency speculatively.
|
||
|
||
---
|
||
|
||
## Change 2 — Twilio webhook auth stays on the Auth Token (`server.py`)
|
||
|
||
**Decision (2026-06-25): keep `TWILIO_AUTH_TOKEN`. The API Key swap was evaluated and rejected.**
|
||
|
||
A Standard API Key (scoped, revocable) was trialed in place of the account Auth Token,
|
||
but it **cannot do what this server needs**: Twilio signs inbound webhooks
|
||
(`X-Twilio-Signature`) with the account **Auth Token** — an API Key Secret cannot validate
|
||
that signature, so `TWILIO_VALIDATE=true` would reject every legitimate `POST /voice`
|
||
(403). The `TwilioFrameSerializer` auto-hang-up also expects the account/Auth-Token
|
||
credential pair. The swap was reverted.
|
||
|
||
**Credential model (in place):**
|
||
```
|
||
Twilio Account SID (not secret on its own)
|
||
└── Auth Token (TWILIO_AUTH_TOKEN — validates webhooks + REST/auto-hang-up)
|
||
```
|
||
|
||
Treat the Auth Token as a password: keep it only in `.env` (never committed), rotate on
|
||
any suspected leak / team departure / quarterly. If finer-grained scoping is ever
|
||
required, the correct design is a *hybrid* — Auth Token for `X-Twilio-Signature`
|
||
validation, an API Key (SK SID + Secret) only for outbound REST — not a wholesale swap.
|
||
|
||
**Current `server.py` (in place — do not change):**
|
||
|
||
```python
|
||
TWILIO_ACCOUNT_SID = os.environ.get("TWILIO_ACCOUNT_SID")
|
||
TWILIO_AUTH_TOKEN = os.environ.get("TWILIO_AUTH_TOKEN")
|
||
|
||
# _twilio_signature_ok(): HMAC-SHA1 keyed by the Auth Token (what Twilio signs with)
|
||
digest = hmac.new(TWILIO_AUTH_TOKEN.encode(), payload.encode("utf-8"), hashlib.sha1).digest()
|
||
|
||
# Validation gate + warning
|
||
if TWILIO_VALIDATE and TWILIO_AUTH_TOKEN:
|
||
...
|
||
elif not TWILIO_AUTH_TOKEN:
|
||
logger.warning("/voice signature validation DISABLED (no TWILIO_AUTH_TOKEN set)")
|
||
|
||
# Serializer auto-hang-up uses the account SID + Auth Token pair
|
||
serializer = TwilioFrameSerializer(
|
||
stream_sid=stream_sid,
|
||
call_sid=call_sid,
|
||
account_sid=TWILIO_ACCOUNT_SID,
|
||
auth_token=TWILIO_AUTH_TOKEN,
|
||
)
|
||
```
|
||
|
||
**Auth Token rotation procedure:**
|
||
1. Generate a new primary Auth Token in the Twilio console (use the secondary-token flow)
|
||
2. Update `TWILIO_AUTH_TOKEN` in `.env`
|
||
3. Restart the service — no rebuild needed
|
||
4. Verify one test call succeeds (signature validation + auto-hang-up both rely on it)
|
||
5. Retire the old token in the Twilio console
|
||
|
||
Rotate on: any suspected leak, any team member departure, quarterly as routine.
|
||
|
||
---
|
||
|
||
## Change 3 — `.env`
|
||
|
||
No swap. `.env` keeps `TWILIO_AUTH_TOKEN` and the Whisper STT vars; there is **no**
|
||
`TWILIO_API_KEY_*` or `DEEPGRAM_*` (those were trialed and removed with Changes 1/2).
|
||
|
||
**Full `.env` reference:**
|
||
```env
|
||
# Twilio — Auth Token validates webhooks + drives auto-hang-up. Never committed.
|
||
TWILIO_ACCOUNT_SID=AC...
|
||
TWILIO_AUTH_TOKEN=
|
||
TWILIO_PHONE_NUMBER=+1...
|
||
TWILIO_VALIDATE=true
|
||
|
||
# STT: Whisper (faster-whisper, real-time in-call; large-v3 also used post-call in Phase 3)
|
||
WHISPER_MODEL=medium
|
||
WHISPER_DEVICE=cuda
|
||
WHISPER_COMPUTE=float16
|
||
|
||
# LLM: Ollama
|
||
OLLAMA_URL=http://127.0.0.1:11434/v1
|
||
OLLAMA_MODEL=activeblue-avc:latest
|
||
LLM_PROVIDER=ollama
|
||
LLM_TEMPERATURE=0.3
|
||
LLM_MAX_TOKENS=160
|
||
|
||
# Anthropic (optional LLM swap + monitoring + synthetic data)
|
||
ANTHROPIC_API_KEY=
|
||
ANTHROPIC_MODEL=claude-sonnet-4-6
|
||
|
||
# TTS: Kokoro
|
||
KOKORO_VOICE=af_heart
|
||
KOKORO_MODEL_DIR=/home/tocmo0nlord/pipecat-run/models
|
||
|
||
# Odoo
|
||
ODOO_URL=https://avc.activeblue.net
|
||
ODOO_DB=avc
|
||
ODOO_USER=
|
||
ODOO_API_KEY=
|
||
ODOO_TARGET=crm
|
||
ODOO_STAGE_ID=
|
||
ODOO_TEAM_ID=
|
||
ODOO_USER_ID=
|
||
|
||
# Server
|
||
PUBLIC_HOST=avc-phone.activeblue.net
|
||
PORT=8200
|
||
BIND_HOST=127.0.0.1
|
||
MAX_CONCURRENT_CALLS=2
|
||
STREAM_TOKEN=
|
||
|
||
# Call behaviour
|
||
AGENT_NAME=AVA
|
||
AGENT_NAME_SPOKEN=Eva # how the name is pronounced in TTS (logs/Odoo keep AGENT_NAME)
|
||
HANGUP_DELAY_SECS=4.0 # grace pause after the goodbye before dropping the carrier leg
|
||
ENABLE_TOOLS=
|
||
VAD_CONFIDENCE=0.5
|
||
VAD_MIN_VOLUME=0.3
|
||
VAD_START_SECS=0.2
|
||
VAD_STOP_SECS=0.5
|
||
|
||
# Monitoring (Phase 4)
|
||
MONITORING_ENABLED=true
|
||
MONITORING_SCHEDULE=0 2 * * *
|
||
|
||
# A/B model routing (Phase 5 only)
|
||
AB_SPLIT_PERCENT=0
|
||
AB_MODEL_B=
|
||
```
|
||
|
||
---
|
||
|
||
## Call Workflow
|
||
|
||
AVA runs a directed script (system prompt in `bot.py`) — warm but direct, one short turn at a
|
||
time, leading the call rather than waiting on the caller. Fixed order:
|
||
|
||
1. **Reason first** — find out what they're calling about (visit reason, or just a question → answer it).
|
||
2. **Location** — ask city/area, confirm the matching office (don't offer others — see office rule).
|
||
3. **Caller info** — full name (ask last name if only a first is given), then **address the caller
|
||
by name** from there on; insurance (log only); preferred day/time in their words.
|
||
4. **Verify phone** — near the end, state the caller-ID back in one line ("I have your number
|
||
as <number> — is that the best number?"), no asking permission first; if not, use the number
|
||
they give. Never raised earlier. **Backed by a deterministic safety net** — if the agent
|
||
skips it, `EndCallProcessor` injects the confirmation before hang-up (see "already solved").
|
||
5. **Wrap up** — recap the booking **as a REQUEST** by name ("I've noted your request to come
|
||
in…"), make clear staff will call to confirm, then ask **"Is there anything else I can help
|
||
you with?"**
|
||
|
||
**Never claims a booking:** AVA must never say an appointment is "booked / scheduled / set /
|
||
confirmed" — everything is a request staff confirm on callback. **Insurance:** never say "we
|
||
accept/take" a plan (or invent one) — just note what the caller said; staff verify.
|
||
|
||
**Keep momentum (prevents mid-call silence):** until the booking is complete, every turn ends
|
||
with the next question. A bare acknowledgment with no question (e.g. just "staff will verify
|
||
your coverage") deadlocks the call — both sides wait, the pipeline goes idle (no GPU/LLM
|
||
activity), and the caller hears silence. So acknowledgment + next question go in the same turn.
|
||
|
||
**Closing is gated:** the word "Goodbye" ends the call (triggers `EndCallProcessor` → hang-up),
|
||
so it is never said in the same turn as confirming details and never before the anything-else
|
||
question — only after the caller says they need nothing more.
|
||
|
||
> Reliability: the script is prompt-driven on the local 8B (order followed well, not perfectly;
|
||
> it can re-ask a last name). The phone-confirmation step is the exception — it's now
|
||
> **guaranteed** by the deterministic `EndCallProcessor` safety net.
|
||
|
||
## Call Data Capture
|
||
|
||
What AVA collects on a booking call and how it's logged. Driven by the system prompt
|
||
(`bot.py`); persisted by the post-call extractor (`extract.py` → `practice.py` → Odoo lead).
|
||
Replies are kept to one short sentence.
|
||
|
||
### The six captured fields
|
||
|
||
| Field | In-call behavior | Logged as |
|
||
|-------|------------------|-----------|
|
||
| Full name | Asks for last name if only a first is given | `patient_name` / lead `contact_name` |
|
||
| Phone | Confirmed **near the end** (not led with); reads back the caller-ID — injected pre-spelled so it's said digit-by-digit — and if the caller declines, uses the number they give | `callback_number` (+ `phone_confirmed`) |
|
||
| Office / city | Asks city/area; when the caller names a place that matches an office, **confirms that office and moves on** — never offers/compares other offices or asks them to choose; names the nearest only if nothing matches | folded into `reason` prefix |
|
||
| Reason | Captured from the conversation | `reason` |
|
||
| Insurance | **Log only, never suggest or guess** — asks open-endedly (no plan names read out), captures only what the caller says, never fills in/completes/guesses the plan (asks to repeat if unclear), never says "we accept/take" a plan, never promises/confirms/denies coverage or treatment even for a listed plan; staff verify on callback | `insurance` (note: "log only — staff to verify") |
|
||
| Preferred day & time | **Capture & defer** — taken in the caller's own words; AVA does not compute or correct the date | `preferred_time` + best-effort resolved `YYYY-MM-DD` |
|
||
|
||
### Dates — capture & defer (do NOT compute in-call)
|
||
|
||
AVA takes the day/time in the caller's **own words** ("next Monday", "the fifth") and tells
|
||
them staff will confirm the exact date on the callback. It must NOT work out, state, or correct
|
||
the calendar date, and must never argue about what today's date is.
|
||
|
||
**Why:** an earlier build injected a 45-day calendar and had AVA validate/correct dates in-call.
|
||
A real call derailed badly — AVA argued about the date, parroted the canned example, and (with
|
||
the model's weak adherence) also hallucinated availability. Tested directly, the local 8B model
|
||
computes appointment dates wrong ~5/5 times, so stating a computed date is a liability. The
|
||
calendar injection and in-call validation were removed (commit on 2026-06-25).
|
||
|
||
The post-call extractor still gets today's date and records a **best-effort** `resolved_date`
|
||
for staff convenience — it is staff-verified on callback, not authoritative, and never spoken to
|
||
the caller. A deterministic (non-LLM) date resolver is the right hardening if accuracy is needed;
|
||
tracked as a later item. Also: never claim appointment availability or that a slot is open —
|
||
everything is a request staff confirm.
|
||
|
||
---
|
||
|
||
## Model Configuration
|
||
|
||
### Current production model: `activeblue-avc:latest`
|
||
|
||
| Property | Value | Notes |
|
||
|----------|-------|-------|
|
||
| Base | `llama3.1:8b-instruct-q4_K_M` | Llama 3.1 8B, Q4_K_M quantization |
|
||
| ID | `366a6cc15bb7` | Rebuilt clean 2026-06-23 |
|
||
| Size | 4.9GB | Down from 8.7GB Q8_0 |
|
||
| VRAM usage | ~4.5GB | Leaves 11.5GB headroom on RTX 5080 |
|
||
| Context | 8192 tokens | Raised from 4096 (2026-06-25) so long calls don't overflow mid-call — see note below |
|
||
| Temperature | 0.3 | Low — maximizes JSON schema compliance |
|
||
| Top-p | 0.9 | Standard |
|
||
| Adapter | None | 44-pair LoRA adapter discarded |
|
||
|
||
### Modelfile (rebuild reference)
|
||
|
||
```
|
||
FROM llama3.1:8b-instruct-q4_K_M
|
||
|
||
PARAMETER stop "<|start_header_id|>"
|
||
PARAMETER stop "<|end_header_id|>"
|
||
PARAMETER stop "<|eot_id|>"
|
||
PARAMETER num_ctx 8192
|
||
PARAMETER temperature 0.3
|
||
PARAMETER top_p 0.9
|
||
|
||
TEMPLATE "{{- range .Messages }}<|start_header_id|>{{ .Role }}<|end_header_id|>
|
||
{{ .Content }}<|eot_id|>
|
||
{{- end }}<|start_header_id|>assistant<|end_header_id|>
|
||
"
|
||
```
|
||
|
||
### Why num_ctx 8192 (was 4096) — fixes mid-call silence
|
||
|
||
Symptom: on longer calls AVA would go silent / stop replying partway through. Cause: the
|
||
system prompt + a growing multi-turn transcript exceeded the 4096-token window mid-call, so
|
||
Ollama truncated and re-evaluated the whole context every turn (cache miss) → multi-second
|
||
stalls = dead air. The capture changes made it worse by briefly injecting a 45-day calendar
|
||
(~600 tok/turn) — that injection was removed; raising num_ctx to 8192 gives long calls real
|
||
headroom (RTX 5080 has the VRAM). Rebuild keeps the previous model as `activeblue-avc:pre-ctx8k`
|
||
for rollback. Keep the live system prompt lean for the same reason.
|
||
|
||
### Latency note — model is pinned warm
|
||
|
||
Per-turn latency is **LLM-side**, not STT: Whisper runs ~0.1s (VAD-stop → transcript), while
|
||
transcript → first TTS is ~0.26s median. The tail (P95 ~3s) came from **cold model reloads** —
|
||
Ollama unloads after its keep-alive window, so the first reply of a call after an idle gap paid
|
||
a ~3s load. Fix: `server.py` has a `lifespan` handler that warms + pins the model with
|
||
`keep_alive=-1` on startup (`ollama ps` shows UNTIL = Forever). Residual ~3s spikes on some
|
||
later turns are 8B generation variance. Switching Whisper size would NOT help — it's not the
|
||
bottleneck (STT model `medium` is for accuracy, not latency).
|
||
|
||
### Why Q4_K_M not Q8_0
|
||
|
||
Q8_0 consumed ~8.5GB VRAM for weights alone. Under telephony load this caused
|
||
inference latency spikes. Q4_K_M cuts weight VRAM to ~4.5GB with negligible quality
|
||
difference at 8B scale.
|
||
|
||
### Why no adapter
|
||
|
||
44-pair LoRA adapter was adding noise not signal. Minimum viable dataset is 200+ pairs
|
||
per intent category. Rebuilt correctly in Phase 5 with 500+ pairs in JSON output format.
|
||
|
||
### Ollama inventory (current)
|
||
|
||
```
|
||
activeblue-avc:latest 366a6cc15bb7 4.9GB production
|
||
llama3.1:8b-instruct-q4_K_M 46e0c10c039e 4.9GB base
|
||
nomic-embed-text:latest 0a109f422b47 274MB embeddings
|
||
```
|
||
|
||
### Phase 5 training note
|
||
|
||
Axolotl pulls from HuggingFace in safetensors format, not Ollama GGUF:
|
||
```bash
|
||
# Phase 5 only — do not run now
|
||
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct
|
||
# ~16GB on disk, separate from Ollama storage
|
||
```
|
||
|
||
---
|
||
|
||
## Build Phases
|
||
|
||
Claude Code must not scaffold Phase N+1 until Phase N gate is marked complete.
|
||
|
||
### Phase 1 — Reliable call loop
|
||
|
||
**Goal:** Every utterance gets a response. Zero silent failures. AVC hangs up — not
|
||
the caller.
|
||
|
||
- [x] Change 1: STT — Deepgram evaluated, reverted; staying on Whisper (`base` → `medium`)
|
||
- [x] Change 2: Twilio auth — API Key evaluated, reverted; staying on Auth Token
|
||
- [x] Change 3: `.env` — Auth Token + Whisper vars; `OLLAMA_MODEL=activeblue-avc:latest`
|
||
- [x] `EndCallProcessor` AVC-side termination — confirmed in call logs (closing → hang-up); Twilio shows status `completed`
|
||
- [x] `AudioHeartbeat` diagnostic logging — active (`[audio-in]` ticks ~every 5s)
|
||
- [ ] `MAX_CONCURRENT_CALLS` capacity gating — NOT yet tested (slot reserve/release works; the busy-reject path needs 3 concurrent calls)
|
||
|
||
**Gate — status:**
|
||
1. ⏳ 10 consecutive calls, zero silent non-responses — zero *genuine* silent non-responses seen so far; no clean 10-in-a-row run after the latest fixes. **RE-TEST.**
|
||
2. ✅ Zero zombie pipeline instances — single process, slots release to `0/2` each call (`ps`/`pgrep`; bare process, not Docker).
|
||
3. ✅ AVC-side termination confirmed — logs (closing → hang-up) + Twilio call status `completed`.
|
||
4. ✅ JSON parse-failure rate visible — extractor logs every save/failure; 0% parse failures observed.
|
||
5. ⏳ Latency P95 < 3s — measured P95 ~3.18s (median 0.26s); cold-reload spikes removed by pinning the model warm. **RE-MEASURE** on a fresh batch.
|
||
|
||
**Still needs live testing before Phase 1 is signed off:** capacity gating (3 concurrent calls), a clean 10-call consecutive run, and a latency re-measure now that the model is pinned.
|
||
|
||
### Phase 1 — refinements since the revert
|
||
|
||
Beyond the three reverted changes, the following hardening is live (see git history):
|
||
|
||
- **STT model** — Whisper default raised `base` → `medium` for telephony accuracy; latency impact negligible (STT ≈ 0.1s; see latency note).
|
||
- **LLM warmup/pin** — `server.py` `lifespan` handler pins the model with `keep_alive=-1` on startup so the first call turn isn't a cold reload (`ollama ps` → UNTIL = Forever).
|
||
- **Context window** — `num_ctx` 4096 → 8192 (fixes mid-call silence; see note above).
|
||
- **Call workflow** — directed script: reason → location → caller info (address by name) → verify phone (read back) → wrap-up "anything else?" before the gated "Goodbye". See Call Workflow.
|
||
- **TTS** — `SpokenKokoroTTSService` reads phone/street/zip digit-by-digit; agent name respelled via `AGENT_NAME_SPOKEN=Eva`; caller-ID injected pre-spelled so it isn't mangled.
|
||
- **Dates** — capture-and-defer (no in-call computation); post-call best-effort `resolved_date`.
|
||
- **Insurance** — log only; never suggest or guess a plan (don't read plan names from the list, never invent one not stated); capture only what the caller says.
|
||
- **Reason capture** — post-call extractor broadened to capture the eye problem/symptom as the reason (not just visit types); reason now shown in the log line and the Odoo lead title.
|
||
- **Hang-up** — `HANGUP_DELAY_SECS=4` grace pause before dropping the carrier leg.
|
||
- **Office selection** — confirm the matching office; never offer/compare others.
|
||
|
||
### Phase 2 — Accuracy (RAG + validation)
|
||
|
||
- [ ] Populate `rag/data/*.jsonl` with real AVC data (human task — see RAG section)
|
||
- [ ] ChromaDB RAG retriever wired into pipeline
|
||
- [ ] Response validator: JSON schema + factual cross-check + PHI leak scan
|
||
- [ ] Keyword blocklist (uncertainty phrases → handoff)
|
||
- [ ] Intent classifier routing
|
||
- [ ] Turn counter: max 3 failed turns before forced handoff + termination
|
||
|
||
**Gate:** 20 manual test calls, zero hallucinations on AVC-specific facts
|
||
|
||
### Phase 3 — Booking
|
||
|
||
- [ ] Real-time calendar availability check (`odoo/calendar.py`)
|
||
- [ ] Whisper large-v3 post-call transcription (`recording/transcriber.py`)
|
||
- [ ] Recording + transcript attached to Odoo lead chatter
|
||
- [ ] Staff review flow confirmed in Odoo
|
||
|
||
**Gate:** Staff receives, reviews, and confirms a lead end-to-end
|
||
|
||
### Phase 4 — Monitoring
|
||
|
||
- [ ] Transcript index (`recordings/index.jsonl`)
|
||
- [ ] Claude monitoring job
|
||
- [ ] Dashboard: toggle, alert queue, one-click apply, playback, quality tagging
|
||
|
||
**Gate:** First monitoring run produces actionable suggestions
|
||
|
||
### Phase 5 — Fine-tuning
|
||
|
||
- [ ] Pull HuggingFace base (see model section)
|
||
- [ ] Synthetic data generation via Claude API in JSON output format
|
||
- [ ] Real call exporter using staff quality tags
|
||
- [ ] Axolotl QLoRA on RTX 5080
|
||
- [ ] Model registry + versioning + A/B routing
|
||
|
||
**Gate:** New model outperforms baseline over 50+ calls
|
||
|
||
---
|
||
|
||
## Repository Structure
|
||
|
||
```
|
||
avc-phone-ai/
|
||
├── CLAUDE.md ← this file
|
||
├── README.md
|
||
├── .env ← never committed
|
||
├── .env.example
|
||
├── .gitignore ← includes .env, recordings/, *.gguf
|
||
│
|
||
├── bot.py ← Pipecat pipeline (Phase 1 changes here)
|
||
├── server.py ← Twilio webhook server (Phase 1 changes here)
|
||
├── practice.py ← AVC facts + Odoo persistence
|
||
├── extract.py ← post-call appointment extraction
|
||
├── odoo_client.py ← Odoo XML-RPC client
|
||
│
|
||
├── rag/ ← Phase 2
|
||
│ ├── store.py
|
||
│ ├── loader.py
|
||
│ ├── retriever.py
|
||
│ └── data/
|
||
│ ├── avc_locations.jsonl
|
||
│ ├── avc_providers.jsonl
|
||
│ ├── avc_services.jsonl
|
||
│ ├── avc_hours.jsonl
|
||
│ ├── avc_insurance.jsonl
|
||
│ └── avc_faqs.jsonl
|
||
│
|
||
├── recording/ ← Phase 3
|
||
│ ├── transcriber.py ← Whisper large-v3 post-call only
|
||
│ └── storage.py
|
||
│
|
||
├── monitoring/ ← Phase 4
|
||
│ ├── monitor.py
|
||
│ ├── analyzer.py
|
||
│ ├── diff_engine.py
|
||
│ ├── scheduler.py
|
||
│ └── dashboard/
|
||
│ ├── app.py
|
||
│ └── static/
|
||
│
|
||
├── training/ ← Phase 5 stub
|
||
│ └── README.md
|
||
│
|
||
├── tests/
|
||
│ ├── test_bot.py
|
||
│ ├── test_server.py
|
||
│ ├── test_odoo_client.py
|
||
│ ├── test_extract.py
|
||
│ └── fixtures/
|
||
│ └── sample_transcripts.jsonl
|
||
│
|
||
├── scripts/
|
||
│ ├── deploy.sh
|
||
│ └── smoke_test.sh
|
||
│
|
||
├── avc-phone.service ← existing systemd unit
|
||
└── traefik-avc-phone.yml ← existing Traefik config
|
||
```
|
||
|
||
---
|
||
|
||
## Infrastructure
|
||
|
||
| Component | Host | Address | Notes |
|
||
|-----------|------|---------|-------|
|
||
| Pipecat pipeline | `miaai` | `10.10.1.221` | Python async, systemd |
|
||
| Ollama LLM | `miaai` | `http://127.0.0.1:11434/v1` | `activeblue-avc:latest` |
|
||
| ChromaDB (Phase 2) | `miaai` | `http://10.10.1.221:8001` | Docker volume |
|
||
| Twilio webhook | `miaai` | `https://avc-phone.activeblue.net` | Traefik + Let's Encrypt |
|
||
| Monitoring dashboard | `miaai` | `https://avc-monitor.activeblue.net` | internal only |
|
||
| Odoo CRM | — | `https://avc.activeblue.net` | XML-RPC, db: `avc` |
|
||
| Recordings | `miaai` | `/home/tocmo0nlord/avc-phone/recordings/` | local only |
|
||
| Gitea | — | `https://git.activeblue.net/tocmo0nlord/avc-phone-ai` | user: `tocmo0nlord` |
|
||
|
||
---
|
||
|
||
## RAG Store (Phase 2)
|
||
|
||
**Stack:** ChromaDB + `nomic-embed-text:latest` (already in Ollama)
|
||
**Collection:** `avc_knowledge`
|
||
**Retrieval:** Top-3 chunks per query on caller's current turn only
|
||
|
||
### JSONL record format
|
||
|
||
```json
|
||
{
|
||
"id": "hours-kendall-weekday",
|
||
"text": "The Kendall location is open Monday through Friday 8:00 AM to 5:00 PM.",
|
||
"tags": ["hours", "kendall"],
|
||
"last_updated": "2026-06-23"
|
||
}
|
||
```
|
||
|
||
### Data files — populated before Phase 2, not before Phase 1
|
||
|
||
| File | Content |
|
||
|------|---------|
|
||
| `avc_locations.jsonl` | Address, phone, fax, parking per location |
|
||
| `avc_providers.jsonl` | Name, title, specialty, locations, languages |
|
||
| `avc_services.jsonl` | Exam types, procedures |
|
||
| `avc_hours.jsonl` | Hours per location, holiday closures, after-hours |
|
||
| `avc_insurance.jsonl` | Accepted plans per location |
|
||
| `avc_faqs.jsonl` | Approved Q&A pairs |
|
||
|
||
**Note:** `practice.py` already contains real AVC location and insurance data scraped
|
||
from `advancedvisioncareflorida.com`. Use it as the seed for the JSONL files rather
|
||
than starting from scratch.
|
||
|
||
---
|
||
|
||
## Claude Monitoring (Phase 4)
|
||
|
||
### What it analyzes
|
||
|
||
- Facts stated by AVA contradicting RAG store
|
||
- System prompt violations
|
||
- Calls that should have been handoffs
|
||
- High failed turn counts — model or prompt signal
|
||
- RAG gaps (AVA said "I don't have that" — should it be added?)
|
||
- Phrasing that caused caller confusion
|
||
|
||
### Output schema
|
||
|
||
```json
|
||
{
|
||
"call_sid": "CA...",
|
||
"severity": "high",
|
||
"issue_type": "factual_error",
|
||
"description": "AVA stated Kendall closes at 6pm. RAG store says 5pm.",
|
||
"suggested_action": "rag_update",
|
||
"suggested_change": {
|
||
"file": "rag/data/avc_hours.jsonl",
|
||
"record_id": "hours-kendall-weekday",
|
||
"field": "text",
|
||
"old": "...open until 6pm...",
|
||
"new": "...open until 5pm..."
|
||
}
|
||
}
|
||
```
|
||
|
||
`suggested_action`: `rag_update` | `prompt_change` | `blocklist_add` | `flag_for_review`
|
||
|
||
### Dashboard
|
||
|
||
FastAPI + HTML/JS at `https://avc-monitor.activeblue.net` (internal only).
|
||
|
||
| Feature | Description |
|
||
|---------|-------------|
|
||
| Enable/disable toggle | Pauses scheduler without redeployment |
|
||
| Alert queue | Suggestions sorted by severity |
|
||
| One-click apply | Applies change, commits via Gitea API to `avc-phone-ai` |
|
||
| Call playback | Audio + transcript side-by-side |
|
||
| Quality tagging | Staff tags calls from dashboard |
|
||
| Manual trigger | `POST /monitor/run` |
|
||
|
||
---
|
||
|
||
## Fine-Tuning Pipeline (Phase 5 — stub)
|
||
|
||
> Not scaffolded until Phase 4 complete and monitoring has run minimum two weeks.
|
||
> See `training/README.md` — populated at Phase 5 start.
|
||
|
||
- Synthetic data: Claude API generates Q&A in JSON output format — schema not style
|
||
- Real calls: staff-tagged `"good"` + corrected bad calls
|
||
- Target: 500+ pairs per intent before first Axolotl run
|
||
- QLoRA via Axolotl on RTX 5080, base: HuggingFace `meta-llama/Llama-3.1-8B-Instruct`
|
||
- Versioned Ollama models: `activeblue-avc:vN`
|
||
- A/B routing: promote when new version wins on booking + hallucination rate over 50+ calls
|
||
|
||
---
|
||
|
||
## HIPAA and Compliance
|
||
|
||
- AVA identifies as automated at call start — no exceptions
|
||
- No PHI in ChromaDB — practice information only
|
||
- Recordings on `miaai` only — no cloud storage
|
||
- Odoo API user: minimum permissions, not admin
|
||
- All endpoints HTTPS via Traefik
|
||
- `.env` never committed
|
||
|
||
---
|
||
|
||
## Deploy Script (`scripts/deploy.sh`)
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
set -e
|
||
cd /home/tocmo0nlord/avc-phone
|
||
git pull origin main
|
||
pip install -r requirements.txt --quiet
|
||
systemctl restart avc-phone
|
||
systemctl status avc-phone --no-pager
|
||
echo "[deploy] Done."
|
||
```
|
||
|
||
---
|
||
|
||
## Development Conventions
|
||
|
||
- Python 3.13 (matches `miaai` miniconda environment)
|
||
- Async throughout — Pipecat is async-native
|
||
- `loguru` for all logging — already in use, keep consistent
|
||
- Structured log lines for all diagnostic events
|
||
- `python-dotenv` for local dev, env injection in prod
|
||
- Secrets never hardcoded
|
||
- Every module has `if __name__ == "__main__":` for isolated testing
|
||
|
||
---
|
||
|
||
## Key Dependencies (current)
|
||
|
||
```
|
||
pipecat-ai==1.3.0 # installed at /opt/miniconda3
|
||
faster-whisper # real-time STT (already installed in pipecat-run venv)
|
||
kokoro-tts # already installed
|
||
ollama # already installed
|
||
scipy / numpy # already installed (pipecat deps)
|
||
chromadb # add for Phase 2
|
||
sentence-transformers # add for Phase 2
|
||
anthropic # for monitoring + optional LLM swap
|
||
openai-whisper # large-v3 for post-call transcription (Phase 3)
|
||
fastapi / uvicorn # already installed
|
||
loguru # already installed
|
||
httpx # already installed
|
||
```
|
||
|
||
---
|
||
|
||
## Open Items
|
||
|
||
- [ ] Confirm `TWILIO_AUTH_TOKEN` in `.env` is current (rotate if leaked/stale)
|
||
- [ ] Confirm `ODOO_STAGE_ID`, `ODOO_TEAM_ID`, `ODOO_USER_ID` from live `avc` db
|
||
- [ ] Confirm AVA voice — `af_heart` is current default, confirm with AVC before go-live
|
||
- [ ] Populate `rag/data/*.jsonl` before Phase 2 (seed from `practice.py` data)
|
||
- [ ] Define Odoo confirmed appointment flow: lead → opportunity → calendar event
|
||
- [ ] Staff training on monitoring dashboard quality tagging
|
||
|
||
---
|
||
|
||
*Active Blue LLC | git.activeblue.net/tocmo0nlord/avc-phone-ai*
|