# AVC Phone Agent — Project Specification > Claude Code authoritative reference. All architecture, security, and build decisions live here. > Repo: `git.activeblue.net/tocmo0nlord/avc-phone-ai` > Last updated: 2026-06-25 | Active Blue LLC --- ## Project Overview **Name:** AVC Phone Agent **Owner:** Active Blue LLC **Client:** Advanced Vision Care (AVC) — multi-location ophthalmology/optometry practice (FL + TX) **Agent name:** AVA (Advanced Vision Assistant) **Purpose:** Automated AI phone agent that answers patient calls, books tentative appointments into Odoo CRM with call recordings and transcripts attached, and self-improves via Claude-powered transcript monitoring and a fine-tuning feedback loop. --- ## Existing Codebase — What to Keep, What to Change The previous build at `/home/tocmo0nlord/avc-phone/` is a working foundation. **Do not rewrite what works.** Apply only the changes documented in this section. ### Files and their status | File | Status | Action | |------|--------|--------| | `bot.py` | Keep as-is | Whisper STT retained (real-time). Deepgram evaluated and rejected — see Change 1 | | `server.py` | Keep as-is | Twilio Auth Token retained. API Key swap evaluated and rejected — see Change 2 | | `practice.py` | Keep as-is | No changes | | `extract.py` | Keep as-is | No changes | | `odoo_client.py` | Keep as-is | Already uses API key auth correctly | ### What is already solved — do not touch **`EndCallProcessor` in `bot.py`** — AVC-side call termination is fully implemented. Watches LLM text stream for closing keywords ("goodbye"), waits for TTS to finish via `BotStoppedSpeakingFrame`, pauses `HANGUP_DELAY_SECS` (default 4s) so the caller isn't clipped, then pushes `EndTaskFrame` upstream. `TwilioFrameSerializer` with `auto_hang_up` drops the carrier leg. Verified working in the Phase 1 gate (4/4 clean hang-ups). It also **deterministically guarantees the callback number is confirmed** on booking calls: the 8B reads the number back only ~half the time, so if a closing is reached on a booking call (booking keyword seen) without the agent having spoken the number (`phone_marker` not seen in its replies), the hang-up is suppressed and a scripted confirmation line (`phone_confirm_line`, the caller-ID spelled out) is injected as a `TTSSpeakFrame` first. The agent's own readback satisfies the gate, so there's no double-ask in the common case; info-only calls (no booking keyword) are never asked for a number. **Mulaw 8kHz ↔ 16kHz conversion** — handled internally by `TwilioFrameSerializer`. `PIPELINE_SAMPLE_RATE = 16000`, `WIRE_SAMPLE_RATE = 8000` are already set correctly. No custom audio module needed. **VAD tuned for telephony** — `confidence=0.5`, `min_volume=0.15`, `start_secs=0.1` — kept sensitive so a quick/quiet "yes" isn't missed (a caller had to repeat it after the phone confirmation). This is safe **because `HalfDuplexGate` gates out the agent's echo while it speaks**, so sensitive VAD only listens hard during the caller's own turn and doesn't cause echo false-triggers. Addresses the repeat-yourself / missed-short-answer problem. **Capacity gating** — `MAX_CONCURRENT_CALLS=2` with atomic slot reservation in `server.py` prevents GPU thrashing. Keep it. **`AudioHeartbeat`** — diagnostic processor that distinguishes VAD failure from transport stall. Keep it. **Call recording (`AudioBufferProcessor`)** — every call is saved to `recordings/_.wav` as **stereo** (caller = left, agent = right) for review/debugging. It sits at the end of the pipeline, so the caller channel is what the system *received* (post-`HalfDuplexGate`) — it does NOT capture caller audio that arrived while the agent was speaking (gated). `RECORD_CALLS=false` to disable. `recordings/` is gitignored. **`SilenceWatchdog` in `bot.py`** — if the caller goes silent after the agent finishes, it re-prompts ("are you still there?") after `SILENCE_REPROMPT_SECS` (7s), and after `MAX_REPROMPTS` closes gracefully. Backstop against dead air; `silence_secs` must stay > `HANGUP_DELAY_SECS`. **`HalfDuplexGate` in `bot.py`** — fixes echo-induced mid-call silence. In this pipecat build interruptions are VAD-driven and always on (`PipelineParams.allow_interruptions` does NOT exist — it's silently ignored). On a phone line the agent's own TTS echoes back, the VAD reads it as the caller speaking (it produces NO transcript), and the broadcast interruption cancels the agent mid-reply → the caller hears silence. This gate sits BEFORE the VAD and withholds inbound audio while the bot is speaking (+`ECHO_TAIL_SECS`, default 0.5s) so echo never reaches the VAD. Trade-off: half-duplex — the caller can't barge in mid-utterance (fine for short replies). `HALF_DUPLEX=false` restores barge-in. Keep it on for telephony. **Post-call extraction (`extract.py`)** — single JSON-mode completion after call ends. Correctly uses `format: json`, uses verified Twilio caller-ID instead of trusting model output, falls back to JSONL if Odoo is unreachable. Keep it. **Odoo integration (`odoo_client.py`)** — already uses `ODOO_API_KEY` for XML-RPC auth, not password. Correct pattern. No changes. **`SpokenKokoroTTSService` in `bot.py`** — number normalization for speech. Kokoro reads raw digit strings as cardinals with symbols spoken aloud ("983-4969" → "nine hundred eighty-three dash forty-nine sixty-nine"). This subclass normalizes the text in `run_tts` (which receives the full sentence) so US phone numbers and 4–5 digit runs (street numbers, zips) are spoken one digit at a time — country code dropped, no "dash"/parens; dates and times left natural ("Monday the fifth", "three thirty"). It also respells the all-caps agent name to `AGENT_NAME_SPOKEN` (Kokoro reads "AVA" as "A-V-A"; set to "Eva" so it says "EE-vuh"). Deterministic, so it's robust to whatever the model emits. Keep it. `tts_normalize()` holds the rules. > Note: don't rely on the model to read raw digits — it mangles them (it emitted > "197-three five seven three…" once). The caller-ID is injected into the prompt **already > spelled out** so AVA just repeats clean words; `tts_normalize` is the backstop for any > other numbers. --- ## Change 1 — Real-time STT stays on Whisper (`bot.py`) **Decision (2026-06-25): keep Whisper. Deepgram Nova-2 was evaluated and rejected.** Deepgram Nova-2 was trialed to cut STT latency (Whisper buffers ~1-3s before the LLM sees input). The swap was applied and then reverted — the project stays on local faster-whisper. No external STT dependency, no per-minute STT cost, and no audio leaving the box (HIPAA posture). Latency is instead managed via VAD tuning and the `medium` model on the RTX 5080. **Current `bot.py` STT (in place — do not change):** ```python from pipecat.services.whisper.stt import WhisperSTTService WHISPER_MODEL = os.environ.get("WHISPER_MODEL", "medium") # tiny|base|small|medium WHISPER_DEVICE = os.environ.get("WHISPER_DEVICE", "cuda") # cuda for the 5080 WHISPER_COMPUTE = os.environ.get("WHISPER_COMPUTE", "float16") WHISPER_HOTWORDS = os.environ.get("WHISPER_HOTWORDS", "...") # domain vocab bias # HintedWhisperSTTService wraps WhisperSTTService to inject faster-whisper `hotwords` # (office cities + optometry terms) per call. Instantiated in run_agent(): stt = HintedWhisperSTTService( settings=WhisperSTTService.Settings(model=WHISPER_MODEL), device=WHISPER_DEVICE, compute_type=WHISPER_COMPUTE, hotwords=WHISPER_HOTWORDS, ) ``` **Note:** Whisper large-v3 also serves post-call transcription in Phase 3 (`recording/transcriber.py`). If real-time latency proves unacceptable in the Phase 1 gate, revisit a streaming STT then — but do not reintroduce the dependency speculatively. --- ## Change 2 — Twilio webhook auth stays on the Auth Token (`server.py`) **Decision (2026-06-25): keep `TWILIO_AUTH_TOKEN`. The API Key swap was evaluated and rejected.** A Standard API Key (scoped, revocable) was trialed in place of the account Auth Token, but it **cannot do what this server needs**: Twilio signs inbound webhooks (`X-Twilio-Signature`) with the account **Auth Token** — an API Key Secret cannot validate that signature, so `TWILIO_VALIDATE=true` would reject every legitimate `POST /voice` (403). The `TwilioFrameSerializer` auto-hang-up also expects the account/Auth-Token credential pair. The swap was reverted. **Credential model (in place):** ``` Twilio Account SID (not secret on its own) └── Auth Token (TWILIO_AUTH_TOKEN — validates webhooks + REST/auto-hang-up) ``` Treat the Auth Token as a password: keep it only in `.env` (never committed), rotate on any suspected leak / team departure / quarterly. If finer-grained scoping is ever required, the correct design is a *hybrid* — Auth Token for `X-Twilio-Signature` validation, an API Key (SK SID + Secret) only for outbound REST — not a wholesale swap. **Current `server.py` (in place — do not change):** ```python TWILIO_ACCOUNT_SID = os.environ.get("TWILIO_ACCOUNT_SID") TWILIO_AUTH_TOKEN = os.environ.get("TWILIO_AUTH_TOKEN") # _twilio_signature_ok(): HMAC-SHA1 keyed by the Auth Token (what Twilio signs with) digest = hmac.new(TWILIO_AUTH_TOKEN.encode(), payload.encode("utf-8"), hashlib.sha1).digest() # Validation gate + warning if TWILIO_VALIDATE and TWILIO_AUTH_TOKEN: ... elif not TWILIO_AUTH_TOKEN: logger.warning("/voice signature validation DISABLED (no TWILIO_AUTH_TOKEN set)") # Serializer auto-hang-up uses the account SID + Auth Token pair serializer = TwilioFrameSerializer( stream_sid=stream_sid, call_sid=call_sid, account_sid=TWILIO_ACCOUNT_SID, auth_token=TWILIO_AUTH_TOKEN, ) ``` **Auth Token rotation procedure:** 1. Generate a new primary Auth Token in the Twilio console (use the secondary-token flow) 2. Update `TWILIO_AUTH_TOKEN` in `.env` 3. Restart the service — no rebuild needed 4. Verify one test call succeeds (signature validation + auto-hang-up both rely on it) 5. Retire the old token in the Twilio console Rotate on: any suspected leak, any team member departure, quarterly as routine. --- ## Change 3 — `.env` No swap. `.env` keeps `TWILIO_AUTH_TOKEN` and the Whisper STT vars; there is **no** `TWILIO_API_KEY_*` or `DEEPGRAM_*` (those were trialed and removed with Changes 1/2). **Full `.env` reference:** ```env # Twilio — Auth Token validates webhooks + drives auto-hang-up. Never committed. TWILIO_ACCOUNT_SID=AC... TWILIO_AUTH_TOKEN= TWILIO_PHONE_NUMBER=+1... TWILIO_VALIDATE=true # STT: Whisper (faster-whisper, real-time in-call; large-v3 also used post-call in Phase 3) WHISPER_MODEL=medium WHISPER_DEVICE=cuda WHISPER_COMPUTE=float16 # LLM: Ollama OLLAMA_URL=http://127.0.0.1:11434/v1 OLLAMA_MODEL=activeblue-avc:latest LLM_PROVIDER=ollama LLM_TEMPERATURE=0.3 LLM_MAX_TOKENS=160 # Anthropic (optional LLM swap + monitoring + synthetic data) ANTHROPIC_API_KEY= ANTHROPIC_MODEL=claude-sonnet-4-6 # TTS: Kokoro KOKORO_VOICE=af_heart KOKORO_MODEL_DIR=/home/tocmo0nlord/pipecat-run/models # Odoo ODOO_URL=https://avc.activeblue.net ODOO_DB=avc ODOO_USER= ODOO_API_KEY= ODOO_TARGET=crm ODOO_STAGE_ID= ODOO_TEAM_ID= ODOO_USER_ID= # Server PUBLIC_HOST=avc-phone.activeblue.net PORT=8200 BIND_HOST=127.0.0.1 MAX_CONCURRENT_CALLS=2 STREAM_TOKEN= # Call behaviour AGENT_NAME=AVA AGENT_NAME_SPOKEN=Eva # how the name is pronounced in TTS (logs/Odoo keep AGENT_NAME) HANGUP_DELAY_SECS=4.0 # grace pause after the goodbye before dropping the carrier leg ENABLE_TOOLS= VAD_CONFIDENCE=0.5 VAD_MIN_VOLUME=0.3 VAD_START_SECS=0.2 VAD_STOP_SECS=0.5 # Monitoring (Phase 4) MONITORING_ENABLED=true MONITORING_SCHEDULE=0 2 * * * # A/B model routing (Phase 5 only) AB_SPLIT_PERCENT=0 AB_MODEL_B= ``` --- ## Call Workflow AVA runs a directed script (system prompt in `bot.py`) — warm but direct, one short turn at a time, leading the call rather than waiting on the caller. Fixed order: 1. **Reason first** — find out what they're calling about (visit reason, or just a question → answer it). 2. **Location** — ask city/area, confirm the matching office (don't offer others — see office rule). 3. **Caller info** — full name (ask last name if only a first is given), then **address the caller by name** from there on; insurance (log only); preferred day/time in their words. 4. **Confirm phone (no "yes" needed)** — near the end, STATE the caller-ID back and invite a correction *only* ("I have your number as ; if that's not the best number, just let me know."), then flow on. **No yes/no question, no waiting** — depending on catching a "yes" right after a long utterance kept failing (echo/gate timing; verified via call recording — the caller's reply was received but VAD never registered it). Caller speaks only to correct it. Still backed by the deterministic `EndCallProcessor` safety net (also a "let me know if wrong" statement). 5. **Wrap up** — recap the booking **as a REQUEST** by name ("I've noted your request to come in…"), make clear staff will call to confirm, then ask **"Is there anything else I can help you with?"** **Never claims a booking:** AVA must never say an appointment is "booked / scheduled / set / confirmed" — everything is a request staff confirm on callback. **Insurance:** never say "we accept/take" a plan (or invent one) — just note what the caller said; staff verify. **Keep momentum (prevents mid-call silence):** until the booking is complete, every turn ends with the next question. A bare acknowledgment with no question (e.g. just "staff will verify your coverage") deadlocks the call — both sides wait, the pipeline goes idle (no GPU/LLM activity), and the caller hears silence. So acknowledgment + next question go in the same turn. **Closing is gated:** the word "Goodbye" ends the call (triggers `EndCallProcessor` → hang-up), so it is never said in the same turn as confirming details and never before the anything-else question — only after the caller says they need nothing more. > Reliability: the script is prompt-driven on the local 8B (order followed well, not perfectly; > it can re-ask a last name). The phone-confirmation step is the exception — it's now > **guaranteed** by the deterministic `EndCallProcessor` safety net. ## Call Data Capture What AVA collects on a booking call and how it's logged. Driven by the system prompt (`bot.py`); persisted by the post-call extractor (`extract.py` → `practice.py` → Odoo lead). Replies are kept to one short sentence. ### The six captured fields | Field | In-call behavior | Logged as | |-------|------------------|-----------| | Full name | Asks for last name if only a first is given | `patient_name` / lead `contact_name` | | Phone | Confirmed **near the end** (not led with); STATES the caller-ID back (injected pre-spelled, digit-by-digit) and invites a correction only — **no "yes" required**; uses a different number only if the caller gives one | `callback_number` (+ `phone_confirmed`) | | Office / city | Asks city/area; when the caller names a place that matches an office, **confirms that office and moves on** — never offers/compares other offices or asks them to choose; names the nearest only if nothing matches | folded into `reason` prefix | | Reason | Captured from the conversation | `reason` | | Insurance | **Log only, never suggest or guess** — asks open-endedly (no plan names read out), captures only what the caller says, never fills in/completes/guesses the plan (asks to repeat if unclear), never says "we accept/take" a plan, never promises/confirms/denies coverage or treatment even for a listed plan; staff verify on callback | `insurance` (note: "log only — staff to verify") | | Preferred day & time | **Capture & defer** — taken in the caller's own words; AVA does not compute or correct the date | `preferred_time` + best-effort resolved `YYYY-MM-DD` | ### Dates — capture & defer (do NOT compute in-call) AVA takes the day/time in the caller's **own words** ("next Monday", "the fifth") and tells them staff will confirm the exact date on the callback. It must NOT work out, state, or correct the calendar date, and must never argue about what today's date is. **Why:** an earlier build injected a 45-day calendar and had AVA validate/correct dates in-call. A real call derailed badly — AVA argued about the date, parroted the canned example, and (with the model's weak adherence) also hallucinated availability. Tested directly, the local 8B model computes appointment dates wrong ~5/5 times, so stating a computed date is a liability. The calendar injection and in-call validation were removed (commit on 2026-06-25). The post-call extractor still gets today's date and records a **best-effort** `resolved_date` for staff convenience — it is staff-verified on callback, not authoritative, and never spoken to the caller. A deterministic (non-LLM) date resolver is the right hardening if accuracy is needed; tracked as a later item. Also: never claim appointment availability or that a slot is open — everything is a request staff confirm. --- ## Model Configuration ### Current production model: `activeblue-avc:latest` | Property | Value | Notes | |----------|-------|-------| | Base | `llama3.1:8b-instruct-q4_K_M` | Llama 3.1 8B, Q4_K_M quantization | | ID | `366a6cc15bb7` | Rebuilt clean 2026-06-23 | | Size | 4.9GB | Down from 8.7GB Q8_0 | | VRAM usage | ~4.5GB | Leaves 11.5GB headroom on RTX 5080 | | Context | 8192 tokens | Raised from 4096 (2026-06-25) so long calls don't overflow mid-call — see note below | | Temperature | 0.3 | Low — maximizes JSON schema compliance | | Top-p | 0.9 | Standard | | Adapter | None | 44-pair LoRA adapter discarded | ### Modelfile (rebuild reference) ``` FROM llama3.1:8b-instruct-q4_K_M PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" PARAMETER num_ctx 8192 PARAMETER temperature 0.3 PARAMETER top_p 0.9 TEMPLATE "{{- range .Messages }}<|start_header_id|>{{ .Role }}<|end_header_id|> {{ .Content }}<|eot_id|> {{- end }}<|start_header_id|>assistant<|end_header_id|> " ``` ### Why num_ctx 8192 (was 4096) — fixes mid-call silence Symptom: on longer calls AVA would go silent / stop replying partway through. Cause: the system prompt + a growing multi-turn transcript exceeded the 4096-token window mid-call, so Ollama truncated and re-evaluated the whole context every turn (cache miss) → multi-second stalls = dead air. The capture changes made it worse by briefly injecting a 45-day calendar (~600 tok/turn) — that injection was removed; raising num_ctx to 8192 gives long calls real headroom (RTX 5080 has the VRAM). Rebuild keeps the previous model as `activeblue-avc:pre-ctx8k` for rollback. Keep the live system prompt lean for the same reason. ### Latency note — model is pinned warm Per-turn latency is **LLM-side**, not STT: Whisper runs ~0.1s (VAD-stop → transcript), while transcript → first TTS is ~0.26s median. The tail (P95 ~3s) came from **cold model reloads** — Ollama unloads after its keep-alive window, so the first reply of a call after an idle gap paid a ~3s load. Fix: `server.py` has a `lifespan` handler that warms + pins the model with `keep_alive=-1` on startup (`ollama ps` shows UNTIL = Forever). Residual ~3s spikes on some later turns are 8B generation variance. Switching Whisper size would NOT help — it's not the bottleneck (STT model `medium` is for accuracy, not latency). ### VRAM budget — shared Whisper model (fixes OOM) GPU is 16GB. Budget: pinned LLM ~6GB (num_ctx 8192) + **one shared** Whisper `medium` ~1.5GB + overhead ≈ 8GB, leaving headroom. Critical: Whisper is loaded **once per process and reused across calls** (`_WHISPER_MODEL_CACHE` in `bot.py`). Loading a new `WhisperModel` per call leaks VRAM — ctranslate2 doesn't release it when the call ends, so models accumulated and the GPU OOM'd after ~6–8 calls (`CUDA failed with error out of memory`, every call dropping right after answer). Symptom to watch: `nvidia-smi` shows the python process growing call-over-call. Don't reintroduce per-call model loads. ### Why Q4_K_M not Q8_0 Q8_0 consumed ~8.5GB VRAM for weights alone. Under telephony load this caused inference latency spikes. Q4_K_M cuts weight VRAM to ~4.5GB with negligible quality difference at 8B scale. ### Why no adapter 44-pair LoRA adapter was adding noise not signal. Minimum viable dataset is 200+ pairs per intent category. Rebuilt correctly in Phase 5 with 500+ pairs in JSON output format. ### Ollama inventory (current) ``` activeblue-avc:latest 366a6cc15bb7 4.9GB production llama3.1:8b-instruct-q4_K_M 46e0c10c039e 4.9GB base nomic-embed-text:latest 0a109f422b47 274MB embeddings ``` ### Phase 5 training note Axolotl pulls from HuggingFace in safetensors format, not Ollama GGUF: ```bash # Phase 5 only — do not run now huggingface-cli download meta-llama/Llama-3.1-8B-Instruct # ~16GB on disk, separate from Ollama storage ``` --- ## Build Phases Claude Code must not scaffold Phase N+1 until Phase N gate is marked complete. ### Phase 1 — Reliable call loop **Goal:** Every utterance gets a response. Zero silent failures. AVC hangs up — not the caller. - [x] Change 1: STT — Deepgram evaluated, reverted; staying on Whisper (`base` → `medium`) - [x] Change 2: Twilio auth — API Key evaluated, reverted; staying on Auth Token - [x] Change 3: `.env` — Auth Token + Whisper vars; `OLLAMA_MODEL=activeblue-avc:latest` - [x] `EndCallProcessor` AVC-side termination — confirmed in call logs (closing → hang-up); Twilio shows status `completed` - [x] `AudioHeartbeat` diagnostic logging — active (`[audio-in]` ticks ~every 5s) - [x] `MAX_CONCURRENT_CALLS` capacity gating — logic verified (`scripts/score_calls.py` aside): atomic reserve grants exactly 2, refuses the 3rd, frees on hangup, 10 simultaneous → 2 granted; `/voice` returns `BUSY_MESSAGE` + `` at cap. End-to-end 3-live-phone test optional. **Gate — status:** 1. ⏳ 10 consecutive calls, zero silent non-responses — zero *genuine* silent non-responses seen so far; no clean 10-in-a-row run after the latest fixes. **RE-TEST.** 2. ✅ Zero zombie pipeline instances — single process, slots release to `0/2` each call (`ps`/`pgrep`; bare process, not Docker). 3. ✅ AVC-side termination confirmed — logs (closing → hang-up) + Twilio call status `completed`. 4. ✅ JSON parse-failure rate visible — extractor logs every save/failure; 0% parse failures observed. 5. ⏳ Latency P95 < 3s — measured P95 ~3.18s (median 0.26s); cold-reload spikes removed by pinning the model warm. **RE-MEASURE** on a fresh batch. **Still needs live testing before Phase 1 is signed off:** a clean 10-call consecutive run with normal (non-stress) input. Score it with `python scripts/score_calls.py` (reads the log; pairs with the stereo WAVs in `recordings/`). Latency P95 (LLM→TTS) is measuring ~0.4s on recent clean calls; capacity gating logic is verified. ### Phase 1 — refinements since the revert Beyond the three reverted changes, the following hardening is live (see git history): - **STT model** — Whisper default raised `base` → `medium` for telephony accuracy; latency impact negligible (STT ≈ 0.1s; see latency note). - **LLM warmup/pin** — `server.py` `lifespan` handler pins the model with `keep_alive=-1` on startup so the first call turn isn't a cold reload (`ollama ps` → UNTIL = Forever). - **Context window** — `num_ctx` 4096 → 8192 (fixes mid-call silence; see note above). - **Call workflow** — directed script: reason → location → caller info (address by name) → verify phone (read back) → wrap-up "anything else?" before the gated "Goodbye". See Call Workflow. - **TTS** — `SpokenKokoroTTSService` reads phone/street/zip digit-by-digit; agent name respelled via `AGENT_NAME_SPOKEN=Eva`; caller-ID injected pre-spelled so it isn't mangled. - **Dates** — capture-and-defer (no in-call computation); post-call best-effort `resolved_date`. - **Insurance** — log only; never suggest or guess a plan (don't read plan names from the list, never invent one not stated); capture only what the caller says. - **Reason capture** — post-call extractor broadened to capture the eye problem/symptom as the reason (not just visit types); reason now shown in the log line and the Odoo lead title. - **Hang-up** — `HANGUP_DELAY_SECS=4` grace pause before dropping the carrier leg. - **Office selection** — confirm the matching office; never offer/compare others. ### Phase 2 — Accuracy (RAG + validation) - [ ] Populate `rag/data/*.jsonl` with real AVC data (human task — see RAG section) - [ ] ChromaDB RAG retriever wired into pipeline - [ ] Response validator: JSON schema + factual cross-check + PHI leak scan - [ ] Keyword blocklist (uncertainty phrases → handoff) - [ ] Intent classifier routing - [ ] Turn counter: max 3 failed turns before forced handoff + termination **Gate:** 20 manual test calls, zero hallucinations on AVC-specific facts ### Phase 3 — Booking - [ ] Real-time calendar availability check (`odoo/calendar.py`) - [ ] Whisper large-v3 post-call transcription (`recording/transcriber.py`) - [ ] Recording + transcript attached to Odoo lead chatter - [ ] Staff review flow confirmed in Odoo **Gate:** Staff receives, reviews, and confirms a lead end-to-end ### Phase 4 — Monitoring - [ ] Transcript index (`recordings/index.jsonl`) - [ ] Claude monitoring job - [ ] Dashboard: toggle, alert queue, one-click apply, playback, quality tagging **Gate:** First monitoring run produces actionable suggestions ### Phase 5 — Fine-tuning - [ ] Pull HuggingFace base (see model section) - [ ] Synthetic data generation via Claude API in JSON output format - [ ] Real call exporter using staff quality tags - [ ] Axolotl QLoRA on RTX 5080 - [ ] Model registry + versioning + A/B routing **Gate:** New model outperforms baseline over 50+ calls --- ## Repository Structure ``` avc-phone-ai/ ├── CLAUDE.md ← this file ├── README.md ├── .env ← never committed ├── .env.example ├── .gitignore ← includes .env, recordings/, *.gguf │ ├── bot.py ← Pipecat pipeline (Phase 1 changes here) ├── server.py ← Twilio webhook server (Phase 1 changes here) ├── practice.py ← AVC facts + Odoo persistence ├── extract.py ← post-call appointment extraction ├── odoo_client.py ← Odoo XML-RPC client │ ├── rag/ ← Phase 2 │ ├── store.py │ ├── loader.py │ ├── retriever.py │ └── data/ │ ├── avc_locations.jsonl │ ├── avc_providers.jsonl │ ├── avc_services.jsonl │ ├── avc_hours.jsonl │ ├── avc_insurance.jsonl │ └── avc_faqs.jsonl │ ├── recording/ ← Phase 3 │ ├── transcriber.py ← Whisper large-v3 post-call only │ └── storage.py │ ├── monitoring/ ← Phase 4 │ ├── monitor.py │ ├── analyzer.py │ ├── diff_engine.py │ ├── scheduler.py │ └── dashboard/ │ ├── app.py │ └── static/ │ ├── training/ ← Phase 5 stub │ └── README.md │ ├── tests/ │ ├── test_bot.py │ ├── test_server.py │ ├── test_odoo_client.py │ ├── test_extract.py │ └── fixtures/ │ └── sample_transcripts.jsonl │ ├── scripts/ │ ├── deploy.sh │ └── smoke_test.sh │ ├── avc-phone.service ← existing systemd unit └── traefik-avc-phone.yml ← existing Traefik config ``` --- ## Infrastructure | Component | Host | Address | Notes | |-----------|------|---------|-------| | Pipecat pipeline | `miaai` | `10.10.1.221` | Python async, systemd | | Ollama LLM | `miaai` | `http://127.0.0.1:11434/v1` | `activeblue-avc:latest` | | ChromaDB (Phase 2) | `miaai` | `http://10.10.1.221:8001` | Docker volume | | Twilio webhook | `miaai` | `https://avc-phone.activeblue.net` | Traefik + Let's Encrypt | | Monitoring dashboard | `miaai` | `https://avc-monitor.activeblue.net` | internal only | | Odoo CRM | — | `https://avc.activeblue.net` | XML-RPC, db: `avc` | | Recordings | `miaai` | `/home/tocmo0nlord/avc-phone/recordings/` | local only | | Gitea | — | `https://git.activeblue.net/tocmo0nlord/avc-phone-ai` | user: `tocmo0nlord` | --- ## RAG Store (Phase 2) **Stack:** ChromaDB + `nomic-embed-text:latest` (already in Ollama) **Collection:** `avc_knowledge` **Retrieval:** Top-3 chunks per query on caller's current turn only ### JSONL record format ```json { "id": "hours-kendall-weekday", "text": "The Kendall location is open Monday through Friday 8:00 AM to 5:00 PM.", "tags": ["hours", "kendall"], "last_updated": "2026-06-23" } ``` ### Data files — populated before Phase 2, not before Phase 1 | File | Content | |------|---------| | `avc_locations.jsonl` | Address, phone, fax, parking per location | | `avc_providers.jsonl` | Name, title, specialty, locations, languages | | `avc_services.jsonl` | Exam types, procedures | | `avc_hours.jsonl` | Hours per location, holiday closures, after-hours | | `avc_insurance.jsonl` | Accepted plans per location | | `avc_faqs.jsonl` | Approved Q&A pairs | **Note:** `practice.py` already contains real AVC location and insurance data scraped from `advancedvisioncareflorida.com`. Use it as the seed for the JSONL files rather than starting from scratch. --- ## Claude Monitoring (Phase 4) ### What it analyzes - Facts stated by AVA contradicting RAG store - System prompt violations - Calls that should have been handoffs - High failed turn counts — model or prompt signal - RAG gaps (AVA said "I don't have that" — should it be added?) - Phrasing that caused caller confusion ### Output schema ```json { "call_sid": "CA...", "severity": "high", "issue_type": "factual_error", "description": "AVA stated Kendall closes at 6pm. RAG store says 5pm.", "suggested_action": "rag_update", "suggested_change": { "file": "rag/data/avc_hours.jsonl", "record_id": "hours-kendall-weekday", "field": "text", "old": "...open until 6pm...", "new": "...open until 5pm..." } } ``` `suggested_action`: `rag_update` | `prompt_change` | `blocklist_add` | `flag_for_review` ### Dashboard FastAPI + HTML/JS at `https://avc-monitor.activeblue.net` (internal only). | Feature | Description | |---------|-------------| | Enable/disable toggle | Pauses scheduler without redeployment | | Alert queue | Suggestions sorted by severity | | One-click apply | Applies change, commits via Gitea API to `avc-phone-ai` | | Call playback | Audio + transcript side-by-side | | Quality tagging | Staff tags calls from dashboard | | Manual trigger | `POST /monitor/run` | --- ## Fine-Tuning Pipeline (Phase 5 — stub) > Not scaffolded until Phase 4 complete and monitoring has run minimum two weeks. > See `training/README.md` — populated at Phase 5 start. - Synthetic data: Claude API generates Q&A in JSON output format — schema not style - Real calls: staff-tagged `"good"` + corrected bad calls - Target: 500+ pairs per intent before first Axolotl run - QLoRA via Axolotl on RTX 5080, base: HuggingFace `meta-llama/Llama-3.1-8B-Instruct` - Versioned Ollama models: `activeblue-avc:vN` - A/B routing: promote when new version wins on booking + hallucination rate over 50+ calls --- ## HIPAA and Compliance - AVA identifies as automated at call start — no exceptions - No PHI in ChromaDB — practice information only - Recordings on `miaai` only — no cloud storage - Odoo API user: minimum permissions, not admin - All endpoints HTTPS via Traefik - `.env` never committed --- ## Deploy Script (`scripts/deploy.sh`) ```bash #!/bin/bash set -e cd /home/tocmo0nlord/avc-phone git pull origin main pip install -r requirements.txt --quiet systemctl restart avc-phone systemctl status avc-phone --no-pager echo "[deploy] Done." ``` --- ## Development Conventions - Python 3.13 (matches `miaai` miniconda environment) - Async throughout — Pipecat is async-native - `loguru` for all logging — already in use, keep consistent - Structured log lines for all diagnostic events - `python-dotenv` for local dev, env injection in prod - Secrets never hardcoded - Every module has `if __name__ == "__main__":` for isolated testing --- ## Key Dependencies (current) ``` pipecat-ai==1.3.0 # installed at /opt/miniconda3 faster-whisper # real-time STT (already installed in pipecat-run venv) kokoro-tts # already installed ollama # already installed scipy / numpy # already installed (pipecat deps) chromadb # add for Phase 2 sentence-transformers # add for Phase 2 anthropic # for monitoring + optional LLM swap openai-whisper # large-v3 for post-call transcription (Phase 3) fastapi / uvicorn # already installed loguru # already installed httpx # already installed ``` --- ## Open Items - [ ] Confirm `TWILIO_AUTH_TOKEN` in `.env` is current (rotate if leaked/stale) - [ ] Confirm `ODOO_STAGE_ID`, `ODOO_TEAM_ID`, `ODOO_USER_ID` from live `avc` db - [ ] Confirm AVA voice — `af_heart` is current default, confirm with AVC before go-live - [ ] Populate `rag/data/*.jsonl` before Phase 2 (seed from `practice.py` data) - [ ] Define Odoo confirmed appointment flow: lead → opportunity → calendar event - [ ] Staff training on monitoring dashboard quality tagging --- *Active Blue LLC | git.activeblue.net/tocmo0nlord/avc-phone-ai*