avc-phone-ai/CLAUDE.md

# AVC Phone Agent — Project Specification
> Claude Code authoritative reference. All architecture, security, and build decisions live here.
> Repo: `git.activeblue.net/tocmo0nlord/avc-phone-ai`
> Last updated: 2026-06-25 | Active Blue LLC

---

## Project Overview

**Name:** AVC Phone Agent
**Owner:** Active Blue LLC
**Client:** Advanced Vision Care (AVC) — multi-location ophthalmology/optometry practice (FL + TX)
**Agent name:** AVA (Advanced Vision Assistant)
**Purpose:** Automated AI phone agent that answers patient calls, books tentative appointments
into Odoo CRM with call recordings and transcripts attached, and self-improves via
Claude-powered transcript monitoring and a fine-tuning feedback loop.

---

## Existing Codebase — What to Keep, What to Change

The previous build at `/home/tocmo0nlord/avc-phone/` is a working foundation.
**Do not rewrite what works.** Apply only the changes documented in this section.

### Files and their status

| File | Status | Action |
|------|--------|--------|
| `bot.py` | Keep as-is | Whisper STT retained (real-time). Deepgram evaluated and rejected — see Change 1 |
| `server.py` | Keep as-is | Twilio Auth Token retained. API Key swap evaluated and rejected — see Change 2 |
| `practice.py` | Keep as-is | No changes |
| `extract.py` | Keep as-is | No changes |
| `odoo_client.py` | Keep as-is | Already uses API key auth correctly |

### What is already solved — do not touch

**`EndCallProcessor` in `bot.py`** — AVC-side call termination is fully implemented.
Watches LLM text stream for closing keywords ("goodbye"), waits for TTS to finish via
`BotStoppedSpeakingFrame`, then pushes `EndTaskFrame` upstream. `TwilioFrameSerializer`
with `auto_hang_up` drops the carrier leg. This is correct. Zero changes.

**Mulaw 8kHz ↔ 16kHz conversion** — handled internally by `TwilioFrameSerializer`.
`PIPELINE_SAMPLE_RATE = 16000`, `WIRE_SAMPLE_RATE = 8000` are already set correctly.
No custom audio module needed.

**VAD tuned for telephony** — `confidence=0.5`, `min_volume=0.3` already loosened from
desktop defaults. These settings directly address the repeat-yourself problem on the
VAD side.

**Capacity gating** — `MAX_CONCURRENT_CALLS=2` with atomic slot reservation in
`server.py` prevents GPU thrashing. Keep it.

**`AudioHeartbeat`** — diagnostic processor that distinguishes VAD failure from
transport stall. Keep it.

**Post-call extraction (`extract.py`)** — single JSON-mode completion after call ends.
Correctly uses `format: json`, uses verified Twilio caller-ID instead of trusting model
output, falls back to JSONL if Odoo is unreachable. Keep it.

**Odoo integration (`odoo_client.py`)** — already uses `ODOO_API_KEY` for XML-RPC auth,
not password. Correct pattern. No changes.

---

## Change 1 — Real-time STT stays on Whisper (`bot.py`)

**Decision (2026-06-25): keep Whisper. Deepgram Nova-2 was evaluated and rejected.**

Deepgram Nova-2 was trialed to cut STT latency (Whisper buffers ~1-3s before the LLM
sees input). The swap was applied and then reverted — the project stays on local
faster-whisper. No external STT dependency, no per-minute STT cost, and no audio
leaving the box (HIPAA posture). Latency is instead managed via VAD tuning and the
`medium` model on the RTX 5080.

**Current `bot.py` STT (in place — do not change):**
```python
from pipecat.services.whisper.stt import WhisperSTTService

WHISPER_MODEL = os.environ.get("WHISPER_MODEL", "medium")   # tiny|base|small|medium
WHISPER_DEVICE = os.environ.get("WHISPER_DEVICE", "cuda")   # cuda for the 5080
WHISPER_COMPUTE = os.environ.get("WHISPER_COMPUTE", "float16")
WHISPER_HOTWORDS = os.environ.get("WHISPER_HOTWORDS", "...")  # domain vocab bias

# HintedWhisperSTTService wraps WhisperSTTService to inject faster-whisper `hotwords`
# (office cities + optometry terms) per call. Instantiated in run_agent():
stt = HintedWhisperSTTService(
    settings=WhisperSTTService.Settings(model=WHISPER_MODEL),
    device=WHISPER_DEVICE,
    compute_type=WHISPER_COMPUTE,
    hotwords=WHISPER_HOTWORDS,
)
```

**Note:** Whisper large-v3 also serves post-call transcription in Phase 3
(`recording/transcriber.py`). If real-time latency proves unacceptable in the Phase 1
gate, revisit a streaming STT then — but do not reintroduce the dependency speculatively.

---

## Change 2 — Twilio webhook auth stays on the Auth Token (`server.py`)

**Decision (2026-06-25): keep `TWILIO_AUTH_TOKEN`. The API Key swap was evaluated and rejected.**

A Standard API Key (scoped, revocable) was trialed in place of the account Auth Token,
but it **cannot do what this server needs**: Twilio signs inbound webhooks
(`X-Twilio-Signature`) with the account **Auth Token** — an API Key Secret cannot validate
that signature, so `TWILIO_VALIDATE=true` would reject every legitimate `POST /voice`
(403). The `TwilioFrameSerializer` auto-hang-up also expects the account/Auth-Token
credential pair. The swap was reverted.

**Credential model (in place):**
```
Twilio Account SID          (not secret on its own)
└── Auth Token              (TWILIO_AUTH_TOKEN — validates webhooks + REST/auto-hang-up)
```

Treat the Auth Token as a password: keep it only in `.env` (never committed), rotate on
any suspected leak / team departure / quarterly. If finer-grained scoping is ever
required, the correct design is a *hybrid* — Auth Token for `X-Twilio-Signature`
validation, an API Key (SK SID + Secret) only for outbound REST — not a wholesale swap.

**Current `server.py` (in place — do not change):**

```python
TWILIO_ACCOUNT_SID = os.environ.get("TWILIO_ACCOUNT_SID")
TWILIO_AUTH_TOKEN = os.environ.get("TWILIO_AUTH_TOKEN")

# _twilio_signature_ok(): HMAC-SHA1 keyed by the Auth Token (what Twilio signs with)
digest = hmac.new(TWILIO_AUTH_TOKEN.encode(), payload.encode("utf-8"), hashlib.sha1).digest()

# Validation gate + warning
if TWILIO_VALIDATE and TWILIO_AUTH_TOKEN:
    ...
elif not TWILIO_AUTH_TOKEN:
    logger.warning("/voice signature validation DISABLED (no TWILIO_AUTH_TOKEN set)")

# Serializer auto-hang-up uses the account SID + Auth Token pair
serializer = TwilioFrameSerializer(
    stream_sid=stream_sid,
    call_sid=call_sid,
    account_sid=TWILIO_ACCOUNT_SID,
    auth_token=TWILIO_AUTH_TOKEN,
)
```

**Auth Token rotation procedure:**
1. Generate a new primary Auth Token in the Twilio console (use the secondary-token flow)
2. Update `TWILIO_AUTH_TOKEN` in `.env`
3. Restart the service — no rebuild needed
4. Verify one test call succeeds (signature validation + auto-hang-up both rely on it)
5. Retire the old token in the Twilio console

Rotate on: any suspected leak, any team member departure, quarterly as routine.

---

## Change 3 — `.env`

No swap. `.env` keeps `TWILIO_AUTH_TOKEN` and the Whisper STT vars; there is **no**
`TWILIO_API_KEY_*` or `DEEPGRAM_*` (those were trialed and removed with Changes 1/2).

**Full `.env` reference:**
```env
# Twilio — Auth Token validates webhooks + drives auto-hang-up. Never committed.
TWILIO_ACCOUNT_SID=AC...
TWILIO_AUTH_TOKEN=
TWILIO_PHONE_NUMBER=+1...
TWILIO_VALIDATE=true

# STT: Whisper (faster-whisper, real-time in-call; large-v3 also used post-call in Phase 3)
WHISPER_MODEL=medium
WHISPER_DEVICE=cuda
WHISPER_COMPUTE=float16

# LLM: Ollama
OLLAMA_URL=http://127.0.0.1:11434/v1
OLLAMA_MODEL=activeblue-avc:latest
LLM_PROVIDER=ollama
LLM_TEMPERATURE=0.3
LLM_MAX_TOKENS=160

# Anthropic (optional LLM swap + monitoring + synthetic data)
ANTHROPIC_API_KEY=
ANTHROPIC_MODEL=claude-sonnet-4-6

# TTS: Kokoro
KOKORO_VOICE=af_heart
KOKORO_MODEL_DIR=/home/tocmo0nlord/pipecat-run/models

# Odoo
ODOO_URL=https://avc.activeblue.net
ODOO_DB=avc
ODOO_USER=
ODOO_API_KEY=
ODOO_TARGET=crm
ODOO_STAGE_ID=
ODOO_TEAM_ID=
ODOO_USER_ID=

# Server
PUBLIC_HOST=avc-phone.activeblue.net
PORT=8200
BIND_HOST=127.0.0.1
MAX_CONCURRENT_CALLS=2
STREAM_TOKEN=

# Call behaviour
AGENT_NAME=AVA
ENABLE_TOOLS=
VAD_CONFIDENCE=0.5
VAD_MIN_VOLUME=0.3
VAD_START_SECS=0.2
VAD_STOP_SECS=0.5

# Monitoring (Phase 4)
MONITORING_ENABLED=true
MONITORING_SCHEDULE=0 2 * * *

# A/B model routing (Phase 5 only)
AB_SPLIT_PERCENT=0
AB_MODEL_B=
```

---

## Model Configuration

### Current production model: `activeblue-avc:latest`

| Property | Value | Notes |
|----------|-------|-------|
| Base | `llama3.1:8b-instruct-q4_K_M` | Llama 3.1 8B, Q4_K_M quantization |
| ID | `366a6cc15bb7` | Rebuilt clean 2026-06-23 |
| Size | 4.9GB | Down from 8.7GB Q8_0 |
| VRAM usage | ~4.5GB | Leaves 11.5GB headroom on RTX 5080 |
| Context | 4096 tokens | Sufficient for any phone call |
| Temperature | 0.3 | Low — maximizes JSON schema compliance |
| Top-p | 0.9 | Standard |
| Adapter | None | 44-pair LoRA adapter discarded |

### Modelfile (rebuild reference)

```
FROM llama3.1:8b-instruct-q4_K_M

PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
PARAMETER num_ctx 4096
PARAMETER temperature 0.3
PARAMETER top_p 0.9

TEMPLATE "{{- range .Messages }}<|start_header_id|>{{ .Role }}<|end_header_id|>
{{ .Content }}<|eot_id|>
{{- end }}<|start_header_id|>assistant<|end_header_id|>
"
```

### Why Q4_K_M not Q8_0

Q8_0 consumed ~8.5GB VRAM for weights alone. Under telephony load this caused
inference latency spikes. Q4_K_M cuts weight VRAM to ~4.5GB with negligible quality
difference at 8B scale.

### Why no adapter

44-pair LoRA adapter was adding noise not signal. Minimum viable dataset is 200+ pairs
per intent category. Rebuilt correctly in Phase 5 with 500+ pairs in JSON output format.

### Ollama inventory (current)

```
activeblue-avc:latest          366a6cc15bb7    4.9GB    production
llama3.1:8b-instruct-q4_K_M    46e0c10c039e    4.9GB    base
nomic-embed-text:latest        0a109f422b47    274MB    embeddings
```

### Phase 5 training note

Axolotl pulls from HuggingFace in safetensors format, not Ollama GGUF:
```bash
# Phase 5 only — do not run now
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct
# ~16GB on disk, separate from Ollama storage
```

---

## Build Phases

Claude Code must not scaffold Phase N+1 until Phase N gate is marked complete.

### Phase 1 — Reliable call loop

**Goal:** Every utterance gets a response. Zero silent failures. AVC hangs up — not
the caller.

- [x] Change 1: STT — Deepgram evaluated, reverted; staying on Whisper (`medium`)
- [x] Change 2: Twilio auth — API Key evaluated, reverted; staying on Auth Token
- [x] Change 3: `.env` — Auth Token + Whisper vars; `OLLAMA_MODEL=activeblue-avc:latest`
- [ ] Verify `EndCallProcessor` termination in Twilio call logs (AVC side, not caller)
- [ ] Verify `AudioHeartbeat` diagnostic logging active
- [ ] Verify `MAX_CONCURRENT_CALLS` capacity gating works

**Gate — all five must pass:**
1. 10 consecutive test calls — zero silent non-responses
2. Zero zombie pipeline instances after call ends (`ps`/`pgrep` — service runs as a bare
   systemd/host process, not Docker)
3. Call termination from AVC side confirmed in Twilio call logs
4. JSON parse failure rate visible in logs — measurable not invisible
5. Response latency P95 under 3 seconds from STT end-of-utterance to first TTS audio

### Phase 2 — Accuracy (RAG + validation)

- [ ] Populate `rag/data/*.jsonl` with real AVC data (human task — see RAG section)
- [ ] ChromaDB RAG retriever wired into pipeline
- [ ] Response validator: JSON schema + factual cross-check + PHI leak scan
- [ ] Keyword blocklist (uncertainty phrases → handoff)
- [ ] Intent classifier routing
- [ ] Turn counter: max 3 failed turns before forced handoff + termination

**Gate:** 20 manual test calls, zero hallucinations on AVC-specific facts

### Phase 3 — Booking

- [ ] Real-time calendar availability check (`odoo/calendar.py`)
- [ ] Whisper large-v3 post-call transcription (`recording/transcriber.py`)
- [ ] Recording + transcript attached to Odoo lead chatter
- [ ] Staff review flow confirmed in Odoo

**Gate:** Staff receives, reviews, and confirms a lead end-to-end

### Phase 4 — Monitoring

- [ ] Transcript index (`recordings/index.jsonl`)
- [ ] Claude monitoring job
- [ ] Dashboard: toggle, alert queue, one-click apply, playback, quality tagging

**Gate:** First monitoring run produces actionable suggestions

### Phase 5 — Fine-tuning

- [ ] Pull HuggingFace base (see model section)
- [ ] Synthetic data generation via Claude API in JSON output format
- [ ] Real call exporter using staff quality tags
- [ ] Axolotl QLoRA on RTX 5080
- [ ] Model registry + versioning + A/B routing

**Gate:** New model outperforms baseline over 50+ calls

---

## Repository Structure

```
avc-phone-ai/
├── CLAUDE.md                          ← this file
├── README.md
├── .env                               ← never committed
├── .env.example
├── .gitignore                         ← includes .env, recordings/, *.gguf
│
├── bot.py                             ← Pipecat pipeline (Phase 1 changes here)
├── server.py                          ← Twilio webhook server (Phase 1 changes here)
├── practice.py                        ← AVC facts + Odoo persistence
├── extract.py                         ← post-call appointment extraction
├── odoo_client.py                     ← Odoo XML-RPC client
│
├── rag/                               ← Phase 2
│   ├── store.py
│   ├── loader.py
│   ├── retriever.py
│   └── data/
│       ├── avc_locations.jsonl
│       ├── avc_providers.jsonl
│       ├── avc_services.jsonl
│       ├── avc_hours.jsonl
│       ├── avc_insurance.jsonl
│       └── avc_faqs.jsonl
│
├── recording/                         ← Phase 3
│   ├── transcriber.py                 ← Whisper large-v3 post-call only
│   └── storage.py
│
├── monitoring/                        ← Phase 4
│   ├── monitor.py
│   ├── analyzer.py
│   ├── diff_engine.py
│   ├── scheduler.py
│   └── dashboard/
│       ├── app.py
│       └── static/
│
├── training/                          ← Phase 5 stub
│   └── README.md
│
├── tests/
│   ├── test_bot.py
│   ├── test_server.py
│   ├── test_odoo_client.py
│   ├── test_extract.py
│   └── fixtures/
│       └── sample_transcripts.jsonl
│
├── scripts/
│   ├── deploy.sh
│   └── smoke_test.sh
│
├── avc-phone.service                  ← existing systemd unit
└── traefik-avc-phone.yml              ← existing Traefik config
```

---

## Infrastructure

| Component | Host | Address | Notes |
|-----------|------|---------|-------|
| Pipecat pipeline | `miaai` | `10.10.1.221` | Python async, systemd |
| Ollama LLM | `miaai` | `http://127.0.0.1:11434/v1` | `activeblue-avc:latest` |
| ChromaDB (Phase 2) | `miaai` | `http://10.10.1.221:8001` | Docker volume |
| Twilio webhook | `miaai` | `https://avc-phone.activeblue.net` | Traefik + Let's Encrypt |
| Monitoring dashboard | `miaai` | `https://avc-monitor.activeblue.net` | internal only |
| Odoo CRM | — | `https://avc.activeblue.net` | XML-RPC, db: `avc` |
| Recordings | `miaai` | `/home/tocmo0nlord/avc-phone/recordings/` | local only |
| Gitea | — | `https://git.activeblue.net/tocmo0nlord/avc-phone-ai` | user: `tocmo0nlord` |

---

## RAG Store (Phase 2)

**Stack:** ChromaDB + `nomic-embed-text:latest` (already in Ollama)
**Collection:** `avc_knowledge`
**Retrieval:** Top-3 chunks per query on caller's current turn only

### JSONL record format

```json
{
  "id": "hours-kendall-weekday",
  "text": "The Kendall location is open Monday through Friday 8:00 AM to 5:00 PM.",
  "tags": ["hours", "kendall"],
  "last_updated": "2026-06-23"
}
```

### Data files — populated before Phase 2, not before Phase 1

| File | Content |
|------|---------|
| `avc_locations.jsonl` | Address, phone, fax, parking per location |
| `avc_providers.jsonl` | Name, title, specialty, locations, languages |
| `avc_services.jsonl` | Exam types, procedures |
| `avc_hours.jsonl` | Hours per location, holiday closures, after-hours |
| `avc_insurance.jsonl` | Accepted plans per location |
| `avc_faqs.jsonl` | Approved Q&A pairs |

**Note:** `practice.py` already contains real AVC location and insurance data scraped
from `advancedvisioncareflorida.com`. Use it as the seed for the JSONL files rather
than starting from scratch.

---

## Claude Monitoring (Phase 4)

### What it analyzes

- Facts stated by AVA contradicting RAG store
- System prompt violations
- Calls that should have been handoffs
- High failed turn counts — model or prompt signal
- RAG gaps (AVA said "I don't have that" — should it be added?)
- Phrasing that caused caller confusion

### Output schema

```json
{
  "call_sid": "CA...",
  "severity": "high",
  "issue_type": "factual_error",
  "description": "AVA stated Kendall closes at 6pm. RAG store says 5pm.",
  "suggested_action": "rag_update",
  "suggested_change": {
    "file": "rag/data/avc_hours.jsonl",
    "record_id": "hours-kendall-weekday",
    "field": "text",
    "old": "...open until 6pm...",
    "new": "...open until 5pm..."
  }
}
```

`suggested_action`: `rag_update` | `prompt_change` | `blocklist_add` | `flag_for_review`

### Dashboard

FastAPI + HTML/JS at `https://avc-monitor.activeblue.net` (internal only).

| Feature | Description |
|---------|-------------|
| Enable/disable toggle | Pauses scheduler without redeployment |
| Alert queue | Suggestions sorted by severity |
| One-click apply | Applies change, commits via Gitea API to `avc-phone-ai` |
| Call playback | Audio + transcript side-by-side |
| Quality tagging | Staff tags calls from dashboard |
| Manual trigger | `POST /monitor/run` |

---

## Fine-Tuning Pipeline (Phase 5 — stub)

> Not scaffolded until Phase 4 complete and monitoring has run minimum two weeks.
> See `training/README.md` — populated at Phase 5 start.

- Synthetic data: Claude API generates Q&A in JSON output format — schema not style
- Real calls: staff-tagged `"good"` + corrected bad calls
- Target: 500+ pairs per intent before first Axolotl run
- QLoRA via Axolotl on RTX 5080, base: HuggingFace `meta-llama/Llama-3.1-8B-Instruct`
- Versioned Ollama models: `activeblue-avc:vN`
- A/B routing: promote when new version wins on booking + hallucination rate over 50+ calls

---

## HIPAA and Compliance

- AVA identifies as automated at call start — no exceptions
- No PHI in ChromaDB — practice information only
- Recordings on `miaai` only — no cloud storage
- Odoo API user: minimum permissions, not admin
- All endpoints HTTPS via Traefik
- `.env` never committed

---

## Deploy Script (`scripts/deploy.sh`)

```bash
#!/bin/bash
set -e
cd /home/tocmo0nlord/avc-phone
git pull origin main
pip install -r requirements.txt --quiet
systemctl restart avc-phone
systemctl status avc-phone --no-pager
echo "[deploy] Done."
```

---

## Development Conventions

- Python 3.13 (matches `miaai` miniconda environment)
- Async throughout — Pipecat is async-native
- `loguru` for all logging — already in use, keep consistent
- Structured log lines for all diagnostic events
- `python-dotenv` for local dev, env injection in prod
- Secrets never hardcoded
- Every module has `if __name__ == "__main__":` for isolated testing

---

## Key Dependencies (current)

```
pipecat-ai==1.3.0           # installed at /opt/miniconda3
faster-whisper              # real-time STT (already installed in pipecat-run venv)
kokoro-tts                  # already installed
ollama                      # already installed
scipy / numpy               # already installed (pipecat deps)
chromadb                    # add for Phase 2
sentence-transformers        # add for Phase 2
anthropic                   # for monitoring + optional LLM swap
openai-whisper              # large-v3 for post-call transcription (Phase 3)
fastapi / uvicorn           # already installed
loguru                      # already installed
httpx                       # already installed
```

---

## Open Items

- [ ] Confirm `TWILIO_AUTH_TOKEN` in `.env` is current (rotate if leaked/stale)
- [ ] Confirm `ODOO_STAGE_ID`, `ODOO_TEAM_ID`, `ODOO_USER_ID` from live `avc` db
- [ ] Confirm AVA voice — `af_heart` is current default, confirm with AVC before go-live
- [ ] Populate `rag/data/*.jsonl` before Phase 2 (seed from `practice.py` data)
- [ ] Define Odoo confirmed appointment flow: lead → opportunity → calendar event
- [ ] Staff training on monitoring dashboard quality tagging

---

*Active Blue LLC | git.activeblue.net/tocmo0nlord/avc-phone-ai*