Files

tocmo0nlord b31f685d91 Raise model num_ctx to 8192 to fix mid-call silence

Long calls overflowed the 4096-token window mid-conversation, forcing Ollama to
truncate + re-evaluate the full context each turn = multi-second stalls / dead
air. Rebuilt activeblue-avc:latest with num_ctx 8192 (rollback tag
activeblue-avc:pre-ctx8k). Combined with removing the 45-day calendar injection,
this keeps long calls well under the window. Doc: context row, Modelfile
reference, and a root-cause note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-25 03:53:41 +00:00

23 KiB

Raw Blame History

AVC Phone Agent — Project Specification

Claude Code authoritative reference. All architecture, security, and build decisions live here. Repo: git.activeblue.net/tocmo0nlord/avc-phone-ai Last updated: 2026-06-25 | Active Blue LLC

Project Overview

Name: AVC Phone Agent Owner: Active Blue LLC Client: Advanced Vision Care (AVC) — multi-location ophthalmology/optometry practice (FL + TX) Agent name: AVA (Advanced Vision Assistant) Purpose: Automated AI phone agent that answers patient calls, books tentative appointments into Odoo CRM with call recordings and transcripts attached, and self-improves via Claude-powered transcript monitoring and a fine-tuning feedback loop.

Existing Codebase — What to Keep, What to Change

The previous build at /home/tocmo0nlord/avc-phone/ is a working foundation. Do not rewrite what works. Apply only the changes documented in this section.

Files and their status

File	Status	Action
`bot.py`	Keep as-is	Whisper STT retained (real-time). Deepgram evaluated and rejected — see Change 1
`server.py`	Keep as-is	Twilio Auth Token retained. API Key swap evaluated and rejected — see Change 2
`practice.py`	Keep as-is	No changes
`extract.py`	Keep as-is	No changes
`odoo_client.py`	Keep as-is	Already uses API key auth correctly

What is already solved — do not touch

EndCallProcessor in bot.py — AVC-side call termination is fully implemented. Watches LLM text stream for closing keywords ("goodbye"), waits for TTS to finish via BotStoppedSpeakingFrame, pauses HANGUP_DELAY_SECS (default 4s) so the caller isn't clipped, then pushes EndTaskFrame upstream. TwilioFrameSerializer with auto_hang_up drops the carrier leg. Verified working in the Phase 1 gate (4/4 clean hang-ups).

Mulaw 8kHz ↔ 16kHz conversion — handled internally by TwilioFrameSerializer. PIPELINE_SAMPLE_RATE = 16000, WIRE_SAMPLE_RATE = 8000 are already set correctly. No custom audio module needed.

VAD tuned for telephony — confidence=0.5, min_volume=0.3 already loosened from desktop defaults. These settings directly address the repeat-yourself problem on the VAD side.

Capacity gating — MAX_CONCURRENT_CALLS=2 with atomic slot reservation in server.py prevents GPU thrashing. Keep it.

AudioHeartbeat — diagnostic processor that distinguishes VAD failure from transport stall. Keep it.

Post-call extraction (extract.py) — single JSON-mode completion after call ends. Correctly uses format: json, uses verified Twilio caller-ID instead of trusting model output, falls back to JSONL if Odoo is unreachable. Keep it.

Odoo integration (odoo_client.py) — already uses ODOO_API_KEY for XML-RPC auth, not password. Correct pattern. No changes.

Change 1 — Real-time STT stays on Whisper (`bot.py`)

Decision (2026-06-25): keep Whisper. Deepgram Nova-2 was evaluated and rejected.

Deepgram Nova-2 was trialed to cut STT latency (Whisper buffers ~1-3s before the LLM sees input). The swap was applied and then reverted — the project stays on local faster-whisper. No external STT dependency, no per-minute STT cost, and no audio leaving the box (HIPAA posture). Latency is instead managed via VAD tuning and the medium model on the RTX 5080.

Current bot.py STT (in place — do not change):

from pipecat.services.whisper.stt import WhisperSTTService

WHISPER_MODEL = os.environ.get("WHISPER_MODEL", "medium")   # tiny|base|small|medium
WHISPER_DEVICE = os.environ.get("WHISPER_DEVICE", "cuda")   # cuda for the 5080
WHISPER_COMPUTE = os.environ.get("WHISPER_COMPUTE", "float16")
WHISPER_HOTWORDS = os.environ.get("WHISPER_HOTWORDS", "...")  # domain vocab bias

# HintedWhisperSTTService wraps WhisperSTTService to inject faster-whisper `hotwords`
# (office cities + optometry terms) per call. Instantiated in run_agent():
stt = HintedWhisperSTTService(
    settings=WhisperSTTService.Settings(model=WHISPER_MODEL),
    device=WHISPER_DEVICE,
    compute_type=WHISPER_COMPUTE,
    hotwords=WHISPER_HOTWORDS,
)

Note: Whisper large-v3 also serves post-call transcription in Phase 3 (recording/transcriber.py). If real-time latency proves unacceptable in the Phase 1 gate, revisit a streaming STT then — but do not reintroduce the dependency speculatively.

Change 2 — Twilio webhook auth stays on the Auth Token (`server.py`)

Decision (2026-06-25): keep TWILIO_AUTH_TOKEN. The API Key swap was evaluated and rejected.

A Standard API Key (scoped, revocable) was trialed in place of the account Auth Token, but it cannot do what this server needs: Twilio signs inbound webhooks (X-Twilio-Signature) with the account Auth Token — an API Key Secret cannot validate that signature, so TWILIO_VALIDATE=true would reject every legitimate POST /voice (403). The TwilioFrameSerializer auto-hang-up also expects the account/Auth-Token credential pair. The swap was reverted.

Credential model (in place):

Twilio Account SID          (not secret on its own)
└── Auth Token              (TWILIO_AUTH_TOKEN — validates webhooks + REST/auto-hang-up)

Treat the Auth Token as a password: keep it only in .env (never committed), rotate on any suspected leak / team departure / quarterly. If finer-grained scoping is ever required, the correct design is a hybrid — Auth Token for X-Twilio-Signature validation, an API Key (SK SID + Secret) only for outbound REST — not a wholesale swap.

Current server.py (in place — do not change):

TWILIO_ACCOUNT_SID = os.environ.get("TWILIO_ACCOUNT_SID")
TWILIO_AUTH_TOKEN = os.environ.get("TWILIO_AUTH_TOKEN")

# _twilio_signature_ok(): HMAC-SHA1 keyed by the Auth Token (what Twilio signs with)
digest = hmac.new(TWILIO_AUTH_TOKEN.encode(), payload.encode("utf-8"), hashlib.sha1).digest()

# Validation gate + warning
if TWILIO_VALIDATE and TWILIO_AUTH_TOKEN:
    ...
elif not TWILIO_AUTH_TOKEN:
    logger.warning("/voice signature validation DISABLED (no TWILIO_AUTH_TOKEN set)")

# Serializer auto-hang-up uses the account SID + Auth Token pair
serializer = TwilioFrameSerializer(
    stream_sid=stream_sid,
    call_sid=call_sid,
    account_sid=TWILIO_ACCOUNT_SID,
    auth_token=TWILIO_AUTH_TOKEN,
)

Auth Token rotation procedure:

Generate a new primary Auth Token in the Twilio console (use the secondary-token flow)
Update TWILIO_AUTH_TOKEN in .env
Restart the service — no rebuild needed
Verify one test call succeeds (signature validation + auto-hang-up both rely on it)
Retire the old token in the Twilio console

Rotate on: any suspected leak, any team member departure, quarterly as routine.

Change 3 — `.env`

No swap. .env keeps TWILIO_AUTH_TOKEN and the Whisper STT vars; there is no TWILIO_API_KEY_* or DEEPGRAM_* (those were trialed and removed with Changes 1/2).

Full .env reference:

# Twilio — Auth Token validates webhooks + drives auto-hang-up. Never committed.
TWILIO_ACCOUNT_SID=AC...
TWILIO_AUTH_TOKEN=
TWILIO_PHONE_NUMBER=+1...
TWILIO_VALIDATE=true

# STT: Whisper (faster-whisper, real-time in-call; large-v3 also used post-call in Phase 3)
WHISPER_MODEL=medium
WHISPER_DEVICE=cuda
WHISPER_COMPUTE=float16

# LLM: Ollama
OLLAMA_URL=http://127.0.0.1:11434/v1
OLLAMA_MODEL=activeblue-avc:latest
LLM_PROVIDER=ollama
LLM_TEMPERATURE=0.3
LLM_MAX_TOKENS=160

# Anthropic (optional LLM swap + monitoring + synthetic data)
ANTHROPIC_API_KEY=
ANTHROPIC_MODEL=claude-sonnet-4-6

# TTS: Kokoro
KOKORO_VOICE=af_heart
KOKORO_MODEL_DIR=/home/tocmo0nlord/pipecat-run/models

# Odoo
ODOO_URL=https://avc.activeblue.net
ODOO_DB=avc
ODOO_USER=
ODOO_API_KEY=
ODOO_TARGET=crm
ODOO_STAGE_ID=
ODOO_TEAM_ID=
ODOO_USER_ID=

# Server
PUBLIC_HOST=avc-phone.activeblue.net
PORT=8200
BIND_HOST=127.0.0.1
MAX_CONCURRENT_CALLS=2
STREAM_TOKEN=

# Call behaviour
AGENT_NAME=AVA
HANGUP_DELAY_SECS=4.0      # grace pause after the goodbye before dropping the carrier leg
ENABLE_TOOLS=
VAD_CONFIDENCE=0.5
VAD_MIN_VOLUME=0.3
VAD_START_SECS=0.2
VAD_STOP_SECS=0.5

# Monitoring (Phase 4)
MONITORING_ENABLED=true
MONITORING_SCHEDULE=0 2 * * *

# A/B model routing (Phase 5 only)
AB_SPLIT_PERCENT=0
AB_MODEL_B=

Call Data Capture

What AVA collects on a booking call and how it's logged. Driven by the system prompt (bot.py); persisted by the post-call extractor (extract.py → practice.py → Odoo lead). Replies are kept to one short sentence.

The six captured fields

Field	In-call behavior	Logged as
Full name	Asks for last name if only a first is given	`patient_name` / lead `contact_name`
Phone	Reads back the caller-ID number; if the caller declines, uses the number they give	`callback_number` (+ `phone_confirmed`)
Office / city	Asks city/area; never names an office unprompted	folded into `reason` prefix
Reason	Captured from the conversation	`reason`
Insurance	Log only — asks the plan, never promises/confirms/denies coverage or treatment (even a listed plan); staff verify on callback	`insurance` (note: "log only — staff to verify")
Preferred day & time	Capture & defer — taken in the caller's own words; AVA does not compute or correct the date	`preferred_time` + best-effort resolved `YYYY-MM-DD`

Dates — capture & defer (do NOT compute in-call)

AVA takes the day/time in the caller's own words ("next Monday", "the fifth") and tells them staff will confirm the exact date on the callback. It must NOT work out, state, or correct the calendar date, and must never argue about what today's date is.

Why: an earlier build injected a 45-day calendar and had AVA validate/correct dates in-call. A real call derailed badly — AVA argued about the date, parroted the canned example, and (with the model's weak adherence) also hallucinated availability. Tested directly, the local 8B model computes appointment dates wrong ~5/5 times, so stating a computed date is a liability. The calendar injection and in-call validation were removed (commit on 2026-06-25).

The post-call extractor still gets today's date and records a best-effort resolved_date for staff convenience — it is staff-verified on callback, not authoritative, and never spoken to the caller. A deterministic (non-LLM) date resolver is the right hardening if accuracy is needed; tracked as a later item. Also: never claim appointment availability or that a slot is open — everything is a request staff confirm.

Model Configuration

Current production model: `activeblue-avc:latest`

Property	Value	Notes
Base	`llama3.1:8b-instruct-q4_K_M`	Llama 3.1 8B, Q4_K_M quantization
ID	`366a6cc15bb7`	Rebuilt clean 2026-06-23
Size	4.9GB	Down from 8.7GB Q8_0
VRAM usage	~4.5GB	Leaves 11.5GB headroom on RTX 5080
Context	8192 tokens	Raised from 4096 (2026-06-25) so long calls don't overflow mid-call — see note below
Temperature	0.3	Low — maximizes JSON schema compliance
Top-p	0.9	Standard
Adapter	None	44-pair LoRA adapter discarded

Modelfile (rebuild reference)

FROM llama3.1:8b-instruct-q4_K_M

PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
PARAMETER num_ctx 8192
PARAMETER temperature 0.3
PARAMETER top_p 0.9

TEMPLATE "{{- range .Messages }}<|start_header_id|>{{ .Role }}<|end_header_id|>
{{ .Content }}<|eot_id|>
{{- end }}<|start_header_id|>assistant<|end_header_id|>
"

Why num_ctx 8192 (was 4096) — fixes mid-call silence

Symptom: on longer calls AVA would go silent / stop replying partway through. Cause: the system prompt + a growing multi-turn transcript exceeded the 4096-token window mid-call, so Ollama truncated and re-evaluated the whole context every turn (cache miss) → multi-second stalls = dead air. The capture changes made it worse by briefly injecting a 45-day calendar (~600 tok/turn) — that injection was removed; raising num_ctx to 8192 gives long calls real headroom (RTX 5080 has the VRAM). Rebuild keeps the previous model as activeblue-avc:pre-ctx8k for rollback. Keep the live system prompt lean for the same reason.

Why Q4_K_M not Q8_0

Q8_0 consumed ~8.5GB VRAM for weights alone. Under telephony load this caused inference latency spikes. Q4_K_M cuts weight VRAM to ~4.5GB with negligible quality difference at 8B scale.

Why no adapter

44-pair LoRA adapter was adding noise not signal. Minimum viable dataset is 200+ pairs per intent category. Rebuilt correctly in Phase 5 with 500+ pairs in JSON output format.

Ollama inventory (current)

activeblue-avc:latest          366a6cc15bb7    4.9GB    production
llama3.1:8b-instruct-q4_K_M    46e0c10c039e    4.9GB    base
nomic-embed-text:latest        0a109f422b47    274MB    embeddings

Phase 5 training note

Axolotl pulls from HuggingFace in safetensors format, not Ollama GGUF:

# Phase 5 only — do not run now
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct
# ~16GB on disk, separate from Ollama storage

Build Phases

Claude Code must not scaffold Phase N+1 until Phase N gate is marked complete.

Phase 1 — Reliable call loop

Goal: Every utterance gets a response. Zero silent failures. AVC hangs up — not the caller.

Change 1: STT — Deepgram evaluated, reverted; staying on Whisper (medium)
Change 2: Twilio auth — API Key evaluated, reverted; staying on Auth Token
Change 3: .env — Auth Token + Whisper vars; OLLAMA_MODEL=activeblue-avc:latest
Verify EndCallProcessor termination in Twilio call logs (AVC side, not caller)
Verify AudioHeartbeat diagnostic logging active
Verify MAX_CONCURRENT_CALLS capacity gating works

Gate — all five must pass:

10 consecutive test calls — zero silent non-responses
Zero zombie pipeline instances after call ends (ps/pgrep — service runs as a bare systemd/host process, not Docker)
Call termination from AVC side confirmed in Twilio call logs
JSON parse failure rate visible in logs — measurable not invisible
Response latency P95 under 3 seconds from STT end-of-utterance to first TTS audio

Phase 2 — Accuracy (RAG + validation)

Populate rag/data/*.jsonl with real AVC data (human task — see RAG section)
ChromaDB RAG retriever wired into pipeline
Response validator: JSON schema + factual cross-check + PHI leak scan
Keyword blocklist (uncertainty phrases → handoff)
Intent classifier routing
Turn counter: max 3 failed turns before forced handoff + termination

Gate: 20 manual test calls, zero hallucinations on AVC-specific facts

Phase 3 — Booking

Real-time calendar availability check (odoo/calendar.py)
Whisper large-v3 post-call transcription (recording/transcriber.py)
Recording + transcript attached to Odoo lead chatter
Staff review flow confirmed in Odoo

Gate: Staff receives, reviews, and confirms a lead end-to-end

Phase 4 — Monitoring

Transcript index (recordings/index.jsonl)
Claude monitoring job
Dashboard: toggle, alert queue, one-click apply, playback, quality tagging

Gate: First monitoring run produces actionable suggestions

Phase 5 — Fine-tuning

Pull HuggingFace base (see model section)
Synthetic data generation via Claude API in JSON output format
Real call exporter using staff quality tags
Axolotl QLoRA on RTX 5080
Model registry + versioning + A/B routing

Gate: New model outperforms baseline over 50+ calls

Repository Structure

avc-phone-ai/
├── CLAUDE.md                          ← this file
├── README.md
├── .env                               ← never committed
├── .env.example
├── .gitignore                         ← includes .env, recordings/, *.gguf
│
├── bot.py                             ← Pipecat pipeline (Phase 1 changes here)
├── server.py                          ← Twilio webhook server (Phase 1 changes here)
├── practice.py                        ← AVC facts + Odoo persistence
├── extract.py                         ← post-call appointment extraction
├── odoo_client.py                     ← Odoo XML-RPC client
│
├── rag/                               ← Phase 2
│   ├── store.py
│   ├── loader.py
│   ├── retriever.py
│   └── data/
│       ├── avc_locations.jsonl
│       ├── avc_providers.jsonl
│       ├── avc_services.jsonl
│       ├── avc_hours.jsonl
│       ├── avc_insurance.jsonl
│       └── avc_faqs.jsonl
│
├── recording/                         ← Phase 3
│   ├── transcriber.py                 ← Whisper large-v3 post-call only
│   └── storage.py
│
├── monitoring/                        ← Phase 4
│   ├── monitor.py
│   ├── analyzer.py
│   ├── diff_engine.py
│   ├── scheduler.py
│   └── dashboard/
│       ├── app.py
│       └── static/
│
├── training/                          ← Phase 5 stub
│   └── README.md
│
├── tests/
│   ├── test_bot.py
│   ├── test_server.py
│   ├── test_odoo_client.py
│   ├── test_extract.py
│   └── fixtures/
│       └── sample_transcripts.jsonl
│
├── scripts/
│   ├── deploy.sh
│   └── smoke_test.sh
│
├── avc-phone.service                  ← existing systemd unit
└── traefik-avc-phone.yml              ← existing Traefik config

Infrastructure

Component	Host	Address	Notes
Pipecat pipeline	`miaai`	`10.10.1.221`	Python async, systemd
Ollama LLM	`miaai`	`http://127.0.0.1:11434/v1`	`activeblue-avc:latest`
ChromaDB (Phase 2)	`miaai`	`http://10.10.1.221:8001`	Docker volume
Twilio webhook	`miaai`	`https://avc-phone.activeblue.net`	Traefik + Let's Encrypt
Monitoring dashboard	`miaai`	`https://avc-monitor.activeblue.net`	internal only
Odoo CRM	—	`https://avc.activeblue.net`	XML-RPC, db: `avc`
Recordings	`miaai`	`/home/tocmo0nlord/avc-phone/recordings/`	local only
Gitea	—	`https://git.activeblue.net/tocmo0nlord/avc-phone-ai`	user: `tocmo0nlord`

RAG Store (Phase 2)

Stack: ChromaDB + nomic-embed-text:latest (already in Ollama) Collection: avc_knowledge Retrieval: Top-3 chunks per query on caller's current turn only

JSONL record format

{
  "id": "hours-kendall-weekday",
  "text": "The Kendall location is open Monday through Friday 8:00 AM to 5:00 PM.",
  "tags": ["hours", "kendall"],
  "last_updated": "2026-06-23"
}

Data files — populated before Phase 2, not before Phase 1

File	Content
`avc_locations.jsonl`	Address, phone, fax, parking per location
`avc_providers.jsonl`	Name, title, specialty, locations, languages
`avc_services.jsonl`	Exam types, procedures
`avc_hours.jsonl`	Hours per location, holiday closures, after-hours
`avc_insurance.jsonl`	Accepted plans per location
`avc_faqs.jsonl`	Approved Q&A pairs

Note: practice.py already contains real AVC location and insurance data scraped from advancedvisioncareflorida.com. Use it as the seed for the JSONL files rather than starting from scratch.

Claude Monitoring (Phase 4)

What it analyzes

Facts stated by AVA contradicting RAG store
System prompt violations
Calls that should have been handoffs
High failed turn counts — model or prompt signal
RAG gaps (AVA said "I don't have that" — should it be added?)
Phrasing that caused caller confusion

Output schema

{
  "call_sid": "CA...",
  "severity": "high",
  "issue_type": "factual_error",
  "description": "AVA stated Kendall closes at 6pm. RAG store says 5pm.",
  "suggested_action": "rag_update",
  "suggested_change": {
    "file": "rag/data/avc_hours.jsonl",
    "record_id": "hours-kendall-weekday",
    "field": "text",
    "old": "...open until 6pm...",
    "new": "...open until 5pm..."
  }
}

suggested_action: rag_update | prompt_change | blocklist_add | flag_for_review

Dashboard

FastAPI + HTML/JS at https://avc-monitor.activeblue.net (internal only).

Feature	Description
Enable/disable toggle	Pauses scheduler without redeployment
Alert queue	Suggestions sorted by severity
One-click apply	Applies change, commits via Gitea API to `avc-phone-ai`
Call playback	Audio + transcript side-by-side
Quality tagging	Staff tags calls from dashboard
Manual trigger	`POST /monitor/run`

Fine-Tuning Pipeline (Phase 5 — stub)

Not scaffolded until Phase 4 complete and monitoring has run minimum two weeks. See training/README.md — populated at Phase 5 start.

Synthetic data: Claude API generates Q&A in JSON output format — schema not style
Real calls: staff-tagged "good" + corrected bad calls
Target: 500+ pairs per intent before first Axolotl run
QLoRA via Axolotl on RTX 5080, base: HuggingFace meta-llama/Llama-3.1-8B-Instruct
Versioned Ollama models: activeblue-avc:vN
A/B routing: promote when new version wins on booking + hallucination rate over 50+ calls

HIPAA and Compliance

AVA identifies as automated at call start — no exceptions
No PHI in ChromaDB — practice information only
Recordings on miaai only — no cloud storage
Odoo API user: minimum permissions, not admin
All endpoints HTTPS via Traefik
.env never committed

Deploy Script (`scripts/deploy.sh`)

#!/bin/bash
set -e
cd /home/tocmo0nlord/avc-phone
git pull origin main
pip install -r requirements.txt --quiet
systemctl restart avc-phone
systemctl status avc-phone --no-pager
echo "[deploy] Done."

Development Conventions

Python 3.13 (matches miaai miniconda environment)
Async throughout — Pipecat is async-native
loguru for all logging — already in use, keep consistent
Structured log lines for all diagnostic events
python-dotenv for local dev, env injection in prod
Secrets never hardcoded
Every module has if __name__ == "__main__": for isolated testing

Key Dependencies (current)

pipecat-ai==1.3.0           # installed at /opt/miniconda3
faster-whisper              # real-time STT (already installed in pipecat-run venv)
kokoro-tts                  # already installed
ollama                      # already installed
scipy / numpy               # already installed (pipecat deps)
chromadb                    # add for Phase 2
sentence-transformers        # add for Phase 2
anthropic                   # for monitoring + optional LLM swap
openai-whisper              # large-v3 for post-call transcription (Phase 3)
fastapi / uvicorn           # already installed
loguru                      # already installed
httpx                       # already installed

Open Items

Confirm TWILIO_AUTH_TOKEN in .env is current (rotate if leaked/stale)
Confirm ODOO_STAGE_ID, ODOO_TEAM_ID, ODOO_USER_ID from live avc db
Confirm AVA voice — af_heart is current default, confirm with AVC before go-live
Populate rag/data/*.jsonl before Phase 2 (seed from practice.py data)
Define Odoo confirmed appointment flow: lead → opportunity → calendar event
Staff training on monitoring dashboard quality tagging

Active Blue LLC | git.activeblue.net/tocmo0nlord/avc-phone-ai

23 KiB Raw Blame History