diff --git a/CLAUDE.md b/CLAUDE.md index e23bc76..25d905c 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -272,7 +272,7 @@ everything is a request staff confirm. | ID | `366a6cc15bb7` | Rebuilt clean 2026-06-23 | | Size | 4.9GB | Down from 8.7GB Q8_0 | | VRAM usage | ~4.5GB | Leaves 11.5GB headroom on RTX 5080 | -| Context | 4096 tokens | Sufficient for any phone call | +| Context | 8192 tokens | Raised from 4096 (2026-06-25) so long calls don't overflow mid-call — see note below | | Temperature | 0.3 | Low — maximizes JSON schema compliance | | Top-p | 0.9 | Standard | | Adapter | None | 44-pair LoRA adapter discarded | @@ -285,7 +285,7 @@ FROM llama3.1:8b-instruct-q4_K_M PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" -PARAMETER num_ctx 4096 +PARAMETER num_ctx 8192 PARAMETER temperature 0.3 PARAMETER top_p 0.9 @@ -295,6 +295,16 @@ TEMPLATE "{{- range .Messages }}<|start_header_id|>{{ .Role }}<|end_header_id|> " ``` +### Why num_ctx 8192 (was 4096) — fixes mid-call silence + +Symptom: on longer calls AVA would go silent / stop replying partway through. Cause: the +system prompt + a growing multi-turn transcript exceeded the 4096-token window mid-call, so +Ollama truncated and re-evaluated the whole context every turn (cache miss) → multi-second +stalls = dead air. The capture changes made it worse by briefly injecting a 45-day calendar +(~600 tok/turn) — that injection was removed; raising num_ctx to 8192 gives long calls real +headroom (RTX 5080 has the VRAM). Rebuild keeps the previous model as `activeblue-avc:pre-ctx8k` +for rollback. Keep the live system prompt lean for the same reason. + ### Why Q4_K_M not Q8_0 Q8_0 consumed ~8.5GB VRAM for weights alone. Under telephony load this caused