Raise model num_ctx to 8192 to fix mid-call silence

Long calls overflowed the 4096-token window mid-conversation, forcing Ollama to truncate + re-evaluate the full context each turn = multi-second stalls / dead air. Rebuilt activeblue-avc:latest with num_ctx 8192 (rollback tag activeblue-avc:pre-ctx8k). Combined with removing the 45-day calendar injection, this keeps long calls well under the window. Doc: context row, Modelfile reference, and a root-cause note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 03:53:41 +00:00
parent 08d9db4f09
commit b31f685d91
1 changed files with 12 additions and 2 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -272,7 +272,7 @@ everything is a request staff confirm.
 | ID | `366a6cc15bb7` | Rebuilt clean 2026-06-23 |
 | Size | 4.9GB | Down from 8.7GB Q8_0 |
 | VRAM usage | ~4.5GB | Leaves 11.5GB headroom on RTX 5080 |
-| Context | 4096 tokens | Sufficient for any phone call |
+| Context | 8192 tokens | Raised from 4096 (2026-06-25) so long calls don't overflow mid-call — see note below |
 | Temperature | 0.3 | Low — maximizes JSON schema compliance |
 | Top-p | 0.9 | Standard |
 | Adapter | None | 44-pair LoRA adapter discarded |
@@ -285,7 +285,7 @@ FROM llama3.1:8b-instruct-q4_K_M
 PARAMETER stop "<|start_header_id|>"
 PARAMETER stop "<|end_header_id|>"
 PARAMETER stop "<|eot_id|>"
-PARAMETER num_ctx 4096
+PARAMETER num_ctx 8192
 PARAMETER temperature 0.3
 PARAMETER top_p 0.9

@@ -295,6 +295,16 @@ TEMPLATE "{{- range .Messages }}<|start_header_id|>{{ .Role }}<|end_header_id|>
 "
 ```

+### Why num_ctx 8192 (was 4096) — fixes mid-call silence
+
+Symptom: on longer calls AVA would go silent / stop replying partway through. Cause: the
+system prompt + a growing multi-turn transcript exceeded the 4096-token window mid-call, so
+Ollama truncated and re-evaluated the whole context every turn (cache miss) → multi-second
+stalls = dead air. The capture changes made it worse by briefly injecting a 45-day calendar
+(~600 tok/turn) — that injection was removed; raising num_ctx to 8192 gives long calls real
+headroom (RTX 5080 has the VRAM). Rebuild keeps the previous model as `activeblue-avc:pre-ctx8k`
+for rollback. Keep the live system prompt lean for the same reason.
+
 ### Why Q4_K_M not Q8_0

 Q8_0 consumed ~8.5GB VRAM for weights alone. Under telephony load this caused