expenses_agent: fix OCR '$→8' misread inflating receipt totals

Add _fix_ocr_dollar_as_8() which strips a spurious leading '8' when it sits at a word boundary before a non-zero digit + 1–3 more digits + .dd (covers $10–$9999). Applied at the top of _extract_amount_from_text so both the labeled-total pass and the max-scan pass benefit. 845.00 → 45.00 ($45 misread as 845) 885.00 → 85.00 ($85 misread as 885) 8150.00 → 150.00 ($150 misread as 8150) 85.00 → 85.00 UNCHANGED (real $85 correctly read) 8.50 → 8.50 UNCHANGED (real $8.50 correctly read) 12 new tests covering fix cases, non-fix cases, and end-to-end extraction (110 tests total, all passing). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 16:08:39 -04:00
parent aea2fa02b8
commit beac16a6a9
2 changed files with 73 additions and 1 deletions
--- a/agent_service/agents/expenses_agent.py
+++ b/agent_service/agents/expenses_agent.py
@@ -29,6 +29,27 @@ _TOTAL_RE = re.compile(
    re.IGNORECASE,
 )

+# OCR artefact: the '$' glyph is often misclassified as '8', turning
+# 'Total: $45.00' into 'Total: 845.00'.  We strip the spurious leading '8'
+# when it sits at a word boundary and is followed by a non-zero digit then
+# 1-3 more digits + two decimal places.  This covers the $10–$9999 range.
+#
+#   845.00  → 45.00   (was $45, OCR gave 845)
+#   885.00  → 85.00   (was $85, OCR gave 885)
+#   8150.00 → 150.00  (was $150, OCR gave 8150)
+#   85.00   → 85.00   UNCHANGED — real $85 correctly read
+#   8.50    → 8.50    UNCHANGED — real $8.50 correctly read
+#   12845.00→ 12845.00 UNCHANGED — digit before the 8 blocks lookbehind
+# Edge case: a real $8xx amount correctly read (e.g. 840.00) may be reduced
+# to $40; this is rare compared to the misread and obvious on human review.
+_OCR_DOLLAR_MISREAD_RE = re.compile(r'(?<!\d)8([1-9]\d{1,3}\.\d{2})\b')
+
+
+def _fix_ocr_dollar_as_8(text: str) -> str:
+    """Strip a spurious leading '8' that is an OCR misread of '$'."""
+    return _OCR_DOLLAR_MISREAD_RE.sub(r'\1', text)
+
+
 # Lines that should never be treated as the total — change given back,
 # tip added after the fact, etc.  Card-brand lines like "VISA USD$ 36.78"
 # are intentionally NOT listed here: the amount on those lines IS the charge.
@@ -109,6 +130,9 @@ def _extract_amount_from_text(text: str) -> float:
    if not text:
        return 0.0

+    # Normalise '$→8' OCR misread before any pattern matching.
+    text = _fix_ocr_dollar_as_8(text)
+
    # Pass 1: explicit label match — return the LARGEST labeled amount.
    # Using max() rather than the last positional match handles the common
    # OCR artefact where "Total\n$2.80" (garbled "Total Taxes") appears