Fix vendor mis-identification (McDonald's bias), MIA Parking amount, grayscale OCR fallback

- Remove "NeDonald's → McDonald's" from LLM vendor correction examples; the
  example was biasing the model to return McDonald's for any ambiguous receipt
  (Home Depot, Sergio's/HMSHost). Replace with neutral brand examples and add
  an explicit instruction not to substitute a brand name absent from the OCR text.
- Add `net\s*fee` to _TOTAL_RE so MIA Parking kiosk receipts ("net fee: 150.00 USD")
  are captured by Pass 1 rather than the max-scan which could pick a larger line.
- Add Step 5b grayscale fallback in receipt_parser: if all binarized PSM attempts
  yield < 20 chars, retry OCR on the pre-binarization grayscale image. Fixes
  dot-matrix and certain thermal-print fonts destroyed by the 160-threshold.
- Tests: 88 passing (test_net_fee_parking, test_vendor_prompt_does_not_contain_mcdonalds,
  test_vendor_prompt_instructs_not_to_guess_absent_brand).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Carlos Garcia
2026-05-21 00:56:45 -04:00
parent ece811cccb
commit db06fede5f
3 changed files with 109 additions and 7 deletions

View File

@@ -130,6 +130,7 @@ def _ocr_image_tesseract(data: bytes, filename: str) -> str:
# ── Step 3: Grayscale + contrast ─────────────────────────────────────
img = ImageOps.grayscale(img)
img = ImageOps.autocontrast(img)
img_gray = img # save grayscale for fallback — before binarization
# ── Step 4: Sharpen then binarize ─────────────────────────────────────
# Sharpen first so edges are crisp before thresholding.
@@ -152,6 +153,23 @@ def _ocr_image_tesseract(data: bytes, filename: str) -> str:
except Exception:
pass
# ── Step 5b: Grayscale fallback ───────────────────────────────────────
# Binarization at threshold 160 can destroy dot-matrix and certain
# thermal-print fonts (e.g. parking kiosk receipts) where character
# pixels are close to the threshold and get wiped to white. If every
# binarized attempt failed, retry on the plain grayscale image —
# Tesseract handles grey-level input reasonably well for these cases.
for psm in (6, 4, 11):
try:
text = pytesseract.image_to_string(
img_gray, config=f'--oem 3 --psm {psm}').strip()
if len(text) >= 20:
logger.debug('Tesseract grayscale fallback %s: psm=%d %d chars',
filename, psm, len(text))
return text
except Exception:
pass
logger.warning('Tesseract OCR %s: all PSM modes returned < 20 chars', filename)
return ''
except ImportError: