Carlos Garcia
1536d83376
Improve OCR preprocessing and amount extraction robustness
Image preprocessing (receipt_parser.py):
- Add ImageOps.exif_transpose() — fixes portrait photos stored with EXIF
rotation metadata (most phone photos); without this Tesseract reads a
rotated image and produces garbage
- Upscale images < 600px wide for better character recognition
- Raise binarization threshold 140→160 for faint thermal-print receipts
- Try PSM 6 (single text block) before PSM 4, PSM 11 as fallbacks;
PSM 6 is better suited to single-column receipt layout
Amount extraction (expenses_agent.py):
- Add Pass 2 bottom-of-receipt line scan when labeled Total: regex fails;
reads lines bottom-to-top in the last 50% of text, skipping change/tip
lines — handles 'T0TAL' OCR misread and amount-on-next-line layout
- Add _SKIP_LINE_RE and _ANY_DOLLAR_RE module-level patterns
- 8 new tests covering garbled total, change-skip, USD suffix, etc.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 23:33:38 -04:00
..
2026-04-12 16:55:30 -04:00
2026-05-19 23:02:51 -04:00
2026-05-20 04:00:45 +00:00
2026-05-19 23:02:51 -04:00
2026-05-19 23:02:51 -04:00
2026-05-19 23:02:51 -04:00
2026-05-20 23:33:38 -04:00
2026-05-19 23:02:51 -04:00
2026-05-20 22:13:46 -04:00
2026-05-14 23:35:03 -04:00
2026-04-24 16:48:23 -04:00
2026-05-19 23:02:51 -04:00
2026-05-19 17:01:57 -04:00
2026-05-19 23:02:51 -04:00
2026-05-19 17:01:57 -04:00
2026-05-19 17:01:57 -04:00