fix(expenses): LAYAL CAFE $2.80 bug, United Airlines rotation & date

LAYAL CAFE ($2.80 instead of $42.90):
- Add (?!\s*tax) lookahead to _TOTAL_RE so "Total Taxes $2.80" is never
  confused with the receipt total when OCR drops the "Taxes" word
- Change Pass 1 from matches[-1] to max() so the largest labeled amount
  always wins, regardless of line order in the OCR output

United Airlines (Subway/$0/wrong date):
- Add OSD-based rotation correction in receipt_parser.py: after EXIF
  transpose, ask Tesseract's orientation-detection engine (--psm 0) what
  angle to rotate; applies to receipts photographed lying sideways where
  EXIF metadata cannot help
- Add month-name date patterns (DD MON YYYY / MON DD YYYY) to
  _extract_date_from_text for airline/hotel receipts that print dates
  like "05 MAY 2026" instead of "05/07/26"

85 tests, all passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Carlos Garcia
2026-05-21 00:46:08 -04:00
parent ce57d19528
commit ece811cccb
3 changed files with 90 additions and 8 deletions

View File

@@ -100,6 +100,23 @@ def _ocr_image_tesseract(data: bytes, filename: str) -> str:
except Exception:
pass # exif_transpose requires Pillow >= 6.0
# ── Step 1b: Content-based rotation correction ───────────────────────
# EXIF transpose (Step 1) only corrects for phone-tilt metadata.
# If the receipt was physically laid sideways in the frame (e.g. a
# landscape receipt photographed with the phone upright), the pixels
# are genuinely rotated and EXIF can't help. Ask Tesseract's OSD
# engine to detect the text orientation and rotate to correct it.
try:
osd = pytesseract.image_to_osd(img, config='--psm 0')
_am = re.search(r'Rotate:\s*(\d+)', osd)
if _am:
_angle = int(_am.group(1))
if _angle:
img = img.rotate(_angle, expand=True)
logger.debug('OSD: rotated %s by %d°', filename, _angle)
except Exception:
pass # OSD unavailable or not enough text — proceed without correction
# ── Step 2: Resize to working width (1800px) ──────────────────────────
max_w = 1800
if img.width > max_w: