Improve OCR preprocessing and amount extraction robustness
Image preprocessing (receipt_parser.py): - Add ImageOps.exif_transpose() — fixes portrait photos stored with EXIF rotation metadata (most phone photos); without this Tesseract reads a rotated image and produces garbage - Upscale images < 600px wide for better character recognition - Raise binarization threshold 140→160 for faint thermal-print receipts - Try PSM 6 (single text block) before PSM 4, PSM 11 as fallbacks; PSM 6 is better suited to single-column receipt layout Amount extraction (expenses_agent.py): - Add Pass 2 bottom-of-receipt line scan when labeled Total: regex fails; reads lines bottom-to-top in the last 50% of text, skipping change/tip lines — handles 'T0TAL' OCR misread and amount-on-next-line layout - Add _SKIP_LINE_RE and _ANY_DOLLAR_RE module-level patterns - 8 new tests covering garbled total, change-skip, USD suffix, etc. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -458,6 +458,27 @@ class TestExtractAmount:
|
||||
def test_comma_in_amount(self):
|
||||
assert _extract_amount_from_text('Grand Total: $1,234.56') == 1234.56
|
||||
|
||||
def test_bottom_scan_garbled_total(self):
|
||||
# OCR garbled "TOTAL" — bottom-scan fallback should find the amount
|
||||
text = 'Burger 5.99\nFries 2.50\nT0TAL 8.49'
|
||||
assert _extract_amount_from_text(text) == 8.49
|
||||
|
||||
def test_bottom_scan_skips_change(self):
|
||||
# Should return the total (8.49), not the change (1.51)
|
||||
text = 'TOTAL 8.49\nCash 10.00\nChange 1.51'
|
||||
assert _extract_amount_from_text(text) == 8.49
|
||||
|
||||
def test_bottom_scan_amount_on_own_line(self):
|
||||
# Amount printed on a separate line below the label
|
||||
text = 'Items 5.00\nTax 0.50\nTotal\n5.50'
|
||||
assert _extract_amount_from_text(text) == 5.50
|
||||
|
||||
def test_amount_due_with_usd_suffix(self):
|
||||
# PDF text may include "USD" after the number — regex should still work
|
||||
# via the bottom scan since the labeled-total regex won't match "USD"
|
||||
text = 'Total Charged: $198.40 USD'
|
||||
assert _extract_amount_from_text(text) == 198.40
|
||||
|
||||
|
||||
class TestExtractDate:
|
||||
def test_iso_format(self):
|
||||
|
||||
Reference in New Issue
Block a user