Fix correctness bugs in scanner and reset endpoint

- Defer Takeout sidecar enrichment until after indexing so its UPDATE
  statements actually match rows. Previously it ran first and silently
  no-op'd on the very first scan because no files existed in the DB yet.

- Preserve user review decisions across incremental and regroup rescans.
  The grouping phase wipes duplicate_groups/duplicate_members, which
  also wiped reviewed=1 / is_keeper flags. Now snapshots reviewed groups
  by (method, frozenset of member file_ids) before the wipe and re-applies
  them to any post-regrouping group whose member set is unchanged.

- Replace 2-hex-char phash bucketing with multi-index pigeonhole
  (16 nibble buckets per hash). At threshold=10, the previous bucketing
  missed any near-duplicate pair that differed in the first byte, since
  they landed in different buckets and were never compared. Caches
  imagehash.hex_to_hash() per phash and dedups pair comparisons.

- Rewrite _suggested_keeper_by_resolution: previous implementation had
  a dead inner score() function and the lambda was missing the date
  tie-breaker (left as a TODO comment). Now picks largest pixels, ties
  by file size, then by oldest exif_datetime.

- Filter phash candidates to length(phash)=16 to skip malformed hashes
  rather than relying on the silent except in the comparison loop.

- Reject /api/scan/reset while a scan is running. Resetting mid-scan
  wiped tables the running scan thread was still writing to.

- Also clears stale 'redundant' file status (not just 'keeper') when
  a file no longer appears in any group after regrouping.
This commit is contained in:
Carlos
2026-04-24 00:42:13 -04:00
parent 356f922940
commit 3001be3a92
2 changed files with 121 additions and 33 deletions

View File

@@ -223,6 +223,10 @@ def scan_resume():
def scan_reset(confirm: str = Query("")):
if confirm != "RESET":
raise HTTPException(400, "Pass ?confirm=RESET to confirm")
if sc.scan_state["status"] == "running":
raise HTTPException(
400, "A scan is currently running — pause it before resetting"
)
con = get_db()
cur = con.cursor()
cur.execute("DELETE FROM duplicate_members")