Carlos 3001be3a92 Fix correctness bugs in scanner and reset endpoint
- Defer Takeout sidecar enrichment until after indexing so its UPDATE
  statements actually match rows. Previously it ran first and silently
  no-op'd on the very first scan because no files existed in the DB yet.

- Preserve user review decisions across incremental and regroup rescans.
  The grouping phase wipes duplicate_groups/duplicate_members, which
  also wiped reviewed=1 / is_keeper flags. Now snapshots reviewed groups
  by (method, frozenset of member file_ids) before the wipe and re-applies
  them to any post-regrouping group whose member set is unchanged.

- Replace 2-hex-char phash bucketing with multi-index pigeonhole
  (16 nibble buckets per hash). At threshold=10, the previous bucketing
  missed any near-duplicate pair that differed in the first byte, since
  they landed in different buckets and were never compared. Caches
  imagehash.hex_to_hash() per phash and dedups pair comparisons.

- Rewrite _suggested_keeper_by_resolution: previous implementation had
  a dead inner score() function and the lambda was missing the date
  tie-breaker (left as a TODO comment). Now picks largest pixels, ties
  by file size, then by oldest exif_datetime.

- Filter phash candidates to length(phash)=16 to skip malformed hashes
  rather than relying on the silent except in the comparison loop.

- Reject /api/scan/reset while a scan is running. Resetting mid-scan
  wiped tables the running scan thread was still writing to.

- Also clears stale 'redundant' file status (not just 'keeper') when
  a file no longer appears in any group after regrouping.
2026-04-24 00:42:13 -04:00
2026-04-04 23:55:42 -04:00

Duplicate Finder

A self-hosted Docker web app that scans a photo/video library, detects duplicates using four methods, and lets you review them in a gallery UI. No files are ever moved, renamed, or deleted — all decisions are recorded in SQLite only.

Quick start

# 1. Edit docker-compose.yml — set your photos volume path
# 2. Build and run
docker compose up -d --build
# 3. Open http://localhost:8765
# 4. Enter folder path in UI and click Scan

Volume mounts

Container path Purpose
/photos Your photo library — mounted read-only
/data SQLite database persistence

Edit docker-compose.yml to point these at your NAS paths.

Detection methods

Method Color Description
SHA-256 Blue Byte-identical files
Perceptual hash Purple Visually similar photos (hamming ≤ 10)
EXIF timestamp + device Amber Same camera, same moment
File size + dimensions Gray Same size and resolution (low confidence)

Scan modes

Mode Description
Incremental Only re-hashes changed/new files. Prior decisions preserved.
New files only Indexes newly added files. Existing decisions untouched.
Rebuild groups Re-runs detection on existing index. No re-hashing.
Full reset Wipes everything and scans from scratch.

Google Takeout

The scanner automatically detects Google Takeout folder structures and reads .json sidecar files to restore correct capture timestamps and original filenames. Takeout files are flagged in the UI.

What "redundant" means

Marking a file redundant only writes to the database. Nothing is moved, renamed, or deleted. This tool produces a decision record only. A separate tool handles file actions.

Tech stack

  • Python 3.12, FastAPI, Uvicorn
  • SQLite (stdlib sqlite3)
  • Pillow, imagehash, pillow-heif
  • Vanilla JS single-page frontend
  • Docker / docker-compose
Description
No description provided
Readme 507 KiB
Languages
Python 53.5%
HTML 34%
PowerShell 12.2%
Dockerfile 0.3%