Foundation for the move/quarantine pipeline. Lets users register one or
more remote SFTP destinations through the API, store credentials at rest
under /data/sftp/{id}.{password|key} (mode 600), and verify connectivity
+ write access via a test endpoint.
Endpoints:
GET /api/sftp/destinations
POST /api/sftp/destinations — create
PUT /api/sftp/destinations/{id} — update
DELETE /api/sftp/destinations/{id}
POST /api/sftp/destinations/{id}/test — connect, stat base_path, mkdir probe
POST /api/sftp/keypair — generate ED25519 keypair
Host keys pinned per-destination on first connect (TOFU); subsequent
mismatches are rejected. paramiko added to requirements.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both _folder_priority and _path_penalty were scanning the entire path
string including the basename. A file named 'mytrashed_pic.jpg' in
/photos/MobileBackup/ would falsely match the 'trash' token.
Now only directory segments are checked; filename never influences keeper
selection beyond its actual path location.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a path-penalty score that downranks files in folders named Trashed,
Dups, Backup, Copy, Old, Archive, plus a penalty for repeated path segments
(e.g. Desktop/Desktop/Files) and very deep paths. Also captures and uses
file mtime as a tiebreaker — older files are usually the originals.
Applied to all four detection passes (sha256, phash, exif, filesize+dim)
and to auto-resolve-exact.
New file_mtime column with idempotent migration.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Captures every review action (keeper, redundant, skip, unreview, auto-resolve,
rescan-restore) with sha256 at decision time so a downstream tool can detect
stale decisions before touching disk.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Defer Takeout sidecar enrichment until after indexing so its UPDATE
statements actually match rows. Previously it ran first and silently
no-op'd on the very first scan because no files existed in the DB yet.
- Preserve user review decisions across incremental and regroup rescans.
The grouping phase wipes duplicate_groups/duplicate_members, which
also wiped reviewed=1 / is_keeper flags. Now snapshots reviewed groups
by (method, frozenset of member file_ids) before the wipe and re-applies
them to any post-regrouping group whose member set is unchanged.
- Replace 2-hex-char phash bucketing with multi-index pigeonhole
(16 nibble buckets per hash). At threshold=10, the previous bucketing
missed any near-duplicate pair that differed in the first byte, since
they landed in different buckets and were never compared. Caches
imagehash.hex_to_hash() per phash and dedups pair comparisons.
- Rewrite _suggested_keeper_by_resolution: previous implementation had
a dead inner score() function and the lambda was missing the date
tie-breaker (left as a TODO comment). Now picks largest pixels, ties
by file size, then by oldest exif_datetime.
- Filter phash candidates to length(phash)=16 to skip malformed hashes
rather than relying on the silent except in the comparison loop.
- Reject /api/scan/reset while a scan is running. Resetting mid-scan
wiped tables the running scan thread was still writing to.
- Also clears stale 'redundant' file status (not just 'keeper') when
a file no longer appears in any group after regrouping.
Instead of walk-everything-first then index, workers now receive files
the instant os.walk yields them. The thread pool is open before the
walk starts; each discovered file is submitted immediately. Completed
futures are drained after each directory to keep memory flat.
Progress message shows:
"Discovering & indexing (8w): 1,234 — 5,678 found so far"
then once walk finishes:
"Indexing (8w): 8,000 / 9,100"
UI: merged Discovery + Indexing into a single "Discover + Index" phase pill.
Indeterminate progress bar stays on until total file count is known.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace single-threaded indexing loop with ThreadPoolExecutor.
Default workers = min(cpu_count*2, 16), tunable via DUPFINDER_WORKERS
env var. Pre-loads all existing DB records in one query instead of
N per-file queries. Progress message shows worker count and live
done/total count. Skipped files bulk-stamped in batches of 500.
On an 8-core machine over NAS: ~4-8x faster indexing phase.
On NVMe: up to 16x faster with 16 workers.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GPU:
- Switch Dockerfile base to pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime
- Add gpu_hasher.py: batched 2D DCT on GPU via PyTorch matrix multiply,
256 images/batch, produces imagehash-compatible 64-bit hex hashes,
auto-falls back to CPU when CUDA unavailable
- Replace per-image phash loop in scanner.py with phasher.hash_files()
- docker-compose.yml: add nvidia GPU device reservation
Hang fix:
- takeout.is_takeout_folder() now caps at 50 directories (was walking
entire tree — blocked for minutes on 65k+ file libraries)
- Add "Not a Takeout folder" status message so takeout phase is never silent
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Scanner now updates message every 250 files during os.walk so the UI
shows a live count. Progress bar switches to an indeterminate animated
pulse during discovery and takeout phases (no known total yet), then
reverts to a normal percentage bar once indexing begins.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New GET /api/browse endpoint lists subdirectories at any path.
UI gets a folder icon button next to each path input that opens
a browsable directory tree modal. Escape or Cancel closes it,
clicking a folder navigates into it, Select confirms the choice.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>