duplicate-finder

Author	SHA1	Message	Date
Carlos	293355b724	SFTP: switch to Transport-based connection (fixes Synology 'Channel closed') paramiko's SSHClient.open_sftp() allocates an exec channel before the SFTP subsystem request, which Synology DSM closes immediately with 'Channel closed'. Manual sftp(1) and WinSCP avoid this by going straight to the SFTP subsystem on a fresh channel. Replaced SSHClient with direct paramiko.Transport + SFTPClient.from_transport, matching the OpenSSH/WinSCP flow. Larger flow-control windows (128 MB) too since Synology has been observed to bail mid-handshake with the default 1 MB. test_connection_verbose now reports per-step status (connect+auth, open_sftp, listdir /, stat base_path, write probe). API returns the steps array so the UI can show exactly which step failed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-26 21:43:56 -04:00
Carlos	7436b23db3	Stage 2 #1 : SFTP destinations CRUD + connection test Foundation for the move/quarantine pipeline. Lets users register one or more remote SFTP destinations through the API, store credentials at rest under /data/sftp/{id}.{password\|key} (mode 600), and verify connectivity + write access via a test endpoint. Endpoints: GET /api/sftp/destinations POST /api/sftp/destinations — create PUT /api/sftp/destinations/{id} — update DELETE /api/sftp/destinations/{id} POST /api/sftp/destinations/{id}/test — connect, stat base_path, mkdir probe POST /api/sftp/keypair — generate ED25519 keypair Host keys pinned per-destination on first connect (TOFU); subsequent mismatches are rejected. paramiko added to requirements. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-26 20:04:42 -04:00
Carlos	8b0fee0055	Folder priority + path penalty: match folder segments only, not filenames Both _folder_priority and _path_penalty were scanning the entire path string including the basename. A file named 'mytrashed_pic.jpg' in /photos/MobileBackup/ would falsely match the 'trash' token. Now only directory segments are checked; filename never influences keeper selection beyond its actual path location. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-26 18:48:30 -04:00
Carlos	759288b37e	Pre-generate all thumbnails up-front, not on scroll After every scan, automatically kick off a background thread that generates a JPEG thumbnail for every file in a duplicate group and caches it locally at /data/thumbs/. Idempotent — already-cached files are skipped. New endpoints: POST /api/thumbs/generate — start pre-gen for all files POST /api/thumbs/generate?only_in_groups=true — only dup-group files GET /api/thumbs/status — progress (total/done/skipped/failed) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-26 16:33:19 -04:00
Carlos	4c21e9fa1c	Add workstation-local thumbnail cache + HEIC support Thumbnails (256px JPEG, q80) generated on first request and cached at /data/thumbs/<shard>/<file_id>.jpg — i.e. on the workstation's local SSD, not the NAS. Subsequent requests serve straight from cache, never re-fetching from /photos. HEIC/HEIF decoded via pillow-heif so iPhone photos finally render. Videos cached as a single ffmpeg-extracted frame, not regenerated each request. New DELETE /api/thumb/cache endpoint to wipe it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-26 16:29:29 -04:00
Carlos	81b38cb5bb	CSV export: path column now contains directory only Filename was duplicated in both columns; trimmed the basename off path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-26 16:03:34 -04:00
Carlos	6827c5965f	Lowest priority (11) for Google Photos / Takeout / backup folders Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-26 15:53:15 -04:00
Carlos	399a80cb70	Add explicit folder-priority ranking for keeper selection #recycle (10) ranks worst, MobileBackup (1) best, default 2. Folder priority dominates resolution + path-penalty; mtime stays as final tiebreak. Override via /data/folder_priority.json (cached per process). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-26 15:52:04 -04:00
Carlos	d95bf69be0	Fix CSV export crash on filenames with embedded newlines Use QUOTE_ALL + sanitise NUL/CR/LF in path/filename/exif fields. Default csv dialect rejected fields containing line terminators with 'need to escape, but no escapechar set'. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-26 13:17:11 -04:00
Carlos	14c6012808	Smarter keeper selection: folder-name + mtime signals Adds a path-penalty score that downranks files in folders named Trashed, Dups, Backup, Copy, Old, Archive, plus a penalty for repeated path segments (e.g. Desktop/Desktop/Files) and very deep paths. Also captures and uses file mtime as a tiebreaker — older files are usually the originals. Applied to all four detection passes (sha256, phash, exif, filesize+dim) and to auto-resolve-exact. New file_mtime column with idempotent migration. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-24 10:56:52 -04:00
Carlos	6a4134762c	Add decisions audit log for future move/delete tool Captures every review action (keeper, redundant, skip, unreview, auto-resolve, rescan-restore) with sha256 at decision time so a downstream tool can detect stale decisions before touching disk. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-24 01:40:54 -04:00
Carlos	3001be3a92	Fix correctness bugs in scanner and reset endpoint - Defer Takeout sidecar enrichment until after indexing so its UPDATE statements actually match rows. Previously it ran first and silently no-op'd on the very first scan because no files existed in the DB yet. - Preserve user review decisions across incremental and regroup rescans. The grouping phase wipes duplicate_groups/duplicate_members, which also wiped reviewed=1 / is_keeper flags. Now snapshots reviewed groups by (method, frozenset of member file_ids) before the wipe and re-applies them to any post-regrouping group whose member set is unchanged. - Replace 2-hex-char phash bucketing with multi-index pigeonhole (16 nibble buckets per hash). At threshold=10, the previous bucketing missed any near-duplicate pair that differed in the first byte, since they landed in different buckets and were never compared. Caches imagehash.hex_to_hash() per phash and dedups pair comparisons. - Rewrite _suggested_keeper_by_resolution: previous implementation had a dead inner score() function and the lambda was missing the date tie-breaker (left as a TODO comment). Now picks largest pixels, ties by file size, then by oldest exif_datetime. - Filter phash candidates to length(phash)=16 to skip malformed hashes rather than relying on the silent except in the comparison loop. - Reject /api/scan/reset while a scan is running. Resetting mid-scan wiped tables the running scan thread was still writing to. - Also clears stale 'redundant' file status (not just 'keeper') when a file no longer appears in any group after regrouping.	2026-04-24 00:42:13 -04:00
tocmo	356f922940	feat: replace Cancel with Pause/Resume — survives server restarts - scanner.py: replace cancel_requested with pause_requested throughout; pause during walk drains in-flight futures gracefully then saves state; phash phase processes in 500-image chunks with pause check between each; _save_pause_state() persists files_indexed/phashes_done/last_phase to DB; init_db() already detects killed-mid-scan (running→paused) on startup - main.py: add POST /api/scan/pause and POST /api/scan/resume endpoints; /api/scan/cancel kept as alias; scan_status now returns folder_path, files_indexed, phashes_done; scan_reset clears all new fields - index.html: "Cancel" → "⏸ Pause" button; new #paused-area banner shows folder, files indexed, phashes done with "▶ Resume" and "Full reset" buttons; updateScanUI handles paused status; pauseScan()/resumeScan() JS functions added; chip gains .paused amber style Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 02:11:00 -04:00
tocmo	a6748de6e0	Pipeline discovery and indexing — workers start immediately Instead of walk-everything-first then index, workers now receive files the instant os.walk yields them. The thread pool is open before the walk starts; each discovered file is submitted immediately. Completed futures are drained after each directory to keep memory flat. Progress message shows: "Discovering & indexing (8w): 1,234 — 5,678 found so far" then once walk finishes: "Indexing (8w): 8,000 / 9,100" UI: merged Discovery + Indexing into a single "Discover + Index" phase pill. Indeterminate progress bar stays on until total file count is known. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 01:54:43 -04:00
tocmo	fef364162c	Parallel SHA-256 indexing with thread pool Replace single-threaded indexing loop with ThreadPoolExecutor. Default workers = min(cpu_count*2, 16), tunable via DUPFINDER_WORKERS env var. Pre-loads all existing DB records in one query instead of N per-file queries. Progress message shows worker count and live done/total count. Skipped files bulk-stamped in batches of 500. On an 8-core machine over NAS: ~4-8x faster indexing phase. On NVMe: up to 16x faster with 16 workers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 01:48:30 -04:00
tocmo	c110a8e4f9	GPU-accelerated phash + fix discovery/takeout hang GPU: - Switch Dockerfile base to pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime - Add gpu_hasher.py: batched 2D DCT on GPU via PyTorch matrix multiply, 256 images/batch, produces imagehash-compatible 64-bit hex hashes, auto-falls back to CPU when CUDA unavailable - Replace per-image phash loop in scanner.py with phasher.hash_files() - docker-compose.yml: add nvidia GPU device reservation Hang fix: - takeout.is_takeout_folder() now caps at 50 directories (was walking entire tree — blocked for minutes on 65k+ file libraries) - Add "Not a Takeout folder" status message so takeout phase is never silent Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 01:37:28 -04:00
tocmo	b519e065cb	Fix discovery phase appearing frozen Scanner now updates message every 250 files during os.walk so the UI shows a live count. Progress bar switches to an indeterminate animated pulse during discovery and takeout phases (no known total yet), then reverts to a normal percentage bar once indexing begins. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 01:25:41 -04:00
tocmo	6e7bb241ad	Add .gitignore, remove pycache and db from tracking	2026-04-04 23:55:53 -04:00
tocmo	c19825c523	Add server-side folder picker New GET /api/browse endpoint lists subdirectories at any path. UI gets a folder icon button next to each path input that opens a browsable directory tree modal. Escape or Cancel closes it, clicking a folder navigates into it, Select confirms the choice. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 23:55:42 -04:00
tocmo	868da9016d	Initial implementation of duplicate finder Full project per spec: FastAPI backend, 4-method duplicate detection (SHA-256, phash, EXIF, filesize), Google Takeout pre-processor, 4 scan modes, and dark-theme vanilla JS gallery frontend. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 23:42:58 -04:00

20 Commits