Commit Graph

14 Commits

Author SHA1 Message Date
Carlos
7436b23db3 Stage 2 #1: SFTP destinations CRUD + connection test
Foundation for the move/quarantine pipeline. Lets users register one or
more remote SFTP destinations through the API, store credentials at rest
under /data/sftp/{id}.{password|key} (mode 600), and verify connectivity
+ write access via a test endpoint.

Endpoints:
  GET    /api/sftp/destinations
  POST   /api/sftp/destinations             — create
  PUT    /api/sftp/destinations/{id}        — update
  DELETE /api/sftp/destinations/{id}
  POST   /api/sftp/destinations/{id}/test   — connect, stat base_path, mkdir probe
  POST   /api/sftp/keypair                  — generate ED25519 keypair

Host keys pinned per-destination on first connect (TOFU); subsequent
mismatches are rejected. paramiko added to requirements.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-26 20:04:42 -04:00
Carlos
8b0fee0055 Folder priority + path penalty: match folder segments only, not filenames
Both _folder_priority and _path_penalty were scanning the entire path
string including the basename. A file named 'mytrashed_pic.jpg' in
/photos/MobileBackup/ would falsely match the 'trash' token.

Now only directory segments are checked; filename never influences keeper
selection beyond its actual path location.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-26 18:48:30 -04:00
Carlos
6827c5965f Lowest priority (11) for Google Photos / Takeout / backup folders
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-26 15:53:15 -04:00
Carlos
399a80cb70 Add explicit folder-priority ranking for keeper selection
#recycle (10) ranks worst, MobileBackup (1) best, default 2.
Folder priority dominates resolution + path-penalty; mtime stays as final
tiebreak. Override via /data/folder_priority.json (cached per process).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-26 15:52:04 -04:00
Carlos
14c6012808 Smarter keeper selection: folder-name + mtime signals
Adds a path-penalty score that downranks files in folders named Trashed,
Dups, Backup, Copy, Old, Archive, plus a penalty for repeated path segments
(e.g. Desktop/Desktop/Files) and very deep paths. Also captures and uses
file mtime as a tiebreaker — older files are usually the originals.

Applied to all four detection passes (sha256, phash, exif, filesize+dim)
and to auto-resolve-exact.

New file_mtime column with idempotent migration.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-24 10:56:52 -04:00
Carlos
6a4134762c Add decisions audit log for future move/delete tool
Captures every review action (keeper, redundant, skip, unreview, auto-resolve,
rescan-restore) with sha256 at decision time so a downstream tool can detect
stale decisions before touching disk.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-24 01:40:54 -04:00
Carlos
3001be3a92 Fix correctness bugs in scanner and reset endpoint
- Defer Takeout sidecar enrichment until after indexing so its UPDATE
  statements actually match rows. Previously it ran first and silently
  no-op'd on the very first scan because no files existed in the DB yet.

- Preserve user review decisions across incremental and regroup rescans.
  The grouping phase wipes duplicate_groups/duplicate_members, which
  also wiped reviewed=1 / is_keeper flags. Now snapshots reviewed groups
  by (method, frozenset of member file_ids) before the wipe and re-applies
  them to any post-regrouping group whose member set is unchanged.

- Replace 2-hex-char phash bucketing with multi-index pigeonhole
  (16 nibble buckets per hash). At threshold=10, the previous bucketing
  missed any near-duplicate pair that differed in the first byte, since
  they landed in different buckets and were never compared. Caches
  imagehash.hex_to_hash() per phash and dedups pair comparisons.

- Rewrite _suggested_keeper_by_resolution: previous implementation had
  a dead inner score() function and the lambda was missing the date
  tie-breaker (left as a TODO comment). Now picks largest pixels, ties
  by file size, then by oldest exif_datetime.

- Filter phash candidates to length(phash)=16 to skip malformed hashes
  rather than relying on the silent except in the comparison loop.

- Reject /api/scan/reset while a scan is running. Resetting mid-scan
  wiped tables the running scan thread was still writing to.

- Also clears stale 'redundant' file status (not just 'keeper') when
  a file no longer appears in any group after regrouping.
2026-04-24 00:42:13 -04:00
tocmo
356f922940 feat: replace Cancel with Pause/Resume — survives server restarts
- scanner.py: replace cancel_requested with pause_requested throughout;
  pause during walk drains in-flight futures gracefully then saves state;
  phash phase processes in 500-image chunks with pause check between each;
  _save_pause_state() persists files_indexed/phashes_done/last_phase to DB;
  init_db() already detects killed-mid-scan (running→paused) on startup

- main.py: add POST /api/scan/pause and POST /api/scan/resume endpoints;
  /api/scan/cancel kept as alias; scan_status now returns folder_path,
  files_indexed, phashes_done; scan_reset clears all new fields

- index.html: "Cancel" → "⏸ Pause" button; new #paused-area banner shows
  folder, files indexed, phashes done with "▶ Resume" and "Full reset"
  buttons; updateScanUI handles paused status; pauseScan()/resumeScan()
  JS functions added; chip gains .paused amber style

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 02:11:00 -04:00
tocmo
a6748de6e0 Pipeline discovery and indexing — workers start immediately
Instead of walk-everything-first then index, workers now receive files
the instant os.walk yields them. The thread pool is open before the
walk starts; each discovered file is submitted immediately. Completed
futures are drained after each directory to keep memory flat.

Progress message shows:
  "Discovering & indexing (8w): 1,234 — 5,678 found so far"
  then once walk finishes:
  "Indexing (8w): 8,000 / 9,100"

UI: merged Discovery + Indexing into a single "Discover + Index" phase pill.
Indeterminate progress bar stays on until total file count is known.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:54:43 -04:00
tocmo
fef364162c Parallel SHA-256 indexing with thread pool
Replace single-threaded indexing loop with ThreadPoolExecutor.
Default workers = min(cpu_count*2, 16), tunable via DUPFINDER_WORKERS
env var. Pre-loads all existing DB records in one query instead of
N per-file queries. Progress message shows worker count and live
done/total count. Skipped files bulk-stamped in batches of 500.

On an 8-core machine over NAS: ~4-8x faster indexing phase.
On NVMe: up to 16x faster with 16 workers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:48:30 -04:00
tocmo
c110a8e4f9 GPU-accelerated phash + fix discovery/takeout hang
GPU:
- Switch Dockerfile base to pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime
- Add gpu_hasher.py: batched 2D DCT on GPU via PyTorch matrix multiply,
  256 images/batch, produces imagehash-compatible 64-bit hex hashes,
  auto-falls back to CPU when CUDA unavailable
- Replace per-image phash loop in scanner.py with phasher.hash_files()
- docker-compose.yml: add nvidia GPU device reservation

Hang fix:
- takeout.is_takeout_folder() now caps at 50 directories (was walking
  entire tree — blocked for minutes on 65k+ file libraries)
- Add "Not a Takeout folder" status message so takeout phase is never silent

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:37:28 -04:00
tocmo
b519e065cb Fix discovery phase appearing frozen
Scanner now updates message every 250 files during os.walk so the UI
shows a live count. Progress bar switches to an indeterminate animated
pulse during discovery and takeout phases (no known total yet), then
reverts to a normal percentage bar once indexing begins.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:25:41 -04:00
tocmo
c19825c523 Add server-side folder picker
New GET /api/browse endpoint lists subdirectories at any path.
UI gets a folder icon button next to each path input that opens
a browsable directory tree modal. Escape or Cancel closes it,
clicking a folder navigates into it, Select confirms the choice.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 23:55:42 -04:00
tocmo
868da9016d Initial implementation of duplicate finder
Full project per spec: FastAPI backend, 4-method duplicate detection
(SHA-256, phash, EXIF, filesize), Google Takeout pre-processor,
4 scan modes, and dark-theme vanilla JS gallery frontend.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 23:42:58 -04:00