Compare commits

...

29 Commits

Author SHA1 Message Date
Carlos
293355b724 SFTP: switch to Transport-based connection (fixes Synology 'Channel closed')
paramiko's SSHClient.open_sftp() allocates an exec channel before the
SFTP subsystem request, which Synology DSM closes immediately with
'Channel closed'. Manual sftp(1) and WinSCP avoid this by going straight
to the SFTP subsystem on a fresh channel.

Replaced SSHClient with direct paramiko.Transport + SFTPClient.from_transport,
matching the OpenSSH/WinSCP flow. Larger flow-control windows (128 MB) too
since Synology has been observed to bail mid-handshake with the default 1 MB.

test_connection_verbose now reports per-step status (connect+auth,
open_sftp, listdir /, stat base_path, write probe). API returns the
steps array so the UI can show exactly which step failed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-26 21:43:56 -04:00
Carlos
a7b023c193 Stage 2 #4: Destinations management UI
Adds 'Destinations' sidebar entry + view + add/edit/delete/test modal.
Generate-keypair button shows the public key for the user to paste into
the remote authorized_keys.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-26 20:29:22 -04:00
Carlos
7436b23db3 Stage 2 #1: SFTP destinations CRUD + connection test
Foundation for the move/quarantine pipeline. Lets users register one or
more remote SFTP destinations through the API, store credentials at rest
under /data/sftp/{id}.{password|key} (mode 600), and verify connectivity
+ write access via a test endpoint.

Endpoints:
  GET    /api/sftp/destinations
  POST   /api/sftp/destinations             — create
  PUT    /api/sftp/destinations/{id}        — update
  DELETE /api/sftp/destinations/{id}
  POST   /api/sftp/destinations/{id}/test   — connect, stat base_path, mkdir probe
  POST   /api/sftp/keypair                  — generate ED25519 keypair

Host keys pinned per-destination on first connect (TOFU); subsequent
mismatches are rejected. paramiko added to requirements.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-26 20:04:42 -04:00
Carlos
8b0fee0055 Folder priority + path penalty: match folder segments only, not filenames
Both _folder_priority and _path_penalty were scanning the entire path
string including the basename. A file named 'mytrashed_pic.jpg' in
/photos/MobileBackup/ would falsely match the 'trash' token.

Now only directory segments are checked; filename never influences keeper
selection beyond its actual path location.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-26 18:48:30 -04:00
Carlos
3128ddc593 Fix 'failed to load group' on click
The detail-panel insertion logic mixed parent contexts: it called
grid.parentNode.insertBefore() but used a child-of-grid as the reference
node. insertBefore requires the reference node to be a child of the
target parent — it threw 'node is not a child of this node' on every
click.

Replaced the inter-row positioning with simple insert-after-grid. Same
visual outcome since panel.scrollIntoView() handles user focus.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-26 18:29:32 -04:00
Carlos
759288b37e Pre-generate all thumbnails up-front, not on scroll
After every scan, automatically kick off a background thread that
generates a JPEG thumbnail for every file in a duplicate group and
caches it locally at /data/thumbs/. Idempotent — already-cached files
are skipped.

New endpoints:
  POST /api/thumbs/generate            — start pre-gen for all files
  POST /api/thumbs/generate?only_in_groups=true  — only dup-group files
  GET  /api/thumbs/status              — progress (total/done/skipped/failed)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-26 16:33:19 -04:00
Carlos
4c21e9fa1c Add workstation-local thumbnail cache + HEIC support
Thumbnails (256px JPEG, q80) generated on first request and cached at
/data/thumbs/<shard>/<file_id>.jpg — i.e. on the workstation's local SSD,
not the NAS. Subsequent requests serve straight from cache, never
re-fetching from /photos.

HEIC/HEIF decoded via pillow-heif so iPhone photos finally render.
Videos cached as a single ffmpeg-extracted frame, not regenerated each
request. New DELETE /api/thumb/cache endpoint to wipe it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-26 16:29:29 -04:00
Carlos
81b38cb5bb CSV export: path column now contains directory only
Filename was duplicated in both columns; trimmed the basename off path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-26 16:03:34 -04:00
Carlos
6827c5965f Lowest priority (11) for Google Photos / Takeout / backup folders
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-26 15:53:15 -04:00
Carlos
399a80cb70 Add explicit folder-priority ranking for keeper selection
#recycle (10) ranks worst, MobileBackup (1) best, default 2.
Folder priority dominates resolution + path-penalty; mtime stays as final
tiebreak. Override via /data/folder_priority.json (cached per process).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-26 15:52:04 -04:00
Carlos
d95bf69be0 Fix CSV export crash on filenames with embedded newlines
Use QUOTE_ALL + sanitise NUL/CR/LF in path/filename/exif fields. Default
csv dialect rejected fields containing line terminators with 'need to
escape, but no escapechar set'.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-26 13:17:11 -04:00
Carlos
14c6012808 Smarter keeper selection: folder-name + mtime signals
Adds a path-penalty score that downranks files in folders named Trashed,
Dups, Backup, Copy, Old, Archive, plus a penalty for repeated path segments
(e.g. Desktop/Desktop/Files) and very deep paths. Also captures and uses
file mtime as a tiebreaker — older files are usually the originals.

Applied to all four detection passes (sha256, phash, exif, filesize+dim)
and to auto-resolve-exact.

New file_mtime column with idempotent migration.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-24 10:56:52 -04:00
Carlos
4d57b0af74 Bump package to 1.0.2
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-24 01:41:50 -04:00
Carlos
6a4134762c Add decisions audit log for future move/delete tool
Captures every review action (keeper, redundant, skip, unreview, auto-resolve,
rescan-restore) with sha256 at decision time so a downstream tool can detect
stale decisions before touching disk.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-24 01:40:54 -04:00
Carlos
79ab0dbb05 Fix stale Gitea token in build-deb.sh
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-24 01:20:53 -04:00
Carlos
077fbd7e8f Fix .deb source staging — preserve app/ subdir for Dockerfile
build-deb.sh used 'cp -r app/ source/' which renames app to source
when source doesn't yet exist, dropping the app/ wrapper that the
Dockerfile's COPY app/ /app/ depends on. The 2>/dev/null || true
on the cp lines hid the resulting failures, so the .deb shipped a
broken /opt/dupfinder/source/ that build-from-source could not use.

Pre-create the source dir and copy each item to its explicit
destination path. Bump package version to 1.0.1.

Also rework dupfinder-setup.sh's image-prep step: prefer a local
image, then a quiet registry pull, then build from the bundled
source. Removes the loud registry-not-found error that scared users
when the (unpublished) tocmo0nlord/dupfinder image wasn't on Docker
Hub.
2026-04-24 01:05:53 -04:00
Carlos
76e89a7313 Fix .deb install path and Gitea upload auth
The .deb install instructions in the README pointed at a URL that
doesn't exist — Gitea exposes the Debian registry as an apt repo, not
as plain file downloads. Switched the README to the apt-repo flow
(add a sources.list line, then apt install).

Also fixed build-deb.sh: Gitea's Debian package endpoint returns
HTTP 405 for token-bearer auth; it requires HTTP basic auth (user +
token-as-password) and the literal /upload suffix on the URL.

Package built and pushed to the registry — apt install works now.
2026-04-24 00:55:11 -04:00
Carlos
90790b648d Rewrite README install instructions for end users
Lay out the three install paths (Windows installer, .deb package, manual
docker compose) with concrete numbered steps and a 'pick your method'
table at the top so users don't have to read past their own platform.
Add a using-it walkthrough, a scan-mode explanation, and a short
troubleshooting section.
2026-04-24 00:48:20 -04:00
Carlos
3001be3a92 Fix correctness bugs in scanner and reset endpoint
- Defer Takeout sidecar enrichment until after indexing so its UPDATE
  statements actually match rows. Previously it ran first and silently
  no-op'd on the very first scan because no files existed in the DB yet.

- Preserve user review decisions across incremental and regroup rescans.
  The grouping phase wipes duplicate_groups/duplicate_members, which
  also wiped reviewed=1 / is_keeper flags. Now snapshots reviewed groups
  by (method, frozenset of member file_ids) before the wipe and re-applies
  them to any post-regrouping group whose member set is unchanged.

- Replace 2-hex-char phash bucketing with multi-index pigeonhole
  (16 nibble buckets per hash). At threshold=10, the previous bucketing
  missed any near-duplicate pair that differed in the first byte, since
  they landed in different buckets and were never compared. Caches
  imagehash.hex_to_hash() per phash and dedups pair comparisons.

- Rewrite _suggested_keeper_by_resolution: previous implementation had
  a dead inner score() function and the lambda was missing the date
  tie-breaker (left as a TODO comment). Now picks largest pixels, ties
  by file size, then by oldest exif_datetime.

- Filter phash candidates to length(phash)=16 to skip malformed hashes
  rather than relying on the silent except in the comparison loop.

- Reject /api/scan/reset while a scan is running. Resetting mid-scan
  wiped tables the running scan thread was still writing to.

- Also clears stale 'redundant' file status (not just 'keeper') when
  a file no longer appears in any group after regrouping.
2026-04-24 00:42:13 -04:00
tocmo
356f922940 feat: replace Cancel with Pause/Resume — survives server restarts
- scanner.py: replace cancel_requested with pause_requested throughout;
  pause during walk drains in-flight futures gracefully then saves state;
  phash phase processes in 500-image chunks with pause check between each;
  _save_pause_state() persists files_indexed/phashes_done/last_phase to DB;
  init_db() already detects killed-mid-scan (running→paused) on startup

- main.py: add POST /api/scan/pause and POST /api/scan/resume endpoints;
  /api/scan/cancel kept as alias; scan_status now returns folder_path,
  files_indexed, phashes_done; scan_reset clears all new fields

- index.html: "Cancel" → "⏸ Pause" button; new #paused-area banner shows
  folder, files indexed, phashes done with "▶ Resume" and "Full reset"
  buttons; updateScanUI handles paused status; pauseScan()/resumeScan()
  JS functions added; chip gains .paused amber style

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 02:11:00 -04:00
tocmo
f37bd76fed Fix GPU setup: check and install nvidia-container-toolkit
dupfinder-setup.sh now verifies nvidia-container-toolkit is present
when a GPU is detected. If missing, prints install instructions and
offers to install it automatically (adds NVIDIA repo, installs toolkit,
configures Docker runtime, restarts Docker).

Without this toolkit Docker silently falls back to CPU even when a
GPU is present and the compose file has the device reservation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:57:51 -04:00
tocmo
a6748de6e0 Pipeline discovery and indexing — workers start immediately
Instead of walk-everything-first then index, workers now receive files
the instant os.walk yields them. The thread pool is open before the
walk starts; each discovered file is submitted immediately. Completed
futures are drained after each directory to keep memory flat.

Progress message shows:
  "Discovering & indexing (8w): 1,234 — 5,678 found so far"
  then once walk finishes:
  "Indexing (8w): 8,000 / 9,100"

UI: merged Discovery + Indexing into a single "Discover + Index" phase pill.
Indeterminate progress bar stays on until total file count is known.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:54:43 -04:00
tocmo
fef364162c Parallel SHA-256 indexing with thread pool
Replace single-threaded indexing loop with ThreadPoolExecutor.
Default workers = min(cpu_count*2, 16), tunable via DUPFINDER_WORKERS
env var. Pre-loads all existing DB records in one query instead of
N per-file queries. Progress message shows worker count and live
done/total count. Skipped files bulk-stamped in batches of 500.

On an 8-core machine over NAS: ~4-8x faster indexing phase.
On NVMe: up to 16x faster with 16 workers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:48:30 -04:00
tocmo
f9164b4fa0 Add Debian package and Gitea APT repository support
debian/control, postinst, prerm, postrm — standard dpkg package lifecycle
debian/files/opt/dupfinder/dupfinder-setup.sh — interactive setup:
  checks Docker, detects NVIDIA GPU, prompts for photos/data paths,
  writes docker-compose.override.yml with GPU reservation if present,
  pulls image from registry (builds from source as fallback)
debian/files/usr/local/bin/dupfinder — CLI wrapper:
  setup / start / stop / restart / status / logs / open / update
debian/files/etc/systemd/system/dupfinder.service — systemd unit,
  guards against starting before setup has run
debian/build-deb.sh — builds .deb and uploads to Gitea package registry;
  prints the exact apt sources.list line on success

Install on any Debian/Ubuntu machine:
  echo "deb [trusted=yes] http://192.168.1.64:3000/api/packages/tocmo0nlord/debian bookworm main" \
    | sudo tee /etc/apt/sources.list.d/dupfinder.list
  sudo apt update && sudo apt install dupfinder
  sudo dupfinder setup

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:42:45 -04:00
tocmo
c110a8e4f9 GPU-accelerated phash + fix discovery/takeout hang
GPU:
- Switch Dockerfile base to pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime
- Add gpu_hasher.py: batched 2D DCT on GPU via PyTorch matrix multiply,
  256 images/batch, produces imagehash-compatible 64-bit hex hashes,
  auto-falls back to CPU when CUDA unavailable
- Replace per-image phash loop in scanner.py with phasher.hash_files()
- docker-compose.yml: add nvidia GPU device reservation

Hang fix:
- takeout.is_takeout_folder() now caps at 50 directories (was walking
  entire tree — blocked for minutes on 65k+ file libraries)
- Add "Not a Takeout folder" status message so takeout phase is never silent

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:37:28 -04:00
tocmo
1d46b9945d Add portable flash-drive installer
- build-release.ps1: builds Docker image, saves to tar, bundles
  everything into dist\ ready to copy to a flash drive
- installer/install.ps1: checks WSL2, Docker Desktop, loads image
  (or builds from source as fallback), prompts for photo/data paths,
  writes docker-compose.override.yml, starts container, creates
  desktop shortcut
- installer/uninstall.ps1: stops container, optionally removes image
  and data, removes shortcut and app directory
- installer/dupfinder-start-stop.ps1: start/stop/restart/open helper
  copied to target machine during install; desktop shortcut uses -Action open
  which polls until the app is responsive before launching browser

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:32:32 -04:00
tocmo
b519e065cb Fix discovery phase appearing frozen
Scanner now updates message every 250 files during os.walk so the UI
shows a live count. Progress bar switches to an indeterminate animated
pulse during discovery and takeout phases (no known total yet), then
reverts to a normal percentage bar once indexing begins.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:25:41 -04:00
tocmo
6e7bb241ad Add .gitignore, remove pycache and db from tracking 2026-04-04 23:55:53 -04:00
tocmo
c19825c523 Add server-side folder picker
New GET /api/browse endpoint lists subdirectories at any path.
UI gets a folder icon button next to each path input that opens
a browsable directory tree modal. Escape or Cancel closes it,
clicking a folder navigates into it, Select confirms the choice.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 23:55:42 -04:00
25 changed files with 3321 additions and 262 deletions

11
.claude/launch.json Normal file
View File

@@ -0,0 +1,11 @@
{
"version": "0.0.1",
"configurations": [
{
"name": "dup-finder-api",
"runtimeExecutable": "uvicorn",
"runtimeArgs": ["main:app", "--host", "0.0.0.0", "--port", "8000", "--reload"],
"port": 8000
}
]
}

6
.gitignore vendored Normal file
View File

@@ -0,0 +1,6 @@
__pycache__/
*.pyc
*.pyo
data/
*.db
.env

View File

@@ -1,7 +1,9 @@
FROM python:3.12-slim
# PyTorch + CUDA 12.1 base — matches Ubuntu 22.04 with NVIDIA driver 525+
FROM pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime
RUN apt-get update && apt-get install -y \
libheif-dev libjpeg-dev libpng-dev libtiff-dev libwebp-dev exiftool \
libheif-dev libjpeg-dev libpng-dev libtiff-dev libwebp-dev \
libgl1 libglib2.0-0 exiftool ffmpeg \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app

170
README.md
View File

@@ -1,56 +1,170 @@
# Duplicate Finder
A self-hosted Docker web app that scans a photo/video library, detects duplicates using four methods, and lets you review them in a gallery UI. **No files are ever moved, renamed, or deleted**all decisions are recorded in SQLite only.
Self-hosted web app that scans your photo and video library, finds duplicates four different ways, and lets you review them in a browser. **It never moves, renames, or deletes anything**every decision is recorded in a SQLite database. A separate tool (coming later) will act on those decisions.
## Quick start
> Once installed, open **http://localhost:8765** in any browser to use it.
---
## Pick your install method
| You have… | Use this |
|---|---|
| **Windows 10/11** | [Windows installer](#windows-1011) (one PowerShell command) |
| **Debian / Ubuntu / Proxmox LXC** | [.deb package](#debian--ubuntu--proxmox) (`apt install`) |
| **Anything else with Docker** | [Docker Compose](#manual-docker-compose) (manual) |
All three installs end up running the same Docker container.
---
### Windows 10/11
**What you need:** Docker Desktop (the installer will check for it and offer to download).
1. Download the latest release zip from the Gitea **Releases** page and extract it anywhere.
2. Right-click `installer\install.ps1`**Run with PowerShell** (or open an elevated PowerShell and run it).
3. When prompted, type the path to your photos folder (e.g. `D:\Photos`) and a folder for the database (default is fine).
4. The installer starts the container and puts a **DupFinder** shortcut on your desktop.
**Day-to-day use:** double-click the desktop shortcut, or browse to http://localhost:8765.
**Uninstall:** run `installer\uninstall.ps1` as administrator.
---
### Debian / Ubuntu / Proxmox
**What you need:** Docker Engine. If you don't have it: `curl -fsSL https://get.docker.com | sh`.
```bash
# 1. Edit docker-compose.yml — set your photos volume path
# 2. Build and run
docker compose up -d --build
# 3. Open http://localhost:8765
# 4. Enter folder path in UI and click Scan
# 1. Add the Gitea apt repo
echo "deb [trusted=yes] http://192.168.1.64:3000/api/packages/tocmo0nlord/debian bookworm main" \
| sudo tee /etc/apt/sources.list.d/dupfinder.list
# 2. Install
sudo apt update
sudo apt install dupfinder
# 3. Run first-time setup (asks for photos path + data path)
sudo dupfinder setup
# 4. Start it
sudo dupfinder start
```
## Volume mounts
> The repo says `bookworm` (Debian 12). For Ubuntu/other distros the package still works — the codename in the URL is just how Gitea organizes the registry.
| Container path | Purpose |
> **One-shot install without the apt repo:**
> ```bash
> curl -u tocmo0nlord:<your-token> -O \
> http://192.168.1.64:3000/api/packages/tocmo0nlord/debian/pool/bookworm/main/dupfinder_1.0.0_amd64.deb
> sudo apt install ./dupfinder_1.0.0_amd64.deb
> ```
**Manage the service:**
| Command | What it does |
|---|---|
| `/photos` | Your photo library — mounted **read-only** |
| `/data` | SQLite database persistence |
| `sudo dupfinder start` | Start the container |
| `sudo dupfinder stop` | Stop the container |
| `sudo dupfinder restart` | Restart |
| `sudo dupfinder status` | Show systemd status |
| `sudo dupfinder logs` | Tail the logs |
| `dupfinder open` | Open in your default browser |
Edit `docker-compose.yml` to point these at your NAS paths.
The service auto-starts on boot via systemd (`dupfinder.service`).
## Detection methods
**Uninstall:** `sudo apt remove dupfinder` (your photos and database are left untouched).
| Method | Color | Description |
---
### Manual Docker Compose
For NAS appliances (Synology, Unraid, TrueNAS), Mac, or any host where you'd rather wire it up yourself.
1. Clone the repo:
```bash
git clone http://192.168.1.64:3000/tocmo0nlord/duplicate-finder.git
cd duplicate-finder
```
2. Open `docker-compose.yml` and change the two volume paths under `dup-finder:`:
```yaml
volumes:
- /your/photos/path:/photos:ro # ← your photo library (read-only)
- /your/data/path:/data # ← where the SQLite DB lives
```
3. Build and start:
```bash
docker compose up -d --build
```
4. Open http://localhost:8765.
To stop: `docker compose down`. To update later: `git pull && docker compose up -d --build`.
> **GPU acceleration (optional):** the compose file requests an NVIDIA GPU for faster perceptual hashing. If you don't have one, delete the `deploy.resources.reservations.devices` block — the app falls back to CPU automatically.
---
## Using it
1. Open http://localhost:8765.
2. Click **Browse** and pick the folder you want to scan (it's relative to the container — usually just `/photos`).
3. Pick a scan mode (see below) and click **Scan**.
4. When it finishes, review the duplicate groups. Each group shows the suggested keeper highlighted; click any other photo to pick it instead, or **Keep all** to skip the group.
5. When you're done, click **Download CSV** to export all decisions.
### Scan modes
| Mode | When to use |
|---|---|
| **Incremental** *(default)* | Day-to-day rescans. Re-hashes only changed/new files. Past review decisions are preserved. |
| **New files only** | Fastest option. Indexes only files added since the last scan. |
| **Rebuild groups** | Re-runs duplicate detection on the existing index without re-hashing. |
| **Full reset** | Wipes the entire index and starts from scratch. |
### Detection methods
| Method | UI color | What it catches |
|---|---|---|
| SHA-256 | Blue | Byte-identical files |
| Perceptual hash | Purple | Visually similar photos (hamming ≤ 10) |
| EXIF timestamp + device | Amber | Same camera, same moment |
| File size + dimensions | Gray | Same size and resolution (low confidence) |
| **SHA-256** | Blue | Byte-identical files |
| **Perceptual hash** | Purple | Visually similar photos (hamming ≤ 10) |
| **EXIF timestamp + device** | Amber | Same camera, same moment |
| **File size + dimensions** | Gray | Same size and resolution (low confidence) |
## Scan modes
### Google Takeout
| Mode | Description |
|---|---|
| Incremental | Only re-hashes changed/new files. Prior decisions preserved. |
| New files only | Indexes newly added files. Existing decisions untouched. |
| Rebuild groups | Re-runs detection on existing index. No re-hashing. |
| Full reset | Wipes everything and scans from scratch. |
Point it at a Google Photos Takeout export and it auto-detects the structure, reads the `.json` sidecars, and restores the correct capture timestamps and original filenames. Takeout files get a flag in the UI.
## Google Takeout
---
The scanner automatically detects Google Takeout folder structures and reads `.json` sidecar files to restore correct capture timestamps and original filenames. Takeout files are flagged in the UI.
## Troubleshooting
**The page won't load at http://localhost:8765**
Check the container is up: `docker ps | grep dup-finder`. If not, see the logs: `docker compose logs dup-finder` (or `sudo dupfinder logs` on Debian).
**"Permission denied" reading photos**
The `/photos` mount is read-only by design, but the container still needs read access. Make sure your user (or the docker daemon) can read the folder you mounted.
**Scan is stuck on "phash"**
Perceptual hashing is the slowest phase — large libraries (>50k photos) on CPU can take hours. Add an NVIDIA GPU and the `deploy.resources` block in compose to get a 10-50× speedup.
**I marked the wrong file as keeper**
Open the group again and click **Unreview**, then re-decide.
---
## What "redundant" means
Marking a file redundant **only writes to the database**. Nothing is moved, renamed, or deleted. This tool produces a decision record only. A separate tool handles file actions.
When you mark a file redundant, **only the database is updated**. Nothing on disk changes. This tool produces a decision record. A future companion tool will use that record to actually move or delete files.
---
## Tech stack
- Python 3.12, FastAPI, Uvicorn
- SQLite (stdlib `sqlite3`)
- Pillow, imagehash, pillow-heif
- PyTorch + CUDA for batched perceptual hashing
- Vanilla JS single-page frontend
- Docker / docker-compose

162
app/gpu_hasher.py Normal file
View File

@@ -0,0 +1,162 @@
"""
GPU-accelerated perceptual hashing via PyTorch + CUDA.
Implements the same pHash algorithm as the `imagehash` library (DCT-II,
8×8 low-frequency block, 64-bit hash) so hashes produced here are
directly comparable with any existing imagehash-generated hashes in the DB.
Falls back to CPU if CUDA is not available — no code changes needed.
"""
import logging
import math
from pathlib import Path
import numpy as np
import torch
from PIL import Image, UnidentifiedImageError
try:
from pillow_heif import register_heif_opener
register_heif_opener()
except ImportError:
pass
log = logging.getLogger(__name__)
# Must match imagehash defaults: hash_size=8, highfreq_factor=4
HASH_SIZE = 8
IMG_SIZE = HASH_SIZE * 4 # 32
BATCH_SIZE = 256 # images per GPU batch; lower if VRAM is tight
class GpuPhasher:
"""
Batched perceptual hasher. Uses CUDA when available, CPU otherwise.
The DCT is implemented as two matrix multiplications:
DCT2D(X) = D @ X @ Dᵀ
where D is the precomputed orthonormal DCT-II matrix of size IMG_SIZE.
This runs entirely on-GPU for the full batch.
"""
def __init__(self, batch_size: int = BATCH_SIZE):
self.batch_size = batch_size
if torch.cuda.is_available():
self.device = torch.device("cuda")
dev_name = torch.cuda.get_device_name(0)
log.info("GpuPhasher: using CUDA device — %s", dev_name)
else:
self.device = torch.device("cpu")
log.info("GpuPhasher: CUDA not available, using CPU")
# Precompute orthonormal DCT-II matrix (IMG_SIZE × IMG_SIZE)
self._dct = self._build_dct_matrix(IMG_SIZE).to(self.device)
# ── DCT matrix ────────────────────────────────────────────────────────────
@staticmethod
def _build_dct_matrix(n: int) -> torch.Tensor:
"""Orthonormal DCT-II matrix of size n×n."""
k = torch.arange(n, dtype=torch.float32).unsqueeze(1) # (n, 1)
i = torch.arange(n, dtype=torch.float32).unsqueeze(0) # (1, n)
mat = torch.cos(math.pi * k * (2.0 * i + 1.0) / (2.0 * n)) # (n, n)
mat[0] *= 1.0 / math.sqrt(n)
mat[1:] *= math.sqrt(2.0 / n)
return mat # (n, n)
# ── Image loading ─────────────────────────────────────────────────────────
@staticmethod
def _load_image(path: str) -> np.ndarray | None:
"""Load image → greyscale float32 numpy array of shape (IMG_SIZE, IMG_SIZE)."""
try:
img = (
Image.open(path)
.convert("L")
.resize((IMG_SIZE, IMG_SIZE), Image.Resampling.LANCZOS)
)
return np.asarray(img, dtype=np.float32)
except (UnidentifiedImageError, OSError, Exception):
return None
# ── Core GPU batch ────────────────────────────────────────────────────────
def _phash_batch(self, arrays: list[np.ndarray]) -> list[str]:
"""
Compute pHash for a list of (IMG_SIZE, IMG_SIZE) float32 numpy arrays.
Returns a list of 16-char hex strings (64-bit hashes).
"""
# Stack into GPU tensor (B, H, W)
batch = torch.from_numpy(np.stack(arrays)).to(self.device) # (B, 32, 32)
# 2D DCT: D @ X @ Dᵀ
dct2d = self._dct @ batch @ self._dct.T # (B, 32, 32)
# Keep only top-left HASH_SIZE × HASH_SIZE block
low = dct2d[:, :HASH_SIZE, :HASH_SIZE] # (B, 8, 8)
flat = low.reshape(low.shape[0], -1) # (B, 64)
# Each bit: is value > row mean?
means = flat.mean(dim=1, keepdim=True)
bits = (flat > means).cpu().numpy() # (B, 64) bool
# Pack bits → bytes → hex (matches imagehash's __str__ format)
return [np.packbits(b).tobytes().hex() for b in bits]
# ── Public API ────────────────────────────────────────────────────────────
def hash_files(
self,
paths: list[str],
progress_cb=None,
) -> dict[str, str]:
"""
Compute pHash for every path in `paths`.
Returns {path: hex_hash_string}. Paths that fail to open are omitted.
progress_cb(n_done: int) is called after each batch.
"""
results: dict[str, str] = {}
done = 0
for i in range(0, len(paths), self.batch_size):
chunk = paths[i : i + self.batch_size]
arrays: list[np.ndarray] = []
valid: list[str] = []
for p in chunk:
arr = self._load_image(p)
if arr is not None:
arrays.append(arr)
valid.append(p)
if arrays:
try:
hashes = self._phash_batch(arrays)
results.update(zip(valid, hashes))
except Exception as exc:
log.warning("GPU batch failed (%s); skipping batch", exc)
done += len(chunk)
if progress_cb:
progress_cb(done)
return results
@property
def using_gpu(self) -> bool:
return self.device.type == "cuda"
# ── Module-level singleton (created once, reused across scan phases) ──────────
_phasher: GpuPhasher | None = None
def get_phasher() -> GpuPhasher:
global _phasher
if _phasher is None:
_phasher = GpuPhasher()
return _phasher

View File

@@ -20,11 +20,25 @@ from fastapi.templating import Jinja2Templates
from pydantic import BaseModel
import scanner as sc
import sftp as sftp_mod
app = FastAPI(title="Duplicate Finder")
templates = Jinja2Templates(directory="/app/templates")
app.mount("/static", StaticFiles(directory="/app/static"), name="static")
# Resolve paths relative to this file so it works both in Docker and locally
_BASE = Path(__file__).parent
_TEMPLATES_DIR = (
str(_BASE / "templates") if (_BASE / "templates").exists()
else str(_BASE.parent / "templates") if (_BASE.parent / "templates").exists()
else "/app/templates"
)
_STATIC_DIR = str(_BASE / "static")
_STATIC_DIR = _STATIC_DIR if Path(_STATIC_DIR).exists() else "/app/static"
# Ensure static dir exists
Path(_STATIC_DIR).mkdir(parents=True, exist_ok=True)
templates = Jinja2Templates(directory=_TEMPLATES_DIR)
app.mount("/static", StaticFiles(directory=_STATIC_DIR), name="static")
METHOD_META = {
"sha256": {"color": "#378ADD", "label": "Exact copy"},
@@ -92,19 +106,27 @@ def scan_start(body: ScanStartBody):
sc.scan_state.update(
scan_id=scan_id,
status="running",
phase="discovery",
phase="takeout",
progress=0,
total=0,
message="Starting...",
cancel_requested=False,
pause_requested=False,
files_indexed=0,
phashes_done=0,
folder_path=body.folder_path,
stats={},
)
thread = threading.Thread(
target=sc.run_scan,
args=(body.folder_path, scan_id, mode),
daemon=True,
)
def _scan_then_thumbs():
try:
sc.run_scan(body.folder_path, scan_id, mode)
finally:
# Kick off thumbnail pre-generation immediately when scan ends.
# Limited to files actually in duplicate groups — that's the gallery
# view and the only place thumbs are looked at.
_start_thumb_thread(only_in_groups=True)
thread = threading.Thread(target=_scan_then_thumbs, daemon=True)
thread.start()
return {"scan_id": scan_id}
@@ -133,28 +155,84 @@ def scan_status():
con.close()
return {
"scan_id": state["scan_id"],
"status": state["status"],
"phase": state["phase"],
"progress": state["progress"],
"total": state["total"],
"message": state["message"],
"stats": stats,
"scan_id": state["scan_id"],
"status": state["status"],
"phase": state["phase"],
"progress": state["progress"],
"total": state["total"],
"message": state["message"],
"folder_path": state.get("folder_path"),
"files_indexed": state.get("files_indexed", 0),
"phashes_done": state.get("phashes_done", 0),
"stats": stats,
}
@app.post("/api/scan/cancel")
def scan_cancel():
@app.post("/api/scan/pause")
def scan_pause():
if sc.scan_state["status"] != "running":
raise HTTPException(400, "No scan is currently running")
sc.scan_state["cancel_requested"] = True
sc.scan_state["pause_requested"] = True
return {"success": True}
# Keep /cancel as an alias so any lingering clients still work
@app.post("/api/scan/cancel")
def scan_cancel():
return scan_pause()
@app.post("/api/scan/resume")
def scan_resume():
if sc.scan_state["status"] != "paused":
raise HTTPException(400, "No paused scan to resume")
folder_path = sc.scan_state.get("folder_path")
if not folder_path:
raise HTTPException(400, "No folder path saved — please start a new scan")
con = get_db()
cur = con.cursor()
cur.execute(
"INSERT INTO scans (folder_path, status) VALUES (?, 'running')",
(folder_path,),
)
scan_id = cur.lastrowid
con.commit()
con.close()
sc.scan_state.update(
scan_id=scan_id,
status="running",
phase="takeout",
progress=0,
total=0,
message="Resuming scan...",
pause_requested=False,
files_indexed=0,
phashes_done=0,
folder_path=folder_path,
stats={},
)
thread = threading.Thread(
target=sc.run_scan,
args=(folder_path, scan_id, "incremental"),
daemon=True,
)
thread.start()
return {"scan_id": scan_id}
@app.delete("/api/scan/reset")
def scan_reset(confirm: str = Query("")):
if confirm != "RESET":
raise HTTPException(400, "Pass ?confirm=RESET to confirm")
if sc.scan_state["status"] == "running":
raise HTTPException(
400, "A scan is currently running — pause it before resetting"
)
con = get_db()
cur = con.cursor()
cur.execute("DELETE FROM duplicate_members")
@@ -165,7 +243,9 @@ def scan_reset(confirm: str = Query("")):
con.close()
sc.scan_state.update(
scan_id=None, status="idle", phase="idle",
progress=0, total=0, message="", stats={},
progress=0, total=0, message="",
pause_requested=False, files_indexed=0,
phashes_done=0, folder_path=None, stats={},
)
return {"success": True}
@@ -337,6 +417,7 @@ def decide(group_id: int, body: DecideBody):
)
status = "keeper" if is_k else "redundant"
cur.execute("UPDATE files SET status=? WHERE id=?", (status, fid))
sc.log_decision(cur, fid, group_id, status, "manual")
cur.execute("UPDATE duplicate_groups SET reviewed=1 WHERE id=?", (group_id,))
con.commit()
@@ -351,6 +432,9 @@ def skip_group(group_id: int):
cur.execute("SELECT id FROM duplicate_groups WHERE id=?", (group_id,))
if not cur.fetchone():
raise HTTPException(404, "Group not found")
cur.execute("SELECT file_id FROM duplicate_members WHERE group_id=?", (group_id,))
for r in cur.fetchall():
sc.log_decision(cur, r["file_id"], group_id, "skip", "manual")
cur.execute("UPDATE duplicate_groups SET reviewed=1 WHERE id=?", (group_id,))
con.commit()
con.close()
@@ -371,6 +455,7 @@ def keep_all(group_id: int):
(group_id, r["file_id"]),
)
cur.execute("UPDATE files SET status='keeper' WHERE id=?", (r["file_id"],))
sc.log_decision(cur, r["file_id"], group_id, "keeper", "keep-all")
cur.execute("UPDATE duplicate_groups SET reviewed=1 WHERE id=?", (group_id,))
con.commit()
con.close()
@@ -391,6 +476,7 @@ def unreview_group(group_id: int):
(group_id, r["file_id"]),
)
cur.execute("UPDATE files SET status='pending' WHERE id=?", (r["file_id"],))
sc.log_decision(cur, r["file_id"], group_id, "unreview", "manual")
cur.execute("UPDATE duplicate_groups SET reviewed=0 WHERE id=?", (group_id,))
con.commit()
con.close()
@@ -410,7 +496,8 @@ def auto_resolve_exact():
for gid in groups:
cur.execute("""
SELECT f.id, f.width, f.height, f.file_size, f.exif_datetime
SELECT f.id, f.path, f.width, f.height, f.file_size,
f.exif_datetime, f.file_mtime
FROM duplicate_members dm
JOIN files f ON f.id = dm.file_id
WHERE dm.group_id = ?
@@ -430,6 +517,11 @@ def auto_resolve_exact():
"UPDATE files SET status=? WHERE id=?",
("keeper" if is_k else "redundant", m["id"]),
)
sc.log_decision(
cur, m["id"], gid,
"keeper" if is_k else "redundant",
"auto-resolve-exact",
)
cur.execute("UPDATE duplicate_groups SET reviewed=1 WHERE id=?", (gid,))
resolved += 1
@@ -449,6 +541,59 @@ VIDEO_PLACEHOLDER_SVG = """<svg xmlns="http://www.w3.org/2000/svg" width="200" h
VIDEO_EXT = {".mp4", ".mov", ".avi", ".mkv", ".m4v", ".3gp", ".wmv", ".mts", ".m2ts"}
THUMB_CACHE_DIR = "/data/thumbs"
THUMB_MAX = 256 # square bounding box; preserves aspect
def _thumb_cache_path(file_id: int) -> str:
"""Sharded cache path so no directory holds more than ~1000 files."""
shard = file_id // 1000
d = os.path.join(THUMB_CACHE_DIR, str(shard))
os.makedirs(d, exist_ok=True)
return os.path.join(d, f"{file_id}.jpg")
def _generate_thumb(src_path: str, dest_path: str, ext: str) -> bool:
"""Generate a 256px JPEG thumbnail at dest_path. Returns True on success."""
try:
if ext in VIDEO_EXT:
# ffmpeg first frame, scaled to fit
result = subprocess.run(
[
"ffmpeg", "-y", "-i", src_path,
"-vframes", "1",
"-vf", f"scale='min({THUMB_MAX},iw)':'-1'",
"-q:v", "5",
dest_path,
],
capture_output=True, timeout=20,
)
return result.returncode == 0 and os.path.getsize(dest_path) > 0
# Image branch — Pillow handles JPEG/PNG/GIF/WebP/TIFF/BMP natively;
# pillow-heif registers HEIC/HEIF as a Pillow-readable format.
from PIL import Image, ImageOps
try:
import pillow_heif # noqa: F401 (registers HEIF opener)
pillow_heif.register_heif_opener()
except Exception:
pass
with Image.open(src_path) as im:
im = ImageOps.exif_transpose(im) # respect EXIF rotation
im.thumbnail((THUMB_MAX, THUMB_MAX))
if im.mode not in ("RGB", "L"):
im = im.convert("RGB")
im.save(dest_path, "JPEG", quality=80, optimize=True)
return True
except Exception:
# Cleanup partial write
try:
if os.path.exists(dest_path):
os.unlink(dest_path)
except Exception:
pass
return False
@app.get("/api/thumb/{file_id}")
def get_thumb(file_id: int):
con = get_db()
@@ -460,32 +605,134 @@ def get_thumb(file_id: int):
if not row:
raise HTTPException(404, "File not found")
path = row["path"]
ext = (row["extension"] or "").lower()
cached = _thumb_cache_path(file_id)
if not os.path.isfile(path):
# Cache hit — serve the local JPEG, never touches the NAS
if os.path.isfile(cached) and os.path.getsize(cached) > 0:
return FileResponse(cached, media_type="image/jpeg")
src = row["path"]
if not os.path.isfile(src):
raise HTTPException(404, "File not on disk")
if ext in VIDEO_EXT:
# Try ffmpeg for first frame
try:
result = subprocess.run(
[
"ffmpeg", "-i", path,
"-vframes", "1", "-f", "image2", "-vcodec", "mjpeg",
"pipe:1",
],
capture_output=True, timeout=10,
)
if result.returncode == 0 and result.stdout:
return Response(content=result.stdout, media_type="image/jpeg")
except Exception:
pass
return Response(content=VIDEO_PLACEHOLDER_SVG, media_type="image/svg+xml")
if _generate_thumb(src, cached, ext):
return FileResponse(cached, media_type="image/jpeg")
# Serve photo directly
# Final fallback: video placeholder for videos, original file for photos
if ext in VIDEO_EXT:
return Response(content=VIDEO_PLACEHOLDER_SVG, media_type="image/svg+xml")
mime = row["mime_type"] or "application/octet-stream"
return FileResponse(path, media_type=mime)
return FileResponse(src, media_type=mime)
@app.delete("/api/thumb/cache")
def clear_thumb_cache():
"""Wipe the thumbnail cache. Safe to call any time — they regenerate on demand."""
import shutil
if os.path.isdir(THUMB_CACHE_DIR):
shutil.rmtree(THUMB_CACHE_DIR, ignore_errors=True)
return {"cleared": True}
# ── Bulk thumbnail pre-generation ────────────────────────────────────────────
thumb_state: dict = {
"status": "idle", # idle | running | done | error
"total": 0,
"done": 0,
"skipped": 0, # already cached
"failed": 0,
"current": "",
"started_at": None,
"completed_at": None,
}
_thumb_thread_lock = threading.Lock()
def _generate_all_thumbs(only_in_groups: bool = False):
"""Walk every file and generate any missing thumbnail.
Runs in a background thread. Idempotent — already-cached files are
counted as skipped, not regenerated.
"""
import time
from datetime import datetime
thumb_state.update(
status="running", total=0, done=0, skipped=0, failed=0,
current="", started_at=datetime.utcnow().isoformat() + "Z",
completed_at=None,
)
try:
con = get_db()
cur = con.cursor()
if only_in_groups:
cur.execute("""
SELECT DISTINCT f.id, f.path, f.extension
FROM files f
JOIN duplicate_members dm ON dm.file_id = f.id
""")
else:
cur.execute("SELECT id, path, extension FROM files")
files = cur.fetchall()
con.close()
thumb_state["total"] = len(files)
for r in files:
fid = r["id"]
path = r["path"]
ext = (r["extension"] or "").lower()
cached = _thumb_cache_path(fid)
thumb_state["current"] = path or ""
if os.path.isfile(cached) and os.path.getsize(cached) > 0:
thumb_state["skipped"] += 1
elif not path or not os.path.isfile(path):
thumb_state["failed"] += 1
elif _generate_thumb(path, cached, ext):
thumb_state["done"] += 1
else:
thumb_state["failed"] += 1
# Yield occasionally so the API stays responsive
if (thumb_state["done"] + thumb_state["skipped"] + thumb_state["failed"]) % 50 == 0:
time.sleep(0)
from datetime import datetime as _dt
thumb_state["status"] = "done"
thumb_state["completed_at"] = _dt.utcnow().isoformat() + "Z"
thumb_state["current"] = ""
except Exception as e:
thumb_state["status"] = "error"
thumb_state["current"] = f"error: {e}"
def _start_thumb_thread(only_in_groups: bool = False) -> bool:
"""Start the background generator if not already running. Returns True if started."""
with _thumb_thread_lock:
if thumb_state["status"] == "running":
return False
t = threading.Thread(
target=_generate_all_thumbs,
args=(only_in_groups,),
daemon=True,
)
t.start()
return True
@app.post("/api/thumbs/generate")
def generate_thumbs(only_in_groups: bool = Query(False)):
"""Pre-generate thumbnails for every file (or only files in a duplicate group)."""
if not _start_thumb_thread(only_in_groups):
raise HTTPException(409, "Thumbnail generation already in progress")
return {"status": "started"}
@app.get("/api/thumbs/status")
def thumbs_status():
return dict(thumb_state)
@app.get("/api/files/{file_id}")
@@ -502,6 +749,32 @@ def get_file_meta(file_id: int):
# ── Stats ─────────────────────────────────────────────────────────────────────
@app.get("/api/browse")
def browse(path: str = Query("/")):
"""List subdirectories at the given path for the folder picker."""
try:
p = Path(path).resolve()
except Exception:
raise HTTPException(400, "Invalid path")
if not p.exists() or not p.is_dir():
raise HTTPException(404, "Path not found")
dirs = []
try:
for entry in sorted(p.iterdir()):
if entry.is_dir() and not entry.name.startswith("."):
dirs.append(entry.name)
except PermissionError:
pass
parent = str(p.parent) if p != p.parent else None
return {
"current": str(p),
"parent": parent,
"dirs": dirs,
}
@app.get("/api/stats")
def get_stats():
con = get_db()
@@ -577,17 +850,30 @@ def export_csv():
con.close()
output = io.StringIO()
writer = csv.writer(output)
# QUOTE_ALL + explicit lineterminator handles paths/filenames containing
# embedded \r, \n, quotes, or NULs — which the default dialect refuses.
writer = csv.writer(output, quoting=csv.QUOTE_ALL, lineterminator="\n")
writer.writerow([
"group_id", "method", "file_id", "path", "filename",
"size", "width", "height", "exif_date", "device",
"is_keeper", "is_redundant", "reviewed",
])
def _clean(v):
# Strip NULs (csv writer rejects them) and normalise embedded line breaks
if isinstance(v, str):
return v.replace("\x00", "").replace("\r\n", " ").replace("\r", " ").replace("\n", " ")
return v
for r in rows:
# path column = directory only; filename has the basename already
full = r["path"] or ""
dir_only = full.rsplit("/", 1)[0] if "/" in full else ""
writer.writerow([
r["group_id"], r["method"], r["file_id"],
r["path"], r["filename"], r["file_size"],
r["width"], r["height"], r["exif_datetime"], r["exif_device"],
_clean(dir_only), _clean(r["filename"]), r["file_size"],
r["width"], r["height"], _clean(r["exif_datetime"]),
_clean(r["exif_device"]),
r["is_keeper"], r["is_redundant"], r["reviewed"],
])
@@ -597,3 +883,171 @@ def export_csv():
media_type="text/csv",
headers={"Content-Disposition": "attachment; filename=dup-finder-export.csv"},
)
# ── SFTP destinations ────────────────────────────────────────────────────────
class SFTPDestBody(BaseModel):
name: str
host: str
port: int = 22
username: str
auth_method: str # 'password' | 'key'
base_path: str
mirror_structure: bool = True
# Either password (for password auth) or private_key (for key auth).
# Optional on update — omit to leave existing credential untouched.
password: Optional[str] = None
private_key: Optional[str] = None
def _dest_row_to_dict(row) -> dict:
return {
"id": row["id"],
"name": row["name"],
"host": row["host"],
"port": row["port"],
"username": row["username"],
"auth_method": row["auth_method"],
"base_path": row["base_path"],
"mirror_structure": bool(row["mirror_structure"]),
"enabled": bool(row["enabled"]),
"created_at": row["created_at"],
"last_tested_at": row["last_tested_at"],
"last_test_result": row["last_test_result"],
"has_credentials": sftp_mod.has_credentials(row["id"], row["auth_method"]),
}
@app.get("/api/sftp/destinations")
def list_destinations():
con = get_db()
cur = con.cursor()
cur.execute("SELECT * FROM sftp_destinations ORDER BY name")
out = [_dest_row_to_dict(r) for r in cur.fetchall()]
con.close()
return out
@app.post("/api/sftp/destinations", status_code=201)
def create_destination(body: SFTPDestBody):
if body.auth_method not in ("password", "key"):
raise HTTPException(400, "auth_method must be 'password' or 'key'")
if body.auth_method == "password" and not body.password:
raise HTTPException(400, "password required for password auth")
if body.auth_method == "key" and not body.private_key:
raise HTTPException(400, "private_key required for key auth")
con = get_db()
cur = con.cursor()
try:
cur.execute("""
INSERT INTO sftp_destinations
(name, host, port, username, auth_method, base_path, mirror_structure)
VALUES (?, ?, ?, ?, ?, ?, ?)
""", (body.name, body.host, body.port, body.username,
body.auth_method, body.base_path, 1 if body.mirror_structure else 0))
dest_id = cur.lastrowid
con.commit()
except sqlite3.IntegrityError:
con.close()
raise HTTPException(409, f"Destination name already in use: {body.name}")
if body.auth_method == "password":
sftp_mod.write_password(dest_id, body.password)
else:
sftp_mod.write_private_key(dest_id, body.private_key)
cur.execute("SELECT * FROM sftp_destinations WHERE id=?", (dest_id,))
out = _dest_row_to_dict(cur.fetchone())
con.close()
return out
@app.put("/api/sftp/destinations/{dest_id}")
def update_destination(dest_id: int, body: SFTPDestBody):
con = get_db()
cur = con.cursor()
cur.execute("SELECT * FROM sftp_destinations WHERE id=?", (dest_id,))
row = cur.fetchone()
if not row:
con.close()
raise HTTPException(404, "Destination not found")
cur.execute("""
UPDATE sftp_destinations
SET name=?, host=?, port=?, username=?, auth_method=?,
base_path=?, mirror_structure=?
WHERE id=?
""", (body.name, body.host, body.port, body.username,
body.auth_method, body.base_path,
1 if body.mirror_structure else 0, dest_id))
# If auth method changed, drop old creds
if row["auth_method"] != body.auth_method:
sftp_mod.delete_credentials(dest_id)
if body.auth_method == "password" and body.password:
sftp_mod.write_password(dest_id, body.password)
elif body.auth_method == "key" and body.private_key:
sftp_mod.write_private_key(dest_id, body.private_key)
con.commit()
cur.execute("SELECT * FROM sftp_destinations WHERE id=?", (dest_id,))
out = _dest_row_to_dict(cur.fetchone())
con.close()
return out
@app.delete("/api/sftp/destinations/{dest_id}", status_code=204)
def delete_destination(dest_id: int):
con = get_db()
cur = con.cursor()
cur.execute("DELETE FROM sftp_destinations WHERE id=?", (dest_id,))
if cur.rowcount == 0:
con.close()
raise HTTPException(404, "Destination not found")
con.commit()
con.close()
sftp_mod.delete_credentials(dest_id)
return Response(status_code=204)
@app.post("/api/sftp/destinations/{dest_id}/test")
def test_destination(dest_id: int):
con = get_db()
cur = con.cursor()
cur.execute("SELECT * FROM sftp_destinations WHERE id=?", (dest_id,))
row = cur.fetchone()
if not row:
con.close()
raise HTTPException(404, "Destination not found")
dest = _dest_row_to_dict(row)
if not dest["has_credentials"]:
con.close()
raise HTTPException(400, "No credentials stored for this destination")
ok, message, steps = sftp_mod.test_connection_verbose(dest)
cur.execute("""
UPDATE sftp_destinations
SET last_tested_at=CURRENT_TIMESTAMP, last_test_result=?
WHERE id=?
""", ("ok" if ok else message, dest_id))
con.commit()
cur.execute("SELECT * FROM sftp_destinations WHERE id=?", (dest_id,))
out = _dest_row_to_dict(cur.fetchone())
con.close()
return {"ok": ok, "message": message, "steps": steps, "destination": out}
@app.post("/api/sftp/keypair")
def generate_keypair():
"""Generate a fresh ED25519 keypair. Returns the private + public halves;
the caller is expected to paste the private key into a destination's
private_key field on create/update."""
private_pem, public_openssh, fingerprint = sftp_mod.generate_keypair()
return {
"private_key": private_pem,
"public_key": public_openssh,
"fingerprint": fingerprint,
}

View File

@@ -7,6 +7,8 @@ import mimetypes
import os
import sqlite3
import subprocess
import threading
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime
from pathlib import Path
@@ -20,6 +22,7 @@ except ImportError:
pass
from takeout import is_takeout_folder, process_takeout
from gpu_hasher import get_phasher
PHOTO_EXT = {
@@ -35,18 +38,23 @@ VIDEO_EXT = {
SUPPORTED_EXT = PHOTO_EXT | VIDEO_EXT
DB_PATH = "/data/dupfinder.db"
_DATA_DIR = Path("/data") if Path("/data").exists() else Path(__file__).parent.parent / "data"
_DATA_DIR.mkdir(parents=True, exist_ok=True)
DB_PATH = str(_DATA_DIR / "dupfinder.db")
# Shared scan state (updated by background thread, read by status endpoint)
scan_state = {
"scan_id": None,
"status": "idle", # idle | running | complete | error | cancelled
"phase": "idle", # discovery | takeout | indexing | phash | grouping | done
"progress": 0,
"total": 0,
"message": "",
"cancel_requested": False,
"stats": {},
"scan_id": None,
"status": "idle", # idle|running|paused|complete|error
"phase": "idle", # takeout|indexing|phash|grouping|done
"progress": 0,
"total": 0,
"message": "",
"folder_path": None, # persists so resume knows where to continue
"pause_requested": False,
"files_indexed": 0, # cumulative across phases
"phashes_done": 0,
"stats": {},
}
@@ -60,6 +68,22 @@ def get_db() -> sqlite3.Connection:
return con
def log_decision(cur, file_id: int, group_id: int | None, action: str, reason: str):
"""Append a row to the decisions audit log.
Captures the file's sha256 at decision time so a future move/delete tool
can detect when a file has changed since the user reviewed it.
"""
cur.execute("SELECT sha256 FROM files WHERE id=?", (file_id,))
row = cur.fetchone()
sha = row["sha256"] if row else None
cur.execute(
"INSERT INTO decisions (file_id, group_id, action, reason, sha256_at_decision) "
"VALUES (?, ?, ?, ?, ?)",
(file_id, group_id, action, reason, sha),
)
def init_db():
con = get_db()
cur = con.cursor()
@@ -77,6 +101,7 @@ def init_db():
exif_device TEXT,
width INTEGER,
height INTEGER,
file_mtime TEXT,
is_takeout INTEGER DEFAULT 0,
is_edited INTEGER DEFAULT 0,
takeout_json TEXT,
@@ -87,12 +112,15 @@ def init_db():
);
CREATE TABLE IF NOT EXISTS scans (
id INTEGER PRIMARY KEY AUTOINCREMENT,
folder_path TEXT NOT NULL,
started_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
completed_at TIMESTAMP,
total_files INTEGER DEFAULT 0,
status TEXT DEFAULT 'running'
id INTEGER PRIMARY KEY AUTOINCREMENT,
folder_path TEXT NOT NULL,
started_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
completed_at TIMESTAMP,
total_files INTEGER DEFAULT 0,
files_indexed INTEGER DEFAULT 0,
phashes_done INTEGER DEFAULT 0,
last_phase TEXT DEFAULT 'indexing',
status TEXT DEFAULT 'running'
);
CREATE TABLE IF NOT EXISTS duplicate_groups (
@@ -111,13 +139,90 @@ def init_db():
suggested INTEGER DEFAULT 0
);
CREATE INDEX IF NOT EXISTS idx_sha256 ON files(sha256);
CREATE INDEX IF NOT EXISTS idx_phash ON files(phash);
CREATE INDEX IF NOT EXISTS idx_exif_dt ON files(exif_datetime, exif_device);
CREATE INDEX IF NOT EXISTS idx_size_dim ON files(file_size, width, height);
CREATE INDEX IF NOT EXISTS idx_status ON files(status);
CREATE TABLE IF NOT EXISTS sftp_destinations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT UNIQUE NOT NULL,
host TEXT NOT NULL,
port INTEGER NOT NULL DEFAULT 22,
username TEXT NOT NULL,
auth_method TEXT NOT NULL, -- 'password' | 'key'
base_path TEXT NOT NULL,
mirror_structure INTEGER NOT NULL DEFAULT 1,
enabled INTEGER NOT NULL DEFAULT 1,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_tested_at TIMESTAMP,
last_test_result TEXT
);
CREATE TABLE IF NOT EXISTS decisions (
id INTEGER PRIMARY KEY AUTOINCREMENT,
file_id INTEGER NOT NULL,
group_id INTEGER,
action TEXT NOT NULL,
reason TEXT,
sha256_at_decision TEXT,
decided_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (file_id) REFERENCES files(id) ON DELETE CASCADE,
FOREIGN KEY (group_id) REFERENCES duplicate_groups(id) ON DELETE SET NULL
);
CREATE INDEX IF NOT EXISTS idx_sha256 ON files(sha256);
CREATE INDEX IF NOT EXISTS idx_phash ON files(phash);
CREATE INDEX IF NOT EXISTS idx_exif_dt ON files(exif_datetime, exif_device);
CREATE INDEX IF NOT EXISTS idx_size_dim ON files(file_size, width, height);
CREATE INDEX IF NOT EXISTS idx_status ON files(status);
CREATE INDEX IF NOT EXISTS idx_decisions_file ON decisions(file_id);
CREATE INDEX IF NOT EXISTS idx_decisions_group ON decisions(group_id);
""")
# Migration: add new columns to scans if upgrading from older schema
for col, defn in [
("files_indexed", "INTEGER DEFAULT 0"),
("phashes_done", "INTEGER DEFAULT 0"),
("last_phase", "TEXT DEFAULT 'indexing'"),
]:
try:
cur.execute(f"ALTER TABLE scans ADD COLUMN {col} {defn}")
except Exception:
pass # column already exists
# Migration: file_mtime added in v1.0.3 for keeper-selection scoring
try:
cur.execute("ALTER TABLE files ADD COLUMN file_mtime TEXT")
except Exception:
pass
con.commit()
# ── Detect interrupted scans from previous run ────────────────────────────
# Any scan left as 'running' means the server was killed mid-scan.
# Mark them 'paused' so the UI offers a resume button.
cur.execute("""
UPDATE scans SET status = 'paused'
WHERE status = 'running'
""")
con.commit()
# Restore scan_state if there's a paused scan
cur.execute("""
SELECT id, folder_path, files_indexed, phashes_done, last_phase
FROM scans WHERE status = 'paused'
ORDER BY started_at DESC LIMIT 1
""")
row = cur.fetchone()
if row:
scan_state.update(
scan_id=row["id"],
status="paused",
phase=row["last_phase"] or "indexing",
folder_path=row["folder_path"],
files_indexed=row["files_indexed"] or 0,
phashes_done=row["phashes_done"] or 0,
message=(
f"Paused — {row['files_indexed']:,} files indexed, "
f"{row['phashes_done']:,} phashes done"
),
)
con.close()
@@ -215,6 +320,7 @@ def extract_file(path: str) -> dict:
"exif_device": None,
"width": None,
"height": None,
"file_mtime": _mtime_str(path),
}
try:
@@ -273,21 +379,149 @@ class UnionFind:
# ── Detection passes ──────────────────────────────────────────────────────────
def _suggested_keeper_by_resolution(members: list[dict]) -> int:
"""Return file_id of highest resolution member; tie-break by size then oldest date."""
def score(m):
w = m["width"] or 0
h = m["height"] or 0
size = m["file_size"] or 0
dt = m["exif_datetime"] or "9999"
return (w * h, size, dt)
# Explicit folder-priority ranking. Lower number = higher priority (preferred
# keeper). Higher number = mark redundant. Tokens match case-insensitively as
# substrings of the full path. When a path matches multiple tokens the WORST
# (highest) number wins — so /photos/#recycle/MobileBackup/foo.jpg ranks as
# #recycle (10), not MobileBackup (1).
#
# Override at runtime by writing /data/folder_priority.json:
# {"priorities": {"my_folder": 5, "trash": 10}, "default": 2}
_FOLDER_PRIORITY_DEFAULTS = (
("google photos", 11),
("googlephotos", 11),
("google_photos", 11),
("google-photos", 11),
("takeout", 11),
("google takeout", 11),
("googletakeout", 11),
("google backup", 11),
("googlebackup", 11),
("google_backup", 11),
("#recycle", 10),
("photoprism", 9),
("photoprizm", 8),
("photolibrary", 7),
("albumsbackup", 6),
("organized", 5),
("moved", 4),
("random", 3),
("mobilebackup", 1),
)
_FOLDER_PRIORITY_DEFAULT_BUCKET = 2 # "anything else"
best = max(members, key=lambda m: (
(m["width"] or 0) * (m["height"] or 0),
m["file_size"] or 0,
# older date = better; invert by negating epoch or use str comparison inverted
))
return best["id"]
_folder_priority_cache: tuple[tuple[tuple[str, int], ...], int] | None = None
def _load_folder_priority() -> tuple[tuple[tuple[str, int], ...], int]:
"""Load folder priority list from /data/folder_priority.json if present,
else fall back to defaults. Cached after first call per process."""
global _folder_priority_cache
if _folder_priority_cache is not None:
return _folder_priority_cache
entries: tuple[tuple[str, int], ...] = _FOLDER_PRIORITY_DEFAULTS
default_bucket = _FOLDER_PRIORITY_DEFAULT_BUCKET
try:
import json
path = "/data/folder_priority.json"
if os.path.exists(path):
with open(path) as f:
data = json.load(f)
entries = tuple(
(k.lower(), int(v))
for k, v in (data.get("priorities") or {}).items()
)
default_bucket = int(data.get("default", default_bucket))
except Exception:
pass
_folder_priority_cache = (entries, default_bucket)
return _folder_priority_cache
def _folder_priority(path: str) -> int:
"""Return the worst (highest) priority bucket matching any DIRECTORY segment
of this path, or default. Filename basename is intentionally excluded —
only folder names influence priority."""
entries, default_bucket = _load_folder_priority()
if not path:
return default_bucket
# Split on /, drop empty segments, drop the last (filename basename).
segments = [s.lower() for s in path.split("/") if s]
if len(segments) <= 1:
return default_bucket # no parent folder
dir_segments = segments[:-1]
worst: int | None = None
for seg in dir_segments:
for token, prio in entries:
if token in seg and (worst is None or prio > worst):
worst = prio
return worst if worst is not None else default_bucket
# Generic copy/backup signal — applies on top of explicit folder priority as a
# tiebreaker. Tokens match as whole-word-ish substrings of each path segment.
_DUP_FOLDER_TOKENS = (
"trash", "trashed", "dup", "dups", "duplicate", "duplicates",
"backup", "backups", "copy", "copies", "old", "archive", "archived",
)
def _path_penalty(path: str) -> int:
"""Higher = worse keeper candidate. Penalises FOLDERS (not filenames) that
look like copies/backups, plus repeated segments and very deep paths."""
if not path:
return 0
segments = [s for s in path.split("/") if s]
if not segments:
return 0
# Folder segments only — exclude filename basename
dir_segments = segments[:-1]
score = 0
for seg in dir_segments:
low = seg.lower()
for tok in _DUP_FOLDER_TOKENS:
if (tok in low.split() or tok == low
or f"_{tok}" in low or f"{tok}_" in low
or low.startswith(tok) or low.endswith(tok)):
score += 100
break
# Repeated folder segments like "Desktop/Desktop/Files" suggest a nested backup
seen: set[str] = set()
for seg in dir_segments:
low = seg.lower()
if low in seen:
score += 30
seen.add(low)
# Slight penalty for very deep paths (originals tend to live shallower)
score += max(0, len(dir_segments) - 6) * 5
return score
def _suggested_keeper_by_resolution(members: list[dict]) -> int:
"""Return file_id of best keeper.
Ranking, in order (lower wins):
1. Folder priority bucket (explicit list, e.g. #recycle = worst)
2. Highest pixel count (tie → largest file_size)
3. Lowest path penalty (Trashed/, Dups/, Backup/, deep nesting)
4. Earliest mtime (originals are usually older than their copies)
5. Earliest exif_datetime
"""
def res_size(m):
# Negate for descending sort with min()
return (-(m["width"] or 0) * (m["height"] or 0), -(m["file_size"] or 0))
def rank(m):
path = m.get("path") or ""
return (
_folder_priority(path),
res_size(m),
_path_penalty(path),
m.get("file_mtime") or "9999",
m.get("exif_datetime") or "9999-99-99T99:99:99",
)
return min(members, key=rank)["id"]
def _suggested_keeper_oldest(members: list[dict]) -> int:
@@ -309,7 +543,7 @@ def _run_sha256_pass(con: sqlite3.Connection, scan_id: int):
for row in rows:
sha = row["sha256"]
cur.execute("""
SELECT id, width, height, file_size, exif_datetime
SELECT id, path, width, height, file_size, exif_datetime, file_mtime
FROM files WHERE sha256 = ?
""", (sha,))
members = [dict(r) for r in cur.fetchall()]
@@ -332,9 +566,11 @@ def _run_phash_pass(con: sqlite3.Connection, scan_id: int):
cur = con.cursor()
# Exclude files already in sha256 groups
cur.execute("""
SELECT f.id, f.phash, f.width, f.height, f.file_size, f.exif_datetime
SELECT f.id, f.path, f.phash, f.width, f.height, f.file_size,
f.exif_datetime, f.file_mtime
FROM files f
WHERE f.phash IS NOT NULL
AND length(f.phash) = 16
AND f.extension NOT IN (
'.mp4','.mov','.avi','.mkv','.m4v','.3gp','.wmv','.mts','.m2ts'
)
@@ -349,25 +585,43 @@ def _run_phash_pass(con: sqlite3.Connection, scan_id: int):
if len(rows) < 2:
return
# Bucket by first 2 hex chars to reduce O(n²) comparisons
buckets: dict[str, list[dict]] = {}
THRESHOLD = 10
# Multi-index pigeonhole: split each 64-bit phash into 16 nibble positions.
# If two hashes differ by ≤K bits, at least 16-K nibble positions are
# untouched, so any candidate pair shares at least one (position, nibble)
# bucket. Catches pairs the previous 2-hex-prefix bucketing missed.
buckets: dict[tuple[int, str], list[dict]] = {}
for r in rows:
key = r["phash"][:2]
buckets.setdefault(key, []).append(r)
for i, ch in enumerate(r["phash"]):
buckets.setdefault((i, ch), []).append(r)
uf = UnionFind()
# Ensure all IDs are registered
for r in rows:
uf.find(r["id"])
THRESHOLD = 10
hash_cache: dict[str, "imagehash.ImageHash"] = {}
def _h(s: str):
h = hash_cache.get(s)
if h is None:
h = imagehash.hex_to_hash(s)
hash_cache[s] = h
return h
seen_pairs: set[tuple[int, int]] = set()
for bucket in buckets.values():
if len(bucket) < 2:
continue
for i in range(len(bucket)):
for j in range(i + 1, len(bucket)):
a, b = bucket[i], bucket[j]
pair = (a["id"], b["id"]) if a["id"] < b["id"] else (b["id"], a["id"])
if pair in seen_pairs:
continue
seen_pairs.add(pair)
try:
dist = imagehash.hex_to_hash(a["phash"]) - imagehash.hex_to_hash(b["phash"])
if dist <= THRESHOLD:
if _h(a["phash"]) - _h(b["phash"]) <= THRESHOLD:
uf.union(a["id"], b["id"])
except Exception:
pass
@@ -410,7 +664,7 @@ def _run_exif_pass(con: sqlite3.Connection, scan_id: int):
for row in rows:
dt, dev = row["exif_datetime"], row["exif_device"]
cur.execute("""
SELECT id, width, height, file_size, exif_datetime
SELECT id, path, width, height, file_size, exif_datetime, file_mtime
FROM files
WHERE exif_datetime = ? AND exif_device = ?
""", (dt, dev))
@@ -449,12 +703,13 @@ def _run_filesize_pass(con: sqlite3.Connection, scan_id: int):
for row in rows:
fs, w, h = row["file_size"], row["width"], row["height"]
cur.execute("""
SELECT id, width, height, file_size, exif_datetime
SELECT id, path, width, height, file_size, exif_datetime, file_mtime
FROM files
WHERE file_size = ? AND width = ? AND height = ?
""", (fs, w, h))
members = [dict(r) for r in cur.fetchall()]
keeper_id = _suggested_keeper_oldest(members)
# Filesize+dim is the weakest signal — folder/mtime tiebreak helps a lot here
keeper_id = _suggested_keeper_by_resolution(members)
method_value = f"{fs}::{w}x{h}"
cur.execute(
"INSERT INTO duplicate_groups (method, method_value) VALUES ('filesize', ?)",
@@ -468,38 +723,31 @@ def _run_filesize_pass(con: sqlite3.Connection, scan_id: int):
)
# ── Pause helpers ────────────────────────────────────────────────────────────
def _save_pause_state(cur, scan_id: int, phase: str,
files_indexed: int, phashes_done: int):
"""Persist pause progress so the scan survives a server restart."""
cur.execute("""
UPDATE scans SET
status = 'paused',
last_phase = ?,
files_indexed = ?,
phashes_done = ?
WHERE id = ?
""", (phase, files_indexed, phashes_done, scan_id))
# ── Main scan entry point ─────────────────────────────────────────────────────
def run_scan(folder_path: str, scan_id: int, mode: str = "incremental"):
"""Main scan function — runs in background thread."""
global scan_state
scan_state["folder_path"] = folder_path # persist so resume knows where to continue
con = get_db()
cur = con.cursor()
try:
# ── Phase: discovery ──────────────────────────────────────────────
scan_state.update(phase="discovery", progress=0, total=0,
message="Discovering files...")
all_files = []
for root, dirs, files in os.walk(folder_path):
dirs[:] = [d for d in dirs if not d.startswith(".")]
for fname in files:
if fname.endswith(".json"):
continue
ext = Path(fname).suffix.lower()
if ext in SUPPORTED_EXT:
all_files.append(os.path.join(root, fname))
scan_state["total"] = len(all_files)
scan_state["message"] = f"Found {len(all_files):,} files"
if scan_state["cancel_requested"]:
_mark_scan(cur, scan_id, "cancelled")
con.commit()
scan_state["status"] = "cancelled"
return
# ── Mode: full reset ──────────────────────────────────────────────
if mode == "full_reset":
cur.execute("DELETE FROM duplicate_members")
@@ -507,100 +755,231 @@ def run_scan(folder_path: str, scan_id: int, mode: str = "incremental"):
cur.execute("DELETE FROM files")
con.commit()
# ── Phase: takeout pre-processing ─────────────────────────────────
scan_state.update(phase="takeout", message="Checking for Google Takeout structure...")
if is_takeout_folder(folder_path):
scan_state["message"] = "Processing Google Takeout sidecars..."
process_takeout(folder_path, DB_PATH)
# ── Phase: takeout detection (sidecar processing deferred until after
# indexing — sidecars enrich existing DB rows, so files must be there). ─
scan_state.update(phase="takeout",
message="Checking for Google Takeout structure...")
is_takeout = is_takeout_folder(folder_path)
scan_state["message"] = (
"Takeout detected — sidecars will be processed after indexing"
if is_takeout else "Not a Takeout folder — skipping"
)
if scan_state["cancel_requested"]:
_mark_scan(cur, scan_id, "cancelled")
if scan_state["pause_requested"]:
_save_pause_state(cur, scan_id, "takeout", 0, 0)
con.commit()
scan_state["status"] = "cancelled"
scan_state.update(
status="paused", pause_requested=False,
message="Paused during Takeout check",
)
return
# ── Phase: indexing ───────────────────────────────────────────────
scan_state.update(phase="indexing", progress=0,
message="Indexing files (SHA-256 + EXIF + dimensions)...")
# ── Phases: discovery + indexing (pipelined) ──────────────────────
# Workers start hashing files the instant they are discovered —
# no waiting for the full directory walk to finish first.
#
# Workers: 2× CPU count, capped at 16. Tune via DUPFINDER_WORKERS.
N_WORKERS = int(os.environ.get(
"DUPFINDER_WORKERS",
min(max((os.cpu_count() or 4) * 2, 4), 16)
))
scan_state.update(
phase="indexing", progress=0, total=0,
message=f"Scanning — discovering & indexing in parallel ({N_WORKERS} workers)..."
)
for i, path in enumerate(all_files):
if scan_state["cancel_requested"]:
_mark_scan(cur, scan_id, "cancelled")
con.commit()
scan_state["status"] = "cancelled"
return
# Pre-load existing DB records once (avoids per-file queries)
cur.execute("SELECT path, id, file_size FROM files")
existing_db: dict[str, dict] = {
row["path"]: {"id": row["id"], "file_size": row["file_size"]}
for row in cur.fetchall()
}
scan_state["progress"] = i + 1
scan_state["message"] = f"Indexing: {Path(path).name}"
# Check existing record
cur.execute("SELECT id, file_size, updated_at FROM files WHERE path = ?", (path,))
existing = cur.fetchone()
# Shared counters (updated from multiple threads)
_lock = threading.Lock()
_discovered = [0] # total files found by walker so far
_done = [0] # files fully indexed (skipped + processed)
_walk_done = [False]
_pause_at_end = False # set True when pause requested mid-walk
all_files: list[str] = []
to_skip: list[str] = []
changed_ids: list[int] = []
def _index_file(path: str) -> dict | None:
try:
current_size = os.path.getsize(path)
except OSError:
continue
return extract_file(path)
except Exception:
return None
if existing and mode in ("incremental", "new_files"):
if mode == "new_files":
# Skip entirely — don't re-hash existing files
cur.execute("UPDATE files SET scan_id = ? WHERE path = ?", (scan_id, path))
continue
# Incremental: skip if size unchanged (use size as proxy for change)
if existing["file_size"] == current_size:
cur.execute("UPDATE files SET scan_id = ? WHERE path = ?", (scan_id, path))
continue
# File changed — re-hash, clear group memberships
def _write_result(path: str, record: dict | None, existing: dict | None):
"""Write one file result to DB. Called on main thread only."""
if record is None:
cur.execute(
"DELETE FROM duplicate_members WHERE file_id = ?", (existing["id"],)
)
try:
record = extract_file(path)
except Exception as e:
cur.execute(
"INSERT OR IGNORE INTO files (path, filename, extension, scan_id, status) "
"INSERT OR IGNORE INTO files "
" (path, filename, extension, scan_id, status) "
"VALUES (?, ?, ?, ?, 'error')",
(path, Path(path).name, Path(path).suffix.lower(), scan_id),
)
cur.execute(
"UPDATE files SET status='error', scan_id=?, updated_at=CURRENT_TIMESTAMP "
"WHERE path=?",
"UPDATE files SET status='error', scan_id=?, "
" updated_at=CURRENT_TIMESTAMP WHERE path=?",
(scan_id, path),
)
con.commit()
continue
record["scan_id"] = scan_id
if existing:
cur.execute("""
UPDATE files SET
filename=:filename, extension=:extension, file_size=:file_size,
mime_type=:mime_type, sha256=:sha256,
exif_datetime=:exif_datetime, exif_device=:exif_device,
width=:width, height=:height, scan_id=:scan_id,
status='pending', updated_at=CURRENT_TIMESTAMP
WHERE path=:path
""", record)
else:
cur.execute("""
INSERT OR IGNORE INTO files
(path, filename, extension, file_size, mime_type, sha256,
exif_datetime, exif_device, width, height, scan_id, status)
VALUES
(:path, :filename, :extension, :file_size, :mime_type, :sha256,
:exif_datetime, :exif_device, :width, :height, :scan_id, 'pending')
""", record)
record["scan_id"] = scan_id
if existing:
cur.execute("""
UPDATE files SET
filename=:filename, extension=:extension,
file_size=:file_size, mime_type=:mime_type,
sha256=:sha256, exif_datetime=:exif_datetime,
exif_device=:exif_device, width=:width,
height=:height, file_mtime=:file_mtime,
scan_id=:scan_id,
status='pending', updated_at=CURRENT_TIMESTAMP
WHERE path=:path
""", record)
else:
cur.execute("""
INSERT OR IGNORE INTO files
(path, filename, extension, file_size, mime_type,
sha256, exif_datetime, exif_device, width,
height, file_mtime, scan_id, status)
VALUES
(:path, :filename, :extension, :file_size,
:mime_type, :sha256, :exif_datetime,
:exif_device, :width, :height, :file_mtime,
:scan_id, 'pending')
""", record)
if (i + 1) % 100 == 0:
con.commit()
with ThreadPoolExecutor(max_workers=N_WORKERS) as pool:
pending: dict = {} # future → (path, existing)
def _drain(limit: int = 50):
"""Collect up to `limit` completed futures and write to DB."""
done_futures = [f for f in list(pending) if f.done()][:limit]
for f in done_futures:
path, existing = pending.pop(f)
_write_result(path, f.result(), existing)
with _lock:
_done[0] += 1
d = _done[0]
disc = _discovered[0]
walking = not _walk_done[0]
scan_state["progress"] = d
scan_state["total"] = disc
scan_state["message"] = (
f"{'Discovering & i' if walking else 'I'}ndexing "
f"({N_WORKERS}w): {d:,}"
+ (f" / {disc:,}" if not walking else f"{disc:,} found so far")
)
if done_futures and _done[0] % 200 == 0:
con.commit()
# ── Walk + submit ─────────────────────────────────────────────
for root, dirs, files in os.walk(folder_path):
dirs[:] = [d for d in dirs if not d.startswith(".")]
if scan_state["pause_requested"]:
_pause_at_end = True
break # stop walking; in-flight futures drain normally
for fname in files:
if fname.endswith(".json"):
continue
ext = Path(fname).suffix.lower()
if ext not in SUPPORTED_EXT:
continue
path = os.path.join(root, fname)
all_files.append(path)
with _lock:
_discovered[0] += 1
existing = existing_db.get(path)
try:
current_size = os.path.getsize(path)
except OSError:
continue
# Skip unchanged files
if existing and mode in ("incremental", "new_files"):
if mode == "new_files" or existing["file_size"] == current_size:
to_skip.append(path)
with _lock:
_done[0] += 1
continue
changed_ids.append(existing["id"])
# Submit to thread pool immediately
future = pool.submit(_index_file, path)
pending[future] = (path, existing)
# Drain completed results regularly to avoid memory buildup
if len(pending) >= N_WORKERS * 4:
_drain(N_WORKERS * 2)
# Drain after each directory
_drain(20)
_walk_done[0] = True
# ── Bulk-stamp skipped files ──────────────────────────────────
for chunk_start in range(0, len(to_skip), 500):
chunk = to_skip[chunk_start : chunk_start + 500]
cur.executemany(
"UPDATE files SET scan_id = ? WHERE path = ?",
[(scan_id, p) for p in chunk],
)
for fid in changed_ids:
cur.execute(
"DELETE FROM duplicate_members WHERE file_id = ?", (fid,)
)
con.commit()
# ── Wait for remaining futures ────────────────────────────────
scan_state["total"] = len(all_files)
for future in as_completed(pending):
path, existing = pending[future]
_write_result(path, future.result(), existing)
with _lock:
_done[0] += 1
d = _done[0]
scan_state["progress"] = d
scan_state["message"] = (
f"Indexing ({N_WORKERS}w): {d:,} / {len(all_files):,}"
)
if d % 200 == 0:
con.commit()
con.commit()
# ── Pause checkpoint: after indexing ──────────────────────────────
scan_state["files_indexed"] = _done[0]
if _pause_at_end:
_save_pause_state(cur, scan_id, "indexing", _done[0], 0)
con.commit()
scan_state.update(
status="paused", pause_requested=False,
message=f"Paused — {_done[0]:,} files indexed",
)
return
# ── Takeout sidecar enrichment (now that files exist in DB) ───────
if is_takeout:
scan_state.update(phase="takeout",
message="Processing Google Takeout sidecars...")
try:
enriched = process_takeout(folder_path, DB_PATH)
scan_state["message"] = f"Takeout: enriched {enriched:,} files"
except Exception as exc:
scan_state["message"] = f"Takeout enrichment failed: {exc}"
# ── Phase: phash ──────────────────────────────────────────────────
phasher = get_phasher()
hw_label = "GPU" if phasher.using_gpu else "CPU"
scan_state.update(phase="phash", progress=0,
message="Computing perceptual hashes...")
message=f"Computing perceptual hashes ({hw_label})...")
cur.execute("""
SELECT id, path FROM files
@@ -613,27 +992,77 @@ def run_scan(folder_path: str, scan_id: int, mode: str = "incremental"):
photo_rows = cur.fetchall()
scan_state["total"] = len(photo_rows)
for i, row in enumerate(photo_rows):
if scan_state["cancel_requested"]:
_mark_scan(cur, scan_id, "cancelled")
con.commit()
scan_state["status"] = "cancelled"
return
if photo_rows:
path_to_id = {row["path"]: row["id"] for row in photo_rows}
all_paths = list(path_to_id.keys())
scan_state["progress"] = i + 1
scan_state["message"] = f"Phash: {Path(row['path']).name}"
ph = _phash(row["path"])
if ph:
cur.execute("UPDATE files SET phash=? WHERE id=?", (ph, row["id"]))
if (i + 1) % 200 == 0:
# Process in chunks so pause requests are honoured between batches
PHASH_CHUNK = 500
phashes_written = 0
for chunk_start in range(0, len(all_paths), PHASH_CHUNK):
if scan_state["pause_requested"]:
_save_pause_state(
cur, scan_id, "phash",
scan_state["files_indexed"], phashes_written,
)
con.commit()
scan_state.update(
status="paused", pause_requested=False,
phashes_done=phashes_written,
message=(
f"Paused — {phashes_written:,} / {len(all_paths):,} "
"perceptual hashes computed"
),
)
return
chunk = all_paths[chunk_start : chunk_start + PHASH_CHUNK]
chunk_results = phasher.hash_files(chunk, progress_cb=None)
for path, ph in chunk_results.items():
fid = path_to_id.get(path)
if fid and ph:
cur.execute(
"UPDATE files SET phash=? WHERE id=?", (ph, fid)
)
con.commit()
phashes_written += len(chunk)
scan_state["phashes_done"] = phashes_written
scan_state["progress"] = phashes_written
scan_state["message"] = (
f"Phash ({hw_label}): {phashes_written:,} / {len(all_paths):,}"
)
con.commit()
# ── Phase: grouping ───────────────────────────────────────────────
scan_state.update(phase="grouping", progress=0, total=4,
message="Running duplicate detection...")
# Snapshot reviewed groups so we can re-apply decisions to any
# post-regrouping group whose member-set is unchanged.
prior_reviewed: dict[tuple[str, frozenset], int | None] = {}
if mode in ("incremental", "regroup"):
cur.execute("""
SELECT dg.id, dg.method, dm.file_id, dm.is_keeper
FROM duplicate_groups dg
JOIN duplicate_members dm ON dm.group_id = dg.id
WHERE dg.reviewed = 1
""")
snap: dict[int, dict] = {}
for r in cur.fetchall():
g = snap.setdefault(
r["id"],
{"method": r["method"], "members": set(), "keeper": None},
)
g["members"].add(r["file_id"])
if r["is_keeper"]:
g["keeper"] = r["file_id"]
for g in snap.values():
prior_reviewed[(g["method"], frozenset(g["members"]))] = g["keeper"]
if mode in ("incremental", "full_reset", "regroup"):
cur.execute("DELETE FROM duplicate_members")
cur.execute("DELETE FROM duplicate_groups")
@@ -669,15 +1098,56 @@ def run_scan(folder_path: str, scan_id: int, mode: str = "incremental"):
scan_state["progress"] = 4
con.commit()
# ── Restore keeper statuses for mode=incremental ──────────────────
# ── Re-apply prior review decisions where membership unchanged ────
if prior_reviewed:
cur.execute("""
SELECT dg.id, dg.method, dm.file_id
FROM duplicate_groups dg
JOIN duplicate_members dm ON dm.group_id = dg.id
""")
new_groups: dict[int, dict] = {}
for r in cur.fetchall():
g = new_groups.setdefault(
r["id"], {"method": r["method"], "members": set()}
)
g["members"].add(r["file_id"])
restored = 0
for gid, g in new_groups.items():
key = (g["method"], frozenset(g["members"]))
if key not in prior_reviewed:
continue
keeper = prior_reviewed[key]
cur.execute(
"UPDATE duplicate_groups SET reviewed=1 WHERE id=?", (gid,)
)
for fid in g["members"]:
is_k = 1 if fid == keeper else 0
cur.execute(
"UPDATE duplicate_members "
"SET is_keeper=?, suggested=? "
"WHERE group_id=? AND file_id=?",
(is_k, is_k, gid, fid),
)
cur.execute(
"UPDATE files SET status=? WHERE id=?",
("keeper" if is_k else "redundant", fid),
)
log_decision(
cur, fid, gid,
"keeper" if is_k else "redundant",
"rescan-restore",
)
restored += 1
con.commit()
scan_state["message"] = f"Restored {restored:,} prior review decisions"
# Reset orphaned keeper status for files no longer in any group
if mode == "incremental":
# If a previously marked keeper no longer appears in any group, reset to pending
cur.execute("""
UPDATE files SET status='pending'
WHERE status='keeper'
AND id NOT IN (
SELECT file_id FROM duplicate_members WHERE is_keeper=1
)
WHERE status IN ('keeper', 'redundant')
AND id NOT IN (SELECT file_id FROM duplicate_members)
""")
con.commit()

257
app/sftp.py Normal file
View File

@@ -0,0 +1,257 @@
"""
SFTP destination management — connection helpers and credential storage.
Credentials live at /data/sftp/{id}.password (mode 600) or /data/sftp/{id}.key
(also mode 600). Public host keys are pinned at /data/sftp/{id}.host_keys after
the first successful connection (TOFU); subsequent connections fail loudly if
the host key changes.
"""
import io
import os
import stat
import errno
from contextlib import contextmanager
from typing import Optional
import paramiko
CRED_DIR = "/data/sftp"
# ── Credential storage ───────────────────────────────────────────────────────
def _ensure_cred_dir() -> None:
os.makedirs(CRED_DIR, mode=0o700, exist_ok=True)
def _password_path(dest_id: int) -> str:
return os.path.join(CRED_DIR, f"{dest_id}.password")
def _key_path(dest_id: int) -> str:
return os.path.join(CRED_DIR, f"{dest_id}.key")
def _host_keys_path(dest_id: int) -> str:
return os.path.join(CRED_DIR, f"{dest_id}.host_keys")
def write_password(dest_id: int, password: str) -> None:
_ensure_cred_dir()
p = _password_path(dest_id)
with open(p, "w") as f:
f.write(password)
os.chmod(p, 0o600)
def write_private_key(dest_id: int, key_text: str) -> None:
_ensure_cred_dir()
p = _key_path(dest_id)
with open(p, "w") as f:
f.write(key_text if key_text.endswith("\n") else key_text + "\n")
os.chmod(p, 0o600)
def delete_credentials(dest_id: int) -> None:
"""Best-effort cleanup of all stored secrets for a destination."""
for p in (_password_path(dest_id), _key_path(dest_id), _host_keys_path(dest_id)):
try:
if os.path.exists(p):
os.unlink(p)
except Exception:
pass
def has_credentials(dest_id: int, auth_method: str) -> bool:
if auth_method == "password":
return os.path.isfile(_password_path(dest_id))
if auth_method == "key":
return os.path.isfile(_key_path(dest_id))
return False
# ── Keypair generation ──────────────────────────────────────────────────────
def generate_keypair() -> tuple[str, str, str]:
"""Generate an ED25519 keypair. Returns (private_pem, public_openssh, fingerprint)."""
key = paramiko.Ed25519Key.generate()
priv_buf = io.StringIO()
key.write_private_key(priv_buf)
private_pem = priv_buf.getvalue()
public_openssh = f"{key.get_name()} {key.get_base64()} dupfinder@miaai"
fingerprint = key.fingerprint # SHA-256:base64
return private_pem, public_openssh, fingerprint
# ── Connection ──────────────────────────────────────────────────────────────
def _open_transport(dest: dict, timeout: int = 15) -> paramiko.Transport:
"""Open and authenticate a Transport directly.
Bypasses SSHClient. Mirrors how OpenSSH/WinSCP invoke the SFTP subsystem
without first allocating an exec channel — works around a "Channel closed"
issue Synology DSM throws at SSHClient.open_sftp() but not at direct
SFTPClient.from_transport().
"""
import socket
sock = socket.create_connection(
(dest["host"], int(dest.get("port") or 22)),
timeout=timeout,
)
transport = paramiko.Transport(sock)
# Generous flow-control windows — Synology sometimes closes mid-handshake
# if the client's window is small.
transport.default_window_size = 2 ** 27 # 128 MB
transport.default_max_packet_size = 2 ** 19 # 512 KB
transport.banner_timeout = timeout
transport.start_client(timeout=timeout)
# Host-key pin (TOFU) — mirror SSHClient behaviour against our pinned file.
hk_path = _host_keys_path(dest["id"])
server_key = transport.get_remote_server_key()
if os.path.isfile(hk_path):
host_keys = paramiko.HostKeys()
host_keys.load(hk_path)
if not host_keys.check(dest["host"], server_key):
transport.close()
raise paramiko.BadHostKeyException(dest["host"], server_key, server_key)
else:
_ensure_cred_dir()
host_keys = paramiko.HostKeys()
host_keys.add(dest["host"], server_key.get_name(), server_key)
host_keys.save(hk_path)
if dest["auth_method"] == "password":
with open(_password_path(dest["id"])) as f:
transport.auth_password(dest["username"], f.read())
elif dest["auth_method"] == "key":
try:
pkey = paramiko.Ed25519Key.from_private_key_file(_key_path(dest["id"]))
except paramiko.SSHException:
pkey = paramiko.RSAKey.from_private_key_file(_key_path(dest["id"]))
transport.auth_publickey(dest["username"], pkey)
else:
transport.close()
raise ValueError(f"Unknown auth_method: {dest['auth_method']}")
return transport
@contextmanager
def open_sftp(dest: dict, timeout: int = 15):
"""Open an SFTP session against the given destination dict.
`dest` must contain: id, host, port, username, auth_method.
Yields a paramiko.SFTPClient. Raises on any failure.
"""
transport = _open_transport(dest, timeout=timeout)
try:
sftp = paramiko.SFTPClient.from_transport(transport)
try:
yield sftp
finally:
try:
sftp.close()
except Exception:
pass
finally:
try:
transport.close()
except Exception:
pass
def test_connection(dest: dict) -> tuple[bool, str]:
ok, msg, _steps = test_connection_verbose(dest)
return ok, msg
def test_connection_verbose(dest: dict) -> tuple[bool, str, list[dict]]:
"""Run each handshake step in isolation and report exactly which one died."""
steps: list[dict] = []
transport = None
sftp = None
try:
try:
transport = _open_transport(dest, timeout=15)
steps.append({
"step": "connect+auth", "ok": True,
"detail": f"active={transport.is_active()} remote={transport.remote_version}",
})
except paramiko.AuthenticationException as e:
steps.append({"step": "connect+auth", "ok": False, "detail": f"auth failed: {e}"})
return False, "Authentication failed", steps
except FileNotFoundError:
steps.append({"step": "connect+auth", "ok": False, "detail": "no stored credentials"})
return False, "No stored credentials for this destination", steps
except Exception as e:
steps.append({"step": "connect+auth", "ok": False, "detail": f"{type(e).__name__}: {e}"})
return False, f"Connection failed: {e}", steps
try:
sftp = paramiko.SFTPClient.from_transport(transport)
steps.append({"step": "open_sftp", "ok": True, "detail": "subsystem opened"})
except Exception as e:
steps.append({"step": "open_sftp", "ok": False, "detail": f"{type(e).__name__}: {e}"})
return False, f"SFTP subsystem refused: {e}", steps
try:
entries = sftp.listdir("/")
steps.append({"step": "listdir_/", "ok": True, "detail": f"entries: {entries[:10]}"})
except Exception as e:
steps.append({"step": "listdir_/", "ok": False, "detail": f"{type(e).__name__}: {e}"})
return False, f"listdir / failed: {e}", steps
try:
sftp.stat(dest["base_path"])
steps.append({"step": "stat_base_path", "ok": True, "detail": dest["base_path"]})
except FileNotFoundError:
steps.append({"step": "stat_base_path", "ok": False, "detail": "FileNotFoundError"})
return False, (
f"Base path does not exist (or not visible from this user): "
f"{dest['base_path']}. Synology sometimes chroots SFTP users to "
f"their home — try a path under /volume1/homes/{dest['username']}/ instead."
), steps
except Exception as e:
steps.append({"step": "stat_base_path", "ok": False, "detail": f"{type(e).__name__}: {e}"})
return False, f"stat {dest['base_path']} failed: {e}", steps
probe = f"{dest['base_path'].rstrip('/')}/.dupfinder_probe"
try:
sftp.mkdir(probe)
sftp.rmdir(probe)
steps.append({"step": "write_probe", "ok": True, "detail": probe})
except Exception as e:
steps.append({"step": "write_probe", "ok": False, "detail": f"{type(e).__name__}: {e}"})
return False, f"Connected, but {dest['base_path']} not writable: {e}", steps
return True, "ok", steps
finally:
try:
if sftp:
sftp.close()
except Exception:
pass
try:
if transport:
transport.close()
except Exception:
pass
# ── Path helpers ────────────────────────────────────────────────────────────
def remote_path_for(source_path: str, dest: dict, photos_root: str = "/photos") -> str:
"""Compute the remote destination path for a given source file.
If mirror_structure is true, preserves the path under photos_root.
Otherwise, lands flat in base_path with the source basename.
"""
base = dest["base_path"].rstrip("/")
if dest.get("mirror_structure", 1):
rel = os.path.relpath(source_path, photos_root)
# On Windows os.path.relpath uses backslashes; force forward
rel = rel.replace("\\", "/")
return f"{base}/{rel}"
return f"{base}/{os.path.basename(source_path)}"

View File

@@ -50,14 +50,19 @@ def is_takeout_folder(folder_path: str) -> bool:
adjacent media files. If we find at least 5 such pairs, call it Takeout.
"""
count = 0
dirs_checked = 0
MAX_DIRS = 50 # sample at most 50 directories — fast on any library size
for root, dirs, files in os.walk(folder_path):
# Skip hidden dirs
dirs[:] = [d for d in dirs if not d.startswith(".")]
dirs_checked += 1
if dirs_checked > MAX_DIRS:
break
file_set = set(files)
for f in files:
if not f.endswith(".json"):
continue
# Check if a media file exists that this could be a sidecar for
base = f[:-5] # strip .json
if base in file_set:
count += 1

153
build-release.ps1 Normal file
View File

@@ -0,0 +1,153 @@
#Requires -Version 5.1
<#
.SYNOPSIS
Builds the DupFinder flash-drive installer bundle.
.DESCRIPTION
1. Builds the Docker image
2. Saves it to dist\image\dupfinder.tar
3. Copies all installer scripts and source into dist\
Run this from the repo root before copying dist\ to a flash drive.
.EXAMPLE
.\build-release.ps1
.\build-release.ps1 -SkipBuild # Skip docker build (reuse existing image)
#>
param(
[switch]$SkipBuild,
[string]$ImageName = "dupfinder",
[string]$ImageTag = "latest"
)
Set-StrictMode -Version Latest
$ErrorActionPreference = "Stop"
$RepoRoot = $PSScriptRoot
$DistDir = Join-Path $RepoRoot "dist"
$ImageFull = "${ImageName}:${ImageTag}"
function Write-Step([string]$msg) {
Write-Host "`n==> $msg" -ForegroundColor Cyan
}
function Write-OK([string]$msg) {
Write-Host " [OK] $msg" -ForegroundColor Green
}
function Write-Fail([string]$msg) {
Write-Host " [!!] $msg" -ForegroundColor Red
}
# ── Check Docker is running ───────────────────────────────────────────────────
Write-Step "Checking Docker..."
docker info 2>&1 | Out-Null
if ($LASTEXITCODE -ne 0) {
Write-Fail "Docker is not running. Start Docker Desktop and try again."
exit 1
}
Write-OK "Docker is running"
# ── Build image ───────────────────────────────────────────────────────────────
if (-not $SkipBuild) {
Write-Step "Building Docker image ($ImageFull)..."
docker build -t $ImageFull --progress=plain $RepoRoot
if ($LASTEXITCODE -ne 0) { Write-Fail "Docker build failed."; exit 1 }
Write-OK "Image built: $ImageFull"
} else {
Write-Step "Skipping build (-SkipBuild). Checking image exists..."
$exists = docker images $ImageFull --format "{{.ID}}" 2>$null
if (-not $exists) {
Write-Fail "Image $ImageFull not found locally. Remove -SkipBuild to build it."
exit 1
}
Write-OK "Image found: $ImageFull"
}
# ── Clean dist\ ──────────────────────────────────────────────────────────────
Write-Step "Preparing dist\ directory..."
if (Test-Path $DistDir) {
Remove-Item $DistDir -Recurse -Force
}
New-Item -ItemType Directory -Path $DistDir | Out-Null
New-Item -ItemType Directory -Path "$DistDir\image" | Out-Null
New-Item -ItemType Directory -Path "$DistDir\source" | Out-Null
New-Item -ItemType Directory -Path "$DistDir\assets" | Out-Null
Write-OK "dist\ ready"
# ── Save Docker image ─────────────────────────────────────────────────────────
Write-Step "Saving Docker image to dist\image\dupfinder.tar (this may take a minute)..."
docker save -o "$DistDir\image\dupfinder.tar" $ImageFull
if ($LASTEXITCODE -ne 0) { Write-Fail "docker save failed."; exit 1 }
$tarSize = [math]::Round((Get-Item "$DistDir\image\dupfinder.tar").Length / 1MB, 1)
Write-OK "Image saved (${tarSize} MB)"
# ── Copy installer scripts ────────────────────────────────────────────────────
Write-Step "Copying installer scripts..."
Copy-Item "$RepoRoot\installer\install.ps1" "$DistDir\install.ps1"
Copy-Item "$RepoRoot\installer\uninstall.ps1" "$DistDir\uninstall.ps1"
Copy-Item "$RepoRoot\installer\dupfinder-start-stop.ps1" "$DistDir\dupfinder-start-stop.ps1"
Copy-Item "$RepoRoot\docker-compose.yml" "$DistDir\docker-compose.yml"
# install.bat launcher (no-click PS1 execution for non-technical users)
@'
@echo off
echo Starting DupFinder installer...
PowerShell -ExecutionPolicy Bypass -File "%~dp0install.ps1"
pause
'@ | Set-Content "$DistDir\INSTALL.bat" -Encoding ASCII
Write-OK "Scripts copied"
# ── Copy source (fallback build) ──────────────────────────────────────────────
Write-Step "Copying source files (offline build fallback)..."
$excludeDirs = @('dist', '__pycache__', 'data', '.git', '.claude', 'installer')
$excludeFiles = @('*.db', '*.db-shm', '*.db-wal', '*.pyc', '*.pyo')
Get-ChildItem $RepoRoot -Recurse | Where-Object {
$item = $_
$skip = $false
foreach ($d in $excludeDirs) { if ($item.FullName -match [regex]::Escape($d)) { $skip = $true } }
foreach ($f in $excludeFiles) { if ($item.Name -like $f) { $skip = $true } }
-not $skip
} | ForEach-Object {
$rel = $_.FullName.Substring($RepoRoot.Length + 1)
$dst = Join-Path "$DistDir\source" $rel
if ($_.PSIsContainer) {
New-Item -ItemType Directory -Path $dst -Force | Out-Null
} else {
$dstDir = Split-Path $dst -Parent
if (-not (Test-Path $dstDir)) { New-Item -ItemType Directory -Path $dstDir -Force | Out-Null }
Copy-Item $_.FullName $dst -Force
}
}
Write-OK "Source copied"
# ── README for flash drive ────────────────────────────────────────────────────
@"
DupFinder Installer
===================
Requirements:
- Windows 10/11 (64-bit)
- Docker Desktop for Windows (if not installed, the installer will guide you)
To install:
1. Right-click INSTALL.bat -> "Run as administrator"
OR
Open PowerShell as Administrator and run:
PowerShell -ExecutionPolicy Bypass -File install.ps1
2. Follow the prompts (photos path, data path)
3. A "DupFinder" shortcut will appear on the desktop when done.
To uninstall:
Run uninstall.ps1 as Administrator.
Built: $(Get-Date -Format 'yyyy-MM-dd HH:mm')
Image: $ImageFull
"@ | Set-Content "$DistDir\README.txt" -Encoding UTF8
# ── Summary ───────────────────────────────────────────────────────────────────
$totalMB = [math]::Round((Get-ChildItem $DistDir -Recurse | Measure-Object -Property Length -Sum).Sum / 1MB, 1)
Write-Host ""
Write-Host "============================================" -ForegroundColor Green
Write-Host " Build complete! dist\ is ${totalMB} MB total" -ForegroundColor Green
Write-Host " Copy the dist\ folder to your flash drive." -ForegroundColor Green
Write-Host "============================================" -ForegroundColor Green

145
debian/build-deb.sh vendored Normal file
View File

@@ -0,0 +1,145 @@
#!/bin/bash
# Build dupfinder.deb and upload it to the Gitea package registry.
# Run this on the NAS / any Linux machine with dpkg-deb and curl installed.
#
# Usage:
# ./debian/build-deb.sh
# ./debian/build-deb.sh --no-upload # build only, skip Gitea upload
set -e
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
DEBIAN_DIR="$REPO_ROOT/debian"
BUILD_DIR="$REPO_ROOT/build/deb"
# ── Config ────────────────────────────────────────────────────────────────────
PKG_NAME="dupfinder"
PKG_VERSION="1.1.2"
PKG_ARCH="amd64"
DEB_FILE="${PKG_NAME}_${PKG_VERSION}_${PKG_ARCH}.deb"
GITEA_URL="http://192.168.1.64:3000"
GITEA_OWNER="tocmo0nlord"
GITEA_TOKEN="${GITEA_TOKEN:-7f8d32ca83f2af6047e78cba0e13b5d63269c104}"
DISTRO="bookworm"
COMPONENT="main"
NO_UPLOAD=false
[[ "${1}" == "--no-upload" ]] && NO_UPLOAD=true
# ── Helpers ───────────────────────────────────────────────────────────────────
info() { echo -e "\033[0;36m==> $*\033[0m"; }
ok() { echo -e "\033[0;32m OK $*\033[0m"; }
fail() { echo -e "\033[0;31m !! $*\033[0m"; exit 1; }
# ── Check dependencies ────────────────────────────────────────────────────────
command -v dpkg-deb &>/dev/null || fail "dpkg-deb not found. Run: sudo apt install dpkg-dev"
command -v curl &>/dev/null || fail "curl not found. Run: sudo apt install curl"
# ── Prepare staging area ──────────────────────────────────────────────────────
info "Preparing build directory..."
PKG_STAGE="$BUILD_DIR/${PKG_NAME}_${PKG_VERSION}_${PKG_ARCH}"
rm -rf "$PKG_STAGE"
mkdir -p "$PKG_STAGE/DEBIAN"
# ── Copy DEBIAN control files ─────────────────────────────────────────────────
info "Copying control files..."
cp "$DEBIAN_DIR/control" "$PKG_STAGE/DEBIAN/control"
cp "$DEBIAN_DIR/postinst" "$PKG_STAGE/DEBIAN/postinst"
cp "$DEBIAN_DIR/prerm" "$PKG_STAGE/DEBIAN/prerm"
cp "$DEBIAN_DIR/postrm" "$PKG_STAGE/DEBIAN/postrm"
# Inject current version into control file
sed -i "s/^Version:.*/Version: $PKG_VERSION/" "$PKG_STAGE/DEBIAN/control"
# Fix permissions — maintainer scripts must be executable
chmod 755 "$PKG_STAGE/DEBIAN/postinst" \
"$PKG_STAGE/DEBIAN/prerm" \
"$PKG_STAGE/DEBIAN/postrm"
# ── Copy payload files ────────────────────────────────────────────────────────
info "Copying payload files..."
cp -r "$DEBIAN_DIR/files/." "$PKG_STAGE/"
# Copy the docker-compose.yml from repo root into the package
mkdir -p "$PKG_STAGE/opt/dupfinder"
cp "$REPO_ROOT/docker-compose.yml" "$PKG_STAGE/opt/dupfinder/docker-compose.yml"
# Copy source as fallback build path. Preserve the app/ subdirectory layout
# so the Dockerfile's `COPY app/ /app/` resolves correctly when building from
# this staged source dir.
SRC_STAGE="$PKG_STAGE/opt/dupfinder/source"
mkdir -p "$SRC_STAGE"
cp -r "$REPO_ROOT/app" "$SRC_STAGE/app"
cp -r "$REPO_ROOT/templates" "$SRC_STAGE/templates"
cp "$REPO_ROOT/Dockerfile" "$SRC_STAGE/Dockerfile"
cp "$REPO_ROOT/requirements.txt" "$SRC_STAGE/requirements.txt"
# ── Fix file permissions ──────────────────────────────────────────────────────
find "$PKG_STAGE" -type f -name "*.sh" -exec chmod 755 {} \;
chmod 755 "$PKG_STAGE/usr/local/bin/dupfinder" 2>/dev/null || true
# Directories must be 755, files 644 (except executables)
find "$PKG_STAGE" -type d -exec chmod 755 {} \;
find "$PKG_STAGE" -type f ! -name "*.sh" \
! -path "*/DEBIAN/*" \
! -name "dupfinder" \
-exec chmod 644 {} \;
ok "Staging area ready: $PKG_STAGE"
# ── Build .deb ────────────────────────────────────────────────────────────────
info "Building $DEB_FILE ..."
mkdir -p "$BUILD_DIR"
dpkg-deb --build --root-owner-group "$PKG_STAGE" "$BUILD_DIR/$DEB_FILE"
DEB_SIZE=$(du -sh "$BUILD_DIR/$DEB_FILE" | cut -f1)
ok "Built: $BUILD_DIR/$DEB_FILE ($DEB_SIZE)"
# ── Upload to Gitea ───────────────────────────────────────────────────────────
if [[ "$NO_UPLOAD" == "true" ]]; then
echo ""
echo "Skipping upload (--no-upload). File is at:"
echo " $BUILD_DIR/$DEB_FILE"
exit 0
fi
info "Uploading to Gitea package registry..."
# Gitea's Debian registry requires HTTP basic auth (user + token-as-password)
# and the literal /upload endpoint — token-bearer auth returns 405.
UPLOAD_URL="$GITEA_URL/api/packages/$GITEA_OWNER/debian/pool/$DISTRO/$COMPONENT/upload"
HTTP_STATUS=$(curl -s -o /tmp/gitea_upload_response.txt -w "%{http_code}" \
-u "$GITEA_OWNER:$GITEA_TOKEN" \
--upload-file "$BUILD_DIR/$DEB_FILE" \
"$UPLOAD_URL")
if [[ "$HTTP_STATUS" == "201" || "$HTTP_STATUS" == "200" ]]; then
ok "Uploaded successfully (HTTP $HTTP_STATUS)"
elif [[ "$HTTP_STATUS" == "409" ]]; then
echo " Package version $PKG_VERSION already exists in registry."
echo " Bump PKG_VERSION in this script to publish a new version."
else
echo " Upload failed (HTTP $HTTP_STATUS):"
cat /tmp/gitea_upload_response.txt
exit 1
fi
# ── Print install instructions ────────────────────────────────────────────────
echo ""
echo "╔══════════════════════════════════════════════════════════════════╗"
echo "║ Package published! Install on any Ubuntu/Debian machine with: ║"
echo "╠══════════════════════════════════════════════════════════════════╣"
echo "║ ║"
echo "║ 1. Add the repo: ║"
echo "║ echo \"deb [trusted=yes] \\ ║"
echo "$GITEA_URL/api/packages/$GITEA_OWNER/debian \\ ║"
echo "$DISTRO $COMPONENT\" \\ ║"
echo "║ | sudo tee /etc/apt/sources.list.d/dupfinder.list ║"
echo "║ ║"
echo "║ 2. Install: ║"
echo "║ sudo apt update && sudo apt install dupfinder ║"
echo "║ ║"
echo "║ 3. Configure: ║"
echo "║ sudo dupfinder setup ║"
echo "║ ║"
echo "╚══════════════════════════════════════════════════════════════════╝"
echo ""

15
debian/control vendored Normal file
View File

@@ -0,0 +1,15 @@
Package: dupfinder
Version: 1.0.0
Architecture: amd64
Maintainer: tocmo0nlord
Depends: docker.io | docker-ce, docker-compose-plugin | docker-compose
Recommends: nvidia-container-toolkit
Section: utils
Priority: optional
Description: Self-hosted duplicate photo and video finder
DupFinder scans a photo/video library using four detection methods:
exact hash (SHA-256), visual similarity (perceptual hash), EXIF
timestamp matching, and file-size/dimension matching. All decisions
are stored in SQLite — no files are ever moved or deleted.
GPU acceleration via NVIDIA CUDA is supported automatically.
Homepage: http://192.168.1.64:3000/tocmo0nlord/duplicate-finder

View File

@@ -0,0 +1,33 @@
[Unit]
Description=DupFinder Duplicate Photo Scanner
Documentation=http://192.168.1.64:3000/tocmo0nlord/duplicate-finder
After=docker.service network-online.target
Requires=docker.service
Wants=network-online.target
[Service]
Type=simple
Restart=on-failure
RestartSec=10
EnvironmentFile=-/etc/dupfinder.conf
WorkingDirectory=/opt/dupfinder
ExecStart=/usr/bin/docker compose \
-f /opt/dupfinder/docker-compose.yml \
-f /opt/dupfinder/docker-compose.override.yml \
up --no-build --remove-orphans
ExecStop=/usr/bin/docker compose \
-f /opt/dupfinder/docker-compose.yml \
-f /opt/dupfinder/docker-compose.override.yml \
down
# Don't start if override hasn't been created yet (setup not run)
ExecStartPre=/bin/test -f /opt/dupfinder/docker-compose.override.yml
StandardOutput=journal
StandardError=journal
SyslogIdentifier=dupfinder
[Install]
WantedBy=multi-user.target

View File

@@ -0,0 +1,202 @@
#!/bin/bash
# DupFinder first-time setup — configure paths, pull image, write override
set -e
CONF_FILE="/etc/dupfinder.conf"
COMPOSE_DIR="/opt/dupfinder"
OVERRIDE_YML="$COMPOSE_DIR/docker-compose.override.yml"
IMAGE_NAME="tocmo0nlord/dupfinder:latest"
DATA_DIR="/var/lib/dupfinder/data"
APP_PORT=8765
RED='\033[0;31m'; GREEN='\033[0;32m'; CYAN='\033[0;36m'; NC='\033[0m'
info() { echo -e "${CYAN}==> $*${NC}"; }
ok() { echo -e "${GREEN} OK $*${NC}"; }
err() { echo -e "${RED} !! $*${NC}"; }
# ── Root check ────────────────────────────────────────────────────────────────
if [[ $EUID -ne 0 ]]; then
err "Please run as root: sudo dupfinder setup"
exit 1
fi
echo ""
echo " ╔══════════════════════════════════════╗"
echo " ║ DupFinder Setup ║"
echo " ╚══════════════════════════════════════╝"
echo ""
# ── Load existing config as defaults ─────────────────────────────────────────
[[ -f "$CONF_FILE" ]] && source "$CONF_FILE"
: "${PHOTOS_PATH:=/mnt/photos}"
: "${DATA_PATH:=$DATA_DIR}"
: "${APP_PORT:=8765}"
# ── Check Docker ──────────────────────────────────────────────────────────────
info "Checking Docker..."
if ! command -v docker &>/dev/null; then
err "Docker is not installed."
echo " Install with: curl -fsSL https://get.docker.com | sh"
exit 1
fi
if ! docker info &>/dev/null; then
err "Docker daemon is not running."
echo " Start with: sudo systemctl start docker"
exit 1
fi
ok "Docker is running"
# ── Check docker compose ──────────────────────────────────────────────────────
if ! docker compose version &>/dev/null; then
err "docker compose (V2 plugin) not found. Update Docker or install docker-compose-plugin."
exit 1
fi
ok "docker compose V2 available"
# ── Check NVIDIA GPU + container toolkit ─────────────────────────────────────
info "Checking GPU..."
if command -v nvidia-smi &>/dev/null && nvidia-smi &>/dev/null; then
GPU_NAME=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1)
ok "NVIDIA GPU detected: $GPU_NAME"
# nvidia-container-toolkit is required for Docker GPU passthrough
if ! command -v nvidia-ctk &>/dev/null && ! dpkg -l nvidia-container-toolkit &>/dev/null 2>&1; then
echo ""
echo " nvidia-container-toolkit is not installed."
echo " Without it Docker cannot pass the GPU to the container."
echo " Install with:"
echo ""
echo " curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg"
echo " curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list"
echo " sudo apt update && sudo apt install -y nvidia-container-toolkit"
echo " sudo nvidia-ctk runtime configure --runtime=docker"
echo " sudo systemctl restart docker"
echo ""
read -rp " Install nvidia-container-toolkit now? (Y/n): " INST_CTK
if [[ "$INST_CTK" != "n" && "$INST_CTK" != "N" ]]; then
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt update -qq && apt install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
ok "nvidia-container-toolkit installed and Docker restarted"
else
echo " Skipping — GPU will not be available in Docker. You can re-run setup later."
GPU_AVAILABLE=false
fi
else
ok "nvidia-container-toolkit is present"
fi
[[ "$GPU_AVAILABLE" != "false" ]] && GPU_AVAILABLE=true
else
echo " No NVIDIA GPU detected — will use CPU for perceptual hashing"
GPU_AVAILABLE=false
fi
# ── Photos path ───────────────────────────────────────────────────────────────
echo ""
info "Photos library path (mounted read-only):"
echo " Current: $PHOTOS_PATH"
read -rp " Path [Enter to keep]: " INPUT
INPUT="${INPUT%\"}" ; INPUT="${INPUT#\"}" # strip quotes
[[ -n "$INPUT" ]] && PHOTOS_PATH="$INPUT"
if [[ ! -d "$PHOTOS_PATH" ]]; then
err "Path not found: $PHOTOS_PATH"
echo " Create it or mount your drive first, then re-run setup."
exit 1
fi
ok "Photos: $PHOTOS_PATH"
# ── Data path ─────────────────────────────────────────────────────────────────
echo ""
info "Database storage path:"
echo " Current: $DATA_PATH"
read -rp " Path [Enter to keep]: " INPUT
INPUT="${INPUT%\"}" ; INPUT="${INPUT#\"}"
[[ -n "$INPUT" ]] && DATA_PATH="$INPUT"
mkdir -p "$DATA_PATH"
ok "Data: $DATA_PATH"
# ── Port ──────────────────────────────────────────────────────────────────────
echo ""
read -rp " Web port [$APP_PORT]: " INPUT
[[ -n "$INPUT" ]] && APP_PORT="$INPUT"
ok "Port: $APP_PORT"
# ── Build (or pull) Docker image ──────────────────────────────────────────────
# The .deb ships the full source tree, so building locally is the default.
# Registry pull is tried only as a quick path if the image happens to be
# published; failures are silent.
echo ""
info "Preparing Docker image ($IMAGE_NAME)..."
if docker image inspect "$IMAGE_NAME" >/dev/null 2>&1; then
ok "Image already present locally"
elif docker pull "$IMAGE_NAME" >/dev/null 2>&1; then
ok "Image pulled from registry"
elif [[ -f "$COMPOSE_DIR/source/Dockerfile" ]]; then
echo " Building image from bundled source (one-time, ~5-10 min)..."
docker build -t "$IMAGE_NAME" "$COMPOSE_DIR/source"
ok "Image built from source"
else
err "No image available and no source bundled. Reinstall the .deb."
exit 1
fi
# ── Write config + override ───────────────────────────────────────────────────
info "Writing configuration..."
cat > "$CONF_FILE" <<EOF
PHOTOS_PATH=$PHOTOS_PATH
DATA_PATH=$DATA_PATH
APP_PORT=$APP_PORT
GPU_AVAILABLE=$GPU_AVAILABLE
EOF
chmod 600 "$CONF_FILE"
# Docker requires forward slashes
PHOTOS_DOCKER="${PHOTOS_PATH//\\//}"
DATA_DOCKER="${DATA_PATH//\\//}"
cat > "$OVERRIDE_YML" <<EOF
services:
dup-finder:
image: $IMAGE_NAME
ports:
- "${APP_PORT}:8000"
volumes:
- "$PHOTOS_DOCKER:/photos:ro"
- "$DATA_DOCKER:/data"
EOF
# Add GPU reservation if available
if [[ "$GPU_AVAILABLE" == "true" ]]; then
cat >> "$OVERRIDE_YML" <<EOF
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
EOF
fi
ok "Config saved to $CONF_FILE"
# ── Start service ─────────────────────────────────────────────────────────────
info "Starting DupFinder..."
systemctl daemon-reload
systemctl enable --now dupfinder.service
ok "Service started"
echo ""
echo -e "${GREEN} ╔══════════════════════════════════════════╗${NC}"
echo -e "${GREEN} ║ DupFinder is running! ║${NC}"
echo -e "${GREEN} ║ Open: http://localhost:$APP_PORT${NC}"
echo -e "${GREEN} ╚══════════════════════════════════════════╝${NC}"
echo ""

81
debian/files/usr/local/bin/dupfinder vendored Normal file
View File

@@ -0,0 +1,81 @@
#!/bin/bash
# DupFinder CLI wrapper
CONF_FILE="/etc/dupfinder.conf"
COMPOSE_DIR="/opt/dupfinder"
COMPOSE_YML="$COMPOSE_DIR/docker-compose.yml"
OVERRIDE_YML="$COMPOSE_DIR/docker-compose.override.yml"
[[ -f "$CONF_FILE" ]] && source "$CONF_FILE"
: "${APP_PORT:=8765}"
_compose() {
docker compose -f "$COMPOSE_YML" -f "$OVERRIDE_YML" "$@"
}
_require_conf() {
if [[ ! -f "$CONF_FILE" ]]; then
echo "DupFinder is not configured. Run: sudo dupfinder setup"
exit 1
fi
}
case "${1:-help}" in
setup)
exec bash /opt/dupfinder/dupfinder-setup.sh
;;
start)
_require_conf
sudo systemctl start dupfinder.service
echo "DupFinder started — http://localhost:$APP_PORT"
;;
stop)
sudo systemctl stop dupfinder.service
echo "DupFinder stopped."
;;
restart)
_require_conf
sudo systemctl restart dupfinder.service
echo "DupFinder restarted — http://localhost:$APP_PORT"
;;
status)
systemctl status dupfinder.service --no-pager
;;
logs)
_compose logs -f --tail=100
;;
open)
_require_conf
# Wait for service to be ready then open browser
for i in $(seq 1 15); do
curl -sf "http://localhost:$APP_PORT/" -o /dev/null && break
sleep 1
done
xdg-open "http://localhost:$APP_PORT" 2>/dev/null || \
echo "Open in browser: http://localhost:$APP_PORT"
;;
update)
_require_conf
echo "Pulling latest image..."
docker pull tocmo0nlord/dupfinder:latest
sudo systemctl restart dupfinder.service
echo "Updated and restarted."
;;
uninstall)
echo "To fully remove DupFinder: sudo apt remove dupfinder"
echo "To also remove data: sudo apt purge dupfinder"
;;
help|--help|-h|*)
echo "Usage: dupfinder <command>"
echo ""
echo "Commands:"
echo " setup Configure photos path, data path, pull image"
echo " start Start the service"
echo " stop Stop the service"
echo " restart Restart the service"
echo " status Show systemd service status"
echo " logs Tail container logs"
echo " open Open in browser"
echo " update Pull latest image and restart"
echo " uninstall Show removal instructions"
;;
esac

View File

@@ -0,0 +1,12 @@
[Desktop Entry]
Type=Application
Version=1.0
Name=DupFinder
GenericName=Duplicate Photo Finder
Comment=Find and review duplicate photos and videos
Exec=dupfinder open
Icon=dupfinder
Terminal=false
Categories=Graphics;Photography;Utility;
Keywords=duplicate;dedup;dedupe;photos;videos;
StartupNotify=false

27
debian/postinst vendored Normal file
View File

@@ -0,0 +1,27 @@
#!/bin/bash
set -e
# Create data directory with correct permissions
mkdir -p /var/lib/dupfinder/data
chmod 755 /var/lib/dupfinder
# Reload systemd and enable service (don't start yet — needs user config first)
systemctl daemon-reload
systemctl enable dupfinder.service 2>/dev/null || true
echo ""
echo "╔══════════════════════════════════════════════════╗"
echo "║ DupFinder installed successfully! ║"
echo "╠══════════════════════════════════════════════════╣"
echo "║ ║"
echo "║ Run setup to configure your photos path: ║"
echo "║ ║"
echo "║ sudo dupfinder setup ║"
echo "║ ║"
echo "║ After setup, manage with: ║"
echo "║ sudo systemctl start dupfinder ║"
echo "║ sudo systemctl stop dupfinder ║"
echo "║ dupfinder status ║"
echo "║ ║"
echo "╚══════════════════════════════════════════════════╝"
echo ""

13
debian/postrm vendored Normal file
View File

@@ -0,0 +1,13 @@
#!/bin/bash
set -e
case "$1" in
purge)
# Remove data only on purge (not regular remove)
rm -rf /var/lib/dupfinder
rm -f /etc/dupfinder.conf
systemctl daemon-reload 2>/dev/null || true
;;
remove|upgrade|failed-upgrade|abort-install|abort-upgrade|disappear)
systemctl daemon-reload 2>/dev/null || true
;;
esac

5
debian/prerm vendored Normal file
View File

@@ -0,0 +1,5 @@
#!/bin/bash
set -e
# Stop and disable service before removal
systemctl stop dupfinder.service 2>/dev/null || true
systemctl disable dupfinder.service 2>/dev/null || true

View File

@@ -13,5 +13,10 @@ services:
deploy:
resources:
limits:
cpus: "2.0"
memory: 2G
cpus: "4.0"
memory: 4G
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]

View File

@@ -0,0 +1,71 @@
#Requires -Version 5.1
<#
.SYNOPSIS
Start, stop, restart DupFinder, or open it in the browser.
.PARAMETER Action
start | stop | restart | open (default: open)
.EXAMPLE
.\dupfinder-start-stop.ps1 -Action open
.\dupfinder-start-stop.ps1 -Action stop
#>
param(
[ValidateSet("start","stop","restart","open")]
[string]$Action = "open"
)
$ConfigFile = "C:\ProgramData\DupFinder\dupfinder.conf"
if (-not (Test-Path $ConfigFile)) {
Write-Error "DupFinder is not installed. Run install.ps1 first."
exit 1
}
# Read config
$conf = @{}
Get-Content $ConfigFile | ForEach-Object {
if ($_ -match '^(.+?)=(.+)$') { $conf[$Matches[1]] = $Matches[2] }
}
$ComposeDir = $conf["COMPOSE_DIR"]
$AppPort = $conf["APP_PORT"]
$ComposeYml = "$ComposeDir\docker-compose.yml"
$OverrideYml = "$ComposeDir\docker-compose.override.yml"
$Url = "http://localhost:$AppPort"
function Invoke-Compose([string]$cmd) {
& docker compose -f $ComposeYml -f $OverrideYml $cmd.Split(" ")
}
switch ($Action) {
"start" {
Write-Host "Starting DupFinder..."
Invoke-Compose "up -d --pull never"
}
"stop" {
Write-Host "Stopping DupFinder..."
Invoke-Compose "stop"
}
"restart" {
Write-Host "Restarting DupFinder..."
Invoke-Compose "restart"
}
"open" {
# Ensure container is running
$running = docker ps --filter "name=dup-finder" --format "{{.Names}}" 2>$null
if (-not $running) {
Write-Host "Starting DupFinder..."
Invoke-Compose "up -d --pull never"
}
# Poll until responsive (up to 15s)
$tries = 0
while ($tries -lt 15) {
try {
$r = Invoke-WebRequest -Uri $Url -UseBasicParsing -TimeoutSec 1 -ErrorAction Stop
break
} catch { }
Start-Sleep 1
$tries++
}
Start-Process $Url
}
}

276
installer/install.ps1 Normal file
View File

@@ -0,0 +1,276 @@
#Requires -Version 5.1
<#
.SYNOPSIS
Installs DupFinder on this workstation.
.DESCRIPTION
- Verifies Docker Desktop is installed and running
- Loads the pre-built Docker image (or builds from source as fallback)
- Prompts for photos library path and data storage path
- Writes a docker-compose.override.yml
- Starts the container
- Creates a desktop shortcut
.PARAMETER ForceReload
Re-load the Docker image even if it's already present locally.
.EXAMPLE
PowerShell -ExecutionPolicy Bypass -File install.ps1
#>
param(
[switch]$ForceReload
)
Set-StrictMode -Version Latest
$ErrorActionPreference = "Stop"
$ScriptDir = $PSScriptRoot
$AppDir = "C:\ProgramData\DupFinder"
$ConfigFile = "$AppDir\dupfinder.conf"
$OverrideYml = "$AppDir\docker-compose.override.yml"
$ComposeYml = "$AppDir\docker-compose.yml"
$ImageName = "dupfinder:latest"
$TarPath = "$ScriptDir\image\dupfinder.tar"
$SourcePath = "$ScriptDir\source"
$AppPort = 8765
function Write-Step([string]$msg) { Write-Host "`n==> $msg" -ForegroundColor Cyan }
function Write-OK([string]$msg) { Write-Host " OK $msg" -ForegroundColor Green }
function Write-Warn([string]$msg) { Write-Host " !! $msg" -ForegroundColor Yellow }
function Write-Fail([string]$msg) { Write-Host "`n[FAIL] $msg" -ForegroundColor Red }
function Pause-Continue {
Write-Host "`nPress Enter to continue..." -NoNewline
$null = Read-Host
}
# ── 1. Admin check ────────────────────────────────────────────────────────────
$principal = [Security.Principal.WindowsPrincipal][Security.Principal.WindowsIdentity]::GetCurrent()
if (-not $principal.IsInRole([Security.Principal.WindowsBuiltInRole]::Administrator)) {
Write-Fail "This script must be run as Administrator."
Write-Host "Right-click install.ps1 and choose 'Run as administrator', or use:"
Write-Host " PowerShell -ExecutionPolicy Bypass -File `"$PSCommandPath`""
exit 1
}
Write-Host ""
Write-Host " ====================================" -ForegroundColor Magenta
Write-Host " DupFinder Installer" -ForegroundColor Magenta
Write-Host " ====================================" -ForegroundColor Magenta
Write-Host ""
# ── 2. WSL2 check ─────────────────────────────────────────────────────────────
Write-Step "Checking WSL2..."
$wslOut = & wsl --status 2>&1
if ($LASTEXITCODE -ne 0) {
Write-Warn "WSL2 is not installed or needs updating."
Write-Host " Installing WSL2 (requires internet + possible reboot)..."
& wsl --install --no-distribution 2>&1 | Out-Null
Write-Warn "A reboot may be required. After rebooting, re-run this installer."
Pause-Continue
exit 0
}
Write-OK "WSL2 is present"
# ── 3. Docker detection ───────────────────────────────────────────────────────
Write-Step "Checking Docker Desktop..."
$dockerExe = $null
$dockerCmd = Get-Command docker -ErrorAction SilentlyContinue
if ($dockerCmd) {
$dockerExe = $dockerCmd.Source
} else {
# Check known install locations
$candidates = @(
"$env:ProgramFiles\Docker\Docker\resources\bin\docker.exe",
"$env:LOCALAPPDATA\Programs\Docker\Docker\resources\bin\docker.exe"
)
foreach ($c in $candidates) {
if (Test-Path $c) { $dockerExe = $c; break }
}
}
if (-not $dockerExe) {
Write-Warn "Docker Desktop is not installed."
$bundledInstaller = "$ScriptDir\assets\DockerDesktopInstaller.exe"
if (Test-Path $bundledInstaller) {
Write-Host " Found bundled Docker Desktop installer. Installing..."
Start-Process -Wait $bundledInstaller -ArgumentList "install --quiet --accept-license"
Write-Warn "Docker Desktop was installed. A reboot may be required."
Write-Host " After rebooting, re-run this installer."
Pause-Continue
exit 0
} else {
Write-Host " Opening Docker Desktop download page in your browser..."
Start-Process "https://www.docker.com/products/docker-desktop/"
Write-Host " Install Docker Desktop, then re-run this script."
Pause-Continue
exit 0
}
}
Write-OK "Docker executable found: $dockerExe"
# ── 4. Ensure Docker daemon is running ────────────────────────────────────────
Write-Step "Waiting for Docker daemon..."
$maxWait = 60
$waited = 0
while ($waited -lt $maxWait) {
$info = docker info 2>&1
if ($LASTEXITCODE -eq 0) { break }
if ($waited -eq 0) {
# Try to start Docker Desktop
$desktopExe = "$env:ProgramFiles\Docker\Docker\Docker Desktop.exe"
if (Test-Path $desktopExe) {
Write-Host " Starting Docker Desktop..."
Start-Process $desktopExe
}
Write-Host " Waiting for Docker to become ready (up to ${maxWait}s)..."
Write-Host " If a Docker Desktop setup window appeared, please complete it."
}
Start-Sleep 3
$waited += 3
Write-Host " ... ${waited}s" -NoNewline
}
Write-Host ""
docker info 2>&1 | Out-Null
if ($LASTEXITCODE -ne 0) {
Write-Fail "Docker did not start within ${maxWait}s. Please start Docker Desktop manually and re-run."
exit 1
}
Write-OK "Docker daemon is running"
# ── 5. docker compose V2 check ────────────────────────────────────────────────
docker compose version 2>&1 | Out-Null
if ($LASTEXITCODE -ne 0) {
Write-Fail "docker compose (V2 plugin) not found. Please update Docker Desktop to 4.0+ and re-run."
exit 1
}
Write-OK "docker compose V2 available"
# ── 6. Load / build image ─────────────────────────────────────────────────────
Write-Step "Preparing DupFinder Docker image..."
$existingImage = (docker images $ImageName --format "{{.ID}}" 2>$null)
if ($existingImage -and -not $ForceReload) {
Write-OK "Image already loaded (use -ForceReload to replace it)"
} elseif (Test-Path $TarPath) {
Write-Host " Loading image from $TarPath ..."
docker load -i $TarPath
if ($LASTEXITCODE -ne 0) { Write-Fail "docker load failed."; exit 1 }
Write-OK "Image loaded from tar"
} elseif (Test-Path "$SourcePath\Dockerfile") {
Write-Warn "No pre-built image found. Building from source (requires internet)..."
docker build -t $ImageName $SourcePath
if ($LASTEXITCODE -ne 0) { Write-Fail "docker build failed."; exit 1 }
Write-OK "Image built from source"
} else {
Write-Fail "No image tar and no source Dockerfile found. Bundle may be incomplete."
exit 1
}
# ── 7. Collect paths from user ────────────────────────────────────────────────
Write-Step "Configuration"
Write-Host ""
# Load existing config as defaults if re-running
$defaultPhotos = "C:\Photos"
$defaultData = "C:\ProgramData\DupFinder\data"
if (Test-Path $ConfigFile) {
Get-Content $ConfigFile | ForEach-Object {
if ($_ -match '^PHOTOS_PATH=(.+)$') { $defaultPhotos = $Matches[1] }
if ($_ -match '^DATA_PATH=(.+)$') { $defaultData = $Matches[1] }
}
}
# Photos path
do {
Write-Host " Photos library path (read-only mount):"
Write-Host " Default: $defaultPhotos"
$input = (Read-Host " Path").Trim().Trim('"')
if ([string]::IsNullOrWhiteSpace($input)) { $input = $defaultPhotos }
$PhotosPath = $input
if (-not (Test-Path $PhotosPath -PathType Container)) {
Write-Warn "Path not found: $PhotosPath (try again)"
}
} while (-not (Test-Path $PhotosPath -PathType Container))
Write-OK "Photos: $PhotosPath"
# Data path
Write-Host ""
Write-Host " Database / data storage path:"
Write-Host " Default: $defaultData"
$input = (Read-Host " Path").Trim().Trim('"')
if ([string]::IsNullOrWhiteSpace($input)) { $input = $defaultData }
$DataPath = $input
New-Item -ItemType Directory -Force -Path $DataPath | Out-Null
Write-OK "Data: $DataPath"
# ── 8. Port conflict check ────────────────────────────────────────────────────
$portInUse = Test-NetConnection localhost -Port $AppPort `
-InformationLevel Quiet -WarningAction SilentlyContinue 2>$null
if ($portInUse) {
Write-Warn "Port $AppPort is already in use. DupFinder may already be running."
}
# ── 9. Write config and compose override ─────────────────────────────────────
Write-Step "Writing configuration..."
New-Item -ItemType Directory -Force -Path $AppDir | Out-Null
# Docker requires forward slashes
$PhotosDocker = $PhotosPath -replace '\\', '/' -replace '//', '/'
$DataDocker = $DataPath -replace '\\', '/' -replace '//', '/'
@"
PHOTOS_PATH=$PhotosPath
DATA_PATH=$DataPath
APP_PORT=$AppPort
COMPOSE_DIR=$AppDir
"@ | Set-Content $ConfigFile -Encoding UTF8
@"
services:
dup-finder:
ports:
- "${AppPort}:8000"
volumes:
- "$PhotosDocker:/photos:ro"
- "$DataDocker:/data"
"@ | Set-Content $OverrideYml -Encoding UTF8
# Copy base compose file
Copy-Item "$ScriptDir\docker-compose.yml" $ComposeYml -Force
# Copy start-stop helper
Copy-Item "$ScriptDir\dupfinder-start-stop.ps1" "$AppDir\dupfinder-start-stop.ps1" -Force
Write-OK "Config written to $AppDir"
# ── 10. Start container ───────────────────────────────────────────────────────
Write-Step "Starting DupFinder container..."
docker compose -f $ComposeYml -f $OverrideYml up -d --pull never
if ($LASTEXITCODE -ne 0) { Write-Fail "docker compose up failed."; exit 1 }
Write-OK "Container started"
# ── 11. Create desktop shortcut ───────────────────────────────────────────────
Write-Step "Creating desktop shortcut..."
$ShortcutPath = "$env:PUBLIC\Desktop\DupFinder.lnk"
$WshShell = New-Object -ComObject WScript.Shell
$Shortcut = $WshShell.CreateShortcut($ShortcutPath)
$Shortcut.TargetPath = "powershell.exe"
$Shortcut.Arguments = "-ExecutionPolicy Bypass -WindowStyle Hidden -File `"$AppDir\dupfinder-start-stop.ps1`" -Action open"
$Shortcut.Description = "Open DupFinder Duplicate Photo Scanner"
$Shortcut.WindowStyle = 7 # Minimized — hides the PS window
$Shortcut.Save()
Write-OK "Shortcut created: $ShortcutPath"
# ── 12. Done ──────────────────────────────────────────────────────────────────
Write-Host ""
Write-Host " ============================================" -ForegroundColor Green
Write-Host " DupFinder installed successfully!" -ForegroundColor Green
Write-Host " Open: http://localhost:$AppPort" -ForegroundColor Green
Write-Host " Or double-click DupFinder on your desktop." -ForegroundColor Green
Write-Host " ============================================" -ForegroundColor Green
Write-Host ""
# Open browser
$open = Read-Host "Open DupFinder in browser now? (Y/n)"
if ($open -ne 'n' -and $open -ne 'N') {
Start-Process "http://localhost:$AppPort"
}

76
installer/uninstall.ps1 Normal file
View File

@@ -0,0 +1,76 @@
#Requires -Version 5.1
<#
.SYNOPSIS
Uninstalls DupFinder from this workstation.
#>
$principal = [Security.Principal.WindowsPrincipal][Security.Principal.WindowsIdentity]::GetCurrent()
if (-not $principal.IsInRole([Security.Principal.WindowsBuiltInRole]::Administrator)) {
Write-Error "This script must be run as Administrator."
exit 1
}
$ConfigFile = "C:\ProgramData\DupFinder\dupfinder.conf"
$AppDir = "C:\ProgramData\DupFinder"
$ShortcutPath = "$env:PUBLIC\Desktop\DupFinder.lnk"
Write-Host ""
Write-Host " DupFinder Uninstaller" -ForegroundColor Magenta
Write-Host ""
if (-not (Test-Path $ConfigFile)) {
Write-Warning "DupFinder config not found. It may not be installed, or was already removed."
exit 0
}
# Read config
$conf = @{}
Get-Content $ConfigFile | ForEach-Object {
if ($_ -match '^(.+?)=(.+)$') { $conf[$Matches[1]] = $Matches[2] }
}
$ComposeDir = $conf["COMPOSE_DIR"]
$DataPath = $conf["DATA_PATH"]
$ComposeYml = "$ComposeDir\docker-compose.yml"
$OverrideYml = "$ComposeDir\docker-compose.override.yml"
# ── Stop and remove container ─────────────────────────────────────────────────
Write-Host "Stopping and removing container..."
docker compose -f $ComposeYml -f $OverrideYml down 2>$null
Write-Host " Done."
# ── Remove Docker image? ──────────────────────────────────────────────────────
$rmImage = Read-Host "Remove the DupFinder Docker image? Frees ~300-600 MB (Y/n)"
if ($rmImage -ne 'n' -and $rmImage -ne 'N') {
docker rmi dupfinder:latest 2>$null
Write-Host " Image removed."
}
# ── Remove data directory? ────────────────────────────────────────────────────
Write-Host ""
Write-Host "Data directory: $DataPath"
Write-Host "This contains the scan database and all decisions."
$rmData = Read-Host "Remove data directory? This CANNOT be undone. (y/N)"
if ($rmData -eq 'y' -or $rmData -eq 'Y') {
if (Test-Path $DataPath) {
Remove-Item $DataPath -Recurse -Force
Write-Host " Data directory removed."
}
} else {
Write-Host " Data directory kept at: $DataPath"
}
# ── Remove shortcut ───────────────────────────────────────────────────────────
if (Test-Path $ShortcutPath) {
Remove-Item $ShortcutPath -Force
Write-Host "Desktop shortcut removed."
}
# ── Remove app directory ──────────────────────────────────────────────────────
if (Test-Path $AppDir) {
Remove-Item $AppDir -Recurse -Force
Write-Host "App directory removed: $AppDir"
}
Write-Host ""
Write-Host " DupFinder has been uninstalled." -ForegroundColor Green
Write-Host ""

View File

@@ -1,3 +1,7 @@
# torch + torchvision come pre-installed in the pytorch/pytorch base image
# (torchvision needed for image transforms)
torchvision==0.18.1
fastapi==0.115.6
uvicorn==0.32.1
Pillow==11.0.0
@@ -5,3 +9,5 @@ imagehash==4.3.1
pillow-heif==0.21.0
jinja2==3.1.4
aiofiles==24.1.0
numpy==1.26.4
paramiko==3.5.0

View File

@@ -61,6 +61,7 @@
#scan-chip.complete { border-color: var(--success); color: var(--success); }
#scan-chip.error { border-color: var(--danger); color: var(--danger); }
#scan-chip.cancelled { border-color: var(--warning); color: var(--warning); }
#scan-chip.paused { border-color: var(--warning); color: var(--warning); }
#topbar-stats { margin-left: auto; display: flex; gap: 20px; font-size: 12px; color: var(--text-dim); }
#topbar-stats span b { color: var(--text); }
@@ -212,6 +213,14 @@
background: var(--accent);
transition: width .3s;
}
.progress-bar-fill.indeterminate {
width: 40% !important;
animation: indeterminate 1.4s ease-in-out infinite;
}
@keyframes indeterminate {
0% { transform: translateX(-100%); }
100% { transform: translateX(300%); }
}
.progress-msg { font-size: 12px; color: var(--text-dim); }
.phase-pills {
display: flex;
@@ -234,6 +243,20 @@
/* ── Rescan buttons ── */
#rescan-area { display: none; margin-top: 16px; }
#rescan-area.show { display: block; }
#paused-area { display: none; margin-top: 16px; }
#paused-area.show { display: block; }
.pause-banner {
display: flex; align-items: flex-start; gap: 12px;
background: rgba(226,164,58,.1);
border: 1px solid rgba(226,164,58,.35);
border-radius: var(--radius);
padding: 12px 14px;
margin-bottom: 10px;
}
.pause-icon { font-size: 22px; line-height: 1; }
.pause-title { font-weight: 600; color: var(--warning); margin-bottom: 4px; }
.pause-details { font-size: 12px; color: var(--text-dim); line-height: 1.6; }
.rescan-info { font-size: 12px; color: var(--text-dim); margin-bottom: 10px; }
.rescan-buttons {
display: flex;
@@ -513,6 +536,83 @@
#export-view tr:hover td { background: rgba(255,255,255,.02); }
/* ── Confirm dialog ── */
/* ── Folder picker ── */
#picker-overlay {
position: fixed; inset: 0;
background: rgba(0,0,0,.75);
display: none;
align-items: center;
justify-content: center;
z-index: 110;
}
#picker-overlay.show { display: flex; }
#picker-box {
background: var(--surface);
border: 1px solid var(--border);
border-radius: var(--radius);
width: 520px;
max-width: 95vw;
display: flex;
flex-direction: column;
max-height: 70vh;
}
#picker-header {
display: flex;
align-items: center;
gap: 10px;
padding: 14px 16px;
border-bottom: 1px solid var(--border);
flex-shrink: 0;
}
#picker-header h3 { font-size: 14px; flex: 1; }
#picker-path {
padding: 8px 16px;
font-family: monospace;
font-size: 12px;
color: var(--text-dim);
background: var(--surface2);
border-bottom: 1px solid var(--border);
flex-shrink: 0;
white-space: nowrap;
overflow: hidden;
text-overflow: ellipsis;
}
#picker-list {
overflow-y: auto;
flex: 1;
padding: 6px 0;
}
.picker-row {
display: flex;
align-items: center;
gap: 10px;
padding: 7px 16px;
cursor: pointer;
font-size: 13px;
transition: background .1s;
}
.picker-row:hover { background: var(--surface2); }
.picker-row .icon { color: var(--warning); font-size: 15px; flex-shrink: 0; }
.picker-row.up-row .icon { color: var(--text-dim); }
#picker-footer {
padding: 12px 16px;
border-top: 1px solid var(--border);
display: flex;
gap: 8px;
align-items: center;
flex-shrink: 0;
}
#picker-selected-path {
flex: 1;
font-family: monospace;
font-size: 12px;
color: var(--text);
background: var(--bg);
border: 1px solid var(--border);
border-radius: var(--radius);
padding: 6px 10px;
}
#confirm-overlay {
position: fixed; inset: 0;
background: rgba(0,0,0,.7);
@@ -625,6 +725,10 @@
<div class="nav-item" data-view="export">
&#8659; Export
</div>
<div class="nav-sep"></div>
<div class="nav-item" data-view="destinations">
&#8593; Destinations
</div>
</nav>
<!-- Main -->
@@ -660,6 +764,7 @@
<div id="first-scan-ui">
<div class="input-row">
<input type="text" id="folder-input" placeholder="/photos/MyLibrary" value="/photos">
<button class="btn-secondary" onclick="openPicker('folder-input')" title="Browse folders">&#128193;</button>
<button class="btn-primary" id="start-scan-btn" onclick="startScan('incremental')">Start Scan</button>
</div>
</div>
@@ -673,14 +778,27 @@
<div class="progress-bar-fill" id="progress-fill" style="width:0%"></div>
</div>
<div class="phase-pills">
<span class="phase-pill" data-phase="discovery">Discovery</span>
<span class="phase-pill" data-phase="takeout">Takeout</span>
<span class="phase-pill" data-phase="indexing">Indexing</span>
<span class="phase-pill" data-phase="indexing">Discover + Index</span>
<span class="phase-pill" data-phase="phash">Phash</span>
<span class="phase-pill" data-phase="grouping">Grouping</span>
</div>
<div class="mt8">
<button class="btn-secondary btn-sm" onclick="cancelScan()">Cancel</button>
<button class="btn-secondary btn-sm" onclick="pauseScan()">&#9646;&#9646; Pause</button>
</div>
</div>
<div id="paused-area">
<div class="pause-banner">
<div class="pause-icon">&#9646;&#9646;</div>
<div class="pause-info">
<div class="pause-title">Scan paused</div>
<div id="pause-details" class="pause-details"></div>
</div>
</div>
<div style="display:flex;gap:8px;flex-wrap:wrap;">
<button class="btn-primary btn-sm" onclick="resumeScan()">&#9654; Resume</button>
<button class="btn-danger btn-sm" onclick="confirmFullReset()">Full reset &#9888;</button>
</div>
</div>
@@ -688,6 +806,7 @@
<div class="rescan-info" id="rescan-info-text"></div>
<div class="input-row" style="margin-bottom:10px;">
<input type="text" id="rescan-folder-input" placeholder="/photos">
<button class="btn-secondary" onclick="openPicker('rescan-folder-input')" title="Browse folders">&#128193;</button>
</div>
<div class="rescan-buttons">
<div class="rescan-btn-group">
@@ -783,9 +902,108 @@
<div id="export-table-wrap"></div>
</div>
<!-- Destinations -->
<div id="view-destinations" class="view">
<div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:16px;">
<div>
<h2 style="margin:0;">SFTP Destinations</h2>
<div class="text-dim" style="font-size:12px;margin-top:4px;">
Remote locations duplicates can be moved to. Move pipeline picks one of these per job.
</div>
</div>
<button class="btn-primary" onclick="openDestModal()">+ Add destination</button>
</div>
<div id="dest-list"></div>
</div>
</main>
</div>
<!-- Destination modal -->
<div id="dest-overlay" style="display:none;position:fixed;inset:0;background:rgba(0,0,0,.6);z-index:200;align-items:center;justify-content:center;">
<div style="background:var(--panel);border:1px solid var(--border);border-radius:8px;width:560px;max-width:90vw;max-height:90vh;overflow:auto;padding:24px;">
<div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:16px;">
<h3 id="dest-modal-title" style="margin:0;">Add destination</h3>
<button class="btn-secondary btn-sm" onclick="closeDestModal()">&#10005;</button>
</div>
<input type="hidden" id="dest-id">
<div style="display:grid;grid-template-columns:1fr 1fr;gap:12px;">
<label style="grid-column:span 2;">
<div class="text-dim" style="font-size:11px;margin-bottom:4px;">Name (display only)</div>
<input id="dest-name" type="text" placeholder="remote-quarantine" style="width:100%;">
</label>
<label>
<div class="text-dim" style="font-size:11px;margin-bottom:4px;">Host</div>
<input id="dest-host" type="text" placeholder="192.168.1.x" style="width:100%;">
</label>
<label>
<div class="text-dim" style="font-size:11px;margin-bottom:4px;">Port</div>
<input id="dest-port" type="number" value="22" style="width:100%;">
</label>
<label>
<div class="text-dim" style="font-size:11px;margin-bottom:4px;">Username</div>
<input id="dest-user" type="text" style="width:100%;">
</label>
<label>
<div class="text-dim" style="font-size:11px;margin-bottom:4px;">Auth method</div>
<select id="dest-auth" onchange="updateAuthFields()" style="width:100%;">
<option value="key">SSH key</option>
<option value="password">Password</option>
</select>
</label>
<label style="grid-column:span 2;">
<div class="text-dim" style="font-size:11px;margin-bottom:4px;">Base path on remote (where files land)</div>
<input id="dest-basepath" type="text" placeholder="/volume1/dupfinder-quarantine" style="width:100%;">
</label>
<label style="grid-column:span 2;display:flex;gap:6px;align-items:center;">
<input id="dest-mirror" type="checkbox" checked>
<span>Mirror source folder structure under base path</span>
</label>
<!-- Password auth field -->
<div id="dest-password-wrap" style="grid-column:span 2;display:none;">
<div class="text-dim" style="font-size:11px;margin-bottom:4px;">Password (leave blank when editing to keep existing)</div>
<input id="dest-password" type="password" style="width:100%;">
</div>
<!-- Key auth fields -->
<div id="dest-key-wrap" style="grid-column:span 2;">
<div style="display:flex;gap:8px;margin-bottom:6px;">
<button type="button" class="btn-secondary btn-sm" onclick="generateKeypair()">Generate new ED25519 keypair</button>
<span class="text-dim" style="font-size:11px;align-self:center;">or paste existing private key below</span>
</div>
<div id="dest-pubkey-wrap" style="display:none;background:var(--panel-2);padding:8px;border-radius:4px;margin-bottom:8px;font-size:11px;">
<div style="font-weight:600;margin-bottom:4px;">Add this public key to the remote ~/.ssh/authorized_keys:</div>
<code id="dest-pubkey" style="display:block;word-break:break-all;user-select:all;"></code>
</div>
<textarea id="dest-privkey" rows="6" placeholder="-----BEGIN OPENSSH PRIVATE KEY-----&#10;...&#10;-----END OPENSSH PRIVATE KEY-----" style="width:100%;font-family:monospace;font-size:11px;"></textarea>
<div class="text-dim" style="font-size:11px;margin-top:4px;">Leave blank when editing to keep existing key.</div>
</div>
</div>
<div style="display:flex;gap:10px;justify-content:flex-end;margin-top:20px;">
<button class="btn-secondary" onclick="closeDestModal()">Cancel</button>
<button class="btn-primary" onclick="saveDest()">Save</button>
</div>
</div>
</div>
<!-- Folder picker -->
<div id="picker-overlay">
<div id="picker-box">
<div id="picker-header">
<h3>Browse for folder</h3>
<button class="btn-secondary btn-sm" onclick="closePicker()">&#10005;</button>
</div>
<div id="picker-path">/</div>
<div id="picker-list"></div>
<div id="picker-footer">
<input type="text" id="picker-selected-path" placeholder="selected path">
<button class="btn-primary btn-sm" onclick="confirmPicker()">Select</button>
<button class="btn-secondary btn-sm" onclick="closePicker()">Cancel</button>
</div>
</div>
</div>
<!-- Confirm dialog -->
<div id="confirm-overlay">
<div id="confirm-box">
@@ -881,6 +1099,7 @@ function switchView(view) {
if (view === 'gallery') loadGallery(true);
if (view === 'reviewed') loadReviewed();
if (view === 'export') loadExport();
if (view === 'destinations') loadDestinations();
}
// ── Stats + topbar refresh ────────────────────────────────────────────────────
@@ -921,7 +1140,7 @@ async function refreshStats() {
// ── Scan polling ──────────────────────────────────────────────────────────────
let scanPoller = null;
const PHASES = ['discovery','takeout','indexing','phash','grouping'];
const PHASES = ['takeout','indexing','phash','grouping'];
function startPoller() {
if (scanPoller) return;
@@ -955,14 +1174,21 @@ function updateScanUI(s) {
chip.classList.add(s.status);
const isRunning = s.status === 'running';
const isPaused = s.status === 'paused';
el('progress-area').classList.toggle('show', isRunning);
el('first-scan-ui').style.display = (s.scan_id || isRunning) ? 'none' : '';
el('rescan-area').classList.toggle('show', !isRunning && !!s.scan_id);
el('paused-area').classList.toggle('show', isPaused);
el('first-scan-ui').style.display = (s.scan_id || isRunning || isPaused) ? 'none' : '';
el('rescan-area').classList.toggle('show', !isRunning && !isPaused && !!s.scan_id);
if (isRunning) {
el('progress-msg').textContent = s.message || '';
const pct = s.total > 0 ? Math.round((s.progress / s.total) * 100) : 0;
el('progress-fill').style.width = pct + '%';
const indeterminate = s.phase === 'takeout' || s.total === 0;
const fill = el('progress-fill');
fill.classList.toggle('indeterminate', indeterminate);
if (!indeterminate) {
const pct = Math.round((s.progress / s.total) * 100);
fill.style.width = pct + '%';
}
el('progress-count').textContent = s.total > 0 ? `${fmt(s.progress)} / ${fmt(s.total)}` : '';
const phaseIdx = PHASES.indexOf(s.phase);
@@ -973,7 +1199,16 @@ function updateScanUI(s) {
});
}
if (s.scan_id && !isRunning) {
if (isPaused) {
const parts = [];
if (s.folder_path) parts.push(`Folder: ${s.folder_path}`);
if (s.files_indexed) parts.push(`${fmt(s.files_indexed)} files indexed`);
if (s.phashes_done) parts.push(`${fmt(s.phashes_done)} phashes computed`);
if (s.message) parts.push(s.message);
el('pause-details').textContent = parts.join(' · ') || 'Progress saved';
}
if (s.scan_id && !isRunning && !isPaused) {
// populate rescan folder from last scan
el('rescan-folder-input').value = el('folder-input').value || '/photos';
}
@@ -1006,11 +1241,20 @@ async function startScan(mode) {
}
}
async function cancelScan() {
async function pauseScan() {
try {
await api('POST', '/api/scan/cancel');
showToast('Cancelling scan...');
} catch(e) {}
await api('POST', '/api/scan/pause');
showToast('Pausing scan — finishing in-flight work...');
} catch(e) { showToast('Error: ' + e.message, 3000); }
}
async function resumeScan() {
try {
await api('POST', '/api/scan/resume');
state.scanStatus = 'running';
showToast('Resuming scan...');
startPoller();
} catch(e) { showToast('Error: ' + e.message, 4000); }
}
function confirmFullReset() {
@@ -1131,19 +1375,13 @@ async function openGroup(groupId, cellEl) {
state.activeGroupData = g;
renderDetail(g);
// Insert detail panel after the row containing the clicked cell
// Position detail panel directly after the grid (in the grid's parent).
// Earlier we tried to thread it between grid rows but mixed parent
// contexts and threw "node is not a child of this node".
const panel = el('detail-panel');
const grid = el('gallery-grid');
if (cellEl) {
// find row end
const cellRect = cellEl.getBoundingClientRect();
const gridRect = grid.getBoundingClientRect();
const cells = Array.from(grid.children).filter(c => c.classList.contains('gallery-cell'));
const cols = Math.round(grid.offsetWidth / (cellEl.offsetWidth + 12));
const cellIdx = cells.indexOf(cellEl);
const rowEnd = Math.min(Math.ceil((cellIdx + 1) / cols) * cols, cells.length);
const afterCell = cells[rowEnd - 1];
grid.parentNode.insertBefore(panel, afterCell.nextSibling || el('load-more-wrap'));
if (panel.parentNode !== grid.parentNode || panel.previousSibling !== grid) {
grid.parentNode.insertBefore(panel, grid.nextSibling);
}
panel.classList.add('show');
panel.scrollIntoView({ behavior: 'smooth', block: 'nearest' });
@@ -1363,10 +1601,72 @@ async function loadExport() {
}
}
// ── Folder picker ─────────────────────────────────────────────────────────────
let _pickerTargetId = null;
async function openPicker(inputId) {
_pickerTargetId = inputId;
const currentVal = el(inputId).value.trim() || '/';
el('picker-overlay').classList.add('show');
await pickerNavigate(currentVal);
}
function closePicker() {
el('picker-overlay').classList.remove('show');
_pickerTargetId = null;
}
function confirmPicker() {
const path = el('picker-selected-path').value.trim();
if (path && _pickerTargetId) {
el(_pickerTargetId).value = path;
}
closePicker();
}
async function pickerNavigate(path) {
try {
const data = await api('GET', `/api/browse?path=${encodeURIComponent(path)}`);
el('picker-path').textContent = data.current;
el('picker-selected-path').value = data.current;
const list = el('picker-list');
list.innerHTML = '';
// Up button
if (data.parent) {
const row = document.createElement('div');
row.className = 'picker-row up-row';
row.innerHTML = `<span class="icon">&#8593;</span> <span>..</span>`;
row.onclick = () => pickerNavigate(data.parent);
list.appendChild(row);
}
if (data.dirs.length === 0) {
list.innerHTML += `<div class="picker-row text-dim" style="cursor:default">No subfolders</div>`;
}
data.dirs.forEach(name => {
const row = document.createElement('div');
row.className = 'picker-row';
const fullPath = data.current.replace(/\\/g, '/').replace(/\/$/, '') + '/' + name;
row.innerHTML = `<span class="icon">&#128193;</span> <span>${name}</span>`;
row.onclick = () => {
el('picker-selected-path').value = fullPath;
pickerNavigate(fullPath);
};
list.appendChild(row);
});
} catch(e) {
el('picker-list').innerHTML = `<div class="picker-row text-dim">Cannot open this path.</div>`;
}
}
// ── Keyboard shortcuts ────────────────────────────────────────────────────────
document.addEventListener('keydown', e => {
if (e.key === 'Escape') {
if (el('confirm-overlay').classList.contains('show')) closeConfirm();
if (el('picker-overlay').classList.contains('show')) closePicker();
else if (el('confirm-overlay').classList.contains('show')) closeConfirm();
else closeDetail();
}
});
@@ -1378,6 +1678,7 @@ async function init() {
try {
const s = await api('GET', '/api/scan/status');
updateScanUI(s);
state.scanStatus = s.status;
if (s.status === 'running') startPoller();
} catch(e) {}
}
@@ -1387,6 +1688,163 @@ init();
setInterval(() => {
if (state.scanStatus !== 'running') refreshStats();
}, 30000);
// ── SFTP destinations ───────────────────────────────────────────────────────
async function loadDestinations() {
const list = el('dest-list');
list.innerHTML = '<div class="text-dim">Loading...</div>';
try {
const dests = await api('GET', '/api/sftp/destinations');
if (!dests.length) {
list.innerHTML = '<div class="text-dim">No destinations yet. Click "Add destination" above.</div>';
return;
}
list.innerHTML = '';
dests.forEach(d => {
const card = document.createElement('div');
card.style.cssText = 'border:1px solid var(--border);border-radius:6px;padding:14px;margin-bottom:10px;background:var(--panel);';
const statusIcon = d.last_test_result === 'ok' ? '✓' : (d.last_test_result ? '✗' : '?');
const statusColor = d.last_test_result === 'ok' ? '#3fb950' : (d.last_test_result ? '#f85149' : '#888');
card.innerHTML = `
<div style="display:flex;justify-content:space-between;align-items:flex-start;">
<div style="flex:1;">
<div style="font-weight:600;font-size:14px;margin-bottom:4px;">
<span style="color:${statusColor};margin-right:6px;">${statusIcon}</span>${escapeHtml(d.name)}
</div>
<div class="text-dim" style="font-size:12px;">
${escapeHtml(d.username)}@${escapeHtml(d.host)}:${d.port}${escapeHtml(d.base_path)}
</div>
<div class="text-dim" style="font-size:11px;margin-top:4px;">
auth: ${d.auth_method}${d.mirror_structure ? ' · mirrors structure' : ' · flat'}
${d.last_tested_at ? ` · last tested ${d.last_tested_at}` : ''}
</div>
${d.last_test_result && d.last_test_result !== 'ok'
? `<div style="font-size:11px;color:#f85149;margin-top:4px;">${escapeHtml(d.last_test_result)}</div>`
: ''}
</div>
<div style="display:flex;gap:6px;">
<button class="btn-secondary btn-sm" onclick="testDest(${d.id})">Test</button>
<button class="btn-secondary btn-sm" onclick="editDest(${d.id})">Edit</button>
<button class="btn-secondary btn-sm" onclick="deleteDest(${d.id})">Delete</button>
</div>
</div>
`;
list.appendChild(card);
});
} catch (e) {
list.innerHTML = `<div style="color:#f85149;">Failed to load: ${escapeHtml(e.message)}</div>`;
}
}
function escapeHtml(s) {
return String(s ?? '').replace(/[&<>"']/g, c => ({'&':'&amp;','<':'&lt;','>':'&gt;','"':'&quot;',"'":'&#39;'})[c]);
}
function openDestModal(dest) {
el('dest-overlay').style.display = 'flex';
el('dest-modal-title').textContent = dest ? 'Edit destination' : 'Add destination';
el('dest-id').value = dest ? dest.id : '';
el('dest-name').value = dest ? dest.name : '';
el('dest-host').value = dest ? dest.host : '';
el('dest-port').value = dest ? dest.port : 22;
el('dest-user').value = dest ? dest.username : '';
el('dest-auth').value = dest ? dest.auth_method : 'key';
el('dest-basepath').value = dest ? dest.base_path : '';
el('dest-mirror').checked = dest ? dest.mirror_structure : true;
el('dest-password').value = '';
el('dest-privkey').value = '';
el('dest-pubkey-wrap').style.display = 'none';
updateAuthFields();
}
function closeDestModal() {
el('dest-overlay').style.display = 'none';
}
function updateAuthFields() {
const method = el('dest-auth').value;
el('dest-password-wrap').style.display = method === 'password' ? 'block' : 'none';
el('dest-key-wrap').style.display = method === 'key' ? 'block' : 'none';
}
async function generateKeypair() {
try {
const r = await api('POST', '/api/sftp/keypair');
el('dest-privkey').value = r.private_key;
el('dest-pubkey').textContent = r.public_key;
el('dest-pubkey-wrap').style.display = 'block';
} catch (e) {
showToast('Failed to generate keypair: ' + e.message);
}
}
async function saveDest() {
const id = el('dest-id').value;
const body = {
name: el('dest-name').value.trim(),
host: el('dest-host').value.trim(),
port: parseInt(el('dest-port').value) || 22,
username: el('dest-user').value.trim(),
auth_method: el('dest-auth').value,
base_path: el('dest-basepath').value.trim(),
mirror_structure: el('dest-mirror').checked,
};
if (!body.name || !body.host || !body.username || !body.base_path) {
showToast('Name, host, username, and base path are required');
return;
}
if (body.auth_method === 'password') {
const pw = el('dest-password').value;
if (pw) body.password = pw;
else if (!id) { showToast('Password is required for new destinations'); return; }
} else {
const pk = el('dest-privkey').value.trim();
if (pk) body.private_key = pk;
else if (!id) { showToast('Private key is required for new destinations'); return; }
}
try {
if (id) await api('PUT', `/api/sftp/destinations/${id}`, body);
else await api('POST', '/api/sftp/destinations', body);
closeDestModal();
showToast('Saved');
loadDestinations();
} catch (e) {
showToast('Save failed: ' + e.message);
}
}
async function editDest(id) {
try {
const dests = await api('GET', '/api/sftp/destinations');
const d = dests.find(x => x.id === id);
if (d) openDestModal(d);
} catch (e) {
showToast('Failed: ' + e.message);
}
}
async function deleteDest(id) {
if (!confirm('Delete this destination? Stored credentials will also be removed.')) return;
try {
await api('DELETE', `/api/sftp/destinations/${id}`);
showToast('Deleted');
loadDestinations();
} catch (e) {
showToast('Delete failed: ' + e.message);
}
}
async function testDest(id) {
showToast('Testing...');
try {
const r = await api('POST', `/api/sftp/destinations/${id}/test`);
showToast(r.ok ? 'Connection OK' : 'Failed: ' + r.message);
loadDestinations();
} catch (e) {
showToast('Test failed: ' + e.message);
}
}
</script>
</body>
</html>