Commit Graph

5 Commits

Author SHA1 Message Date
Carlos Garcia
608bb51943 fix: replace dead sitemap with crawl-based URL discovery
The Odoo 18 sitemap.xml returns 404. The fallback URL list also failed
because urljoin(BASE_URL, /applications/...) strips the /documentation/18.0
path (absolute path arg replaces the whole path component in urljoin).

Changes:
- Add discover_urls_by_crawl(): fetches each module index page and
  collects all internal links — replaces sitemap as primary source
- crawl() now chains: sitemap → crawl discovery → hardcoded fallback
- Fix fallback_urls() to use BASE_URL + path (not urljoin) and trim
  the list to known-good pages
- Keep crawl discovery rate-limited (0.5s between module seeds)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 13:05:40 -04:00
Carlos Garcia
3d94c4eb25 fix: use list-form command with literal block to avoid sh syntax error
YAML folded scalar (>) preserves newlines on more-indented continuation
lines, so the shell received && on its own line which is invalid in
dash. Switched to the list form [/bin/sh, -c, |script|] so Docker
passes the script verbatim to sh -c without double-wrapping.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 12:20:08 -04:00
Carlos Garcia
de051fb2e7 fix: remove qdrant healthcheck, use wait-loop in our own containers
qdrant/qdrant:v1.9.0 does not ship curl or wget, so CMD healthchecks
always exit 127 (not found) and the container is immediately marked
unhealthy regardless of whether Qdrant is actually running.

Fix: remove the healthcheck from the qdrant service entirely. Instead,
rag-api and indexer now loop on `curl http://qdrant:6333/` (curl is
installed in our own python:3.11-slim image via the Dockerfile) before
starting the main process. Also removes the obsolete `version` key.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 12:16:48 -04:00
Carlos Garcia
8fbf574634 fix: browser UA for scraper, Qdrant healthcheck endpoint
Scraper was using a bot User-Agent that triggered Cloudflare bot
detection, returning challenge pages with < 100 chars of content.
Switched to a standard Chrome UA with Accept headers.

Qdrant healthcheck used /healthz which does not exist in v1.9.0.
Changed to GET / which is always available. Added start_period: 30s
so the check does not fire before Qdrant has time to initialise.
Added --debug flag to scraper for future extraction diagnostics.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 11:57:34 -04:00
Carlos Garcia
7fb1573bac Initial commit: Odoo 18 RAG stack
Scraper, indexer, and FastAPI query service for Retrieval-Augmented
Generation over Odoo 18 documentation. Uses Qdrant + Ollama (nomic-embed-text
+ llama3.1). Integrates with ActiveBlue PeerBus agent interface.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 11:25:55 -04:00