Files
odootrain/scraper
Carlos Garcia 608bb51943 fix: replace dead sitemap with crawl-based URL discovery
The Odoo 18 sitemap.xml returns 404. The fallback URL list also failed
because urljoin(BASE_URL, /applications/...) strips the /documentation/18.0
path (absolute path arg replaces the whole path component in urljoin).

Changes:
- Add discover_urls_by_crawl(): fetches each module index page and
  collects all internal links — replaces sitemap as primary source
- crawl() now chains: sitemap → crawl discovery → hardcoded fallback
- Fix fallback_urls() to use BASE_URL + path (not urljoin) and trim
  the list to known-good pages
- Keep crawl discovery rate-limited (0.5s between module seeds)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 13:05:40 -04:00
..