Files
odootrain/scraper
Carlos Garcia bc054cd478 fix: strip Sphinx pilcrow artifacts from extracted text
Sphinx generates headerlink anchors (the paragraph symbol ¶) next to
each heading. These appear as ¶ in the output due to a UTF-8/Latin-1
decode mismatch in BeautifulSoup. Fix by removing .headerlink elements
in NOISE_SELECTORS and stripping any residual ¶/¶ in clean_text().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 14:03:30 -04:00
..