fix: strip Sphinx pilcrow artifacts from extracted text
Sphinx generates headerlink anchors (the paragraph symbol ¶) next to each heading. These appear as ¶ in the output due to a UTF-8/Latin-1 decode mismatch in BeautifulSoup. Fix by removing .headerlink elements in NOISE_SELECTORS and stripping any residual ¶/¶ in clean_text(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -78,6 +78,7 @@ NOISE_SELECTORS = [
|
||||
".rst-footer-buttons", "#edit-on-github",
|
||||
"[role='navigation']", ".breadcrumbs",
|
||||
".sidebar", ".sphinxsidebar",
|
||||
".headerlink", # Sphinx ¶ permalink anchors
|
||||
"script", "style",
|
||||
]
|
||||
|
||||
@@ -248,6 +249,8 @@ def clean_text(soup: BeautifulSoup) -> tuple:
|
||||
|
||||
raw = "\n".join(lines)
|
||||
clean = re.sub(r"\n{3,}", "\n\n", raw).strip()
|
||||
# Strip residual Sphinx pilcrow characters (¶ and its mis-decoded form ¶)
|
||||
clean = re.sub(r"¶|¶", "", clean).strip()
|
||||
return clean, headings
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user