fix: strip Sphinx pilcrow artifacts from extracted text
Sphinx generates headerlink anchors (the paragraph symbol ¶) next to each heading. These appear as ¶ in the output due to a UTF-8/Latin-1 decode mismatch in BeautifulSoup. Fix by removing .headerlink elements in NOISE_SELECTORS and stripping any residual ¶/¶ in clean_text(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -78,6 +78,7 @@ NOISE_SELECTORS = [
|
|||||||
".rst-footer-buttons", "#edit-on-github",
|
".rst-footer-buttons", "#edit-on-github",
|
||||||
"[role='navigation']", ".breadcrumbs",
|
"[role='navigation']", ".breadcrumbs",
|
||||||
".sidebar", ".sphinxsidebar",
|
".sidebar", ".sphinxsidebar",
|
||||||
|
".headerlink", # Sphinx ¶ permalink anchors
|
||||||
"script", "style",
|
"script", "style",
|
||||||
]
|
]
|
||||||
|
|
||||||
@@ -248,6 +249,8 @@ def clean_text(soup: BeautifulSoup) -> tuple:
|
|||||||
|
|
||||||
raw = "\n".join(lines)
|
raw = "\n".join(lines)
|
||||||
clean = re.sub(r"\n{3,}", "\n\n", raw).strip()
|
clean = re.sub(r"\n{3,}", "\n\n", raw).strip()
|
||||||
|
# Strip residual Sphinx pilcrow characters (¶ and its mis-decoded form ¶)
|
||||||
|
clean = re.sub(r"¶|¶", "", clean).strip()
|
||||||
return clean, headings
|
return clean, headings
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user