Initial commit: Odoo 18 RAG stack

Scraper, indexer, and FastAPI query service for Retrieval-Augmented Generation over Odoo 18 documentation. Uses Qdrant + Ollama (nomic-embed-text + llama3.1). Integrates with ActiveBlue PeerBus agent interface. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 11:25:55 -04:00
commit 7fb1573bac
10 changed files with 1295 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,18 @@
+# Python
+__pycache__/
+*.py[cod]
+*.egg-info/
+.venv/
+venv/
+.env
+
+# Data (scraped docs — too large for git, regenerate with scraper)
+data/raw/
+data/*.jsonl
+
+# Docker
+.docker/
+
+# OS
+.DS_Store
+Thumbs.db
--- a/16
+++ b/16
@@ -0,0 +1,16 @@
+FROM python:3.11-slim
+
+WORKDIR /app
+
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    libxml2 libxslt1.1 curl \
+    && rm -rf /var/lib/apt/lists/*
+
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+COPY scraper/ ./scraper/
+COPY indexer/ ./indexer/
+COPY api/     ./api/
+
+CMD ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]
--- a/README.md
+++ b/README.md
@@ -0,0 +1,127 @@
+# odoo18-rag
+
+Retrieval-Augmented Generation over the full Odoo 18 documentation.
+Built for the ActiveBlue AI agent stack.
+
+## Stack
+
+| Component | What it does |
+|---|---|
+| `scraper/` | Crawls odoo.com/documentation/18.0, outputs clean JSONL |
+| `indexer/` | Chunks pages, embeds with `nomic-embed-text`, loads Qdrant |
+| `api/` | FastAPI — `/ask`, `/ask/stream`, `/agent/ask`, `/health` |
+| Qdrant | Vector database (Docker) |
+| Ollama @ `miaai:11434` | Embeddings + generation (local, HIPAA-safe) |
+
+## Quick start
+
+```bash
+# 1. Pull the embedding model on miaai
+ollama pull nomic-embed-text
+
+# 2. Start Qdrant + RAG API
+docker compose up -d
+
+# 3. Scrape the docs (~800 pages, ~20 min)
+docker compose run --rm scraper
+
+# 4. Index into Qdrant (~30-40 min)
+docker compose run --rm indexer
+
+# 5. Test
+curl http://localhost:8000/health
+curl -X POST http://localhost:8000/ask \
+  -H "Content-Type: application/json" \
+  -d '{"question": "How do I run a payroll batch in Odoo 18?"}'
+```
+
+## Endpoints
+
+| Method | Path | Description |
+|---|---|---|
+| GET | `/health` | Qdrant + Ollama connectivity |
+| GET | `/stats` | Vector count, models in use |
+| GET | `/modules` | List indexed Odoo modules |
+| POST | `/ask` | Blocking answer + sources |
+| POST | `/ask/stream` | SSE token stream |
+| POST | `/agent/ask` | ActiveBlue PeerBus integration |
+
+### Ask with module filter
+
+```bash
+curl -X POST http://localhost:8000/ask \
+  -H "Content-Type: application/json" \
+  -d '{"question": "How do reordering rules work?", "module": "inventory"}'
+```
+
+### Streaming
+
+```bash
+curl -N -X POST http://localhost:8000/ask/stream \
+  -H "Content-Type: application/json" \
+  -d '{"question": "Explain the Quote-to-Cash workflow"}'
+```
+
+## Agent integration
+
+```python
+from api.odoo_rag_agent import OdooRagAgent
+
+agent = OdooRagAgent(rag_url="http://localhost:8000")
+
+# Generic
+result = await agent.ask("How do I configure NACHA payments?")
+
+# Module-scoped
+result = await agent.ask_payroll("How do I generate a payslip batch?")
+result = await agent.ask_accounting("What is the chart of accounts?")
+result = await agent.ask_inventory("How does MTO work?")
+
+# Streaming
+async for token in agent.ask_stream("Explain the CRM pipeline"):
+    print(token, end="", flush=True)
+
+# PeerBus
+response = await agent.handle_peer_message({
+    "action": "ask",
+    "payload": {"question": "How do I set up taxes?", "module": "accounting"},
+    "request_id": "req-001"
+})
+```
+
+## Re-indexing
+
+Odoo releases doc updates regularly. Re-index to stay current:
+
+```bash
+docker compose run --rm scraper
+docker compose run --rm indexer python /app/indexer/indexer.py --reset
+```
+
+Or add a monthly cron on the host:
+
+```cron
+0 3 1 * * cd /opt/odoo18-rag && docker compose run --rm scraper && docker compose run --rm indexer python /app/indexer/indexer.py --reset
+```
+
+## Scraper options
+
+```bash
+# Single module only
+docker compose run --rm scraper python /app/scraper/scraper.py --module accounting
+
+# Quick test (first 50 pages)
+docker compose run --rm scraper python /app/scraper/scraper.py --limit 50
+```
+
+## Environment variables
+
+All configurable via `docker-compose.yml` environment section:
+
+| Variable | Default | Description |
+|---|---|---|
+| `OLLAMA_URL` | `http://miaai:11434` | Ollama endpoint |
+| `QDRANT_URL` | `http://qdrant:6333` | Qdrant endpoint |
+| `EMBED_MODEL` | `nomic-embed-text` | Embedding model |
+| `GEN_MODEL` | `llama3.1` | Generation model |
+| `COLLECTION_NAME` | `odoo18_docs` | Qdrant collection |
--- a/api/main.py
+++ b/api/main.py
@@ -0,0 +1,316 @@
+#!/usr/bin/env python3
+"""
+Odoo 18 RAG Query API
+======================
+FastAPI service — embeds the question, retrieves top-K chunks from Qdrant,
+builds a prompt, and streams or returns the answer from Ollama.
+
+Endpoints:
+    POST /ask           blocking answer + sources
+    POST /ask/stream    Server-Sent Events token stream
+    POST /agent/ask     ActiveBlue AI agent integration
+    GET  /health        connectivity check
+    GET  /modules       list indexed modules
+    GET  /stats         collection stats
+
+Run:
+    uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload
+"""
+
+import json
+import logging
+import os
+
+import httpx
+from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import StreamingResponse
+from pydantic import BaseModel, Field
+from qdrant_client import QdrantClient
+from qdrant_client.models import Filter, FieldCondition, MatchValue
+from typing import AsyncIterator
+
+logging.basicConfig(level=logging.INFO)
+log = logging.getLogger("odoo18_rag")
+
+OLLAMA_URL      = os.getenv("OLLAMA_URL",      "http://miaai:11434")
+QDRANT_URL      = os.getenv("QDRANT_URL",      "http://qdrant:6333")
+EMBED_MODEL     = os.getenv("EMBED_MODEL",     "nomic-embed-text")
+GEN_MODEL       = os.getenv("GEN_MODEL",       "llama3.1")
+COLLECTION_NAME = os.getenv("COLLECTION_NAME", "odoo18_docs")
+TOP_K           = 6
+MAX_CONTEXT     = 4000
+
+SYSTEM_PROMPT = """\
+You are an expert Odoo 18 consultant for ActiveBlue LLC, an MSP serving \
+medical and dental practices. You have deep knowledge of all Odoo 18 modules: \
+Finance, Accounting, Inventory, Manufacturing, Purchase, Sales, CRM, HR, \
+Payroll, eCommerce, Helpdesk, Project, and more.
+
+Answer questions clearly and concisely using the provided documentation context. \
+Use numbered steps when explaining procedures. Always mention the Odoo menu path \
+when explaining navigation. If the context doesn't cover the question fully, say \
+so and answer from general knowledge.\
+"""
+
+
+# ── Models ────────────────────────────────────────────────────────────────────
+
+class AskRequest(BaseModel):
+    question:    str   = Field(..., min_length=5, max_length=2000)
+    module:      str   | None = Field(None, description="Filter to one Odoo module")
+    model:       str   | None = Field(None, description="Override the LLM model")
+    top_k:       int   = Field(TOP_K, ge=1, le=20)
+    temperature: float = Field(0.3, ge=0.0, le=1.0)
+
+
+class Source(BaseModel):
+    url:     str
+    title:   str
+    module:  str
+    section: str
+
+
+class AskResponse(BaseModel):
+    answer:   str
+    sources:  list[Source]
+    model:    str
+    question: str
+
+
+# ── App ───────────────────────────────────────────────────────────────────────
+
+app = FastAPI(
+    title="Odoo 18 RAG API",
+    description="Retrieval-Augmented Generation over Odoo 18 documentation",
+    version="1.0.0",
+)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+
+qdrant = QdrantClient(url=QDRANT_URL)
+
+
+# ── Helpers ───────────────────────────────────────────────────────────────────
+
+async def embed_query(text: str) -> list:
+    async with httpx.AsyncClient(timeout=30) as client:
+        resp = await client.post(
+            f"{OLLAMA_URL}/api/embed",
+            json={"model": EMBED_MODEL, "input": [text]},
+        )
+        resp.raise_for_status()
+        embeddings = resp.json().get("embeddings", [])
+        if not embeddings:
+            raise HTTPException(500, "Empty embedding response from Ollama")
+        return embeddings[0]
+
+
+def retrieve(vector: list, top_k: int, module: str | None) -> list:
+    query_filter = None
+    if module:
+        query_filter = Filter(
+            must=[FieldCondition(key="module", match=MatchValue(value=module))]
+        )
+    results = qdrant.search(
+        collection_name=COLLECTION_NAME,
+        query_vector=vector,
+        limit=top_k,
+        query_filter=query_filter,
+        with_payload=True,
+    )
+    return [hit.payload for hit in results]
+
+
+def build_prompt(question: str, chunks: list) -> str:
+    context_parts = []
+    char_count = 0
+    for i, chunk in enumerate(chunks, 1):
+        block = (
+            f"[Source {i}: {chunk.get('title', '')} | {chunk.get('section', '')}]\n"
+            f"{chunk.get('text', '')}\n"
+            f"URL: {chunk.get('url', '')}\n"
+        )
+        if char_count + len(block) > MAX_CONTEXT:
+            break
+        context_parts.append(block)
+        char_count += len(block)
+
+    return (
+        f"{SYSTEM_PROMPT}\n\n"
+        f"## Relevant documentation\n\n"
+        f"{'---'.join(context_parts)}\n\n"
+        f"---\n\n"
+        f"## Question\n\n{question}\n\n"
+        f"## Answer\n"
+    )
+
+
+def dedupe_sources(chunks: list) -> list[Source]:
+    seen = set()
+    sources = []
+    for chunk in chunks:
+        url = chunk.get("url", "")
+        if url not in seen:
+            seen.add(url)
+            sources.append(Source(
+                url=url,
+                title=chunk.get("title", ""),
+                module=chunk.get("module", ""),
+                section=chunk.get("section", ""),
+            ))
+    return sources
+
+
+async def generate_blocking(prompt: str, model: str, temperature: float) -> str:
+    async with httpx.AsyncClient(timeout=120) as client:
+        resp = await client.post(
+            f"{OLLAMA_URL}/api/generate",
+            json={
+                "model": model,
+                "prompt": prompt,
+                "stream": False,
+                "options": {"temperature": temperature, "num_ctx": 8192},
+            },
+        )
+        resp.raise_for_status()
+        return resp.json().get("response", "").strip()
+
+
+async def generate_stream(prompt: str, model: str, temperature: float) -> AsyncIterator[str]:
+    async with httpx.AsyncClient(timeout=120) as client:
+        async with client.stream(
+            "POST",
+            f"{OLLAMA_URL}/api/generate",
+            json={
+                "model": model,
+                "prompt": prompt,
+                "stream": True,
+                "options": {"temperature": temperature, "num_ctx": 8192},
+            },
+        ) as resp:
+            async for line in resp.aiter_lines():
+                if line.strip():
+                    try:
+                        data = json.loads(line)
+                        token = data.get("response", "")
+                        if token:
+                            yield token
+                        if data.get("done"):
+                            break
+                    except json.JSONDecodeError:
+                        continue
+
+
+# ── Endpoints ─────────────────────────────────────────────────────────────────
+
+@app.get("/health")
+async def health():
+    status = {"api": "ok", "qdrant": "unknown", "ollama": "unknown"}
+    try:
+        info = qdrant.get_collection(COLLECTION_NAME)
+        status["qdrant"] = f"ok ({info.points_count} vectors)"
+    except Exception as e:
+        status["qdrant"] = f"error: {e}"
+    try:
+        async with httpx.AsyncClient(timeout=5) as client:
+            resp = await client.get(f"{OLLAMA_URL}/api/tags")
+            models = [m["name"] for m in resp.json().get("models", [])]
+            status["ollama"] = f"ok ({len(models)} models)"
+    except Exception as e:
+        status["ollama"] = f"error: {e}"
+    return status
+
+
+@app.get("/modules")
+async def list_modules():
+    try:
+        result = qdrant.scroll(collection_name=COLLECTION_NAME, limit=1000, with_payload=["module"])
+        modules = sorted(set(p.payload.get("module", "general") for p in result[0]))
+        return {"modules": modules}
+    except Exception as e:
+        raise HTTPException(500, str(e))
+
+
+@app.get("/stats")
+async def stats():
+    try:
+        info = qdrant.get_collection(COLLECTION_NAME)
+        return {
+            "collection":   COLLECTION_NAME,
+            "vectors":      info.points_count,
+            "vector_size":  768,
+            "embed_model":  EMBED_MODEL,
+            "gen_model":    GEN_MODEL,
+        }
+    except Exception as e:
+        raise HTTPException(500, str(e))
+
+
+@app.post("/ask", response_model=AskResponse)
+async def ask(req: AskRequest):
+    model = req.model or GEN_MODEL
+
+    try:
+        vector = await embed_query(req.question)
+    except Exception as e:
+        raise HTTPException(500, f"Embedding failed: {e}")
+
+    chunks = retrieve(vector, req.top_k, req.module)
+    if not chunks:
+        raise HTTPException(404, "No relevant documentation found.")
+
+    prompt = build_prompt(req.question, chunks)
+
+    try:
+        answer = await generate_blocking(prompt, model, req.temperature)
+    except Exception as e:
+        raise HTTPException(500, f"Generation failed: {e}")
+
+    return AskResponse(
+        answer=answer,
+        sources=dedupe_sources(chunks),
+        model=model,
+        question=req.question,
+    )
+
+
+@app.post("/ask/stream")
+async def ask_stream(req: AskRequest):
+    model = req.model or GEN_MODEL
+
+    try:
+        vector = await embed_query(req.question)
+    except Exception as e:
+        raise HTTPException(500, f"Embedding failed: {e}")
+
+    chunks = retrieve(vector, req.top_k, req.module)
+    if not chunks:
+        raise HTTPException(404, "No relevant documentation found.")
+
+    prompt = build_prompt(req.question, chunks)
+    sources = [s.model_dump() for s in dedupe_sources(chunks)]
+
+    async def sse():
+        async for token in generate_stream(prompt, model, req.temperature):
+            yield f"data: {json.dumps({'type': 'token', 'content': token})}\n\n"
+        yield f"data: {json.dumps({'type': 'sources', 'sources': sources})}\n\n"
+        yield "data: [DONE]\n\n"
+
+    return StreamingResponse(sse(), media_type="text/event-stream")
+
+
+@app.post("/agent/ask")
+async def agent_ask(req: AskRequest):
+    """ActiveBlue AI agent endpoint — compatible with PeerBus message format."""
+    result = await ask(req)
+    return {
+        "answer":         result.answer,
+        "sources":        [s.url for s in result.sources],
+        "module_context": req.module,
+        "model_used":     result.model,
+    }
--- a/api/odoo_rag_agent.py
+++ b/api/odoo_rag_agent.py
@@ -0,0 +1,147 @@
+"""
+ActiveBlue AI Agent — Odoo 18 RAG Specialist
+=============================================
+Drop-in specialist agent for the ActiveBlue AI system.
+Implements the PeerBus interface defined in ACTIVEBLUE_AI_SPEC.md.
+
+Usage:
+    from api.odoo_rag_agent import OdooRagAgent
+
+    agent = OdooRagAgent(rag_url="http://localhost:8000")
+    result = await agent.ask("How do I run a payroll batch?")
+    print(result["answer"])
+"""
+
+import json
+import httpx
+import logging
+from typing import AsyncIterator
+
+log = logging.getLogger(__name__)
+
+
+class OdooRagAgent:
+    name        = "odoo18_rag"
+    description = "Answers Odoo 18 questions using RAG over official documentation"
+    capabilities = [
+        "odoo_how_to",
+        "odoo_configuration",
+        "odoo_troubleshooting",
+        "odoo_workflow",
+    ]
+    privacy_mode = "local"   # uses local Ollama — HIPAA safe
+
+    def __init__(
+        self,
+        rag_url: str = "http://localhost:8000",
+        timeout: int = 120,
+        default_model: str | None = None,
+    ):
+        self.rag_url       = rag_url.rstrip("/")
+        self.timeout       = timeout
+        self.default_model = default_model
+
+    async def ask(
+        self,
+        question: str,
+        module: str | None = None,
+        top_k: int = 6,
+        temperature: float = 0.3,
+    ) -> dict:
+        payload = {"question": question, "top_k": top_k, "temperature": temperature}
+        if module:
+            payload["module"] = module
+        if self.default_model:
+            payload["model"] = self.default_model
+
+        async with httpx.AsyncClient(timeout=self.timeout) as client:
+            resp = await client.post(f"{self.rag_url}/ask", json=payload)
+            resp.raise_for_status()
+            return resp.json()
+
+    async def ask_stream(
+        self,
+        question: str,
+        module: str | None = None,
+        top_k: int = 6,
+        temperature: float = 0.3,
+    ) -> AsyncIterator[str]:
+        payload = {"question": question, "top_k": top_k, "temperature": temperature}
+        if module:
+            payload["module"] = module
+
+        async with httpx.AsyncClient(timeout=self.timeout) as client:
+            async with client.stream("POST", f"{self.rag_url}/ask/stream", json=payload) as resp:
+                async for line in resp.aiter_lines():
+                    if line.startswith("data: "):
+                        data_str = line[6:]
+                        if data_str == "[DONE]":
+                            break
+                        try:
+                            data = json.loads(data_str)
+                            if data.get("type") == "token":
+                                yield data["content"]
+                            elif data.get("type") == "sources":
+                                yield json.dumps(data)
+                        except json.JSONDecodeError:
+                            continue
+
+    async def handle_peer_message(self, message: dict) -> dict:
+        """PeerBus message handler for the ActiveBlue Master AI."""
+        action  = message.get("action")
+        payload = message.get("payload", {})
+        req_id  = message.get("request_id")
+
+        if action == "ask":
+            result = await self.ask(
+                question    = payload.get("question", ""),
+                module      = payload.get("module"),
+                top_k       = payload.get("top_k", 6),
+                temperature = payload.get("temperature", 0.3),
+            )
+            return {"request_id": req_id, "agent": self.name, "status": "ok", "result": result}
+
+        elif action == "capabilities":
+            return {
+                "request_id":  req_id,
+                "agent":       self.name,
+                "capabilities": self.capabilities,
+                "description": self.description,
+                "privacy_mode": self.privacy_mode,
+            }
+
+        elif action == "health":
+            return await self.health()
+
+        return {"request_id": req_id, "agent": self.name, "status": "error", "error": f"Unknown action: {action}"}
+
+    async def health(self) -> dict:
+        try:
+            async with httpx.AsyncClient(timeout=5) as client:
+                resp = await client.get(f"{self.rag_url}/health")
+                return {"agent": self.name, "status": "ok", "rag": resp.json()}
+        except Exception as e:
+            return {"agent": self.name, "status": "error", "error": str(e)}
+
+    # ── Module convenience wrappers ───────────────────────────────────────────
+
+    async def ask_accounting(self, question: str) -> dict:
+        return await self.ask(question, module="accounting")
+
+    async def ask_payroll(self, question: str) -> dict:
+        return await self.ask(question, module="payroll")
+
+    async def ask_inventory(self, question: str) -> dict:
+        return await self.ask(question, module="inventory")
+
+    async def ask_crm(self, question: str) -> dict:
+        return await self.ask(question, module="crm")
+
+    async def ask_hr(self, question: str) -> dict:
+        return await self.ask(question, module="employees")
+
+    async def ask_manufacturing(self, question: str) -> dict:
+        return await self.ask(question, module="manufacturing")
+
+    async def ask_helpdesk(self, question: str) -> dict:
+        return await self.ask(question, module="helpdesk")
--- a/data/.gitkeep
+++ b/data/.gitkeep
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -0,0 +1,102 @@
+version: "3.9"
+
+# ─── Odoo 18 RAG Stack ────────────────────────────────────────────────────────
+# rag-api:8000  ──► qdrant:6333   (internal docker network)
+# rag-api       ──► miaai:11434   (direct outbound to Ollama)
+#
+# Usage:
+#   docker compose up -d                  # start Qdrant + RAG API
+#   docker compose logs -f rag-api        # follow API logs
+#   docker compose run --rm scraper       # scrape Odoo 18 docs
+#   docker compose run --rm indexer       # embed + load into Qdrant
+#   docker compose run --rm indexer python /app/indexer/indexer.py --reset
+
+services:
+
+  qdrant:
+    image: qdrant/qdrant:v1.9.0
+    container_name: odoo18-qdrant
+    restart: unless-stopped
+    volumes:
+      - qdrant_storage:/qdrant/storage
+    ports:
+      - "6333:6333"
+      - "6334:6334"
+    environment:
+      QDRANT__SERVICE__HTTP_PORT: 6333
+      QDRANT__SERVICE__GRPC_PORT: 6334
+      QDRANT__LOG_LEVEL: INFO
+    networks:
+      - rag_net
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+
+  rag-api:
+    build: .
+    container_name: odoo18-rag-api
+    restart: unless-stopped
+    depends_on:
+      qdrant:
+        condition: service_healthy
+    ports:
+      - "8000:8000"
+    environment:
+      OLLAMA_URL: "http://miaai:11434"
+      QDRANT_URL: "http://qdrant:6333"
+      COLLECTION_NAME: "odoo18_docs"
+      EMBED_MODEL: "nomic-embed-text"
+      GEN_MODEL: "llama3.1"
+      LOG_LEVEL: "INFO"
+    volumes:
+      - ./data:/app/data
+    extra_hosts:
+      - "miaai:192.168.2.9"
+    networks:
+      - rag_net
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+
+  scraper:
+    build: .
+    container_name: odoo18-scraper
+    profiles: ["scraper"]
+    command: python /app/scraper/scraper.py
+    volumes:
+      - ./data:/app/data
+    networks:
+      - rag_net
+    environment:
+      PYTHONUNBUFFERED: "1"
+
+  indexer:
+    build: .
+    container_name: odoo18-indexer
+    profiles: ["indexer"]
+    command: python /app/indexer/indexer.py
+    depends_on:
+      qdrant:
+        condition: service_healthy
+    volumes:
+      - ./data:/app/data
+    extra_hosts:
+      - "miaai:192.168.2.9"
+    networks:
+      - rag_net
+    environment:
+      OLLAMA_URL: "http://miaai:11434"
+      QDRANT_URL: "http://qdrant:6333"
+      PYTHONUNBUFFERED: "1"
+
+networks:
+  rag_net:
+    driver: bridge
+
+volumes:
+  qdrant_storage:
+    name: odoo18_qdrant_storage
--- a/indexer/indexer.py
+++ b/indexer/indexer.py
@@ -0,0 +1,244 @@
+#!/usr/bin/env python3
+"""
+Odoo 18 RAG Indexer
+====================
+Reads scraped pages, chunks them, embeds with nomic-embed-text via Ollama,
+and upserts into Qdrant.
+
+Usage:
+    python indexer.py               # index everything
+    python indexer.py --reset       # drop collection and re-index
+    python indexer.py --module accounting
+
+Requires:
+    - Qdrant running:  docker compose up -d qdrant
+    - Ollama with model pulled:  ollama pull nomic-embed-text
+"""
+
+import json
+import logging
+import argparse
+import hashlib
+import time
+import os
+from pathlib import Path
+from dataclasses import dataclass
+
+import requests
+from qdrant_client import QdrantClient
+from qdrant_client.models import (
+    Distance, VectorParams, PointStruct,
+    Filter, FieldCondition, MatchValue,
+)
+
+logging.basicConfig(level=logging.INFO, format="%(levelname)s  %(message)s")
+log = logging.getLogger(__name__)
+
+OLLAMA_URL      = os.getenv("OLLAMA_URL", "http://miaai:11434")
+QDRANT_URL      = os.getenv("QDRANT_URL", "http://localhost:6333")
+EMBED_MODEL     = os.getenv("EMBED_MODEL", "nomic-embed-text")
+COLLECTION_NAME = os.getenv("COLLECTION_NAME", "odoo18_docs")
+VECTOR_SIZE     = 768
+RAW_DATA_FILE   = Path("../data/raw/odoo18_docs_raw.jsonl")
+BATCH_SIZE      = 32
+CHUNK_SIZE      = 512
+CHUNK_OVERLAP   = 64
+UPSERT_BATCH    = 100
+
+
+@dataclass
+class Chunk:
+    chunk_id: str
+    doc_id: str
+    url: str
+    title: str
+    module: str
+    section: str
+    headings: list
+    text: str
+    chunk_index: int
+
+
+def split_text(text: str) -> list:
+    target_words = int(CHUNK_SIZE * 0.75)
+    overlap_words = int(CHUNK_OVERLAP * 0.75)
+
+    sentences = []
+    current = []
+    for word in text.split():
+        current.append(word)
+        if word.endswith((".", "?", "!", ":\n", "\n\n")):
+            sentences.append(" ".join(current))
+            current = []
+    if current:
+        sentences.append(" ".join(current))
+
+    chunks = []
+    buffer_words = []
+    buffer_count = 0
+
+    for sentence in sentences:
+        s_words = sentence.split()
+        s_count = len(s_words)
+        if buffer_count + s_count > target_words and buffer_words:
+            chunks.append(" ".join(buffer_words))
+            overlap_slice = buffer_words[-overlap_words:] if overlap_words else []
+            buffer_words = overlap_slice + s_words
+            buffer_count = len(buffer_words)
+        else:
+            buffer_words.extend(s_words)
+            buffer_count += s_count
+
+    if buffer_words:
+        chunks.append(" ".join(buffer_words))
+
+    return [c for c in chunks if len(c.strip()) > 80]
+
+
+def chunk_page(page: dict) -> list:
+    text_chunks = split_text(page["text"])
+    if not text_chunks:
+        return []
+
+    chunks = []
+    for idx, text in enumerate(text_chunks):
+        chunk_id = hashlib.sha256(f"{page['doc_id']}_{idx}".encode()).hexdigest()[:20]
+        chunks.append(Chunk(
+            chunk_id=chunk_id,
+            doc_id=page["doc_id"],
+            url=page["url"],
+            title=page["title"],
+            module=page.get("module", "general"),
+            section=page.get("section", ""),
+            headings=page.get("headings", []),
+            text=text,
+            chunk_index=idx,
+        ))
+    return chunks
+
+
+def embed_batch(texts: list) -> list:
+    resp = requests.post(
+        f"{OLLAMA_URL}/api/embed",
+        json={"model": EMBED_MODEL, "input": texts},
+        timeout=120,
+    )
+    resp.raise_for_status()
+    embeddings = resp.json().get("embeddings", [])
+    if len(embeddings) != len(texts):
+        raise ValueError(f"Expected {len(texts)} embeddings, got {len(embeddings)}")
+    return embeddings
+
+
+def check_ollama() -> bool:
+    try:
+        resp = requests.get(f"{OLLAMA_URL}/api/tags", timeout=5)
+        models = [m["name"] for m in resp.json().get("models", [])]
+        if not any(EMBED_MODEL in m for m in models):
+            log.error(f"Model '{EMBED_MODEL}' not found. Run: ollama pull {EMBED_MODEL}")
+            return False
+        log.info(f"Ollama OK at {OLLAMA_URL} — model {EMBED_MODEL} ready")
+        return True
+    except Exception as e:
+        log.error(f"Ollama unreachable at {OLLAMA_URL}: {e}")
+        return False
+
+
+def setup_collection(client: QdrantClient, reset: bool = False):
+    exists = client.collection_exists(COLLECTION_NAME)
+
+    if exists and reset:
+        log.info(f"Dropping collection '{COLLECTION_NAME}'...")
+        client.delete_collection(COLLECTION_NAME)
+        exists = False
+
+    if not exists:
+        log.info(f"Creating collection '{COLLECTION_NAME}' (dim={VECTOR_SIZE})...")
+        client.create_collection(
+            collection_name=COLLECTION_NAME,
+            vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE),
+        )
+        client.create_payload_index(COLLECTION_NAME, field_name="module", field_schema="keyword")
+        client.create_payload_index(COLLECTION_NAME, field_name="url",    field_schema="keyword")
+    else:
+        info = client.get_collection(COLLECTION_NAME)
+        log.info(f"Collection '{COLLECTION_NAME}' exists ({info.points_count} points)")
+
+
+def upsert_chunks(client: QdrantClient, chunks: list, vectors: list):
+    points = []
+    for chunk, vector in zip(chunks, vectors):
+        points.append(PointStruct(
+            id=int(chunk.chunk_id[:8], 16),
+            vector=vector,
+            payload={
+                "chunk_id":    chunk.chunk_id,
+                "doc_id":      chunk.doc_id,
+                "url":         chunk.url,
+                "title":       chunk.title,
+                "module":      chunk.module,
+                "section":     chunk.section,
+                "headings":    chunk.headings,
+                "text":        chunk.text,
+                "chunk_index": chunk.chunk_index,
+            },
+        ))
+    for i in range(0, len(points), UPSERT_BATCH):
+        client.upsert(collection_name=COLLECTION_NAME, points=points[i:i + UPSERT_BATCH])
+
+
+def index(module_filter: str | None = None, reset: bool = False):
+    if not check_ollama():
+        raise SystemExit(1)
+
+    if not RAW_DATA_FILE.exists():
+        raise FileNotFoundError(
+            f"Raw data not found: {RAW_DATA_FILE}\n"
+            f"Run the scraper first: docker compose run --rm scraper"
+        )
+
+    client = QdrantClient(url=QDRANT_URL)
+    setup_collection(client, reset=reset)
+
+    pages = []
+    with open(RAW_DATA_FILE, encoding="utf-8") as f:
+        for line in f:
+            page = json.loads(line.strip())
+            if module_filter and page.get("module") != module_filter:
+                continue
+            pages.append(page)
+    log.info(f"Loaded {len(pages)} pages")
+
+    all_chunks = []
+    for page in pages:
+        all_chunks.extend(chunk_page(page))
+    log.info(f"Created {len(all_chunks)} chunks")
+
+    total = len(all_chunks)
+    embedded = failed = 0
+
+    for i in range(0, total, BATCH_SIZE):
+        batch = all_chunks[i:i + BATCH_SIZE]
+        try:
+            vectors = embed_batch([c.text for c in batch])
+            upsert_chunks(client, batch, vectors)
+            embedded += len(batch)
+            log.info(f"Progress: {embedded}/{total} ({embedded/total*100:.0f}%)")
+        except Exception as e:
+            log.error(f"Batch {i//BATCH_SIZE} failed: {e}")
+            failed += len(batch)
+            time.sleep(2)
+
+    info = client.get_collection(COLLECTION_NAME)
+    log.info(
+        f"\n✅ Done. Embedded: {embedded}, Failed: {failed}\n"
+        f"   Total vectors in Qdrant: {info.points_count}"
+    )
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Odoo 18 RAG indexer")
+    parser.add_argument("--module", help="Index only one module")
+    parser.add_argument("--reset", action="store_true", help="Drop and recreate collection")
+    args = parser.parse_args()
+    index(module_filter=args.module, reset=args.reset)
--- a/requirements.txt
+++ b/requirements.txt
@@ -0,0 +1,9 @@
+fastapi==0.111.0
+uvicorn[standard]==0.29.0
+httpx==0.27.0
+pydantic==2.7.0
+qdrant-client==1.9.0
+requests==2.31.0
+beautifulsoup4==4.12.3
+lxml==5.2.1
+python-dotenv==1.0.1
--- a/scraper/scraper.py
+++ b/scraper/scraper.py
@@ -0,0 +1,316 @@
+#!/usr/bin/env python3
+"""
+Odoo 18 Documentation Scraper
+==============================
+Crawls the Odoo 18 docs sitemap, extracts clean text from each page,
+and saves structured JSON ready for the indexer.
+
+Usage:
+    python scraper.py                      # full crawl
+    python scraper.py --module accounting  # single module
+    python scraper.py --limit 50           # test run
+
+Output: ../data/raw/odoo18_docs_raw.jsonl
+"""
+
+import json
+import time
+import re
+import argparse
+import hashlib
+import logging
+from pathlib import Path
+from urllib.parse import urljoin
+from dataclasses import dataclass, asdict
+
+import requests
+from bs4 import BeautifulSoup
+
+logging.basicConfig(level=logging.INFO, format="%(levelname)s  %(message)s")
+log = logging.getLogger(__name__)
+
+BASE_URL        = "https://www.odoo.com/documentation/18.0"
+SITEMAP_URL     = f"{BASE_URL}/sitemap.xml"
+OUTPUT_DIR      = Path("../data/raw")
+OUTPUT_FILE     = OUTPUT_DIR / "odoo18_docs_raw.jsonl"
+DELAY_SECONDS   = 1.2
+MAX_RETRIES     = 3
+REQUEST_TIMEOUT = 20
+
+HEADERS = {
+    "User-Agent": (
+        "Mozilla/5.0 (compatible; ActiveBlue-RAG-Indexer/1.0; "
+        "+https://activeblue.net)"
+    ),
+}
+
+MODULE_PATHS = {
+    "accounting":    "/applications/finance/accounting",
+    "invoicing":     "/applications/finance",
+    "inventory":     "/applications/inventory_and_mrp/inventory",
+    "purchase":      "/applications/inventory_and_mrp/purchase",
+    "manufacturing": "/applications/inventory_and_mrp/manufacturing",
+    "sales":         "/applications/sales/sales",
+    "crm":           "/applications/sales/crm",
+    "employees":     "/applications/hr/employees",
+    "payroll":       "/applications/hr/payroll",
+    "timesheets":    "/applications/services/timesheets",
+    "project":       "/applications/services/project",
+    "helpdesk":      "/applications/services/helpdesk",
+    "ecommerce":     "/applications/websites/ecommerce",
+    "website":       "/applications/websites/website",
+    "marketing":     "/applications/marketing",
+    "pos":           "/applications/sales/point_of_sale",
+    "quality":       "/applications/inventory_and_mrp/quality",
+    "maintenance":   "/applications/inventory_and_mrp/maintenance",
+    "fleet":         "/applications/hr/fleet",
+    "discuss":       "/applications/productivity/discuss",
+    "studio":        "/applications/studio",
+    "general":       "/applications/general",
+    "install":       "/administration",
+}
+
+NOISE_SELECTORS = [
+    "nav", "footer", "header", ".toctree-wrapper",
+    ".wy-nav-side", ".wy-menu", ".wy-side-nav-search",
+    ".rst-footer-buttons", "#edit-on-github",
+    "[role='navigation']", ".breadcrumbs",
+    ".sidebar", ".sphinxsidebar",
+    "script", "style",
+]
+
+
+@dataclass
+class DocPage:
+    url: str
+    title: str
+    module: str
+    section: str
+    text: str
+    headings: list
+    doc_id: str
+
+
+def fetch_sitemap_urls(sitemap_url: str, module_filter: str | None) -> list:
+    log.info(f"Fetching sitemap: {sitemap_url}")
+    resp = requests.get(sitemap_url, headers=HEADERS, timeout=REQUEST_TIMEOUT)
+    resp.raise_for_status()
+    soup = BeautifulSoup(resp.text, "xml")
+    all_urls = [loc.text.strip() for loc in soup.find_all("loc")]
+    urls = [
+        u for u in all_urls
+        if "/18.0/" in u or "/documentation/18.0" in u
+        if not any(f"/{lang}/" in u for lang in ["fr", "de", "es", "pt", "nl", "zh"])
+    ]
+    if module_filter:
+        path = MODULE_PATHS.get(module_filter)
+        if not path:
+            raise ValueError(f"Unknown module '{module_filter}'. Choose from: {', '.join(MODULE_PATHS)}")
+        urls = [u for u in urls if path in u]
+        log.info(f"Module filter '{module_filter}': {len(urls)} pages")
+    else:
+        log.info(f"Total pages: {len(urls)}")
+    return urls
+
+
+def fallback_urls() -> list:
+    """Curated fallback list if sitemap is unavailable."""
+    paths = [
+        "/applications/finance/accounting.html",
+        "/applications/finance/accounting/customer_invoices.html",
+        "/applications/finance/accounting/customer_invoices/overview.html",
+        "/applications/finance/accounting/vendor_bills.html",
+        "/applications/finance/accounting/get_started/chart_of_accounts.html",
+        "/applications/finance/accounting/get_started/cheat_sheet.html",
+        "/applications/finance/accounting/get_started/multi_currency.html",
+        "/applications/finance/accounting/reporting/budget.html",
+        "/applications/finance/accounting/reporting/analytic_accounting.html",
+        "/applications/finance/accounting/bank.html",
+        "/applications/finance/accounting/taxes.html",
+        "/applications/finance/accounting/reporting.html",
+        "/applications/finance/expenses.html",
+        "/applications/finance/expenses/reinvoice_expenses.html",
+        "/applications/finance/payment_providers.html",
+        "/applications/finance.html",
+        "/applications/sales.html",
+        "/applications/sales/sales.html",
+        "/applications/sales/crm.html",
+        "/applications/sales/crm/pipeline.html",
+        "/applications/sales/crm/acquire_leads/email_manual.html",
+        "/applications/sales/crm/pipeline/manage_sales_teams.html",
+        "/applications/sales/crm/optimize/utilize_activities.html",
+        "/applications/inventory_and_mrp/inventory.html",
+        "/applications/inventory_and_mrp/inventory/warehouses_storage/replenishment.html",
+        "/applications/inventory_and_mrp/inventory/warehouses_storage/replenishment/mto.html",
+        "/applications/inventory_and_mrp/inventory/warehouses_storage/replenishment/reordering_rules.html",
+        "/applications/inventory_and_mrp/inventory/shipping_receiving/daily_operations.html",
+        "/applications/inventory_and_mrp/purchase.html",
+        "/applications/inventory_and_mrp/purchase/manage_deals/rfq.html",
+        "/applications/inventory_and_mrp/purchase/manage_deals/manage.html",
+        "/applications/inventory_and_mrp/purchase/manage_deals/blanket_orders.html",
+        "/applications/inventory_and_mrp/purchase/manage_deals/calls_for_tenders.html",
+        "/applications/inventory_and_mrp/manufacturing.html",
+        "/applications/inventory_and_mrp/manufacturing/workflows.html",
+        "/applications/inventory_and_mrp/manufacturing/workflows/use_mps.html",
+        "/applications/inventory_and_mrp/manufacturing/workflows/manufacturing_backorders.html",
+        "/applications/inventory_and_mrp/manufacturing/subcontracting.html",
+        "/applications/inventory_and_mrp/manufacturing/advanced_configuration/kit_shipping.html",
+        "/applications/hr.html",
+        "/applications/hr/employees.html",
+        "/applications/hr/employees/new_employee.html",
+        "/applications/hr/payroll.html",
+        "/applications/hr/payroll/contracts.html",
+        "/applications/hr/payroll/payslips.html",
+        "/applications/hr/payroll/batches.html",
+        "/applications/websites/ecommerce.html",
+        "/applications/websites/ecommerce/products.html",
+        "/applications/websites/ecommerce/checkout_payment_shipping/checkout.html",
+        "/applications/websites/ecommerce/checkout_payment_shipping/payments.html",
+        "/applications/websites/ecommerce/customer_accounts.html",
+        "/applications/services/helpdesk.html",
+        "/applications/services/helpdesk/advanced/after_sales.html",
+        "/applications/services/project.html",
+        "/applications/finance/fiscal_localizations/united_states.html",
+        "/applications.html",
+        "/applications/general.html",
+    ]
+    return [urljoin(BASE_URL, p) for p in paths]
+
+
+def infer_module(url: str) -> str:
+    for module, path in MODULE_PATHS.items():
+        if path.lstrip("/") in url:
+            return module
+    return "general"
+
+
+def extract_section(soup: BeautifulSoup) -> str:
+    bc = soup.select(".breadcrumbs a, .wy-breadcrumbs a, nav[aria-label='breadcrumb'] a")
+    if bc:
+        return " > ".join(a.get_text(strip=True) for a in bc if a.get_text(strip=True))
+    h1 = soup.find("h1")
+    return h1.get_text(strip=True) if h1 else "Odoo 18 Docs"
+
+
+def clean_text(soup: BeautifulSoup) -> tuple:
+    for sel in NOISE_SELECTORS:
+        for el in soup.select(sel):
+            el.decompose()
+
+    content = (
+        soup.find("div", {"class": "document"})
+        or soup.find("article")
+        or soup.find("main")
+        or soup.find("div", {"role": "main"})
+        or soup.find("body")
+    )
+    if not content:
+        return "", []
+
+    headings = []
+    lines = []
+
+    for el in content.descendants:
+        if not hasattr(el, "name"):
+            continue
+        if el.name in ("h1", "h2", "h3", "h4"):
+            text = el.get_text(strip=True)
+            if text:
+                prefix = "#" * int(el.name[1])
+                lines.append(f"\n{prefix} {text}\n")
+                if el.name in ("h2", "h3"):
+                    headings.append(text)
+        elif el.name == "p":
+            text = el.get_text(separator=" ", strip=True)
+            if text and len(text) > 20:
+                lines.append(text)
+        elif el.name == "li":
+            text = el.get_text(separator=" ", strip=True)
+            if text and len(text) > 5:
+                lines.append(f"- {text}")
+        elif el.name == "code":
+            text = el.get_text(strip=True)
+            if text:
+                lines.append(f"`{text}`")
+
+    raw = "\n".join(lines)
+    clean = re.sub(r"\n{3,}", "\n\n", raw).strip()
+    return clean, headings
+
+
+def fetch_page(url: str) -> DocPage | None:
+    for attempt in range(MAX_RETRIES):
+        try:
+            resp = requests.get(url, headers=HEADERS, timeout=REQUEST_TIMEOUT)
+            if resp.status_code == 404:
+                log.warning(f"404: {url}")
+                return None
+            resp.raise_for_status()
+
+            soup = BeautifulSoup(resp.text, "html.parser")
+            title_tag = soup.find("title")
+            title = title_tag.get_text(strip=True) if title_tag else url
+            title = re.sub(r"\s*—\s*Odoo.*", "", title).strip()
+
+            text, headings = clean_text(soup)
+            if len(text) < 100:
+                return None
+
+            return DocPage(
+                url=url,
+                title=title,
+                module=infer_module(url),
+                section=extract_section(soup),
+                text=text,
+                headings=headings,
+                doc_id=hashlib.sha256(url.encode()).hexdigest()[:16],
+            )
+        except requests.RequestException as e:
+            if attempt < MAX_RETRIES - 1:
+                wait = 2 ** attempt
+                log.warning(f"Retry {attempt+1} for {url}: {e} (wait {wait}s)")
+                time.sleep(wait)
+            else:
+                log.error(f"Failed: {url}: {e}")
+                return None
+
+
+def crawl(module: str | None = None, limit: int | None = None):
+    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+
+    try:
+        urls = fetch_sitemap_urls(SITEMAP_URL, module)
+    except Exception as e:
+        log.warning(f"Sitemap unavailable ({e}), using fallback list")
+        urls = fallback_urls()
+        if module:
+            path = MODULE_PATHS.get(module, "")
+            urls = [u for u in urls if path.lstrip("/") in u]
+
+    if limit:
+        urls = urls[:limit]
+
+    log.info(f"Crawling {len(urls)} pages...")
+    written = skipped = 0
+
+    with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
+        for i, url in enumerate(urls, 1):
+            log.info(f"[{i}/{len(urls)}] {url}")
+            page = fetch_page(url)
+            if page:
+                f.write(json.dumps(asdict(page), ensure_ascii=False) + "\n")
+                written += 1
+            else:
+                skipped += 1
+            time.sleep(DELAY_SECONDS)
+
+    log.info(f"\n✅ Done. Written: {written}, Skipped: {skipped}")
+    log.info(f"   Output: {OUTPUT_FILE}")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Odoo 18 docs scraper")
+    parser.add_argument("--module", help=f"Filter to one module: {', '.join(MODULE_PATHS)}")
+    parser.add_argument("--limit", type=int, help="Max pages (for testing)")
+    args = parser.parse_args()
+    crawl(module=args.module, limit=args.limit)