Spaces:

roger1024
/

DocPipe

Running

jieluo1024 commited on 12 days ago

Commit

00b2f48

1 Parent(s): b8ca6f2

feat: update XGBoost weights URL and add Gradio demo

- Fix XGBoost router weights download URL to use GitHub raw links
- Add timeout and fallback URLs for model download
- Add Gradio demo interface (demo/app.py, demo/pipeline.py)
- Add app.py entry point for HuggingFace Spaces
- Add requirements.txt for dependencies

Files changed (9) hide show

.gitignore +7 -0
README.md +75 -1
app.py +25 -0
demo/README.md +102 -0
demo/app.py +377 -0
demo/pipeline.py +311 -0
docs/ROADMAP.md +807 -0
packages/pdfsys-router/src/pdfsys_router/download_weights.py +29 -18
requirements.txt +21 -0

.gitignore CHANGED Viewed

@@ -7,6 +7,8 @@ __pycache__/
 .eggs/
 build/
 dist/
 # uv / virtualenv
 .venv/
@@ -38,3 +40,8 @@ models/
 .idea/
 .vscode/
 *.swp

 .eggs/
 build/
 dist/
+.cursor/
+scripts/
 # uv / virtualenv
 .venv/
 .idea/
 .vscode/
 *.swp
+# Gradio / HF Spaces runtime artifacts
+flagged/
+gradio_cached_examples/
+.gradio/

README.md CHANGED Viewed

@@ -1,8 +1,25 @@
 # pdfsys-mnbvc
 PB-scale PDF → pretraining-data pipeline for the [MNBVC](https://github.com/esbatmop/MNBVC) corpus project.
 FinePDFs-inspired architecture adapted for Chinese-heavy, mixed-quality input.
 ## Current status: MVP closed loop ✅
 The first end-to-end path — **Router → MuPDF parser → OCR quality scorer** — is working on the OmniDocBench-100 evaluation set. PDFs that need OCR are routed to `PIPELINE` but not yet extracted (that backend is not implemented yet).
@@ -221,7 +238,64 @@ A companion `.summary.json` file is also written with aggregate statistics.
 ## Docs
-- `docs/PRD.md` — full PRD with resource budgets and roadmap.
 ## License

+---
+title: PDFSystem MNBVC Demo
+emoji: 📄
+colorFrom: green
+colorTo: purple
+sdk: gradio
+sdk_version: 5.12.0
+app_file: app.py
+pinned: false
+license: apache-2.0
+short_description: FinePDFs-style PDF pipeline demo for MNBVC
+---
 # pdfsys-mnbvc
 PB-scale PDF → pretraining-data pipeline for the [MNBVC](https://github.com/esbatmop/MNBVC) corpus project.
 FinePDFs-inspired architecture adapted for Chinese-heavy, mixed-quality input.
+> **Try it:** `python app.py` locally, or deploy to Hugging Face Spaces with one click
+> — the YAML header above is all the Space config needed. See [`demo/README.md`](demo/README.md)
+> for both paths.
 ## Current status: MVP closed loop ✅
 The first end-to-end path — **Router → MuPDF parser → OCR quality scorer** — is working on the OmniDocBench-100 evaluation set. PDFs that need OCR are routed to `PIPELINE` but not yet extracted (that backend is not implemented yet).
 ## Docs
+- [`docs/PRD.md`](docs/PRD.md) — full PRD with resource budgets and architectural rationale (the "what & why").
+- [`docs/ROADMAP.md`](docs/ROADMAP.md) — prioritised implementation plan with work-estimates and acceptance criteria (the "how & when").
+- [`CONTRIBUTING.md`](CONTRIBUTING.md) — naming, parity rules, commit scopes.
+- [`demo/README.md`](demo/README.md) — Gradio demo + Hugging Face Spaces deploy guide.
+## Collaborating with Cursor
+This repo ships a full set of [Cursor project rules](https://docs.cursor.com/context/rules) under `.cursor/rules/`. They give the AI agent the same mental model senior contributors have — including the non-obvious bits (FinePDFs feature parity, `pdfsys-core` zero-dep rule, Gradio UI/logic separation) that a new collaborator would otherwise step on.
+### Quick start
+```bash
+# One-shot bootstrap: checks python/uv, syncs workspace, downloads router weights.
+bash scripts/setup_cursor.sh
+```
+Then open the repo in Cursor (≥ 0.50, which supports `.cursor/rules/*.mdc`). The always-on rules activate immediately; file-specific rules attach as you open matching files.
+### Active rules
+| Rule | Scope | What it enforces |
+|------|-------|------------------|
+| `00-project-context.mdc` | always | Project goals, tech stack, must-read docs, explicit non-goals. |
+| `01-architecture-invariants.mdc` | always | 7 load-bearing invariants (zero-dep core, stateless processing, normalized bbox, etc.). |
+| `02-commit-workflow.mdc` | always | Conventional commits with package-scoped names; pre-commit checklist. |
+| `03-doc-sync.mdc` | always | Doc-sync mapping table: which code change forces which doc update. Cursor proactively scans after edits. |
+| `10-python-standards.mdc` | `**/*.py` | Type hints, frozen dataclass, lazy imports for heavy deps. |
+| `20-core-contracts.mdc` | `packages/pdfsys-core/**` | Zero external deps; no I/O; schema change ripple rules. |
+| `21-router-parity.mdc` | `packages/pdfsys-router/**` | FinePDFs 124-feature parity is sacred; how to verify. |
+| `22-parser-backends.mdc` | `packages/pdfsys-parser-*/**` | All three backends must emit identical `ExtractedDoc`. |
+| `23-bench-scorer.mdc` | `packages/pdfsys-bench/**` | torch/transformers lazy load; bf16 default; loop never raises. |
+| `30-gradio-demo.mdc` | `demo/**,app.py` | UI layer has no business logic; callbacks never raise; lazy singletons. |
+### Recommended Cursor workflow
+1. **Before touching `pdfsys-core`** — read `20-core-contracts.mdc`. The AI will refuse to add third-party deps here and surface schema-ripple questions.
+2. **Before touching `feature_extractor.py`** — `21-router-parity.mdc` kicks in; the AI will suggest running the parity check before you commit.
+3. **When building a new parser backend** — `22-parser-backends.mdc` walks through the 6-step addition procedure and refuses partial impls.
+4. **When writing demo UI** — `30-gradio-demo.mdc` rejects `import pymupdf` in `demo/app.py` (belongs in `demo/pipeline.py`).
+### Authoring new rules
+Rules live in `.cursor/rules/*.mdc`. Format:
+```yaml
+---
+description: Short description shown in the rule picker
+globs: packages/<pkg>/**/*.py    # omit for always-on rules
+alwaysApply: false                # true = always loaded
+---
+# Rule Title
+- Bullet rule 1 (with ✅/❌ example)
+- Bullet rule 2
+```
+Keep each rule under 100 lines, one concern per file. See existing rules for patterns.
 ## License

app.py ADDED Viewed

	@@ -0,0 +1,25 @@

+"""Hugging Face Spaces entry point.
+HF Spaces looks for ``app.py`` at the repo root. We just import the
+actual app from ``demo/`` so the demo code stays tucked away and the
+root stays uncluttered.
+"""
+from __future__ import annotations
+import sys
+from pathlib import Path
+_DEMO_DIR = Path(__file__).resolve().parent / "demo"
+sys.path.insert(0, str(_DEMO_DIR))
+from app import demo  # noqa: E402,F401 — re-exported for HF Spaces
+if __name__ == "__main__":
+    import os
+    demo.queue(max_size=8).launch(
+        server_name=os.environ.get("GRADIO_SERVER_NAME", "0.0.0.0"),
+        server_port=int(os.environ.get("GRADIO_SERVER_PORT", "7860")),
+        show_api=False,
+    )

demo/README.md ADDED Viewed

	@@ -0,0 +1,102 @@

+# pdfsys-mnbvc · Gradio Demo
+A small self-contained Gradio app that runs the **actually-implemented** MVP
+path of the pdfsys-mnbvc pipeline on a single PDF you upload.
+It exercises the same three components the bench harness does:
+1. **Stage-A XGBoost router** (`pdfsys_router.Router`) — 124 PyMuPDF features → `ocr_prob` → one of `mupdf / pipeline / vlm / deferred`.
+2. **MuPDF fast path** (`pdfsys_parser_mupdf.extract_doc`) — runs only when the router picks `mupdf`. Emits `Segment[]` with normalized bboxes + a merged Markdown blob.
+3. **ModernBERT OCR quality scorer** (`pdfsys_bench.quality.OcrQualityScorer`) — optional; heavy; gated behind a checkbox.
+PIPELINE / VLM / DEFERRED backends are currently stubs in the repo, so the
+demo surfaces the router decision and skips extraction for them.
+## UI
+```
+┌─────────────────┬──────────────────────────────────────────────────┐
+│  upload PDF     │  Summary · backend · P(ocr) · pages · timing     │
+│  threshold      ├──────────────────────────────────────────────────┤
+│  ☐ quality      │  [ Page preview │ Markdown │ Segments │          │
+│  [Run Pipeline] │    Router features │ Raw JSON ]                  │
+│                 │                                                   │
+│  pipeline       │  Page preview draws extracted bboxes (color =     │
+│  diagram        │  chosen backend) directly on the first page.      │
+└─────────────────┴──────────────────────────────────────────────────┘
+```
+## Run locally
+```bash
+# option A — full workspace install (recommended)
+uv sync                                           # installs all packages + deps
+python -m pdfsys_router.download_weights          # one-time: XGBoost weights (257 KB)
+python app.py                                     # http://localhost:7860
+# option B — plain pip (matches HF Spaces)
+pip install -r requirements.txt
+python -m pdfsys_router.download_weights
+python app.py
+```
+First run of the quality scorer pulls `HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn`
+(~800 MB) from the HF Hub. Set `HF_HOME=/path/to/cache` to control where it lands.
+## Deploy to Hugging Face Spaces
+The root `README.md` already contains the required [Spaces YAML config](https://huggingface.co/docs/hub/spaces-config-reference):
+```yaml
+---
+title: PDFSystem MNBVC Demo
+sdk: gradio
+sdk_version: 4.44.0
+app_file: app.py
+license: apache-2.0
+---
+```
+### Option 1 · One-click from GitHub (recommended)
+1. Push this repo to GitHub.
+2. Go to <https://huggingface.co/new-space>.
+3. Pick **Gradio** SDK, hardware **CPU basic** is enough for the MVP loop.
+4. In **Files** → **Create Space from an existing GitHub repo**, paste the repo URL.
+HF Spaces will clone the whole repo, read the YAML header in the root
+`README.md`, install `requirements.txt`, and launch `app.py`. The router's
+XGBoost weights are downloaded automatically on first request (~257 KB, inline
+in the Space container).
+### Option 2 · Manual push
+```bash
+git clone https://huggingface.co/spaces/<you>/pdfsys-mnbvc-demo
+cd pdfsys-mnbvc-demo
+# copy repo contents into this dir (the four workspace packages must come
+# along — they are installed editable by requirements.txt)
+cp -r /path/to/pdfsystem_mnbvc/{app.py,requirements.txt,README.md,packages,demo} .
+git add . && git commit -m "Initial deploy" && git push
+```
+### Resource notes (HF Spaces free tier: CPU, 16 GB RAM)
+- Router: ~50–100 ms per PDF; effectively free.
+- MuPDF extraction: ~10 ms per page.
+- Quality scorer (ModernBERT-large): ~3–5 s per PDF at bf16; fits in RAM.
+  Disabled by default in the UI. **Keep it off** unless you want to wait.
+- GPU Spaces aren't required; the MVP path is CPU-only. A GPU Space becomes
+  useful once the Pipeline / VLM parsers land.
+## Files
+| Path | Role |
+| ---- | ---- |
+| `demo/app.py` | Gradio `Blocks` definition + event handlers. |
+| `demo/pipeline.py` | Pure-Python wrapper around `Router` + `extract_doc` + `OcrQualityScorer`. Rendering helpers live here too. |
+| `app.py` (repo root) | Thin HF-Spaces entry; imports `demo.app`. |
+| `requirements.txt` (repo root) | Pin-friendly deps for `pip install -r`. Installs the four workspace packages in editable mode. |
+The demo imports the real pipeline modules — if you change `pdfsys-router`
+or `pdfsys-parser-mupdf`, the demo picks it up on the next launch.

demo/app.py ADDED Viewed

	@@ -0,0 +1,377 @@

+"""Gradio demo for the pdfsys-mnbvc MVP pipeline.
+What this demonstrates (matching the code that actually exists in the
+repo today, not the aspirational PRD):
+* Stage-A XGBoost router — decides text-ok vs needs-ocr from 124
+  PyMuPDF-derived features.
+* MuPDF fast path — extracts Markdown-ready segments when the router
+  picks ``Backend.MUPDF``. Overlaid on the first page as colored bboxes.
+* ModernBERT OCR quality scorer — optional, heavy (~800 MB download,
+  3–5 s per doc on CPU). Off by default to keep the demo snappy.
+PIPELINE / VLM / DEFERRED backends are surfaced through the router
+decision but are still stubs in ``packages/pdfsys-parser-*``; the UI
+just reports the routing choice in that case and skips extraction.
+Runs locally (``python demo/app.py``) and as a Hugging Face Space (see
+the repo-root ``README.md`` frontmatter and ``demo/README.md``).
+"""
+from __future__ import annotations
+import json
+import os
+import sys
+import tempfile
+import traceback
+from pathlib import Path
+import gradio as gr
+# Allow ``python demo/app.py`` without installing the workspace by falling
+# back to the in-tree sources. When running under HF Spaces / uv sync the
+# packages are already on sys.path and these inserts become no-ops.
+_REPO_ROOT = Path(__file__).resolve().parent.parent
+for pkg in ("pdfsys-core", "pdfsys-router", "pdfsys-parser-mupdf", "pdfsys-bench"):
+    src = _REPO_ROOT / "packages" / pkg / "src"
+    if src.is_dir() and str(src) not in sys.path:
+        sys.path.insert(0, str(src))
+from pipeline import (  # noqa: E402 — must come after sys.path surgery
+    PipelineResult,
+    pick_curated_features,
+    render_first_page_with_bboxes,
+    run_pipeline,
+)
+# ------------------------------------------------------------------ constants
+DESCRIPTION = """\
+# PDFSystem-MNBVC · Pipeline Demo
+**FinePDFs-inspired PB-scale PDF → pretraining-data pipeline**, adapted
+for the Chinese MNBVC corpus. This demo shows the MVP closed loop that
+is actually implemented in the repo today:
+**Router (XGBoost, 124 features)** → **MuPDF fast path** → **OCR Quality Scorer (ModernBERT)**
+The router decides whether a PDF is cheap to parse with PyMuPDF alone,
+or whether it needs to go to the (still-stubbed) OCR / VLM backends.
+Roughly 90% of a typical PDF corpus takes the green fast-path lane.
+"""
+PIPELINE_DIAGRAM_MD = """\
+### Pipeline
+```
+               ┌────────────────┐
+   PDF ───────►│  Stage-A       │  XGBoost · ~10 ms/PDF
+               │  Router        │  124 PyMuPDF features
+               └────────┬───────┘
+                        │  ocr_prob
+          ┌─────────────┼─────────────┐
+          ▼             ▼             ▼
+       MUPDF         PIPELINE        VLM / DEFERRED
+       (text-ok)     (OCR, stub)     (VLM, stub)
+          │
+          ▼
+     PyMuPDF blocks ─► Markdown + Segments (with bboxes)
+          │
+          ▼
+     ModernBERT-large OCR quality regressor ─► score ∈ [0, 3]
+```
+**Backend color legend on page preview**
+- 🟢 `mupdf` — text-ok fast path (implemented)
+- 🟠 `pipeline` — OCR lane (stub, routing only)
+- 🟣 `vlm` — VLM lane (stub, routing only)
+- ⚪ `deferred` — held back until VLM workers online
+"""
+def _safe(val, default=""):
+    """Coerce NaN / None for Gradio components that don't like them."""
+    if val is None:
+        return default
+    try:
+        import math
+        if isinstance(val, float) and math.isnan(val):
+            return default
+    except Exception:
+        pass
+    return val
+# ------------------------------------------------------------------ handlers
+def process_pdf(
+    pdf_file: str | None,
+    run_quality: bool,
+    ocr_threshold: float,
+    progress: gr.Progress = gr.Progress(),
+):
+    """Main Gradio callback. Returns one value per output component."""
+    empty_segments = [[0, 0, "-", "-", 0, ""]]
+    empty_features = [["(no PDF uploaded)", ""]]
+    empty_summary = "Upload a PDF to get started."
+    if not pdf_file:
+        return (
+            empty_summary,
+            "", 0.0, 0, "", 0.0,
+            None,
+            "_No markdown yet._",
+            empty_segments,
+            empty_features,
+            {},
+        )
+    pdf_path = Path(pdf_file)
+    try:
+        progress(0.1, desc="Routing (XGBoost)…")
+        result: PipelineResult = run_pipeline(
+            pdf_path,
+            run_quality=run_quality,
+            ocr_threshold=ocr_threshold,
+        )
+        progress(0.7, desc="Rendering first page…")
+        preview = render_first_page_with_bboxes(pdf_path, result, page_index=0)
+    except Exception as e:  # noqa: BLE001
+        tb = traceback.format_exc()
+        err_json = {"error": str(e), "traceback": tb.splitlines()[-6:]}
+        return (
+            f"**Failed:** `{e}`",
+            "", 0.0, 0, "", 0.0,
+            None,
+            f"```\n{tb}\n```",
+            empty_segments,
+            empty_features,
+            err_json,
+        )
+    # ------------------------------------------------------------- summary
+    lines = [
+        f"**File:** `{pdf_path.name}` ({pdf_path.stat().st_size / 1024:.1f} KB)",
+        f"**Routed to:** `{result.backend}` &nbsp;·&nbsp; "
+        f"P(ocr) = **{result.ocr_prob:.3f}** &nbsp;·&nbsp; {result.num_pages} page(s)",
+    ]
+    flags = []
+    if result.is_form:
+        flags.append("is_form")
+    if result.is_encrypted:
+        flags.append("encrypted")
+    if result.needs_password:
+        flags.append("password-protected")
+    if result.garbled_text_ratio > 0.01:
+        flags.append(f"garbled_text_ratio={result.garbled_text_ratio:.2%}")
+    if flags:
+        lines.append("**Flags:** " + ", ".join(f"`{f}`" for f in flags))
+    if result.router_error:
+        lines.append(f"**Router error:** `{result.router_error}`")
+    if result.extract_error:
+        lines.append(f"**Extract error:** `{result.extract_error}`")
+    if result.quality_error:
+        lines.append(f"**Quality error:** `{result.quality_error}`")
+    if result.backend == "mupdf" and not result.extract_error:
+        stats = result.extract_stats
+        lines.append(
+            f"**Extracted:** {stats.get('segment_count', 0)} segments, "
+            f"{stats.get('char_count', 0):,} chars "
+            f"(pages {stats.get('pages_extracted', 0)}/{stats.get('page_count', 0)})"
+        )
+    else:
+        lines.append(
+            "_MuPDF extraction skipped — backend is not `mupdf`. "
+            "PIPELINE/VLM backends are still stubs in this repo._"
+        )
+    if result.quality_score is not None:
+        lines.append(
+            f"**OCR quality:** **{result.quality_score:.2f}** / 3.0 "
+            f"({result.quality_num_tokens} tokens, `{result.quality_model}`)"
+        )
+    lines.append(
+        f"**Timing (ms):** router **{result.wall_ms_router:.0f}** · "
+        f"extract **{result.wall_ms_extract:.0f}** · "
+        f"quality **{result.wall_ms_quality:.0f}**"
+    )
+    summary_md = "\n\n".join(lines)
+    # ------------------------------------------------------------- markdown
+    md_text = result.markdown.strip() or "_No markdown — this PDF was not routed to MuPDF._"
+    if len(md_text) > 20_000:
+        md_text = md_text[:20_000] + "\n\n…\n\n**[truncated for UI — full Markdown in the JSON tab]**"
+    # ------------------------------------------------------------- segments
+    seg_rows = [
+        [s["index"], s["page"], s["type"], str(s["bbox_norm"]), s["chars"], s["preview"]]
+        for s in result.segments
+    ] or empty_segments
+    # ------------------------------------------------------------- features
+    feat_rows = pick_curated_features(result.router_features) or empty_features
+    # ------------------------------------------------------------- raw JSON
+    raw = result.to_record()
+    raw["router_features_full"] = result.router_features
+    raw["segments_full"] = result.segments
+    return (
+        summary_md,
+        result.backend,
+        float(result.ocr_prob) if result.ocr_prob == result.ocr_prob else 0.0,
+        int(result.num_pages),
+        ("-" if result.quality_score is None else f"{result.quality_score:.2f} / 3.0"),
+        float(result.wall_ms_router + result.wall_ms_extract + result.wall_ms_quality),
+        preview,
+        md_text,
+        seg_rows,
+        feat_rows,
+        raw,
+    )
+# ---------------------------------------------------------------------- UI
+CSS = """
+.small-num input { font-weight: 600; font-size: 1.1rem; }
+footer { display: none !important; }
+"""
+def build_demo() -> gr.Blocks:
+    with gr.Blocks(title="PDFSystem-MNBVC Demo") as demo:
+        gr.Markdown(DESCRIPTION)
+        with gr.Row():
+            # -------------------- left column: controls + diagram
+            with gr.Column(scale=1, min_width=320):
+                pdf_input = gr.File(
+                    label="Upload a PDF",
+                    file_types=[".pdf"],
+                    type="filepath",
+                )
+                with gr.Accordion("Options", open=True):
+                    ocr_threshold = gr.Slider(
+                        0.0, 1.0, value=0.5, step=0.05,
+                        label="OCR probability threshold",
+                        info="ocr_prob ≥ threshold ⇒ route off the MuPDF fast path",
+                    )
+                    run_quality = gr.Checkbox(
+                        label="Run ModernBERT quality scorer",
+                        value=False,
+                        info="~3–5 s on CPU. First run downloads ~800 MB.",
+                    )
+                run_btn = gr.Button("Run Pipeline", variant="primary", size="lg")
+                gr.Markdown(PIPELINE_DIAGRAM_MD)
+            # -------------------- right column: outputs
+            with gr.Column(scale=2, min_width=520):
+                summary_md = gr.Markdown(
+                    "Upload a PDF and click **Run Pipeline**.",
+                    label="Summary",
+                )
+                with gr.Row():
+                    backend_out = gr.Textbox(
+                        label="Backend", interactive=False, elem_classes=["small-num"]
+                    )
+                    ocr_prob_out = gr.Number(
+                        label="P(OCR)", interactive=False, precision=3,
+                        elem_classes=["small-num"],
+                    )
+                    pages_out = gr.Number(
+                        label="Pages", interactive=False,
+                        elem_classes=["small-num"],
+                    )
+                    quality_out = gr.Textbox(
+                        label="Quality", interactive=False,
+                        elem_classes=["small-num"],
+                    )
+                    wall_ms_out = gr.Number(
+                        label="Total ms", interactive=False, precision=0,
+                        elem_classes=["small-num"],
+                    )
+                with gr.Tabs():
+                    with gr.Tab("Page preview"):
+                        preview_img = gr.Image(
+                            label="First page with extracted bboxes",
+                            type="pil",
+                            interactive=False,
+                            height=720,
+                        )
+                    with gr.Tab("Markdown"):
+                        md_out = gr.Markdown()
+                    with gr.Tab("Segments"):
+                        seg_df = gr.Dataframe(
+                            headers=["idx", "page", "type", "bbox_norm", "chars", "preview"],
+                            datatype=["number", "number", "str", "str", "number", "str"],
+                            wrap=True,
+                            label="Extracted segments (one row per block)",
+                        )
+                    with gr.Tab("Router features"):
+                        feat_df = gr.Dataframe(
+                            headers=["feature", "value"],
+                            datatype=["str", "str"],
+                            label="Curated subset (full 124-dim vector in Raw JSON)",
+                        )
+                    with gr.Tab("Raw JSON"):
+                        raw_json = gr.JSON(label="All pipeline outputs")
+        # ----------------------------------------------------------- wiring
+        outputs = [
+            summary_md,
+            backend_out, ocr_prob_out, pages_out, quality_out, wall_ms_out,
+            preview_img,
+            md_out,
+            seg_df,
+            feat_df,
+            raw_json,
+        ]
+        run_btn.click(
+            process_pdf,
+            inputs=[pdf_input, run_quality, ocr_threshold],
+            outputs=outputs,
+        )
+        # Auto-run on file upload (with quality off for snappiness).
+        pdf_input.upload(
+            lambda f, t: process_pdf(f, False, t),
+            inputs=[pdf_input, ocr_threshold],
+            outputs=outputs,
+        )
+        gr.Markdown(
+            "---\n"
+            "Repo: [pdfsystem_mnbvc](https://github.com/) · "
+            "Architecture: [FinePDFs](https://huggingface.co/datasets/HuggingFaceFW/finepdfs) · "
+            "Router weights: FinePDFs upstream (Apache-2.0) · "
+            "Quality model: `HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn`"
+        )
+    return demo
+demo = build_demo()
+if __name__ == "__main__":
+    # Sensible defaults for both local dev and HF Spaces.
+    server_name = os.environ.get("GRADIO_SERVER_NAME", "0.0.0.0")
+    server_port = int(os.environ.get("GRADIO_SERVER_PORT", "7860"))
+    demo.queue(max_size=8).launch(
+        server_name=server_name,
+        server_port=server_port,
+        theme=gr.themes.Soft(primary_hue="emerald"),
+        css=CSS,
+    )

demo/pipeline.py ADDED Viewed

	@@ -0,0 +1,311 @@

+"""End-to-end wiring used by the Gradio demo.
+Wraps the three production-path components in one callable:
+    Router (Stage-A XGBoost)
+        └─► Backend.MUPDF  → pdfsys_parser_mupdf.extract_doc
+        └─► anything else  → not extracted (Pipeline/VLM/Deferred are
+                             still stubs in this repo; we surface the
+                             router decision and stop).
+Kept deliberately Gradio-free so the same code is unit-testable and
+reusable from notebooks. ``app.py`` only imports :func:`run_pipeline`
+and :func:`render_first_page_with_bboxes`.
+"""
+from __future__ import annotations
+import io
+import time
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+import pymupdf
+from PIL import Image, ImageDraw
+# ------------------------------------------------------------------ singletons
+_ROUTER: Any = None
+_SCORER: Any = None
+def _ensure_router_weights() -> None:
+    """Make sure the XGBoost weights are on disk. No-op if already present."""
+    from pdfsys_router.download_weights import download, target_path
+    if not target_path().is_file():
+        download()
+def get_router(ocr_threshold: float = 0.5):
+    """Lazy-load the singleton Router. Weights download on first call."""
+    global _ROUTER
+    _ensure_router_weights()
+    from pdfsys_router import Router
+    if _ROUTER is None or abs(_ROUTER.ocr_threshold - ocr_threshold) > 1e-9:
+        _ROUTER = Router(ocr_threshold=ocr_threshold)
+    return _ROUTER
+def get_scorer():
+    """Lazy-load the singleton ModernBERT quality scorer (~800 MB download)."""
+    global _SCORER
+    if _SCORER is None:
+        from pdfsys_bench.quality import OcrQualityScorer
+        _SCORER = OcrQualityScorer()
+    return _SCORER
+# ------------------------------------------------------------------ data class
+@dataclass(slots=True)
+class PipelineResult:
+    """Everything the UI needs in one flat object."""
+    # Router
+    backend: str
+    ocr_prob: float
+    num_pages: int
+    is_form: bool
+    garbled_text_ratio: float
+    is_encrypted: bool
+    needs_password: bool
+    router_error: str | None
+    router_features: dict[str, Any] = field(default_factory=dict)
+    # Extract (only when backend == mupdf)
+    sha256: str | None = None
+    segments: list[dict[str, Any]] = field(default_factory=list)
+    markdown: str = ""
+    extract_stats: dict[str, Any] = field(default_factory=dict)
+    extract_error: str | None = None
+    # Quality
+    quality_score: float | None = None
+    quality_num_tokens: int | None = None
+    quality_model: str | None = None
+    quality_error: str | None = None
+    # Wall times (ms)
+    wall_ms_router: float = 0.0
+    wall_ms_extract: float = 0.0
+    wall_ms_quality: float = 0.0
+    def to_record(self) -> dict[str, Any]:
+        """Flat JSON-friendly dict for the raw output tab."""
+        return {
+            "backend": self.backend,
+            "ocr_prob": self.ocr_prob,
+            "num_pages": self.num_pages,
+            "is_form": self.is_form,
+            "garbled_text_ratio": self.garbled_text_ratio,
+            "is_encrypted": self.is_encrypted,
+            "needs_password": self.needs_password,
+            "router_error": self.router_error,
+            "sha256": self.sha256,
+            "num_segments": len(self.segments),
+            "markdown_chars": len(self.markdown),
+            "extract_stats": self.extract_stats,
+            "extract_error": self.extract_error,
+            "quality_score": self.quality_score,
+            "quality_num_tokens": self.quality_num_tokens,
+            "quality_model": self.quality_model,
+            "quality_error": self.quality_error,
+            "wall_ms_router": round(self.wall_ms_router, 1),
+            "wall_ms_extract": round(self.wall_ms_extract, 1),
+            "wall_ms_quality": round(self.wall_ms_quality, 1),
+        }
+# -------------------------------------------------------------------- helpers
+def _segment_to_row(seg: Any) -> dict[str, Any]:
+    """Flatten a :class:`pdfsys_core.Segment` for the UI table."""
+    bbox = seg.bbox
+    bbox_tuple = None if bbox is None else (
+        round(bbox.x0, 4),
+        round(bbox.y0, 4),
+        round(bbox.x1, 4),
+        round(bbox.y1, 4),
+    )
+    return {
+        "index": seg.index,
+        "page": seg.page_index,
+        "type": seg.type.value,
+        "bbox_norm": bbox_tuple,
+        "chars": len(seg.content),
+        "preview": seg.content[:120].replace("\n", " "),
+    }
+# ------------------------------------------------------------------ core entry
+def run_pipeline(
+    pdf_path: str | Path,
+    *,
+    run_quality: bool = False,
+    ocr_threshold: float = 0.5,
+) -> PipelineResult:
+    """Route the PDF, extract if text-ok, optionally score quality.
+    Never raises on malformed input — all failure modes surface via the
+    ``*_error`` fields so the UI can present them uniformly.
+    """
+    pdf_path = Path(pdf_path)
+    if not pdf_path.is_file():
+        raise FileNotFoundError(f"PDF not found: {pdf_path}")
+    # -- Stage-A router -------------------------------------------------------
+    router = get_router(ocr_threshold=ocr_threshold)
+    t0 = time.perf_counter()
+    decision = router.classify(pdf_path)
+    t1 = time.perf_counter()
+    result = PipelineResult(
+        backend=decision.backend.value,
+        ocr_prob=float(decision.ocr_prob) if decision.ocr_prob == decision.ocr_prob else float("nan"),
+        num_pages=decision.num_pages,
+        is_form=decision.is_form,
+        garbled_text_ratio=decision.garbled_text_ratio,
+        is_encrypted=decision.is_encrypted,
+        needs_password=decision.needs_password,
+        router_error=decision.error,
+        router_features=dict(decision.features or {}),
+        wall_ms_router=(t1 - t0) * 1000.0,
+    )
+    # -- MuPDF extraction (only for text-ok path) -----------------------------
+    from pdfsys_core import Backend
+    from pdfsys_parser_mupdf import extract_doc
+    if decision.backend == Backend.MUPDF and decision.error is None:
+        try:
+            t2 = time.perf_counter()
+            extracted = extract_doc(pdf_path)
+            t3 = time.perf_counter()
+            result.sha256 = extracted.sha256
+            result.segments = [_segment_to_row(s) for s in extracted.segments]
+            result.markdown = extracted.markdown
+            result.extract_stats = dict(extracted.stats)
+            result.wall_ms_extract = (t3 - t2) * 1000.0
+        except Exception as e:  # noqa: BLE001 — surface to UI
+            result.extract_error = f"{type(e).__name__}: {e}"
+    # -- Quality scoring (optional, heavy) ------------------------------------
+    if run_quality and result.markdown:
+        try:
+            scorer = get_scorer()
+            t4 = time.perf_counter()
+            q = scorer.score(result.markdown)
+            t5 = time.perf_counter()
+            result.quality_score = q.score
+            result.quality_num_tokens = q.num_tokens
+            result.quality_model = q.model
+            result.wall_ms_quality = (t5 - t4) * 1000.0
+        except Exception as e:  # noqa: BLE001
+            result.quality_error = f"{type(e).__name__}: {e}"
+    return result
+# ----------------------------------------------------------------- rendering
+_BACKEND_COLOR = {
+    "mupdf": (39, 174, 96),      # green — text-ok fast path
+    "pipeline": (243, 156, 18),  # orange — OCR pipeline (stub)
+    "vlm": (155, 89, 182),       # purple — VLM (stub)
+    "deferred": (127, 140, 141), # gray — held back
+}
+def render_first_page_with_bboxes(
+    pdf_path: str | Path,
+    result: PipelineResult,
+    page_index: int = 0,
+    target_max_side: int = 1100,
+) -> Image.Image | None:
+    """Render ``page_index`` of the PDF and overlay MuPDF segment bboxes.
+    Falls back to ``None`` on any failure (corrupted / encrypted / etc.).
+    """
+    pdf_path = Path(pdf_path)
+    try:
+        doc = pymupdf.open(str(pdf_path))
+    except Exception:
+        return None
+    try:
+        if len(doc) == 0 or page_index >= len(doc):
+            return None
+        page = doc[page_index]
+        rect = page.rect
+        # Scale so the longest side ~= target_max_side (for UI readability).
+        zoom = max(1.0, target_max_side / max(rect.width, rect.height))
+        pix = page.get_pixmap(matrix=pymupdf.Matrix(zoom, zoom), alpha=False)
+        img = Image.open(io.BytesIO(pix.tobytes("png"))).convert("RGB")
+    except Exception:
+        return None
+    finally:
+        doc.close()
+    # Overlay segment bboxes for the selected page only.
+    color = _BACKEND_COLOR.get(result.backend, (52, 152, 219))
+    draw = ImageDraw.Draw(img, "RGBA")
+    w, h = img.size
+    drawn = 0
+    for seg in result.segments:
+        if seg["page"] != page_index or seg["bbox_norm"] is None:
+            continue
+        x0, y0, x1, y1 = seg["bbox_norm"]
+        box = (int(x0 * w), int(y0 * h), int(x1 * w), int(y1 * h))
+        # Semi-transparent fill + solid outline.
+        draw.rectangle(box, fill=(*color, 28), outline=(*color, 220), width=2)
+        # Small index badge.
+        label = str(seg["index"])
+        tx, ty = box[0] + 2, box[1] + 2
+        draw.rectangle((tx, ty, tx + 6 + 7 * len(label), ty + 16), fill=(*color, 220))
+        draw.text((tx + 3, ty + 1), label, fill=(255, 255, 255))
+        drawn += 1
+    return img
+def pick_curated_features(features: dict[str, Any]) -> list[list[Any]]:
+    """Select a small, meaningful subset of the 124-feature vector for display.
+    The full vector goes into the raw JSON tab; this is the "at a glance"
+    view. Ordered by importance / interpretability, not by XGBoost column
+    order.
+    """
+    keys_in_order = [
+        "num_pages_successfully_sampled",
+        "garbled_text_ratio",
+        "is_form",
+        "creator_or_producer_is_known_scanner",
+        "num_unique_image_xrefs",
+        "num_junk_image_xrefs",
+        "page_level_char_counts_page1",
+        "page_level_unique_font_counts_page1",
+        "page_level_text_area_ratios_page1",
+        "page_level_image_counts_page1",
+        "page_level_bitmap_proportions_page1",
+        "page_level_vector_graphics_obj_count_page1",
+        "page_level_hidden_char_counts_page1",
+    ]
+    rows: list[list[Any]] = []
+    for k in keys_in_order:
+        if k in features:
+            v = features[k]
+            if isinstance(v, float):
+                v = round(v, 4)
+            rows.append([k, v])
+    return rows

docs/ROADMAP.md ADDED Viewed

	@@ -0,0 +1,807 @@

+# pdfsys-mnbvc · Roadmap
+> 优化方案与实施计划 · v0.1 · 2026-04-17
+>
+> 本文档把 [`PRD.md`](./PRD.md) 描述的目标转化为**带优先级、带工作量、带验收标准**的可执行任务池。PRD 回答"我们要做什么"，ROADMAP 回答"按什么顺序做、怎么做、做完怎么验证"。
+---
+## 0 · 摘要
+**一句话**：设计文档与架构框架一流，工程基础设施缺失严重，6 个 stage 只落地了 1.5 个。
+**冲刺计划**：以 2 周"可协作化"冲刺（P0）作为一切后续工作的前提，再用 4 周打磨性能与可靠性（P1），最后 10–16 周补齐 6-stage 闭环（P2）。P3 是 PB 级规模化与生态，作为长期背景项。
+---
+## 1 · 现状评分卡
+| 维度 | 状态 | 评分 |
+|---|---|---|
+| 设计文档（PRD） | 441 行，取舍清晰 | 9/10 |
+| 架构分包 | 7 个 workspace 包，边界合理 | 8/10 |
+| 核心契约（`pdfsys-core`） | frozen dataclass + 零依赖 + 原子写 | 9/10 |
+| MVP 闭环（Router→MuPDF→Scorer） | 跑通 OmniDocBench-100 | 7/10 |
+| **测试** | **零测试文件，零 CI** | **0/10** |
+| **依赖管理** | 无 lock 文件，依赖无上界 | 2/10 |
+| **Observability** | 无 logging，无 metrics | 2/10 |
+| 实现完成度 | 2180 行，4/7 包是 stub | 3/10 |
+| Demo & 贡献者体验 | Gradio + Cursor rules 完善 | 8/10 |
+**关键风险**：当前状态下 1 人可 hack 前进；**任何超过 3 人的协作会立刻失控**——没有测试保护 parity、没有 CI、没有 lock 文件，第一次依赖升级就会毒化路由器。
+---
+## 2 · 优化全景
+```
+┌──────────────────────────────────────────────────────────────────┐
+│  P0  工程基础（2 周，阻塞一切后续）                                │
+│  ├─ 1.1 测试框架 pytest + 关键单测                                 │
+│  ├─ 1.2 代码质量 ruff + mypy + pre-commit                         │
+│  ├─ 1.3 GitHub Actions CI                                         │
+│  ├─ 1.4 uv.lock 入库 + 依赖上界                                    │
+│  └─ 1.5 Parity harness（router 回归守门）                          │
+├──────────────────────────────────────────────────────────────────┤
+│  P1  性能与可靠性（4 周）                                          │
+│  ├─ 2.1 Router 热路径优化（49 ms → 10 ms）                         │
+│  ├─ 2.2 Quality scorer 批量推理                                    │
+│  ├─ 2.3 structlog 日志系统                                         │
+│  ├─ 2.4 Prometheus metrics 导出                                    │
+│  └─ 2.5 错误分类 + quarantine 桶                                   │
+├──────────────────────────────────────────────────────────────────┤
+│  P2  功能补全（8-12 周，按 PRD roadmap）                           │
+│  ├─ 3.1 Layout analyser（PP-DocLayoutV3 ONNX INT8）                │
+│  ├─ 3.2 Pipeline parser（RapidOCR 简单版式）                       │
+│  ├─ 3.3 Stage-B router（layout-cache 驱动）                        │
+│  ├─ 3.4 VLM parser（MinerU 2.5 + LMDeploy）                        │
+│  ├─ 3.5 Stage-3 后处理                                             │
+│  ├─ 3.6 Stage-4 质量 / PII / MinHash 去重                          │
+│  └─ 3.7 Stage-5 Parquet 打包                                       │
+├──────────────────────────────────────────────────────────────────┤
+│  P3  规模化与生态（3-6 个月）                                      │
+│  ├─ 4.1 datatrove 编排集成                                         │
+│  ├─ 4.2 Slurm / K8s runner                                         │
+│  ├─ 4.3 对象存储后端（S3 / OSS / MinIO）                           │
+│  ├─ 4.4 中文 EduScore 训练                                         │
+│  └─ 4.5 竖排古籍 LoRA                                              │
+└──────────────────────────────────────────────────────────────────┘
+```
+---
+## 3 · P0 工程基础（Week 1-2）
+### 3.1 测试框架 · pytest
+**目标**：2 周内 `pdfsys-core` ≥ 90% / `pdfsys-router` ≥ 60% / `pdfsys-parser-mupdf` ≥ 60% 覆盖率。
+**为什么优先**：`.cursor/rules/01-architecture-invariants.mdc` 里 7 条不变式（BBox 归一化、frozen dataclass、原子写、schema 同构等）**全部可单测验证**。没有测试，"不要违反不变式"只是一句空话。
+**交付物结构**：
+```
+tests/
+├── conftest.py                         # 共享 fixtures
+├── fixtures/pdfs/                      # 5-10 个跨类型 PDF（< 100 KB/file，入库）
+├── unit/
+│   ├── core/
+│   │   ├── test_bbox.py               # BBox 边界、转换、非法值
+│   │   ├── test_serde.py              # to_dict/from_dict roundtrip
+│   │   ├── test_cache.py              # LayoutCache 原子写 + 崩溃恢复
+│   │   └── test_types.py              # Backend / RegionType 枚举稳定性
+│   ├── router/
+│   │   ├── test_classifier_smoke.py   # classify() 不 raise 任何畸形输入
+│   │   ├── test_feature_shape.py      # 输出必须 124 列，列名锁定
+│   │   └── test_error_taxonomy.py     # encrypted/corrupt/empty 错误分类
+│   ├── parser_mupdf/
+│   │   ├── test_extract_basic.py      # 正常 PDF 段落抽取
+│   │   ├── test_bbox_normalized.py    # 所有 bbox ∈ [0, 1]
+│   │   └── test_corrupted_pdf.py      # 坏 PDF 不 crash
+│   └── bench/
+│       └── test_loop_never_raises.py  # 坏 PDF 进去，JSONL 行出来
+├── contract/
+│   ├── test_extracted_doc_schema.py   # 所有 parser 输出同构
+│   └── test_cursor_rules_valid.py     # .mdc frontmatter 合法
+└── integration/
+    └── test_bench_smoke.py            # python -m pdfsys_bench --limit 3
+```
+**关键样例**：
+```python
+# tests/unit/core/test_bbox.py
+import pytest
+from pdfsys_core import BBox
+class TestBBoxInvariants:
+    @pytest.mark.parametrize("x0,y0,x1,y1", [
+        (-0.1, 0, 0.5, 0.5),   # 负坐标
+        (0, 0, 1.1, 0.5),      # 超过 1
+        (0.5, 0, 0.3, 0.5),    # x1 < x0
+        (0, 0, 0, 0),          # 零面积
+    ])
+    def test_rejects_invalid(self, x0, y0, x1, y1):
+        with pytest.raises(ValueError):
+            BBox(x0=x0, y0=y0, x1=x1, y1=y1)
+    def test_to_pixels_roundtrip(self):
+        box = BBox(0.1, 0.2, 0.9, 0.8)
+        assert box.to_pixels(1000, 500) == (100, 100, 900, 400)
+```
+```python
+# tests/unit/router/test_feature_shape.py
+EXPECTED_COLUMNS = 124
+def test_feature_vector_has_124_columns(sample_pdf):
+    router = Router()
+    decision = router.classify(sample_pdf)
+    assert not decision.error
+    assert len(decision.features) == EXPECTED_COLUMNS, (
+        f"Feature vector drifted from 124 to {len(decision.features)}. "
+        "If intentional, retrain XGBoost weights."
+    )
+```
+**实施步骤**：
+1. `uv add --group dev pytest pytest-cov pytest-xdist hypothesis`
+2. 根 `pyproject.toml` 加 `[tool.pytest.ini_options]` 和 `[tool.coverage.run]`
+3. `conftest.py` 提供 `sample_pdf` / `encrypted_pdf` / `corrupted_pdf` fixture
+4. 按上表顺序写测试（每天 1 个子目录）
+5. 加 `Makefile` 或 `scripts/test.sh`：`uv run pytest -n auto tests/`
+**验收**：CI 跑通全部测试 < 2 分钟；三包覆盖率达标。
+**工作量**：1 人 · 10 天
+---
+### 3.2 代码质量 · ruff + mypy + pre-commit
+**目标**：零 ruff 错误、`pdfsys-core` 零 mypy 错误、commit 前自动拦截。
+**根 `pyproject.toml` 新增**：
+```toml
+[tool.ruff]
+target-version = "py311"
+line-length = 100
+src = ["packages/pdfsys-core/src", "packages/pdfsys-router/src",
+       "packages/pdfsys-parser-mupdf/src", "packages/pdfsys-bench/src",
+       "demo"]
+[tool.ruff.lint]
+select = ["E", "F", "W", "I", "B", "UP", "SIM", "PLC0415", "BLE001", "RET", "ARG"]
+ignore = ["E501"]
+per-file-ignores = { "packages/pdfsys-bench/**" = ["BLE001"] }
+[tool.mypy]
+python_version = "3.11"
+strict = true
+exclude = ["^packages/pdfsys-parser-(pipeline|vlm)/", "^packages/pdfsys-layout-analyser/"]
+[[tool.mypy.overrides]]
+module = ["pymupdf.*", "xgboost.*", "gradio.*"]
+ignore_missing_imports = true
+```
+**`.pre-commit-config.yaml`**：
+```yaml
+repos:
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.6.9
+    hooks:
+      - id: ruff
+        args: [--fix]
+      - id: ruff-format
+  - repo: https://github.com/pre-commit/mirrors-mypy
+    rev: v1.11.2
+    hooks:
+      - id: mypy
+        files: ^packages/pdfsys-core/
+  - repo: local
+    hooks:
+      - id: no-committed-weights
+        name: Reject committed model weights
+        entry: bash -c '! git diff --cached --name-only | grep -E "\.(ubj|safetensors|pt|bin)$"'
+        language: system
+        pass_filenames: false
+      - id: validate-cursor-rules
+        name: Validate .cursor/rules YAML frontmatter
+        entry: python scripts/validate_rules.py
+        language: system
+        files: ^\.cursor/rules/.*\.mdc$
+```
+**实施步骤**：
+1. `uv add --group dev ruff mypy pre-commit`
+2. 写上面两个配置
+3. `uv run ruff check --fix .` + `uv run ruff format .` 修现存问题
+4. `uv run mypy packages/pdfsys-core` 直到零错
+5. `pre-commit install` 追加到 `scripts/setup_cursor.sh`
+6. 把 `03-doc-sync.mdc` 里提到的 `scripts/validate_rules.py` 落地
+**验收**：`pre-commit run --all-files` 全绿。
+**工作量**：1 人 · 3 天
+---
+### 3.3 GitHub Actions CI
+**`.github/workflows/ci.yml`**：
+```yaml
+name: CI
+on:
+  pull_request:
+  push:
+    branches: [main]
+jobs:
+  lint:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: astral-sh/setup-uv@v3
+        with: { version: "0.4.x", enable-cache: true }
+      - run: uv sync --frozen
+      - run: uv run ruff check .
+      - run: uv run ruff format --check .
+      - run: uv run mypy packages/pdfsys-core
+  test:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python: ["3.11", "3.12"]
+    steps:
+      - uses: actions/checkout@v4
+      - uses: astral-sh/setup-uv@v3
+        with: { python-version: "${{ matrix.python }}" }
+      - run: uv sync --frozen
+      - run: uv run python -m pdfsys_router.download_weights
+      - run: uv run pytest -n auto --cov --cov-report=xml tests/
+      - uses: codecov/codecov-action@v4
+        if: matrix.python == '3.11'
+  parity:
+    runs-on: ubuntu-latest
+    if: contains(github.event.pull_request.changed_files, 'feature_extractor.py')
+    steps:
+      - uses: actions/checkout@v4
+        with: { fetch-depth: 2 }
+      - uses: astral-sh/setup-uv@v3
+      - run: uv sync --frozen
+      - run: uv run python -m pdfsys_router.download_weights
+      - run: bash scripts/check_parity.sh origin/main HEAD
+```
+**实施步骤**：
+1. 写上面 workflow
+2. 可选：`.github/workflows/preview-hf-space.yml` PR 自动部署预览 Space
+3. GitHub Settings → Branches 把 `main` 设为 protected、必须通过 CI
+**验收**：PR 打开 3 分钟内看到 ✅ × 3。
+**工作量**：1 人 · 1 天
+---
+### 3.4 uv.lock 入库 + 依赖上界
+**当前痛点**：
+- `.gitignore:14` 把 `uv.lock` 排除了（反模式，lock 文件必须入库）
+- 所有依赖只有下界：`pymupdf>=1.24` 明天升级到 2.0 会被自动拉进来
+**修复**：
+1. 从 `.gitignore` 移除 `uv.lock`
+2. 给所有依赖加上界（保守策略 major+1）：
+```toml
+# packages/pdfsys-router/pyproject.toml
+dependencies = [
+    "pdfsys-core",
+    "pymupdf>=1.24,<2.0",
+    "xgboost>=2.0,<3.0",
+    "scikit-learn>=1.3,<2.0",
+    "pandas>=2.0,<3.0",
+    "numpy>=1.26,<3.0",
+]
+```
+3. `uv lock && git add uv.lock`
+4. CI 用 `uv sync --frozen`（见 §3.3）
+**工作量**：0.5 天
+---
+### 3.5 Parity Harness
+**背景**：`.cursor/rules/21-router-parity.mdc` 已描述 parity 验证流程，但**缺可执行脚本**。
+**`scripts/check_parity.sh`**：
+```bash
+#!/usr/bin/env bash
+# Verify router ocr_prob drift between two refs.
+# Usage: bash scripts/check_parity.sh <baseline_ref> <candidate_ref>
+set -euo pipefail
+BASELINE="${1:-origin/main}"
+CANDIDATE="${2:-HEAD}"
+SAMPLE_DIR="${PARITY_SAMPLE_DIR:-tests/fixtures/pdfs}"
+EPSILON="${PARITY_EPSILON:-1e-6}"
+WORK_DIR="$(mktemp -d)"
+trap 'rm -rf "$WORK_DIR"' EXIT
+run_bench() {
+    local ref="$1" out="$2"
+    git worktree add "$WORK_DIR/$ref" "$ref"
+    (cd "$WORK_DIR/$ref" && uv sync --frozen --quiet \
+       && uv run python -m pdfsys_router.download_weights >/dev/null \
+       && uv run python -m pdfsys_bench --pdf-dir "$SAMPLE_DIR" --out "$out" --no-quality)
+    git worktree remove --force "$WORK_DIR/$ref"
+}
+run_bench "$BASELINE"  "$WORK_DIR/baseline.jsonl"
+run_bench "$CANDIDATE" "$WORK_DIR/candidate.jsonl"
+uv run python scripts/parity_diff.py \
+    "$WORK_DIR/baseline.jsonl" "$WORK_DIR/candidate.jsonl" \
+    --epsilon "$EPSILON"
+```
+**`scripts/parity_diff.py`**：接收两个 JSONL、逐 PDF 对比 `ocr_prob`、漂移超阈值 exit 非零。
+**工作量**：1 天
+---
+## 4 · P1 性能与可靠性（Week 3-6）
+### 4.1 Router 热路径优化
+**现状**：49 ms/PDF（PRD 目标 ≤10 ms）。跑 1 PB 语料 ≈ 浪费 10+ 小时 CPU。
+**优化点**（先 profile 后改，要求 P0 测试先到位）：
+#### (a) 去掉 pandas DataFrame 构造
+```python
+# ❌ 现状 (packages/pdfsys-router/src/pdfsys_router/xgb_model.py)
+df = pd.DataFrame([features])
+names = getattr(self.model, "feature_names_in_", None)
+if names is not None:
+    df = df.reindex(columns=list(names), fill_value=0)
+probs = self.model.predict_proba(df)
+# ✅ 优化：缓存列序 + numpy array
+class XgbRouterModel:
+    def __init__(self, path):
+        self._feature_order: list[str] | None = None
+    def predict_proba(self, features: dict[str, float]) -> float:
+        if self._feature_order is None:
+            self._feature_order = list(self.model.feature_names_in_)
+        arr = np.fromiter(
+            (features.get(k, 0.0) for k in self._feature_order),
+            dtype=np.float32, count=len(self._feature_order),
+        ).reshape(1, -1)
+        return float(self.model.predict_proba(arr)[0, 1])
+```
+预估：~15 ms → ~2 ms。
+#### (b) PyMuPDF 文本读取去重
+`_get_garbled_text_per_page` 对每页 `get_text()`，后续 `compute_features_per_chunk` 对采样页再读一次——同一页读两次。
+优化：读所有采样页文本时就缓存 `page → text` 字典，复用。预估 ~25 ms → ~12 ms。
+#### (c) 早 return
+`is_encrypted` / `needs_pass` / `len(doc) == 0` 这类硬错误应在特征提取前 short-circuit。
+**验收**：Parity harness 验证 `|diff(ocr_prob)| < 1e-6`；OmniDocBench-100 上 p50 ≤ 10 ms。
+**工作量**：2-3 天
+---
+### 4.2 Quality scorer 批量推理
+**现状**：单条 3.6 s；10 万文档 ≈ 100 小时。
+**改动**：`OcrQualityScorer.score_many` 从循环改成真正 batch：
+```python
+def score_many(self, texts: list[str], batch_size: int = 8) -> list[QualityScore]:
+    self._ensure_loaded()
+    torch = self._torch
+    results: list[QualityScore] = []
+    for i in range(0, len(texts), batch_size):
+        batch = [t[:self.max_chars] or " " for t in texts[i:i + batch_size]]
+        enc = self._tokenizer(
+            batch, return_tensors="pt", truncation=True,
+            max_length=self.max_tokens, padding=True,
+        ).to(self._device)
+        with torch.inference_mode():
+            logits = self._model(**enc).logits.squeeze(-1)
+        for j, text in enumerate(batch):
+            score = max(0.0, min(3.0, float(logits[j].item())))
+            results.append(QualityScore(
+                score=score,
+                num_chars=len(text),
+                num_tokens=int(enc["attention_mask"][j].sum()),
+                model=self.model_name,
+            ))
+    return results
+```
+**配套**：`pdfsys_bench.loop.run_loop` 改成"先全部 extract → 批量 score → 展回 JSONL"，保持输出顺序。
+**验收**：batch=8 相比 batch=1 吞吐 ≥ 3×；单样本数值差 `< 1e-3`。
+**工作量**：3 天
+---
+### 4.3 structlog 日志系统
+**现状**：全仓 `print(...)` × 12 处；无级别、无结构。
+**方案**：`pdfsys-core` 之外的包引入 `structlog`（core 保持零依赖）：
+```python
+# packages/pdfsys-router/src/pdfsys_router/_log.py
+import structlog
+log = structlog.get_logger("pdfsys.router")
+# 使用：
+log.info("classified", backend=decision.backend.value,
+         ocr_prob=decision.ocr_prob, pdf=str(path),
+         num_pages=decision.num_pages)
+```
+生产用 `JSONRenderer()`（便于 Grafana/ELK 摄入），dev 用 `ConsoleRenderer()`。
+**工作量**：2 天
+---
+### 4.4 Prometheus metrics
+**最小实现**：
+```python
+# packages/pdfsys-bench/src/pdfsys_bench/_metrics.py
+from prometheus_client import Counter, Histogram, start_http_server
+router_decisions = Counter("pdfsys_router_decisions_total",
+                           "Router decisions by backend", ["backend"])
+router_latency = Histogram("pdfsys_router_duration_seconds",
+                           "Router classification latency",
+                           buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0])
+extract_failures = Counter("pdfsys_extract_failures_total",
+                           "Extraction failures", ["backend", "error_class"])
+def enable_metrics_endpoint(port: int = 9000) -> None:
+    start_http_server(port)
+```
+`pdfsys-bench` CLI 新增 `--metrics-port` flag。
+**工作量**：2 天
+---
+### 4.5 错误分类 + quarantine 桶
+**现状**：失败写 `extract_error: "classify_failed: X"` 自由字符串，无法聚合。
+**方案**：`pdfsys-core` 新增 `errors.py`：
+```python
+from enum import Enum
+class ErrorClass(str, Enum):
+    OPEN_FAILED = "open_failed"
+    ENCRYPTED = "encrypted"
+    EMPTY = "empty"
+    CORRUPTED_STREAM = "corrupted_stream"
+    FEATURE_EXTRACTION_FAILED = "feature_extraction_failed"
+    MODEL_INFERENCE_FAILED = "model_inference_failed"
+    OOM = "oom"
+    UNKNOWN = "unknown"
+```
+`RouterDecision.error_class: ErrorClass` 替代自由字符串。Bench 按 class 聚合计数。
+Quarantine 桶：`out/quarantine/<error_class>/<sha256>.json` 保留失败记录（路径 + error + 完整特征向量，**不保留 PDF**），离线分析用。
+**工作量**：3 天
+---
+## 5 · P2 功能补全（Week 7-16）
+### 依赖 DAG
+```
+Layout Analyser (3.1) ──┬──► Pipeline Parser (3.2) ──┐
+                        │                             │
+                        └──► VLM Parser     (3.4) ────┼──► Stage-3 (3.5) ──► Stage-4 (3.6) ──► Stage-5 (3.7)
+                                                       │
+                        ┌──► Stage-B Router (3.3) ─────┘
+                        │
+                  (reads LayoutCache)
+```
+### 5.1 Layout Analyser · P2-1
+**选型**：PP-DocLayoutV3 ONNX INT8（CPU ~50 ms/页），未来可接 docling-layout-heron。
+**交付物**：
+```
+packages/pdfsys-layout-analyser/src/pdfsys_layout_analyser/
+├── __init__.py
+├── analyser.py              # LayoutAnalyser 主类
+├── runners/
+│   ├── pp_doclayoutv3.py    # ONNX runtime 驱动
+│   └── heuristic.py         # bbox 列数聚类 fallback
+├── render.py                # PDF 页 → PNG（DPI 可调）
+└── postprocess.py           # 阅读顺序 + 跨栏合并
+```
+**API**：
+```python
+class LayoutAnalyser:
+    def __init__(self, config: LayoutConfig = LayoutConfig()): ...
+    def analyse(self, pdf_path: str | Path) -> LayoutDocument: ...
+    def analyse_with_cache(
+        self, pdf_path: str | Path, cache: LayoutCache
+    ) -> LayoutDocument: ...   # idempotent
+```
+**验收**：
+- OmniDocBench-100 上 mAP ≥ 0.85
+- CPU INT8 吞吐 ≥ 20 页/s/core
+- `LayoutDocument` 能被 `LayoutCache.save/load` 完整 roundtrip
+- 空 / 加密 / 损坏 PDF 全部不 crash
+**工作量**：1 人 · 10 天
+---
+### 5.2 Pipeline Parser · P2-2
+**选型**：RapidOCR（PaddleOCR ONNX 前向，无 Paddle 依赖）。
+**交付物**：
+```
+packages/pdfsys-parser-pipeline/src/pdfsys_parser_pipeline/
+├── extract.py              # extract_doc / extract_doc_bytes
+├── ocr_engine.py           # RapidOCR wrapper (lazy load)
+├── region_processor.py     # 按 RegionType 派发
+├── image_cropper.py        # bbox → image crop
+└── markdown_emitter.py     # region + OCR → Segment
+```
+**核心逻辑**：
+```python
+def extract_doc(pdf_path, *, layout_cache: LayoutCache) -> ExtractedDoc:
+    layout = layout_cache.load_or_compute(pdf_path, analyser)
+    segments = []
+    for page in layout.pages:
+        for region in page.regions:
+            img = crop_region_from_pdf(pdf_path, page.index, region.bbox)
+            text = ocr_engine.recognise(img, region.type)
+            segments.append(Segment(
+                index=len(segments),
+                backend=Backend.PIPELINE,
+                page_index=page.index,
+                type=region.type,
+                content=text,
+                bbox=region.bbox,
+                source_region_id=region.region_id,
+            ))
+    return ExtractedDoc(
+        sha256=sha256_of_file(pdf_path),
+        backend=Backend.PIPELINE,
+        segments=tuple(segments),
+        markdown=merge_segments_to_markdown(tuple(segments)),
+        stats={"page_count": len(layout.pages)},
+    )
+```
+**验收**：
+- OmniDocBench 扫描件子集中文字符 F1 ≥ 0.90
+- 输出 schema 与 `parser-mupdf` 同构（`tests/contract/test_extracted_doc_schema.py` 保护）
+- CPU 吞吐 ≥ 5 页/s/core
+**工作量**：1 人 · 12 天
+---
+### 5.3 Stage-B Router · P2-3
+把当前 4 行 stub `decider.py` 做实：
+```python
+def decide_complex_vs_simple(
+    layout: LayoutDocument, config: RouterConfig
+) -> Backend:
+    if not config.vlm_enabled:
+        return Backend.PIPELINE
+    if layout.has_complex_content:
+        return Backend.VLM
+    return Backend.PIPELINE
+```
+`Router._route()`：`ocr_prob ≥ threshold` 时先查 `LayoutCache`，命中 → 调 `decide_complex_vs_simple`；未命中 → 返回 `DEFERRED`。
+**工作量**：2 天
+---
+### 5.4 VLM Parser · P2-4
+**选型**（PRD §4.4）：生产用 LMDeploy 驱动 MinerU 2.5-Pro 1.2B。
+**交付物**：
+```
+packages/pdfsys-parser-vlm/src/pdfsys_parser_vlm/
+├── extract.py
+├── engines/
+│   ├── mineru.py           # LMDeploy wrapper
+│   └── paddleocr_vl.py     # 备选
+├── batching.py             # dynamic batching
+├── rendering.py            # 高 DPI 页面渲染
+└── fallback.py             # OOM 降 batch 重试
+```
+**关键约束**：
+- Worker 常驻模型（单例懒加载）
+- `max_batch_size=16, max_seq=8192`（PRD §4.4）
+- 超长页：单页 > 8192 tokens 按 bbox 聚类切两块
+- 单页 OOM 自动降 batch 重试 ≤ 2 次后写 quarantine（见 §4.5）
+**工作量**：1 人 · 15 天（含 LMDeploy 调通）
+---
+### 5.5 Stage-3 后处理
+独立成新包 `packages/pdfsys-postproc/`：
+```
+├── reading_order.py       # 跨页合并、脚注挂回正文、双栏交错修正
+├── paragraph_merge.py     # 折行还原 + 中文断句
+├── formula_norm.py        # KaTeX 语法校验，失败转 image placeholder
+├── table_norm.py          # HTML↔Markdown 双格式，行列校验
+└── unicode_norm.py        # NFC + 全半角统一 + 零宽字符清理
+```
+**工作量**：1 人 · 10 天
+---
+### 5.6 Stage-4 质量 / PII / MinHash 去重
+独立成 `packages/pdfsys-quality/`，复用 `datatrove` 的 MinHash block（PRD §4.6.5）：
+```
+├── lang_id.py         # GlotLID 段落级语种识别
+├── heuristic.py       # 重复 n-gram、非 CJK 比例、行长方差
+├── edu_score.py       # 中文 EduScore (fastText → DeBERTa-v3-tiny)
+├── pii.py             # 正则 + NER 兜底
+└── dedup/
+    ├── exact.py       # md5 内容精确去重
+    └── minhash.py     # datatrove MinHash LSH wrapper
+```
+**工作量**：2 人 · 3 周（MinHash 跨 shard 需全局 shuffle，最复杂）
+---
+### 5.7 Stage-5 Parquet 打包
+独立成 `packages/pdfsys-output/`：
+- Parquet 分片 ~1 GB/shard，zstd 压缩
+- 分桶路径：`v1/lang=zh/source=arxiv/qb=high/shard-NNNNN.parquet`
+- JSONL 镜像 + Markdown 抽样存档（每 shard 0.1%）
+**工作量**：1 人 · 5 天
+---
+## 6 · P3 规模化与生态（3-6 个月）
+| 项 | 说明 | 工作量 |
+|---|---|---|
+| **datatrove 集成** | 把现有 stage 包成 `datatrove.Block`，原生 Slurm 后端 | 2-3 周 |
+| **Slurm / K8s runner** | 新包 `pdfsys-runner`，支持 shard checkpoint + 反压 | 3-4 周 |
+| **对象存储后端** | `pdfsys-core` 抽象 `FSBackend` 协议，支持 `file://` / `s3://` / `oss://` / `minio://` | 1-2 周 |
+| **中文 EduScore 训练** | fastText → DeBERTa-v3-tiny 分类器 + 数据标注 | 4-6 周（含标注） |
+| **竖排古籍 LoRA** | MinerU 2.5 针对性 LoRA 微调 | 4-6 周（GPU 密集） |
+---
+## 7 · 里程碑时间线
+| 里程碑 | 周 | 标志 |
+|---|---|---|
+| **M1 · 可协作化** | 2 | CI 绿灯；覆盖率达标；lock 文件入库；parity harness 守门 |
+| **M2 · 生产级核心** | 6 | Router p50 ≤ 10 ms；scorer 3× 吞吐；统一 log+metrics；错误可聚合 |
+| **M3 · 6-stage 打通** | 16 | 10 GB 数据集端到端跑完；三种 backend 同构 schema |
+| **M4 · PB 就绪** | 24 | datatrove + Slurm runner；对象存储后端；TCO 估算入库 |
+| **M5 · v0.1 数据集** | 32 | 首个 1 TB 级对外可发布数据集 + 评测报告 |
+---
+## 8 · Quick Wins · 两周内可立即启动
+如果只能挑最高 ROI 的 5 件事立刻做：
+1. **写 15 个 core / router / parser-mupdf 单测** — 2 天 · 把不变式变成机器可验证
+2. **配 ruff + pre-commit** — 0.5 天 · 新 PR 质量底线立起来
+3. **写 `.github/workflows/ci.yml`** — 0.5 天 · 反馈从"review 时"提前到"push 时"
+4. **`uv.lock` 入库 + 依赖加上界** — 0.5 天 · 依赖不会突然不一样
+5. **`scripts/check_parity.sh` + 10 个样本 PDF 入 fixtures** — 2 天 · router 改动自动守门
+合计 **5-6 个工作日**，换来"可协作化"的全部前提。强烈建议以这作为第一冲刺。
+---
+## 9 · 风险与"不做的事"
+### 必须克制的诱惑
+- ❌ **不要在 P0 之前碰 stub 实现**——没有测试和 parity harness 保护，任何功能添加都是技术债的利息
+- ❌ **不要替换 PyMuPDF**——它在中文场景的工程成熟度是第一梯队，换 pdfminer/PyPDF2 会立刻倒退
+- ❌ **不要引入 LangChain / LlamaIndex**——这是数据处理 pipeline，不是 RAG 应用
+- ❌ **不要在 `pdfsys-core` 引入 pydantic**——现有 `dataclass(frozen=True, slots=True)` + `serde.py` 够用，换 pydantic 破坏零依赖不变式
+### 长期风险对应策略
+| 风险 | 对应 |
+|---|---|
+| MinerU 2.5 新版许可变化 | PaddleOCR-VL 保持热备，`pdfsys-parser-vlm` 做成 engine 抽象 |
+| PyMuPDF AGPL 限制 | 评估 pikepdf / pdfplumber 作为退路（低优先级） |
+| PB 级对象存储成本失控 | P0 阶段写 `scripts/tco.py` 估算 |
+| 中文 PII 召回不足 | NER 模型兜底，保留审计表便于事后补救 |
+---
+## 10 · 如何跟踪进度
+- **短期（P0-P1）**：GitHub Projects / Milestones。每个子项一 issue，带验收标准。
+- **中期（P2）**：每个 stage 落地时开一个"tracking issue"聚合子 PR，`CHANGELOG.md` 按 SemVer 更新。
+- **长期（P3）**：PRD §10 的 P0/P1/P2/P3 roadmap 每月复盘一次，本文档 v0.N 同步迭代。
+进度状态在根 `README.md` §What's implemented 表里维护——按 `.cursor/rules/03-doc-sync.mdc` 的映射表，任何 Stage 状态从 ❌→✅ 都必须同步该表。
+---
+## 附录 · 总量一览
+| 阶段 | 周期 | 核心交付 | 人力 |
+|---|---|---|---|
+| **P0 工程基础** | 2 周 | pytest + ruff + CI + lock + parity | 1 人 |
+| **P1 性能/可靠性** | 4 周 | router 5×、scorer 3×、log/metrics | 1-2 人 |
+| **P2 功能补全** | 10-12 周 | 6 stage 闭环 | 2-3 人 |
+| **P3 规模化** | 3-6 月 | datatrove + Slurm + PB 级运行 | 3-4 人 |
+从 0 到"PB 级准备"约 24 周，累计约 20-30 人周。与 PRD §6 的资源预算 "100 × A100 + 32 节点 CPU 墙钟 ~2 个月"相匹配——**先把工具链造好，再把大算力接上**。

packages/pdfsys-router/src/pdfsys_router/download_weights.py CHANGED Viewed

@@ -12,39 +12,50 @@ Usage::
 from __future__ import annotations
 import sys
 import urllib.request
 from pathlib import Path
-# media.githubusercontent.com serves the actual LFS payload directly,
-# bypassing the pointer file that raw.githubusercontent.com returns.
-WEIGHTS_URL = (
-    "https://media.githubusercontent.com/media/huggingface/finepdfs/main/"
-    "blocks/predictor/xgb.ubj"
-)
 def target_path() -> Path:
     return Path(__file__).resolve().parent.parent.parent / "models" / "xgb_classifier.ubj"
-def download(force: bool = False) -> Path:
     dst = target_path()
     if dst.exists() and not force:
         print(f"[download_weights] already present: {dst}")
         return dst
     dst.parent.mkdir(parents=True, exist_ok=True)
-    print(f"[download_weights] fetching {WEIGHTS_URL}")
-    with urllib.request.urlopen(WEIGHTS_URL) as r:  # noqa: S310 — pinned URL
-        data = r.read()
-    if len(data) < 10_000:
-        raise RuntimeError(
-            f"downloaded blob is suspiciously small ({len(data)} bytes) — "
-            "likely an LFS pointer, not the binary"
-        )
-    dst.write_bytes(data)
-    print(f"[download_weights] wrote {len(data)} bytes -> {dst}")
-    return dst
 if __name__ == "__main__":

 from __future__ import annotations
+import socket
 import sys
 import urllib.request
 from pathlib import Path
+# GitHub raw download URL for XGBoost router weights
+WEIGHTS_URLS = [
+    "https://github.com/huggingface/finepdfs/raw/main/models/xgb_ocr_classifier/xgb_classifier.ubj",
+    "https://raw.githubusercontent.com/huggingface/finepdfs/main/models/xgb_ocr_classifier/xgb_classifier.ubj",
+]
 def target_path() -> Path:
     return Path(__file__).resolve().parent.parent.parent / "models" / "xgb_classifier.ubj"
+def download(force: bool = False, timeout: int = 30) -> Path:
     dst = target_path()
     if dst.exists() and not force:
         print(f"[download_weights] already present: {dst}")
         return dst
     dst.parent.mkdir(parents=True, exist_ok=True)
+    last_error = None
+    for url in WEIGHTS_URLS:
+        print(f"[download_weights] fetching {url}")
+        try:
+            # 设置超时
+            with urllib.request.urlopen(url, timeout=timeout) as r:  # noqa: S310 — pinned URL
+                data = r.read()
+            if len(data) < 10_000:
+                raise RuntimeError(
+                    f"downloaded blob is suspiciously small ({len(data)} bytes) — "
+                    "likely an LFS pointer, not the binary"
+                )
+            dst.write_bytes(data)
+            print(f"[download_weights] wrote {len(data)} bytes -> {dst}")
+            return dst
+        except (urllib.error.URLError, socket.timeout) as e:
+            last_error = e
+            print(f"[download_weights] failed for {url}: {e}")
+            continue
+    raise RuntimeError(f"Failed to download weights from all URLs: {last_error}")
 if __name__ == "__main__":

requirements.txt ADDED Viewed

	@@ -0,0 +1,21 @@

+# Hugging Face Spaces installs from this file.
+# Note: Local workspace packages (pdfsys-*) are loaded via sys.path in demo/app.py
+# and do not need editable installation in HF Spaces.
+# --- Python 3.13 compatibility (audioop removed) --------------------------
+audioop-lts
+# --- CPU-only torch (HF Spaces free tier is CPU) --------------------------
+--extra-index-url https://download.pytorch.org/whl/cpu
+torch>=2.1,<3.0
+# --- Third-party runtime deps -------------------------------------------
+gradio==5.12.0
+huggingface-hub>=0.26,<0.29
+pymupdf>=1.24
+xgboost>=2.0
+scikit-learn>=1.3
+pandas>=2.0
+numpy>=1.26
+transformers>=4.44
+pillow>=10.0