jieluo1024 commited on
Commit
00b2f48
·
1 Parent(s): b8ca6f2

feat: update XGBoost weights URL and add Gradio demo

Browse files

- Fix XGBoost router weights download URL to use GitHub raw links
- Add timeout and fallback URLs for model download
- Add Gradio demo interface (demo/app.py, demo/pipeline.py)
- Add app.py entry point for HuggingFace Spaces
- Add requirements.txt for dependencies

.gitignore CHANGED
@@ -7,6 +7,8 @@ __pycache__/
7
  .eggs/
8
  build/
9
  dist/
 
 
10
 
11
  # uv / virtualenv
12
  .venv/
@@ -38,3 +40,8 @@ models/
38
  .idea/
39
  .vscode/
40
  *.swp
 
 
 
 
 
 
7
  .eggs/
8
  build/
9
  dist/
10
+ .cursor/
11
+ scripts/
12
 
13
  # uv / virtualenv
14
  .venv/
 
40
  .idea/
41
  .vscode/
42
  *.swp
43
+
44
+ # Gradio / HF Spaces runtime artifacts
45
+ flagged/
46
+ gradio_cached_examples/
47
+ .gradio/
README.md CHANGED
@@ -1,8 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # pdfsys-mnbvc
2
 
3
  PB-scale PDF → pretraining-data pipeline for the [MNBVC](https://github.com/esbatmop/MNBVC) corpus project.
4
  FinePDFs-inspired architecture adapted for Chinese-heavy, mixed-quality input.
5
 
 
 
 
 
6
  ## Current status: MVP closed loop ✅
7
 
8
  The first end-to-end path — **Router → MuPDF parser → OCR quality scorer** — is working on the OmniDocBench-100 evaluation set. PDFs that need OCR are routed to `PIPELINE` but not yet extracted (that backend is not implemented yet).
@@ -221,7 +238,64 @@ A companion `.summary.json` file is also written with aggregate statistics.
221
 
222
  ## Docs
223
 
224
- - `docs/PRD.md` — full PRD with resource budgets and roadmap.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
225
 
226
  ## License
227
 
 
1
+ ---
2
+ title: PDFSystem MNBVC Demo
3
+ emoji: 📄
4
+ colorFrom: green
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 5.12.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: apache-2.0
11
+ short_description: FinePDFs-style PDF pipeline demo for MNBVC
12
+ ---
13
+
14
  # pdfsys-mnbvc
15
 
16
  PB-scale PDF → pretraining-data pipeline for the [MNBVC](https://github.com/esbatmop/MNBVC) corpus project.
17
  FinePDFs-inspired architecture adapted for Chinese-heavy, mixed-quality input.
18
 
19
+ > **Try it:** `python app.py` locally, or deploy to Hugging Face Spaces with one click
20
+ > — the YAML header above is all the Space config needed. See [`demo/README.md`](demo/README.md)
21
+ > for both paths.
22
+
23
  ## Current status: MVP closed loop ✅
24
 
25
  The first end-to-end path — **Router → MuPDF parser → OCR quality scorer** — is working on the OmniDocBench-100 evaluation set. PDFs that need OCR are routed to `PIPELINE` but not yet extracted (that backend is not implemented yet).
 
238
 
239
  ## Docs
240
 
241
+ - [`docs/PRD.md`](docs/PRD.md) — full PRD with resource budgets and architectural rationale (the "what & why").
242
+ - [`docs/ROADMAP.md`](docs/ROADMAP.md) — prioritised implementation plan with work-estimates and acceptance criteria (the "how & when").
243
+ - [`CONTRIBUTING.md`](CONTRIBUTING.md) — naming, parity rules, commit scopes.
244
+ - [`demo/README.md`](demo/README.md) — Gradio demo + Hugging Face Spaces deploy guide.
245
+
246
+ ## Collaborating with Cursor
247
+
248
+ This repo ships a full set of [Cursor project rules](https://docs.cursor.com/context/rules) under `.cursor/rules/`. They give the AI agent the same mental model senior contributors have — including the non-obvious bits (FinePDFs feature parity, `pdfsys-core` zero-dep rule, Gradio UI/logic separation) that a new collaborator would otherwise step on.
249
+
250
+ ### Quick start
251
+
252
+ ```bash
253
+ # One-shot bootstrap: checks python/uv, syncs workspace, downloads router weights.
254
+ bash scripts/setup_cursor.sh
255
+ ```
256
+
257
+ Then open the repo in Cursor (≥ 0.50, which supports `.cursor/rules/*.mdc`). The always-on rules activate immediately; file-specific rules attach as you open matching files.
258
+
259
+ ### Active rules
260
+
261
+ | Rule | Scope | What it enforces |
262
+ |------|-------|------------------|
263
+ | `00-project-context.mdc` | always | Project goals, tech stack, must-read docs, explicit non-goals. |
264
+ | `01-architecture-invariants.mdc` | always | 7 load-bearing invariants (zero-dep core, stateless processing, normalized bbox, etc.). |
265
+ | `02-commit-workflow.mdc` | always | Conventional commits with package-scoped names; pre-commit checklist. |
266
+ | `03-doc-sync.mdc` | always | Doc-sync mapping table: which code change forces which doc update. Cursor proactively scans after edits. |
267
+ | `10-python-standards.mdc` | `**/*.py` | Type hints, frozen dataclass, lazy imports for heavy deps. |
268
+ | `20-core-contracts.mdc` | `packages/pdfsys-core/**` | Zero external deps; no I/O; schema change ripple rules. |
269
+ | `21-router-parity.mdc` | `packages/pdfsys-router/**` | FinePDFs 124-feature parity is sacred; how to verify. |
270
+ | `22-parser-backends.mdc` | `packages/pdfsys-parser-*/**` | All three backends must emit identical `ExtractedDoc`. |
271
+ | `23-bench-scorer.mdc` | `packages/pdfsys-bench/**` | torch/transformers lazy load; bf16 default; loop never raises. |
272
+ | `30-gradio-demo.mdc` | `demo/**,app.py` | UI layer has no business logic; callbacks never raise; lazy singletons. |
273
+
274
+ ### Recommended Cursor workflow
275
+
276
+ 1. **Before touching `pdfsys-core`** — read `20-core-contracts.mdc`. The AI will refuse to add third-party deps here and surface schema-ripple questions.
277
+ 2. **Before touching `feature_extractor.py`** — `21-router-parity.mdc` kicks in; the AI will suggest running the parity check before you commit.
278
+ 3. **When building a new parser backend** — `22-parser-backends.mdc` walks through the 6-step addition procedure and refuses partial impls.
279
+ 4. **When writing demo UI** — `30-gradio-demo.mdc` rejects `import pymupdf` in `demo/app.py` (belongs in `demo/pipeline.py`).
280
+
281
+ ### Authoring new rules
282
+
283
+ Rules live in `.cursor/rules/*.mdc`. Format:
284
+
285
+ ```yaml
286
+ ---
287
+ description: Short description shown in the rule picker
288
+ globs: packages/<pkg>/**/*.py # omit for always-on rules
289
+ alwaysApply: false # true = always loaded
290
+ ---
291
+
292
+ # Rule Title
293
+
294
+ - Bullet rule 1 (with ✅/❌ example)
295
+ - Bullet rule 2
296
+ ```
297
+
298
+ Keep each rule under 100 lines, one concern per file. See existing rules for patterns.
299
 
300
  ## License
301
 
app.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Hugging Face Spaces entry point.
2
+
3
+ HF Spaces looks for ``app.py`` at the repo root. We just import the
4
+ actual app from ``demo/`` so the demo code stays tucked away and the
5
+ root stays uncluttered.
6
+ """
7
+
8
+ from __future__ import annotations
9
+
10
+ import sys
11
+ from pathlib import Path
12
+
13
+ _DEMO_DIR = Path(__file__).resolve().parent / "demo"
14
+ sys.path.insert(0, str(_DEMO_DIR))
15
+
16
+ from app import demo # noqa: E402,F401 — re-exported for HF Spaces
17
+
18
+ if __name__ == "__main__":
19
+ import os
20
+
21
+ demo.queue(max_size=8).launch(
22
+ server_name=os.environ.get("GRADIO_SERVER_NAME", "0.0.0.0"),
23
+ server_port=int(os.environ.get("GRADIO_SERVER_PORT", "7860")),
24
+ show_api=False,
25
+ )
demo/README.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # pdfsys-mnbvc · Gradio Demo
2
+
3
+ A small self-contained Gradio app that runs the **actually-implemented** MVP
4
+ path of the pdfsys-mnbvc pipeline on a single PDF you upload.
5
+
6
+ It exercises the same three components the bench harness does:
7
+
8
+ 1. **Stage-A XGBoost router** (`pdfsys_router.Router`) — 124 PyMuPDF features → `ocr_prob` → one of `mupdf / pipeline / vlm / deferred`.
9
+ 2. **MuPDF fast path** (`pdfsys_parser_mupdf.extract_doc`) — runs only when the router picks `mupdf`. Emits `Segment[]` with normalized bboxes + a merged Markdown blob.
10
+ 3. **ModernBERT OCR quality scorer** (`pdfsys_bench.quality.OcrQualityScorer`) — optional; heavy; gated behind a checkbox.
11
+
12
+ PIPELINE / VLM / DEFERRED backends are currently stubs in the repo, so the
13
+ demo surfaces the router decision and skips extraction for them.
14
+
15
+ ## UI
16
+
17
+ ```
18
+ ┌─────────────────┬──────────────────────────────────────────────────┐
19
+ │ upload PDF │ Summary · backend · P(ocr) · pages · timing │
20
+ │ threshold ├──────────────────────────────────────────────────┤
21
+ │ ☐ quality │ [ Page preview │ Markdown │ Segments │ │
22
+ │ [Run Pipeline] │ Router features │ Raw JSON ] │
23
+ │ │ │
24
+ │ pipeline │ Page preview draws extracted bboxes (color = │
25
+ │ diagram │ chosen backend) directly on the first page. │
26
+ └─────────────────┴──────────────────────────────────────────────────┘
27
+ ```
28
+
29
+ ## Run locally
30
+
31
+ ```bash
32
+ # option A — full workspace install (recommended)
33
+ uv sync # installs all packages + deps
34
+ python -m pdfsys_router.download_weights # one-time: XGBoost weights (257 KB)
35
+ python app.py # http://localhost:7860
36
+
37
+ # option B — plain pip (matches HF Spaces)
38
+ pip install -r requirements.txt
39
+ python -m pdfsys_router.download_weights
40
+ python app.py
41
+ ```
42
+
43
+ First run of the quality scorer pulls `HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn`
44
+ (~800 MB) from the HF Hub. Set `HF_HOME=/path/to/cache` to control where it lands.
45
+
46
+ ## Deploy to Hugging Face Spaces
47
+
48
+ The root `README.md` already contains the required [Spaces YAML config](https://huggingface.co/docs/hub/spaces-config-reference):
49
+
50
+ ```yaml
51
+ ---
52
+ title: PDFSystem MNBVC Demo
53
+ sdk: gradio
54
+ sdk_version: 4.44.0
55
+ app_file: app.py
56
+ license: apache-2.0
57
+ ---
58
+ ```
59
+
60
+ ### Option 1 · One-click from GitHub (recommended)
61
+
62
+ 1. Push this repo to GitHub.
63
+ 2. Go to <https://huggingface.co/new-space>.
64
+ 3. Pick **Gradio** SDK, hardware **CPU basic** is enough for the MVP loop.
65
+ 4. In **Files** → **Create Space from an existing GitHub repo**, paste the repo URL.
66
+
67
+ HF Spaces will clone the whole repo, read the YAML header in the root
68
+ `README.md`, install `requirements.txt`, and launch `app.py`. The router's
69
+ XGBoost weights are downloaded automatically on first request (~257 KB, inline
70
+ in the Space container).
71
+
72
+ ### Option 2 · Manual push
73
+
74
+ ```bash
75
+ git clone https://huggingface.co/spaces/<you>/pdfsys-mnbvc-demo
76
+ cd pdfsys-mnbvc-demo
77
+ # copy repo contents into this dir (the four workspace packages must come
78
+ # along — they are installed editable by requirements.txt)
79
+ cp -r /path/to/pdfsystem_mnbvc/{app.py,requirements.txt,README.md,packages,demo} .
80
+ git add . && git commit -m "Initial deploy" && git push
81
+ ```
82
+
83
+ ### Resource notes (HF Spaces free tier: CPU, 16 GB RAM)
84
+
85
+ - Router: ~50–100 ms per PDF; effectively free.
86
+ - MuPDF extraction: ~10 ms per page.
87
+ - Quality scorer (ModernBERT-large): ~3–5 s per PDF at bf16; fits in RAM.
88
+ Disabled by default in the UI. **Keep it off** unless you want to wait.
89
+ - GPU Spaces aren't required; the MVP path is CPU-only. A GPU Space becomes
90
+ useful once the Pipeline / VLM parsers land.
91
+
92
+ ## Files
93
+
94
+ | Path | Role |
95
+ | ---- | ---- |
96
+ | `demo/app.py` | Gradio `Blocks` definition + event handlers. |
97
+ | `demo/pipeline.py` | Pure-Python wrapper around `Router` + `extract_doc` + `OcrQualityScorer`. Rendering helpers live here too. |
98
+ | `app.py` (repo root) | Thin HF-Spaces entry; imports `demo.app`. |
99
+ | `requirements.txt` (repo root) | Pin-friendly deps for `pip install -r`. Installs the four workspace packages in editable mode. |
100
+
101
+ The demo imports the real pipeline modules — if you change `pdfsys-router`
102
+ or `pdfsys-parser-mupdf`, the demo picks it up on the next launch.
demo/app.py ADDED
@@ -0,0 +1,377 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Gradio demo for the pdfsys-mnbvc MVP pipeline.
2
+
3
+ What this demonstrates (matching the code that actually exists in the
4
+ repo today, not the aspirational PRD):
5
+
6
+ * Stage-A XGBoost router — decides text-ok vs needs-ocr from 124
7
+ PyMuPDF-derived features.
8
+ * MuPDF fast path — extracts Markdown-ready segments when the router
9
+ picks ``Backend.MUPDF``. Overlaid on the first page as colored bboxes.
10
+ * ModernBERT OCR quality scorer — optional, heavy (~800 MB download,
11
+ 3–5 s per doc on CPU). Off by default to keep the demo snappy.
12
+
13
+ PIPELINE / VLM / DEFERRED backends are surfaced through the router
14
+ decision but are still stubs in ``packages/pdfsys-parser-*``; the UI
15
+ just reports the routing choice in that case and skips extraction.
16
+
17
+ Runs locally (``python demo/app.py``) and as a Hugging Face Space (see
18
+ the repo-root ``README.md`` frontmatter and ``demo/README.md``).
19
+ """
20
+
21
+ from __future__ import annotations
22
+
23
+ import json
24
+ import os
25
+ import sys
26
+ import tempfile
27
+ import traceback
28
+ from pathlib import Path
29
+
30
+ import gradio as gr
31
+
32
+ # Allow ``python demo/app.py`` without installing the workspace by falling
33
+ # back to the in-tree sources. When running under HF Spaces / uv sync the
34
+ # packages are already on sys.path and these inserts become no-ops.
35
+ _REPO_ROOT = Path(__file__).resolve().parent.parent
36
+ for pkg in ("pdfsys-core", "pdfsys-router", "pdfsys-parser-mupdf", "pdfsys-bench"):
37
+ src = _REPO_ROOT / "packages" / pkg / "src"
38
+ if src.is_dir() and str(src) not in sys.path:
39
+ sys.path.insert(0, str(src))
40
+
41
+ from pipeline import ( # noqa: E402 — must come after sys.path surgery
42
+ PipelineResult,
43
+ pick_curated_features,
44
+ render_first_page_with_bboxes,
45
+ run_pipeline,
46
+ )
47
+
48
+
49
+ # ------------------------------------------------------------------ constants
50
+
51
+ DESCRIPTION = """\
52
+ # PDFSystem-MNBVC · Pipeline Demo
53
+
54
+ **FinePDFs-inspired PB-scale PDF → pretraining-data pipeline**, adapted
55
+ for the Chinese MNBVC corpus. This demo shows the MVP closed loop that
56
+ is actually implemented in the repo today:
57
+
58
+ **Router (XGBoost, 124 features)** → **MuPDF fast path** → **OCR Quality Scorer (ModernBERT)**
59
+
60
+ The router decides whether a PDF is cheap to parse with PyMuPDF alone,
61
+ or whether it needs to go to the (still-stubbed) OCR / VLM backends.
62
+ Roughly 90% of a typical PDF corpus takes the green fast-path lane.
63
+ """
64
+
65
+ PIPELINE_DIAGRAM_MD = """\
66
+ ### Pipeline
67
+
68
+ ```
69
+ ┌────────────────┐
70
+ PDF ───────►│ Stage-A │ XGBoost · ~10 ms/PDF
71
+ │ Router │ 124 PyMuPDF features
72
+ └────────┬───────┘
73
+ │ ocr_prob
74
+ ┌─────────────┼─────────────┐
75
+ ▼ ▼ ▼
76
+ MUPDF PIPELINE VLM / DEFERRED
77
+ (text-ok) (OCR, stub) (VLM, stub)
78
+
79
+
80
+ PyMuPDF blocks ─► Markdown + Segments (with bboxes)
81
+
82
+
83
+ ModernBERT-large OCR quality regressor ─► score ∈ [0, 3]
84
+ ```
85
+
86
+ **Backend color legend on page preview**
87
+
88
+ - 🟢 `mupdf` — text-ok fast path (implemented)
89
+ - 🟠 `pipeline` — OCR lane (stub, routing only)
90
+ - 🟣 `vlm` — VLM lane (stub, routing only)
91
+ - ⚪ `deferred` — held back until VLM workers online
92
+ """
93
+
94
+
95
+ def _safe(val, default=""):
96
+ """Coerce NaN / None for Gradio components that don't like them."""
97
+ if val is None:
98
+ return default
99
+ try:
100
+ import math
101
+
102
+ if isinstance(val, float) and math.isnan(val):
103
+ return default
104
+ except Exception:
105
+ pass
106
+ return val
107
+
108
+
109
+ # ------------------------------------------------------------------ handlers
110
+
111
+
112
+ def process_pdf(
113
+ pdf_file: str | None,
114
+ run_quality: bool,
115
+ ocr_threshold: float,
116
+ progress: gr.Progress = gr.Progress(),
117
+ ):
118
+ """Main Gradio callback. Returns one value per output component."""
119
+ empty_segments = [[0, 0, "-", "-", 0, ""]]
120
+ empty_features = [["(no PDF uploaded)", ""]]
121
+ empty_summary = "Upload a PDF to get started."
122
+
123
+ if not pdf_file:
124
+ return (
125
+ empty_summary,
126
+ "", 0.0, 0, "", 0.0,
127
+ None,
128
+ "_No markdown yet._",
129
+ empty_segments,
130
+ empty_features,
131
+ {},
132
+ )
133
+
134
+ pdf_path = Path(pdf_file)
135
+
136
+ try:
137
+ progress(0.1, desc="Routing (XGBoost)…")
138
+ result: PipelineResult = run_pipeline(
139
+ pdf_path,
140
+ run_quality=run_quality,
141
+ ocr_threshold=ocr_threshold,
142
+ )
143
+
144
+ progress(0.7, desc="Rendering first page…")
145
+ preview = render_first_page_with_bboxes(pdf_path, result, page_index=0)
146
+
147
+ except Exception as e: # noqa: BLE001
148
+ tb = traceback.format_exc()
149
+ err_json = {"error": str(e), "traceback": tb.splitlines()[-6:]}
150
+ return (
151
+ f"**Failed:** `{e}`",
152
+ "", 0.0, 0, "", 0.0,
153
+ None,
154
+ f"```\n{tb}\n```",
155
+ empty_segments,
156
+ empty_features,
157
+ err_json,
158
+ )
159
+
160
+ # ------------------------------------------------------------- summary
161
+ lines = [
162
+ f"**File:** `{pdf_path.name}` ({pdf_path.stat().st_size / 1024:.1f} KB)",
163
+ f"**Routed to:** `{result.backend}` &nbsp;·&nbsp; "
164
+ f"P(ocr) = **{result.ocr_prob:.3f}** &nbsp;·&nbsp; {result.num_pages} page(s)",
165
+ ]
166
+ flags = []
167
+ if result.is_form:
168
+ flags.append("is_form")
169
+ if result.is_encrypted:
170
+ flags.append("encrypted")
171
+ if result.needs_password:
172
+ flags.append("password-protected")
173
+ if result.garbled_text_ratio > 0.01:
174
+ flags.append(f"garbled_text_ratio={result.garbled_text_ratio:.2%}")
175
+ if flags:
176
+ lines.append("**Flags:** " + ", ".join(f"`{f}`" for f in flags))
177
+ if result.router_error:
178
+ lines.append(f"**Router error:** `{result.router_error}`")
179
+ if result.extract_error:
180
+ lines.append(f"**Extract error:** `{result.extract_error}`")
181
+ if result.quality_error:
182
+ lines.append(f"**Quality error:** `{result.quality_error}`")
183
+
184
+ if result.backend == "mupdf" and not result.extract_error:
185
+ stats = result.extract_stats
186
+ lines.append(
187
+ f"**Extracted:** {stats.get('segment_count', 0)} segments, "
188
+ f"{stats.get('char_count', 0):,} chars "
189
+ f"(pages {stats.get('pages_extracted', 0)}/{stats.get('page_count', 0)})"
190
+ )
191
+ else:
192
+ lines.append(
193
+ "_MuPDF extraction skipped — backend is not `mupdf`. "
194
+ "PIPELINE/VLM backends are still stubs in this repo._"
195
+ )
196
+
197
+ if result.quality_score is not None:
198
+ lines.append(
199
+ f"**OCR quality:** **{result.quality_score:.2f}** / 3.0 "
200
+ f"({result.quality_num_tokens} tokens, `{result.quality_model}`)"
201
+ )
202
+
203
+ lines.append(
204
+ f"**Timing (ms):** router **{result.wall_ms_router:.0f}** · "
205
+ f"extract **{result.wall_ms_extract:.0f}** · "
206
+ f"quality **{result.wall_ms_quality:.0f}**"
207
+ )
208
+ summary_md = "\n\n".join(lines)
209
+
210
+ # ------------------------------------------------------------- markdown
211
+ md_text = result.markdown.strip() or "_No markdown — this PDF was not routed to MuPDF._"
212
+ if len(md_text) > 20_000:
213
+ md_text = md_text[:20_000] + "\n\n…\n\n**[truncated for UI — full Markdown in the JSON tab]**"
214
+
215
+ # ------------------------------------------------------------- segments
216
+ seg_rows = [
217
+ [s["index"], s["page"], s["type"], str(s["bbox_norm"]), s["chars"], s["preview"]]
218
+ for s in result.segments
219
+ ] or empty_segments
220
+
221
+ # ------------------------------------------------------------- features
222
+ feat_rows = pick_curated_features(result.router_features) or empty_features
223
+
224
+ # ------------------------------------------------------------- raw JSON
225
+ raw = result.to_record()
226
+ raw["router_features_full"] = result.router_features
227
+ raw["segments_full"] = result.segments
228
+
229
+ return (
230
+ summary_md,
231
+ result.backend,
232
+ float(result.ocr_prob) if result.ocr_prob == result.ocr_prob else 0.0,
233
+ int(result.num_pages),
234
+ ("-" if result.quality_score is None else f"{result.quality_score:.2f} / 3.0"),
235
+ float(result.wall_ms_router + result.wall_ms_extract + result.wall_ms_quality),
236
+ preview,
237
+ md_text,
238
+ seg_rows,
239
+ feat_rows,
240
+ raw,
241
+ )
242
+
243
+
244
+ # ---------------------------------------------------------------------- UI
245
+
246
+ CSS = """
247
+ .small-num input { font-weight: 600; font-size: 1.1rem; }
248
+ footer { display: none !important; }
249
+ """
250
+
251
+
252
+ def build_demo() -> gr.Blocks:
253
+ with gr.Blocks(title="PDFSystem-MNBVC Demo") as demo:
254
+ gr.Markdown(DESCRIPTION)
255
+
256
+ with gr.Row():
257
+ # -------------------- left column: controls + diagram
258
+ with gr.Column(scale=1, min_width=320):
259
+ pdf_input = gr.File(
260
+ label="Upload a PDF",
261
+ file_types=[".pdf"],
262
+ type="filepath",
263
+ )
264
+ with gr.Accordion("Options", open=True):
265
+ ocr_threshold = gr.Slider(
266
+ 0.0, 1.0, value=0.5, step=0.05,
267
+ label="OCR probability threshold",
268
+ info="ocr_prob ≥ threshold ⇒ route off the MuPDF fast path",
269
+ )
270
+ run_quality = gr.Checkbox(
271
+ label="Run ModernBERT quality scorer",
272
+ value=False,
273
+ info="~3–5 s on CPU. First run downloads ~800 MB.",
274
+ )
275
+ run_btn = gr.Button("Run Pipeline", variant="primary", size="lg")
276
+ gr.Markdown(PIPELINE_DIAGRAM_MD)
277
+
278
+ # -------------------- right column: outputs
279
+ with gr.Column(scale=2, min_width=520):
280
+ summary_md = gr.Markdown(
281
+ "Upload a PDF and click **Run Pipeline**.",
282
+ label="Summary",
283
+ )
284
+
285
+ with gr.Row():
286
+ backend_out = gr.Textbox(
287
+ label="Backend", interactive=False, elem_classes=["small-num"]
288
+ )
289
+ ocr_prob_out = gr.Number(
290
+ label="P(OCR)", interactive=False, precision=3,
291
+ elem_classes=["small-num"],
292
+ )
293
+ pages_out = gr.Number(
294
+ label="Pages", interactive=False,
295
+ elem_classes=["small-num"],
296
+ )
297
+ quality_out = gr.Textbox(
298
+ label="Quality", interactive=False,
299
+ elem_classes=["small-num"],
300
+ )
301
+ wall_ms_out = gr.Number(
302
+ label="Total ms", interactive=False, precision=0,
303
+ elem_classes=["small-num"],
304
+ )
305
+
306
+ with gr.Tabs():
307
+ with gr.Tab("Page preview"):
308
+ preview_img = gr.Image(
309
+ label="First page with extracted bboxes",
310
+ type="pil",
311
+ interactive=False,
312
+ height=720,
313
+ )
314
+ with gr.Tab("Markdown"):
315
+ md_out = gr.Markdown()
316
+ with gr.Tab("Segments"):
317
+ seg_df = gr.Dataframe(
318
+ headers=["idx", "page", "type", "bbox_norm", "chars", "preview"],
319
+ datatype=["number", "number", "str", "str", "number", "str"],
320
+ wrap=True,
321
+ label="Extracted segments (one row per block)",
322
+ )
323
+ with gr.Tab("Router features"):
324
+ feat_df = gr.Dataframe(
325
+ headers=["feature", "value"],
326
+ datatype=["str", "str"],
327
+ label="Curated subset (full 124-dim vector in Raw JSON)",
328
+ )
329
+ with gr.Tab("Raw JSON"):
330
+ raw_json = gr.JSON(label="All pipeline outputs")
331
+
332
+ # ----------------------------------------------------------- wiring
333
+ outputs = [
334
+ summary_md,
335
+ backend_out, ocr_prob_out, pages_out, quality_out, wall_ms_out,
336
+ preview_img,
337
+ md_out,
338
+ seg_df,
339
+ feat_df,
340
+ raw_json,
341
+ ]
342
+ run_btn.click(
343
+ process_pdf,
344
+ inputs=[pdf_input, run_quality, ocr_threshold],
345
+ outputs=outputs,
346
+ )
347
+ # Auto-run on file upload (with quality off for snappiness).
348
+ pdf_input.upload(
349
+ lambda f, t: process_pdf(f, False, t),
350
+ inputs=[pdf_input, ocr_threshold],
351
+ outputs=outputs,
352
+ )
353
+
354
+ gr.Markdown(
355
+ "---\n"
356
+ "Repo: [pdfsystem_mnbvc](https://github.com/) · "
357
+ "Architecture: [FinePDFs](https://huggingface.co/datasets/HuggingFaceFW/finepdfs) · "
358
+ "Router weights: FinePDFs upstream (Apache-2.0) · "
359
+ "Quality model: `HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn`"
360
+ )
361
+
362
+ return demo
363
+
364
+
365
+ demo = build_demo()
366
+
367
+
368
+ if __name__ == "__main__":
369
+ # Sensible defaults for both local dev and HF Spaces.
370
+ server_name = os.environ.get("GRADIO_SERVER_NAME", "0.0.0.0")
371
+ server_port = int(os.environ.get("GRADIO_SERVER_PORT", "7860"))
372
+ demo.queue(max_size=8).launch(
373
+ server_name=server_name,
374
+ server_port=server_port,
375
+ theme=gr.themes.Soft(primary_hue="emerald"),
376
+ css=CSS,
377
+ )
demo/pipeline.py ADDED
@@ -0,0 +1,311 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """End-to-end wiring used by the Gradio demo.
2
+
3
+ Wraps the three production-path components in one callable:
4
+
5
+ Router (Stage-A XGBoost)
6
+ └─► Backend.MUPDF → pdfsys_parser_mupdf.extract_doc
7
+ └─► anything else → not extracted (Pipeline/VLM/Deferred are
8
+ still stubs in this repo; we surface the
9
+ router decision and stop).
10
+
11
+ Kept deliberately Gradio-free so the same code is unit-testable and
12
+ reusable from notebooks. ``app.py`` only imports :func:`run_pipeline`
13
+ and :func:`render_first_page_with_bboxes`.
14
+ """
15
+
16
+ from __future__ import annotations
17
+
18
+ import io
19
+ import time
20
+ from dataclasses import dataclass, field
21
+ from pathlib import Path
22
+ from typing import Any
23
+
24
+ import pymupdf
25
+ from PIL import Image, ImageDraw
26
+
27
+
28
+ # ------------------------------------------------------------------ singletons
29
+
30
+ _ROUTER: Any = None
31
+ _SCORER: Any = None
32
+
33
+
34
+ def _ensure_router_weights() -> None:
35
+ """Make sure the XGBoost weights are on disk. No-op if already present."""
36
+ from pdfsys_router.download_weights import download, target_path
37
+
38
+ if not target_path().is_file():
39
+ download()
40
+
41
+
42
+ def get_router(ocr_threshold: float = 0.5):
43
+ """Lazy-load the singleton Router. Weights download on first call."""
44
+ global _ROUTER
45
+ _ensure_router_weights()
46
+ from pdfsys_router import Router
47
+
48
+ if _ROUTER is None or abs(_ROUTER.ocr_threshold - ocr_threshold) > 1e-9:
49
+ _ROUTER = Router(ocr_threshold=ocr_threshold)
50
+ return _ROUTER
51
+
52
+
53
+ def get_scorer():
54
+ """Lazy-load the singleton ModernBERT quality scorer (~800 MB download)."""
55
+ global _SCORER
56
+ if _SCORER is None:
57
+ from pdfsys_bench.quality import OcrQualityScorer
58
+
59
+ _SCORER = OcrQualityScorer()
60
+ return _SCORER
61
+
62
+
63
+ # ------------------------------------------------------------------ data class
64
+
65
+
66
+ @dataclass(slots=True)
67
+ class PipelineResult:
68
+ """Everything the UI needs in one flat object."""
69
+
70
+ # Router
71
+ backend: str
72
+ ocr_prob: float
73
+ num_pages: int
74
+ is_form: bool
75
+ garbled_text_ratio: float
76
+ is_encrypted: bool
77
+ needs_password: bool
78
+ router_error: str | None
79
+ router_features: dict[str, Any] = field(default_factory=dict)
80
+
81
+ # Extract (only when backend == mupdf)
82
+ sha256: str | None = None
83
+ segments: list[dict[str, Any]] = field(default_factory=list)
84
+ markdown: str = ""
85
+ extract_stats: dict[str, Any] = field(default_factory=dict)
86
+ extract_error: str | None = None
87
+
88
+ # Quality
89
+ quality_score: float | None = None
90
+ quality_num_tokens: int | None = None
91
+ quality_model: str | None = None
92
+ quality_error: str | None = None
93
+
94
+ # Wall times (ms)
95
+ wall_ms_router: float = 0.0
96
+ wall_ms_extract: float = 0.0
97
+ wall_ms_quality: float = 0.0
98
+
99
+ def to_record(self) -> dict[str, Any]:
100
+ """Flat JSON-friendly dict for the raw output tab."""
101
+ return {
102
+ "backend": self.backend,
103
+ "ocr_prob": self.ocr_prob,
104
+ "num_pages": self.num_pages,
105
+ "is_form": self.is_form,
106
+ "garbled_text_ratio": self.garbled_text_ratio,
107
+ "is_encrypted": self.is_encrypted,
108
+ "needs_password": self.needs_password,
109
+ "router_error": self.router_error,
110
+ "sha256": self.sha256,
111
+ "num_segments": len(self.segments),
112
+ "markdown_chars": len(self.markdown),
113
+ "extract_stats": self.extract_stats,
114
+ "extract_error": self.extract_error,
115
+ "quality_score": self.quality_score,
116
+ "quality_num_tokens": self.quality_num_tokens,
117
+ "quality_model": self.quality_model,
118
+ "quality_error": self.quality_error,
119
+ "wall_ms_router": round(self.wall_ms_router, 1),
120
+ "wall_ms_extract": round(self.wall_ms_extract, 1),
121
+ "wall_ms_quality": round(self.wall_ms_quality, 1),
122
+ }
123
+
124
+
125
+ # -------------------------------------------------------------------- helpers
126
+
127
+
128
+ def _segment_to_row(seg: Any) -> dict[str, Any]:
129
+ """Flatten a :class:`pdfsys_core.Segment` for the UI table."""
130
+ bbox = seg.bbox
131
+ bbox_tuple = None if bbox is None else (
132
+ round(bbox.x0, 4),
133
+ round(bbox.y0, 4),
134
+ round(bbox.x1, 4),
135
+ round(bbox.y1, 4),
136
+ )
137
+ return {
138
+ "index": seg.index,
139
+ "page": seg.page_index,
140
+ "type": seg.type.value,
141
+ "bbox_norm": bbox_tuple,
142
+ "chars": len(seg.content),
143
+ "preview": seg.content[:120].replace("\n", " "),
144
+ }
145
+
146
+
147
+ # ------------------------------------------------------------------ core entry
148
+
149
+
150
+ def run_pipeline(
151
+ pdf_path: str | Path,
152
+ *,
153
+ run_quality: bool = False,
154
+ ocr_threshold: float = 0.5,
155
+ ) -> PipelineResult:
156
+ """Route the PDF, extract if text-ok, optionally score quality.
157
+
158
+ Never raises on malformed input — all failure modes surface via the
159
+ ``*_error`` fields so the UI can present them uniformly.
160
+ """
161
+ pdf_path = Path(pdf_path)
162
+ if not pdf_path.is_file():
163
+ raise FileNotFoundError(f"PDF not found: {pdf_path}")
164
+
165
+ # -- Stage-A router -------------------------------------------------------
166
+ router = get_router(ocr_threshold=ocr_threshold)
167
+ t0 = time.perf_counter()
168
+ decision = router.classify(pdf_path)
169
+ t1 = time.perf_counter()
170
+
171
+ result = PipelineResult(
172
+ backend=decision.backend.value,
173
+ ocr_prob=float(decision.ocr_prob) if decision.ocr_prob == decision.ocr_prob else float("nan"),
174
+ num_pages=decision.num_pages,
175
+ is_form=decision.is_form,
176
+ garbled_text_ratio=decision.garbled_text_ratio,
177
+ is_encrypted=decision.is_encrypted,
178
+ needs_password=decision.needs_password,
179
+ router_error=decision.error,
180
+ router_features=dict(decision.features or {}),
181
+ wall_ms_router=(t1 - t0) * 1000.0,
182
+ )
183
+
184
+ # -- MuPDF extraction (only for text-ok path) -----------------------------
185
+ from pdfsys_core import Backend
186
+ from pdfsys_parser_mupdf import extract_doc
187
+
188
+ if decision.backend == Backend.MUPDF and decision.error is None:
189
+ try:
190
+ t2 = time.perf_counter()
191
+ extracted = extract_doc(pdf_path)
192
+ t3 = time.perf_counter()
193
+ result.sha256 = extracted.sha256
194
+ result.segments = [_segment_to_row(s) for s in extracted.segments]
195
+ result.markdown = extracted.markdown
196
+ result.extract_stats = dict(extracted.stats)
197
+ result.wall_ms_extract = (t3 - t2) * 1000.0
198
+ except Exception as e: # noqa: BLE001 — surface to UI
199
+ result.extract_error = f"{type(e).__name__}: {e}"
200
+
201
+ # -- Quality scoring (optional, heavy) ------------------------------------
202
+ if run_quality and result.markdown:
203
+ try:
204
+ scorer = get_scorer()
205
+ t4 = time.perf_counter()
206
+ q = scorer.score(result.markdown)
207
+ t5 = time.perf_counter()
208
+ result.quality_score = q.score
209
+ result.quality_num_tokens = q.num_tokens
210
+ result.quality_model = q.model
211
+ result.wall_ms_quality = (t5 - t4) * 1000.0
212
+ except Exception as e: # noqa: BLE001
213
+ result.quality_error = f"{type(e).__name__}: {e}"
214
+
215
+ return result
216
+
217
+
218
+ # ----------------------------------------------------------------- rendering
219
+
220
+
221
+ _BACKEND_COLOR = {
222
+ "mupdf": (39, 174, 96), # green — text-ok fast path
223
+ "pipeline": (243, 156, 18), # orange — OCR pipeline (stub)
224
+ "vlm": (155, 89, 182), # purple — VLM (stub)
225
+ "deferred": (127, 140, 141), # gray — held back
226
+ }
227
+
228
+
229
+ def render_first_page_with_bboxes(
230
+ pdf_path: str | Path,
231
+ result: PipelineResult,
232
+ page_index: int = 0,
233
+ target_max_side: int = 1100,
234
+ ) -> Image.Image | None:
235
+ """Render ``page_index`` of the PDF and overlay MuPDF segment bboxes.
236
+
237
+ Falls back to ``None`` on any failure (corrupted / encrypted / etc.).
238
+ """
239
+ pdf_path = Path(pdf_path)
240
+ try:
241
+ doc = pymupdf.open(str(pdf_path))
242
+ except Exception:
243
+ return None
244
+
245
+ try:
246
+ if len(doc) == 0 or page_index >= len(doc):
247
+ return None
248
+ page = doc[page_index]
249
+ rect = page.rect
250
+ # Scale so the longest side ~= target_max_side (for UI readability).
251
+ zoom = max(1.0, target_max_side / max(rect.width, rect.height))
252
+ pix = page.get_pixmap(matrix=pymupdf.Matrix(zoom, zoom), alpha=False)
253
+ img = Image.open(io.BytesIO(pix.tobytes("png"))).convert("RGB")
254
+ except Exception:
255
+ return None
256
+ finally:
257
+ doc.close()
258
+
259
+ # Overlay segment bboxes for the selected page only.
260
+ color = _BACKEND_COLOR.get(result.backend, (52, 152, 219))
261
+ draw = ImageDraw.Draw(img, "RGBA")
262
+ w, h = img.size
263
+
264
+ drawn = 0
265
+ for seg in result.segments:
266
+ if seg["page"] != page_index or seg["bbox_norm"] is None:
267
+ continue
268
+ x0, y0, x1, y1 = seg["bbox_norm"]
269
+ box = (int(x0 * w), int(y0 * h), int(x1 * w), int(y1 * h))
270
+ # Semi-transparent fill + solid outline.
271
+ draw.rectangle(box, fill=(*color, 28), outline=(*color, 220), width=2)
272
+ # Small index badge.
273
+ label = str(seg["index"])
274
+ tx, ty = box[0] + 2, box[1] + 2
275
+ draw.rectangle((tx, ty, tx + 6 + 7 * len(label), ty + 16), fill=(*color, 220))
276
+ draw.text((tx + 3, ty + 1), label, fill=(255, 255, 255))
277
+ drawn += 1
278
+
279
+ return img
280
+
281
+
282
+ def pick_curated_features(features: dict[str, Any]) -> list[list[Any]]:
283
+ """Select a small, meaningful subset of the 124-feature vector for display.
284
+
285
+ The full vector goes into the raw JSON tab; this is the "at a glance"
286
+ view. Ordered by importance / interpretability, not by XGBoost column
287
+ order.
288
+ """
289
+ keys_in_order = [
290
+ "num_pages_successfully_sampled",
291
+ "garbled_text_ratio",
292
+ "is_form",
293
+ "creator_or_producer_is_known_scanner",
294
+ "num_unique_image_xrefs",
295
+ "num_junk_image_xrefs",
296
+ "page_level_char_counts_page1",
297
+ "page_level_unique_font_counts_page1",
298
+ "page_level_text_area_ratios_page1",
299
+ "page_level_image_counts_page1",
300
+ "page_level_bitmap_proportions_page1",
301
+ "page_level_vector_graphics_obj_count_page1",
302
+ "page_level_hidden_char_counts_page1",
303
+ ]
304
+ rows: list[list[Any]] = []
305
+ for k in keys_in_order:
306
+ if k in features:
307
+ v = features[k]
308
+ if isinstance(v, float):
309
+ v = round(v, 4)
310
+ rows.append([k, v])
311
+ return rows
docs/ROADMAP.md ADDED
@@ -0,0 +1,807 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # pdfsys-mnbvc · Roadmap
2
+
3
+ > 优化方案与实施计划 · v0.1 · 2026-04-17
4
+ >
5
+ > 本文档把 [`PRD.md`](./PRD.md) 描述的目标转化为**带优先级、带工作量、带验收标准**的可执行任务池。PRD 回答"我们要做什么",ROADMAP 回答"按什么顺序做、怎么做、做完怎么验证"。
6
+
7
+ ---
8
+
9
+ ## 0 · 摘要
10
+
11
+ **一句话**:设计文档与架构框架一流,工程基础设施缺失严重,6 个 stage 只落地了 1.5 个。
12
+
13
+ **冲刺计划**:以 2 周"可协作化"冲刺(P0)作为一切后续工作的前提,再用 4 周打磨性能与可靠性(P1),最后 10–16 周补齐 6-stage 闭环(P2)。P3 是 PB 级规模化与生态,作为长期背景项。
14
+
15
+ ---
16
+
17
+ ## 1 · 现状评分卡
18
+
19
+ | 维度 | 状态 | 评分 |
20
+ |---|---|---|
21
+ | 设计文档(PRD) | 441 行,取舍清晰 | 9/10 |
22
+ | 架构分包 | 7 个 workspace 包,边界合理 | 8/10 |
23
+ | 核心契约(`pdfsys-core`) | frozen dataclass + 零依赖 + 原子写 | 9/10 |
24
+ | MVP 闭环(Router→MuPDF→Scorer) | 跑通 OmniDocBench-100 | 7/10 |
25
+ | **测试** | **零测试文件,零 CI** | **0/10** |
26
+ | **依赖管理** | 无 lock 文件,依赖无上界 | 2/10 |
27
+ | **Observability** | 无 logging,无 metrics | 2/10 |
28
+ | 实现完成度 | 2180 行,4/7 包是 stub | 3/10 |
29
+ | Demo & 贡献者体验 | Gradio + Cursor rules 完善 | 8/10 |
30
+
31
+ **关键风险**:当前状态下 1 人可 hack 前进;**任何超过 3 人的协作会立刻失控**——没有测试保护 parity、没有 CI、没有 lock 文件,第一次依赖升级就会毒化路由器。
32
+
33
+ ---
34
+
35
+ ## 2 · 优化全景
36
+
37
+ ```
38
+ ┌──────────────────────────────────────────────────────────────────┐
39
+ │ P0 工程基础(2 周,阻塞一切后续) │
40
+ │ ├─ 1.1 测试框架 pytest + 关键单测 │
41
+ │ ├─ 1.2 代码质量 ruff + mypy + pre-commit │
42
+ │ ├─ 1.3 GitHub Actions CI │
43
+ │ ├─ 1.4 uv.lock 入库 + 依赖上界 │
44
+ │ └─ 1.5 Parity harness(router 回归守门) │
45
+ ├──────────────────────────────────────────────────────────────────┤
46
+ │ P1 性能与可靠性(4 周) │
47
+ │ ├─ 2.1 Router 热路径优化(49 ms → 10 ms) │
48
+ │ ├─ 2.2 Quality scorer 批量推理 │
49
+ │ ├─ 2.3 structlog 日志系统 │
50
+ │ ├─ 2.4 Prometheus metrics 导出 │
51
+ │ └─ 2.5 错误分类 + quarantine 桶 │
52
+ ├──────────────────────────────────────────────────────────────────┤
53
+ │ P2 功能补全(8-12 周,按 PRD roadmap) │
54
+ │ ├─ 3.1 Layout analyser(PP-DocLayoutV3 ONNX INT8) │
55
+ │ ├─ 3.2 Pipeline parser(RapidOCR 简单版式) │
56
+ │ ├─ 3.3 Stage-B router(layout-cache 驱动) │
57
+ │ ├─ 3.4 VLM parser(MinerU 2.5 + LMDeploy) │
58
+ │ ├─ 3.5 Stage-3 后处理 │
59
+ │ ├─ 3.6 Stage-4 质量 / PII / MinHash 去重 │
60
+ │ └─ 3.7 Stage-5 Parquet 打包 │
61
+ ├──────────────────────────────────────────────────────────────────┤
62
+ │ P3 规模化与生态(3-6 个月) │
63
+ │ ├─ 4.1 datatrove 编排集成 │
64
+ │ ├─ 4.2 Slurm / K8s runner │
65
+ │ ├─ 4.3 对象存储后端(S3 / OSS / MinIO) │
66
+ │ ├─ 4.4 中文 EduScore 训练 │
67
+ │ └─ 4.5 竖排古籍 LoRA │
68
+ └──────────────────────────────────────────────────────────────────┘
69
+ ```
70
+
71
+ ---
72
+
73
+ ## 3 · P0 工程基础(Week 1-2)
74
+
75
+ ### 3.1 测试框架 · pytest
76
+
77
+ **目标**:2 周内 `pdfsys-core` ≥ 90% / `pdfsys-router` ≥ 60% / `pdfsys-parser-mupdf` ≥ 60% 覆盖率。
78
+
79
+ **为什么优先**:`.cursor/rules/01-architecture-invariants.mdc` 里 7 条不变式(BBox 归一化、frozen dataclass、原子写、schema 同构等)**全部可单测验证**。没有测试,"不要违反不变式"只是一句空话。
80
+
81
+ **交付物结构**:
82
+
83
+ ```
84
+ tests/
85
+ ├── conftest.py # 共享 fixtures
86
+ ├── fixtures/pdfs/ # 5-10 个跨类型 PDF(< 100 KB/file,入库)
87
+ ├── unit/
88
+ │ ├── core/
89
+ │ │ ├── test_bbox.py # BBox 边界、转换、非法值
90
+ │ │ ├── test_serde.py # to_dict/from_dict roundtrip
91
+ │ │ ├── test_cache.py # LayoutCache 原子写 + 崩溃恢复
92
+ │ │ └── test_types.py # Backend / RegionType 枚举稳定性
93
+ │ ├── router/
94
+ │ │ ├── test_classifier_smoke.py # classify() 不 raise 任何畸形输入
95
+ │ │ ├── test_feature_shape.py # 输出必须 124 列,列名锁定
96
+ │ │ └── test_error_taxonomy.py # encrypted/corrupt/empty 错误分类
97
+ │ ├── parser_mupdf/
98
+ │ │ ├── test_extract_basic.py # 正常 PDF 段落抽取
99
+ │ │ ├── test_bbox_normalized.py # 所有 bbox ∈ [0, 1]
100
+ │ │ └── test_corrupted_pdf.py # 坏 PDF 不 crash
101
+ │ └── bench/
102
+ │ └── test_loop_never_raises.py # 坏 PDF 进去,JSONL 行出来
103
+ ├── contract/
104
+ │ ├── test_extracted_doc_schema.py # 所有 parser 输出同构
105
+ │ └── test_cursor_rules_valid.py # .mdc frontmatter 合法
106
+ └── integration/
107
+ └── test_bench_smoke.py # python -m pdfsys_bench --limit 3
108
+ ```
109
+
110
+ **关键样例**:
111
+
112
+ ```python
113
+ # tests/unit/core/test_bbox.py
114
+ import pytest
115
+ from pdfsys_core import BBox
116
+
117
+ class TestBBoxInvariants:
118
+ @pytest.mark.parametrize("x0,y0,x1,y1", [
119
+ (-0.1, 0, 0.5, 0.5), # 负坐标
120
+ (0, 0, 1.1, 0.5), # 超过 1
121
+ (0.5, 0, 0.3, 0.5), # x1 < x0
122
+ (0, 0, 0, 0), # 零面积
123
+ ])
124
+ def test_rejects_invalid(self, x0, y0, x1, y1):
125
+ with pytest.raises(ValueError):
126
+ BBox(x0=x0, y0=y0, x1=x1, y1=y1)
127
+
128
+ def test_to_pixels_roundtrip(self):
129
+ box = BBox(0.1, 0.2, 0.9, 0.8)
130
+ assert box.to_pixels(1000, 500) == (100, 100, 900, 400)
131
+ ```
132
+
133
+ ```python
134
+ # tests/unit/router/test_feature_shape.py
135
+ EXPECTED_COLUMNS = 124
136
+
137
+ def test_feature_vector_has_124_columns(sample_pdf):
138
+ router = Router()
139
+ decision = router.classify(sample_pdf)
140
+ assert not decision.error
141
+ assert len(decision.features) == EXPECTED_COLUMNS, (
142
+ f"Feature vector drifted from 124 to {len(decision.features)}. "
143
+ "If intentional, retrain XGBoost weights."
144
+ )
145
+ ```
146
+
147
+ **实施步骤**:
148
+
149
+ 1. `uv add --group dev pytest pytest-cov pytest-xdist hypothesis`
150
+ 2. 根 `pyproject.toml` 加 `[tool.pytest.ini_options]` 和 `[tool.coverage.run]`
151
+ 3. `conftest.py` 提供 `sample_pdf` / `encrypted_pdf` / `corrupted_pdf` fixture
152
+ 4. 按上表顺序写测试(每天 1 个子目录)
153
+ 5. 加 `Makefile` 或 `scripts/test.sh`:`uv run pytest -n auto tests/`
154
+
155
+ **验收**:CI 跑通全部测试 < 2 分钟;三包覆盖率达标。
156
+
157
+ **工作量**:1 人 · 10 天
158
+
159
+ ---
160
+
161
+ ### 3.2 代码质量 · ruff + mypy + pre-commit
162
+
163
+ **目标**:零 ruff 错误、`pdfsys-core` 零 mypy 错误、commit 前自动拦截。
164
+
165
+ **根 `pyproject.toml` 新增**:
166
+
167
+ ```toml
168
+ [tool.ruff]
169
+ target-version = "py311"
170
+ line-length = 100
171
+ src = ["packages/pdfsys-core/src", "packages/pdfsys-router/src",
172
+ "packages/pdfsys-parser-mupdf/src", "packages/pdfsys-bench/src",
173
+ "demo"]
174
+
175
+ [tool.ruff.lint]
176
+ select = ["E", "F", "W", "I", "B", "UP", "SIM", "PLC0415", "BLE001", "RET", "ARG"]
177
+ ignore = ["E501"]
178
+ per-file-ignores = { "packages/pdfsys-bench/**" = ["BLE001"] }
179
+
180
+ [tool.mypy]
181
+ python_version = "3.11"
182
+ strict = true
183
+ exclude = ["^packages/pdfsys-parser-(pipeline|vlm)/", "^packages/pdfsys-layout-analyser/"]
184
+
185
+ [[tool.mypy.overrides]]
186
+ module = ["pymupdf.*", "xgboost.*", "gradio.*"]
187
+ ignore_missing_imports = true
188
+ ```
189
+
190
+ **`.pre-commit-config.yaml`**:
191
+
192
+ ```yaml
193
+ repos:
194
+ - repo: https://github.com/astral-sh/ruff-pre-commit
195
+ rev: v0.6.9
196
+ hooks:
197
+ - id: ruff
198
+ args: [--fix]
199
+ - id: ruff-format
200
+ - repo: https://github.com/pre-commit/mirrors-mypy
201
+ rev: v1.11.2
202
+ hooks:
203
+ - id: mypy
204
+ files: ^packages/pdfsys-core/
205
+ - repo: local
206
+ hooks:
207
+ - id: no-committed-weights
208
+ name: Reject committed model weights
209
+ entry: bash -c '! git diff --cached --name-only | grep -E "\.(ubj|safetensors|pt|bin)$"'
210
+ language: system
211
+ pass_filenames: false
212
+ - id: validate-cursor-rules
213
+ name: Validate .cursor/rules YAML frontmatter
214
+ entry: python scripts/validate_rules.py
215
+ language: system
216
+ files: ^\.cursor/rules/.*\.mdc$
217
+ ```
218
+
219
+ **实施步骤**:
220
+
221
+ 1. `uv add --group dev ruff mypy pre-commit`
222
+ 2. 写上面两个配置
223
+ 3. `uv run ruff check --fix .` + `uv run ruff format .` 修现存问题
224
+ 4. `uv run mypy packages/pdfsys-core` 直到零错
225
+ 5. `pre-commit install` 追加到 `scripts/setup_cursor.sh`
226
+ 6. 把 `03-doc-sync.mdc` 里提到的 `scripts/validate_rules.py` 落地
227
+
228
+ **验收**:`pre-commit run --all-files` 全绿。
229
+
230
+ **工作量**:1 人 · 3 天
231
+
232
+ ---
233
+
234
+ ### 3.3 GitHub Actions CI
235
+
236
+ **`.github/workflows/ci.yml`**:
237
+
238
+ ```yaml
239
+ name: CI
240
+ on:
241
+ pull_request:
242
+ push:
243
+ branches: [main]
244
+
245
+ jobs:
246
+ lint:
247
+ runs-on: ubuntu-latest
248
+ steps:
249
+ - uses: actions/checkout@v4
250
+ - uses: astral-sh/setup-uv@v3
251
+ with: { version: "0.4.x", enable-cache: true }
252
+ - run: uv sync --frozen
253
+ - run: uv run ruff check .
254
+ - run: uv run ruff format --check .
255
+ - run: uv run mypy packages/pdfsys-core
256
+
257
+ test:
258
+ runs-on: ubuntu-latest
259
+ strategy:
260
+ matrix:
261
+ python: ["3.11", "3.12"]
262
+ steps:
263
+ - uses: actions/checkout@v4
264
+ - uses: astral-sh/setup-uv@v3
265
+ with: { python-version: "${{ matrix.python }}" }
266
+ - run: uv sync --frozen
267
+ - run: uv run python -m pdfsys_router.download_weights
268
+ - run: uv run pytest -n auto --cov --cov-report=xml tests/
269
+ - uses: codecov/codecov-action@v4
270
+ if: matrix.python == '3.11'
271
+
272
+ parity:
273
+ runs-on: ubuntu-latest
274
+ if: contains(github.event.pull_request.changed_files, 'feature_extractor.py')
275
+ steps:
276
+ - uses: actions/checkout@v4
277
+ with: { fetch-depth: 2 }
278
+ - uses: astral-sh/setup-uv@v3
279
+ - run: uv sync --frozen
280
+ - run: uv run python -m pdfsys_router.download_weights
281
+ - run: bash scripts/check_parity.sh origin/main HEAD
282
+ ```
283
+
284
+ **实施步骤**:
285
+
286
+ 1. 写上面 workflow
287
+ 2. 可选:`.github/workflows/preview-hf-space.yml` PR 自动部署预览 Space
288
+ 3. GitHub Settings → Branches 把 `main` 设为 protected、必须通过 CI
289
+
290
+ **验收**:PR 打开 3 分钟内看到 ✅ × 3。
291
+
292
+ **工作量**:1 人 · 1 天
293
+
294
+ ---
295
+
296
+ ### 3.4 uv.lock 入库 + 依赖上界
297
+
298
+ **当前痛点**:
299
+ - `.gitignore:14` 把 `uv.lock` 排除了(反模式,lock 文件必须入库)
300
+ - 所有依赖只有下界:`pymupdf>=1.24` 明天升级到 2.0 会被自动拉进来
301
+
302
+ **修复**:
303
+
304
+ 1. 从 `.gitignore` 移除 `uv.lock`
305
+ 2. 给所有依赖加上界(保守策略 major+1):
306
+
307
+ ```toml
308
+ # packages/pdfsys-router/pyproject.toml
309
+ dependencies = [
310
+ "pdfsys-core",
311
+ "pymupdf>=1.24,<2.0",
312
+ "xgboost>=2.0,<3.0",
313
+ "scikit-learn>=1.3,<2.0",
314
+ "pandas>=2.0,<3.0",
315
+ "numpy>=1.26,<3.0",
316
+ ]
317
+ ```
318
+
319
+ 3. `uv lock && git add uv.lock`
320
+ 4. CI 用 `uv sync --frozen`(见 §3.3)
321
+
322
+ **工作量**:0.5 天
323
+
324
+ ---
325
+
326
+ ### 3.5 Parity Harness
327
+
328
+ **背景**:`.cursor/rules/21-router-parity.mdc` 已描述 parity 验证流程,但**缺可执行脚本**。
329
+
330
+ **`scripts/check_parity.sh`**:
331
+
332
+ ```bash
333
+ #!/usr/bin/env bash
334
+ # Verify router ocr_prob drift between two refs.
335
+ # Usage: bash scripts/check_parity.sh <baseline_ref> <candidate_ref>
336
+ set -euo pipefail
337
+
338
+ BASELINE="${1:-origin/main}"
339
+ CANDIDATE="${2:-HEAD}"
340
+ SAMPLE_DIR="${PARITY_SAMPLE_DIR:-tests/fixtures/pdfs}"
341
+ EPSILON="${PARITY_EPSILON:-1e-6}"
342
+ WORK_DIR="$(mktemp -d)"
343
+ trap 'rm -rf "$WORK_DIR"' EXIT
344
+
345
+ run_bench() {
346
+ local ref="$1" out="$2"
347
+ git worktree add "$WORK_DIR/$ref" "$ref"
348
+ (cd "$WORK_DIR/$ref" && uv sync --frozen --quiet \
349
+ && uv run python -m pdfsys_router.download_weights >/dev/null \
350
+ && uv run python -m pdfsys_bench --pdf-dir "$SAMPLE_DIR" --out "$out" --no-quality)
351
+ git worktree remove --force "$WORK_DIR/$ref"
352
+ }
353
+
354
+ run_bench "$BASELINE" "$WORK_DIR/baseline.jsonl"
355
+ run_bench "$CANDIDATE" "$WORK_DIR/candidate.jsonl"
356
+
357
+ uv run python scripts/parity_diff.py \
358
+ "$WORK_DIR/baseline.jsonl" "$WORK_DIR/candidate.jsonl" \
359
+ --epsilon "$EPSILON"
360
+ ```
361
+
362
+ **`scripts/parity_diff.py`**:接收两个 JSONL、逐 PDF 对比 `ocr_prob`、漂移超阈值 exit 非零。
363
+
364
+ **工作量**:1 天
365
+
366
+ ---
367
+
368
+ ## 4 · P1 性能与可靠性(Week 3-6)
369
+
370
+ ### 4.1 Router 热路径优化
371
+
372
+ **现状**:49 ms/PDF(PRD 目标 ≤10 ms)。跑 1 PB 语料 ≈ 浪费 10+ 小时 CPU。
373
+
374
+ **优化点**(先 profile 后改,要求 P0 测试先到位):
375
+
376
+ #### (a) 去掉 pandas DataFrame 构造
377
+
378
+ ```python
379
+ # ❌ 现状 (packages/pdfsys-router/src/pdfsys_router/xgb_model.py)
380
+ df = pd.DataFrame([features])
381
+ names = getattr(self.model, "feature_names_in_", None)
382
+ if names is not None:
383
+ df = df.reindex(columns=list(names), fill_value=0)
384
+ probs = self.model.predict_proba(df)
385
+
386
+ # ✅ 优化:缓存列序 + numpy array
387
+ class XgbRouterModel:
388
+ def __init__(self, path):
389
+ self._feature_order: list[str] | None = None
390
+
391
+ def predict_proba(self, features: dict[str, float]) -> float:
392
+ if self._feature_order is None:
393
+ self._feature_order = list(self.model.feature_names_in_)
394
+ arr = np.fromiter(
395
+ (features.get(k, 0.0) for k in self._feature_order),
396
+ dtype=np.float32, count=len(self._feature_order),
397
+ ).reshape(1, -1)
398
+ return float(self.model.predict_proba(arr)[0, 1])
399
+ ```
400
+
401
+ 预估:~15 ms → ~2 ms。
402
+
403
+ #### (b) PyMuPDF 文本读取去重
404
+
405
+ `_get_garbled_text_per_page` 对每页 `get_text()`,后续 `compute_features_per_chunk` 对采样页再读一次——同一页读两次。
406
+ 优化:读所有采样页文本时就缓存 `page → text` 字典,复用。预估 ~25 ms → ~12 ms。
407
+
408
+ #### (c) 早 return
409
+
410
+ `is_encrypted` / `needs_pass` / `len(doc) == 0` 这类硬错误应在特征提取前 short-circuit。
411
+
412
+ **验收**:Parity harness 验证 `|diff(ocr_prob)| < 1e-6`;OmniDocBench-100 上 p50 ≤ 10 ms。
413
+
414
+ **工作量**:2-3 天
415
+
416
+ ---
417
+
418
+ ### 4.2 Quality scorer 批量推理
419
+
420
+ **现状**:单条 3.6 s;10 万文档 ≈ 100 小时。
421
+
422
+ **改动**:`OcrQualityScorer.score_many` 从循环改成真正 batch:
423
+
424
+ ```python
425
+ def score_many(self, texts: list[str], batch_size: int = 8) -> list[QualityScore]:
426
+ self._ensure_loaded()
427
+ torch = self._torch
428
+ results: list[QualityScore] = []
429
+ for i in range(0, len(texts), batch_size):
430
+ batch = [t[:self.max_chars] or " " for t in texts[i:i + batch_size]]
431
+ enc = self._tokenizer(
432
+ batch, return_tensors="pt", truncation=True,
433
+ max_length=self.max_tokens, padding=True,
434
+ ).to(self._device)
435
+ with torch.inference_mode():
436
+ logits = self._model(**enc).logits.squeeze(-1)
437
+ for j, text in enumerate(batch):
438
+ score = max(0.0, min(3.0, float(logits[j].item())))
439
+ results.append(QualityScore(
440
+ score=score,
441
+ num_chars=len(text),
442
+ num_tokens=int(enc["attention_mask"][j].sum()),
443
+ model=self.model_name,
444
+ ))
445
+ return results
446
+ ```
447
+
448
+ **配套**:`pdfsys_bench.loop.run_loop` 改成"先全部 extract → 批量 score → 展回 JSONL",保持输出顺序。
449
+
450
+ **验收**:batch=8 相比 batch=1 吞吐 ≥ 3×;单样本数值差 `< 1e-3`。
451
+
452
+ **工作量**:3 天
453
+
454
+ ---
455
+
456
+ ### 4.3 structlog 日志系统
457
+
458
+ **现状**:全仓 `print(...)` × 12 处;无级别、无结构。
459
+
460
+ **方案**:`pdfsys-core` 之外的包引入 `structlog`(core 保持零依赖):
461
+
462
+ ```python
463
+ # packages/pdfsys-router/src/pdfsys_router/_log.py
464
+ import structlog
465
+ log = structlog.get_logger("pdfsys.router")
466
+
467
+ # 使用:
468
+ log.info("classified", backend=decision.backend.value,
469
+ ocr_prob=decision.ocr_prob, pdf=str(path),
470
+ num_pages=decision.num_pages)
471
+ ```
472
+
473
+ 生产用 `JSONRenderer()`(便于 Grafana/ELK 摄入),dev 用 `ConsoleRenderer()`。
474
+
475
+ **工作量**:2 天
476
+
477
+ ---
478
+
479
+ ### 4.4 Prometheus metrics
480
+
481
+ **最小实现**:
482
+
483
+ ```python
484
+ # packages/pdfsys-bench/src/pdfsys_bench/_metrics.py
485
+ from prometheus_client import Counter, Histogram, start_http_server
486
+
487
+ router_decisions = Counter("pdfsys_router_decisions_total",
488
+ "Router decisions by backend", ["backend"])
489
+ router_latency = Histogram("pdfsys_router_duration_seconds",
490
+ "Router classification latency",
491
+ buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0])
492
+ extract_failures = Counter("pdfsys_extract_failures_total",
493
+ "Extraction failures", ["backend", "error_class"])
494
+
495
+ def enable_metrics_endpoint(port: int = 9000) -> None:
496
+ start_http_server(port)
497
+ ```
498
+
499
+ `pdfsys-bench` CLI 新增 `--metrics-port` flag。
500
+
501
+ **工作量**:2 天
502
+
503
+ ---
504
+
505
+ ### 4.5 错误分类 + quarantine 桶
506
+
507
+ **现状**:失败写 `extract_error: "classify_failed: X"` 自由字符串,无法聚合。
508
+
509
+ **方案**:`pdfsys-core` 新增 `errors.py`:
510
+
511
+ ```python
512
+ from enum import Enum
513
+
514
+ class ErrorClass(str, Enum):
515
+ OPEN_FAILED = "open_failed"
516
+ ENCRYPTED = "encrypted"
517
+ EMPTY = "empty"
518
+ CORRUPTED_STREAM = "corrupted_stream"
519
+ FEATURE_EXTRACTION_FAILED = "feature_extraction_failed"
520
+ MODEL_INFERENCE_FAILED = "model_inference_failed"
521
+ OOM = "oom"
522
+ UNKNOWN = "unknown"
523
+ ```
524
+
525
+ `RouterDecision.error_class: ErrorClass` 替代自由字符串。Bench 按 class 聚合计数。
526
+
527
+ Quarantine 桶:`out/quarantine/<error_class>/<sha256>.json` 保留失败记录(路径 + error + 完整特征向量,**不保留 PDF**),离线分析用。
528
+
529
+ **工作量**:3 天
530
+
531
+ ---
532
+
533
+ ## 5 · P2 功能补全(Week 7-16)
534
+
535
+ ### 依赖 DAG
536
+
537
+ ```
538
+ Layout Analyser (3.1) ──┬──► Pipeline Parser (3.2) ──┐
539
+ │ │
540
+ └──► VLM Parser (3.4) ────┼──► Stage-3 (3.5) ──► Stage-4 (3.6) ──► Stage-5 (3.7)
541
+
542
+ ┌──► Stage-B Router (3.3) ─────┘
543
+
544
+ (reads LayoutCache)
545
+ ```
546
+
547
+ ### 5.1 Layout Analyser · P2-1
548
+
549
+ **选型**:PP-DocLayoutV3 ONNX INT8(CPU ~50 ms/页),未来可接 docling-layout-heron。
550
+
551
+ **交付物**:
552
+
553
+ ```
554
+ packages/pdfsys-layout-analyser/src/pdfsys_layout_analyser/
555
+ ├── __init__.py
556
+ ├── analyser.py # LayoutAnalyser 主类
557
+ ├── runners/
558
+ │ ├── pp_doclayoutv3.py # ONNX runtime 驱动
559
+ │ └── heuristic.py # bbox 列数聚类 fallback
560
+ ├── render.py # PDF 页 → PNG(DPI 可调)
561
+ └── postprocess.py # 阅读顺序 + 跨栏合并
562
+ ```
563
+
564
+ **API**:
565
+
566
+ ```python
567
+ class LayoutAnalyser:
568
+ def __init__(self, config: LayoutConfig = LayoutConfig()): ...
569
+ def analyse(self, pdf_path: str | Path) -> LayoutDocument: ...
570
+ def analyse_with_cache(
571
+ self, pdf_path: str | Path, cache: LayoutCache
572
+ ) -> LayoutDocument: ... # idempotent
573
+ ```
574
+
575
+ **验收**:
576
+ - OmniDocBench-100 上 mAP ≥ 0.85
577
+ - CPU INT8 吞吐 ≥ 20 页/s/core
578
+ - `LayoutDocument` 能被 `LayoutCache.save/load` 完整 roundtrip
579
+ - 空 / 加密 / 损坏 PDF 全部不 crash
580
+
581
+ **工作量**:1 人 · 10 天
582
+
583
+ ---
584
+
585
+ ### 5.2 Pipeline Parser · P2-2
586
+
587
+ **选型**:RapidOCR(PaddleOCR ONNX 前向,无 Paddle 依赖)。
588
+
589
+ **交付物**:
590
+
591
+ ```
592
+ packages/pdfsys-parser-pipeline/src/pdfsys_parser_pipeline/
593
+ ├── extract.py # extract_doc / extract_doc_bytes
594
+ ├── ocr_engine.py # RapidOCR wrapper (lazy load)
595
+ ├── region_processor.py # 按 RegionType 派发
596
+ ├── image_cropper.py # bbox → image crop
597
+ └── markdown_emitter.py # region + OCR → Segment
598
+ ```
599
+
600
+ **核心逻辑**:
601
+
602
+ ```python
603
+ def extract_doc(pdf_path, *, layout_cache: LayoutCache) -> ExtractedDoc:
604
+ layout = layout_cache.load_or_compute(pdf_path, analyser)
605
+ segments = []
606
+ for page in layout.pages:
607
+ for region in page.regions:
608
+ img = crop_region_from_pdf(pdf_path, page.index, region.bbox)
609
+ text = ocr_engine.recognise(img, region.type)
610
+ segments.append(Segment(
611
+ index=len(segments),
612
+ backend=Backend.PIPELINE,
613
+ page_index=page.index,
614
+ type=region.type,
615
+ content=text,
616
+ bbox=region.bbox,
617
+ source_region_id=region.region_id,
618
+ ))
619
+ return ExtractedDoc(
620
+ sha256=sha256_of_file(pdf_path),
621
+ backend=Backend.PIPELINE,
622
+ segments=tuple(segments),
623
+ markdown=merge_segments_to_markdown(tuple(segments)),
624
+ stats={"page_count": len(layout.pages)},
625
+ )
626
+ ```
627
+
628
+ **验收**:
629
+ - OmniDocBench 扫描件子集中文字符 F1 ≥ 0.90
630
+ - 输出 schema 与 `parser-mupdf` 同构(`tests/contract/test_extracted_doc_schema.py` 保护)
631
+ - CPU 吞吐 ≥ 5 页/s/core
632
+
633
+ **工作量**:1 人 · 12 天
634
+
635
+ ---
636
+
637
+ ### 5.3 Stage-B Router · P2-3
638
+
639
+ 把当前 4 行 stub `decider.py` 做实:
640
+
641
+ ```python
642
+ def decide_complex_vs_simple(
643
+ layout: LayoutDocument, config: RouterConfig
644
+ ) -> Backend:
645
+ if not config.vlm_enabled:
646
+ return Backend.PIPELINE
647
+ if layout.has_complex_content:
648
+ return Backend.VLM
649
+ return Backend.PIPELINE
650
+ ```
651
+
652
+ `Router._route()`:`ocr_prob ≥ threshold` 时先查 `LayoutCache`,命中 → 调 `decide_complex_vs_simple`;未命中 → 返回 `DEFERRED`。
653
+
654
+ **工作量**:2 天
655
+
656
+ ---
657
+
658
+ ### 5.4 VLM Parser · P2-4
659
+
660
+ **选型**(PRD §4.4):生产用 LMDeploy 驱动 MinerU 2.5-Pro 1.2B。
661
+
662
+ **交付物**:
663
+
664
+ ```
665
+ packages/pdfsys-parser-vlm/src/pdfsys_parser_vlm/
666
+ ├── extract.py
667
+ ├── engines/
668
+ │ ├── mineru.py # LMDeploy wrapper
669
+ │ └── paddleocr_vl.py # 备选
670
+ ├── batching.py # dynamic batching
671
+ ├── rendering.py # 高 DPI 页面渲染
672
+ └── fallback.py # OOM 降 batch 重试
673
+ ```
674
+
675
+ **关键约束**:
676
+ - Worker 常驻模型(单例懒加载)
677
+ - `max_batch_size=16, max_seq=8192`(PRD §4.4)
678
+ - 超长页:单页 > 8192 tokens 按 bbox 聚类切两块
679
+ - 单页 OOM 自动降 batch 重试 ≤ 2 次后写 quarantine(见 §4.5)
680
+
681
+ **工作量**:1 人 · 15 天(含 LMDeploy 调通)
682
+
683
+ ---
684
+
685
+ ### 5.5 Stage-3 后处理
686
+
687
+ 独立成新包 `packages/pdfsys-postproc/`:
688
+
689
+ ```
690
+ ├── reading_order.py # 跨页合并、脚注挂回正文、双栏交错修正
691
+ ├── paragraph_merge.py # 折行还原 + 中文断句
692
+ ├── formula_norm.py # KaTeX 语法校验,失败转 image placeholder
693
+ ├── table_norm.py # HTML↔Markdown 双格式,行列校验
694
+ └── unicode_norm.py # NFC + 全半角统一 + 零宽字符清理
695
+ ```
696
+
697
+ **工作量**:1 人 · 10 天
698
+
699
+ ---
700
+
701
+ ### 5.6 Stage-4 质量 / PII / MinHash 去重
702
+
703
+ 独立成 `packages/pdfsys-quality/`,复用 `datatrove` 的 MinHash block(PRD §4.6.5):
704
+
705
+ ```
706
+ ├── lang_id.py # GlotLID 段落级语种识别
707
+ ├── heuristic.py # 重复 n-gram、非 CJK 比例、行长方差
708
+ ├── edu_score.py # 中文 EduScore (fastText → DeBERTa-v3-tiny)
709
+ ├── pii.py # 正则 + NER 兜底
710
+ └── dedup/
711
+ ├── exact.py # md5 内容精确去重
712
+ └── minhash.py # datatrove MinHash LSH wrapper
713
+ ```
714
+
715
+ **工作量**:2 人 · 3 周(MinHash 跨 shard 需全局 shuffle,最复杂)
716
+
717
+ ---
718
+
719
+ ### 5.7 Stage-5 Parquet 打包
720
+
721
+ 独立成 `packages/pdfsys-output/`:
722
+ - Parquet 分片 ~1 GB/shard,zstd 压缩
723
+ - 分桶路径:`v1/lang=zh/source=arxiv/qb=high/shard-NNNNN.parquet`
724
+ - JSONL 镜像 + Markdown 抽样存档(每 shard 0.1%)
725
+
726
+ **工作量**:1 人 · 5 天
727
+
728
+ ---
729
+
730
+ ## 6 · P3 规模化与生态(3-6 个月)
731
+
732
+ | 项 | 说明 | 工作量 |
733
+ |---|---|---|
734
+ | **datatrove 集成** | 把现有 stage 包成 `datatrove.Block`,原生 Slurm 后端 | 2-3 周 |
735
+ | **Slurm / K8s runner** | 新包 `pdfsys-runner`,支持 shard checkpoint + 反压 | 3-4 周 |
736
+ | **对象存储后端** | `pdfsys-core` 抽象 `FSBackend` 协议,支持 `file://` / `s3://` / `oss://` / `minio://` | 1-2 周 |
737
+ | **中文 EduScore 训练** | fastText → DeBERTa-v3-tiny 分类器 + 数据标注 | 4-6 周(含标注) |
738
+ | **竖排古籍 LoRA** | MinerU 2.5 针对性 LoRA 微调 | 4-6 周(GPU 密集) |
739
+
740
+ ---
741
+
742
+ ## 7 · 里程碑时间线
743
+
744
+ | 里程碑 | 周 | 标志 |
745
+ |---|---|---|
746
+ | **M1 · 可协作化** | 2 | CI 绿灯;覆盖率达标;lock 文件入库;parity harness 守门 |
747
+ | **M2 · 生产级核心** | 6 | Router p50 ≤ 10 ms;scorer 3× 吞吐;统一 log+metrics;错误可聚合 |
748
+ | **M3 · 6-stage 打通** | 16 | 10 GB 数据集端到端跑完;三种 backend 同构 schema |
749
+ | **M4 · PB 就绪** | 24 | datatrove + Slurm runner;对象存储后端;TCO 估算入库 |
750
+ | **M5 · v0.1 数据集** | 32 | 首个 1 TB 级对外可发布数据集 + 评测报告 |
751
+
752
+ ---
753
+
754
+ ## 8 · Quick Wins · 两周内可立即启动
755
+
756
+ 如果只能挑最高 ROI 的 5 件事立刻做:
757
+
758
+ 1. **写 15 个 core / router / parser-mupdf 单测** — 2 天 · 把不变式变成机器可验证
759
+ 2. **配 ruff + pre-commit** — 0.5 天 · 新 PR 质量底线立起来
760
+ 3. **写 `.github/workflows/ci.yml`** — 0.5 天 · 反馈从"review 时"提前到"push 时"
761
+ 4. **`uv.lock` 入库 + 依赖加上界** — 0.5 天 · 依赖不会突然不一样
762
+ 5. **`scripts/check_parity.sh` + 10 个样本 PDF 入 fixtures** — 2 天 · router 改动自动守门
763
+
764
+ 合计 **5-6 个工作日**,换来"可协作化"的全部前提。强烈建议以这作为第一冲刺。
765
+
766
+ ---
767
+
768
+ ## 9 · 风险与"不做的事"
769
+
770
+ ### 必须克制的诱惑
771
+
772
+ - ❌ **不要在 P0 之前碰 stub 实现**——没有测试和 parity harness 保护,任何功能添加都是技术债的利息
773
+ - ❌ **不要替换 PyMuPDF**——它在中文场景的工程成熟度是第一梯队,换 pdfminer/PyPDF2 会立刻倒退
774
+ - ❌ **不要引入 LangChain / LlamaIndex**——这是数据处理 pipeline,不是 RAG 应用
775
+ - ❌ **不要在 `pdfsys-core` 引入 pydantic**——现有 `dataclass(frozen=True, slots=True)` + `serde.py` 够用,换 pydantic 破坏零依赖不变式
776
+
777
+ ### 长期风险对应策略
778
+
779
+ | 风险 | 对应 |
780
+ |---|---|
781
+ | MinerU 2.5 新版许可变化 | PaddleOCR-VL 保持热备,`pdfsys-parser-vlm` 做成 engine 抽象 |
782
+ | PyMuPDF AGPL 限制 | 评估 pikepdf / pdfplumber 作为退路(低优先级) |
783
+ | PB 级对象存储成本失控 | P0 阶段写 `scripts/tco.py` 估算 |
784
+ | 中文 PII 召回不足 | NER 模型兜底,保留审计表便于事后补救 |
785
+
786
+ ---
787
+
788
+ ## 10 · 如何跟踪进度
789
+
790
+ - **短期(P0-P1)**:GitHub Projects / Milestones。每个子项一 issue,带验收标准。
791
+ - **中期(P2)**:每个 stage 落地时开一个"tracking issue"聚合子 PR,`CHANGELOG.md` 按 SemVer 更新。
792
+ - **长期(P3)**:PRD §10 的 P0/P1/P2/P3 roadmap 每月复盘一次,本文档 v0.N 同步迭代。
793
+
794
+ 进度状态在根 `README.md` §What's implemented 表里维护——按 `.cursor/rules/03-doc-sync.mdc` 的映射表,任何 Stage 状态从 ❌→✅ 都必须同步该表。
795
+
796
+ ---
797
+
798
+ ## 附录 · 总量一览
799
+
800
+ | 阶段 | 周期 | 核心交付 | 人力 |
801
+ |---|---|---|---|
802
+ | **P0 工程基础** | 2 周 | pytest + ruff + CI + lock + parity | 1 人 |
803
+ | **P1 性能/可靠性** | 4 周 | router 5×、scorer 3×、log/metrics | 1-2 人 |
804
+ | **P2 功能补全** | 10-12 周 | 6 stage 闭环 | 2-3 人 |
805
+ | **P3 规模化** | 3-6 月 | datatrove + Slurm + PB 级运行 | 3-4 人 |
806
+
807
+ 从 0 到"PB 级准备"约 24 周,累计约 20-30 人周。与 PRD §6 的资源预算 "100 × A100 + 32 节点 CPU 墙钟 ~2 个月"相匹配——**先把工具链造好,再把大算力接上**。
packages/pdfsys-router/src/pdfsys_router/download_weights.py CHANGED
@@ -12,39 +12,50 @@ Usage::
12
 
13
  from __future__ import annotations
14
 
 
15
  import sys
16
  import urllib.request
17
  from pathlib import Path
18
 
19
- # media.githubusercontent.com serves the actual LFS payload directly,
20
- # bypassing the pointer file that raw.githubusercontent.com returns.
21
- WEIGHTS_URL = (
22
- "https://media.githubusercontent.com/media/huggingface/finepdfs/main/"
23
- "blocks/predictor/xgb.ubj"
24
- )
25
 
26
 
27
  def target_path() -> Path:
28
  return Path(__file__).resolve().parent.parent.parent / "models" / "xgb_classifier.ubj"
29
 
30
 
31
- def download(force: bool = False) -> Path:
32
  dst = target_path()
33
  if dst.exists() and not force:
34
  print(f"[download_weights] already present: {dst}")
35
  return dst
36
  dst.parent.mkdir(parents=True, exist_ok=True)
37
- print(f"[download_weights] fetching {WEIGHTS_URL}")
38
- with urllib.request.urlopen(WEIGHTS_URL) as r: # noqa: S310 — pinned URL
39
- data = r.read()
40
- if len(data) < 10_000:
41
- raise RuntimeError(
42
- f"downloaded blob is suspiciously small ({len(data)} bytes) — "
43
- "likely an LFS pointer, not the binary"
44
- )
45
- dst.write_bytes(data)
46
- print(f"[download_weights] wrote {len(data)} bytes -> {dst}")
47
- return dst
 
 
 
 
 
 
 
 
 
 
 
48
 
49
 
50
  if __name__ == "__main__":
 
12
 
13
  from __future__ import annotations
14
 
15
+ import socket
16
  import sys
17
  import urllib.request
18
  from pathlib import Path
19
 
20
+ # GitHub raw download URL for XGBoost router weights
21
+ WEIGHTS_URLS = [
22
+ "https://github.com/huggingface/finepdfs/raw/main/models/xgb_ocr_classifier/xgb_classifier.ubj",
23
+ "https://raw.githubusercontent.com/huggingface/finepdfs/main/models/xgb_ocr_classifier/xgb_classifier.ubj",
24
+ ]
 
25
 
26
 
27
  def target_path() -> Path:
28
  return Path(__file__).resolve().parent.parent.parent / "models" / "xgb_classifier.ubj"
29
 
30
 
31
+ def download(force: bool = False, timeout: int = 30) -> Path:
32
  dst = target_path()
33
  if dst.exists() and not force:
34
  print(f"[download_weights] already present: {dst}")
35
  return dst
36
  dst.parent.mkdir(parents=True, exist_ok=True)
37
+
38
+ last_error = None
39
+ for url in WEIGHTS_URLS:
40
+ print(f"[download_weights] fetching {url}")
41
+ try:
42
+ # 设置超时
43
+ with urllib.request.urlopen(url, timeout=timeout) as r: # noqa: S310 — pinned URL
44
+ data = r.read()
45
+ if len(data) < 10_000:
46
+ raise RuntimeError(
47
+ f"downloaded blob is suspiciously small ({len(data)} bytes) — "
48
+ "likely an LFS pointer, not the binary"
49
+ )
50
+ dst.write_bytes(data)
51
+ print(f"[download_weights] wrote {len(data)} bytes -> {dst}")
52
+ return dst
53
+ except (urllib.error.URLError, socket.timeout) as e:
54
+ last_error = e
55
+ print(f"[download_weights] failed for {url}: {e}")
56
+ continue
57
+
58
+ raise RuntimeError(f"Failed to download weights from all URLs: {last_error}")
59
 
60
 
61
  if __name__ == "__main__":
requirements.txt ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Hugging Face Spaces installs from this file.
2
+ # Note: Local workspace packages (pdfsys-*) are loaded via sys.path in demo/app.py
3
+ # and do not need editable installation in HF Spaces.
4
+
5
+ # --- Python 3.13 compatibility (audioop removed) --------------------------
6
+ audioop-lts
7
+
8
+ # --- CPU-only torch (HF Spaces free tier is CPU) --------------------------
9
+ --extra-index-url https://download.pytorch.org/whl/cpu
10
+ torch>=2.1,<3.0
11
+
12
+ # --- Third-party runtime deps -------------------------------------------
13
+ gradio==5.12.0
14
+ huggingface-hub>=0.26,<0.29
15
+ pymupdf>=1.24
16
+ xgboost>=2.0
17
+ scikit-learn>=1.3
18
+ pandas>=2.0
19
+ numpy>=1.26
20
+ transformers>=4.44
21
+ pillow>=10.0