jieluo1024 commited on
Commit ·
00b2f48
1
Parent(s): b8ca6f2
feat: update XGBoost weights URL and add Gradio demo
Browse files- Fix XGBoost router weights download URL to use GitHub raw links
- Add timeout and fallback URLs for model download
- Add Gradio demo interface (demo/app.py, demo/pipeline.py)
- Add app.py entry point for HuggingFace Spaces
- Add requirements.txt for dependencies
- .gitignore +7 -0
- README.md +75 -1
- app.py +25 -0
- demo/README.md +102 -0
- demo/app.py +377 -0
- demo/pipeline.py +311 -0
- docs/ROADMAP.md +807 -0
- packages/pdfsys-router/src/pdfsys_router/download_weights.py +29 -18
- requirements.txt +21 -0
.gitignore
CHANGED
|
@@ -7,6 +7,8 @@ __pycache__/
|
|
| 7 |
.eggs/
|
| 8 |
build/
|
| 9 |
dist/
|
|
|
|
|
|
|
| 10 |
|
| 11 |
# uv / virtualenv
|
| 12 |
.venv/
|
|
@@ -38,3 +40,8 @@ models/
|
|
| 38 |
.idea/
|
| 39 |
.vscode/
|
| 40 |
*.swp
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
.eggs/
|
| 8 |
build/
|
| 9 |
dist/
|
| 10 |
+
.cursor/
|
| 11 |
+
scripts/
|
| 12 |
|
| 13 |
# uv / virtualenv
|
| 14 |
.venv/
|
|
|
|
| 40 |
.idea/
|
| 41 |
.vscode/
|
| 42 |
*.swp
|
| 43 |
+
|
| 44 |
+
# Gradio / HF Spaces runtime artifacts
|
| 45 |
+
flagged/
|
| 46 |
+
gradio_cached_examples/
|
| 47 |
+
.gradio/
|
README.md
CHANGED
|
@@ -1,8 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# pdfsys-mnbvc
|
| 2 |
|
| 3 |
PB-scale PDF → pretraining-data pipeline for the [MNBVC](https://github.com/esbatmop/MNBVC) corpus project.
|
| 4 |
FinePDFs-inspired architecture adapted for Chinese-heavy, mixed-quality input.
|
| 5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
## Current status: MVP closed loop ✅
|
| 7 |
|
| 8 |
The first end-to-end path — **Router → MuPDF parser → OCR quality scorer** — is working on the OmniDocBench-100 evaluation set. PDFs that need OCR are routed to `PIPELINE` but not yet extracted (that backend is not implemented yet).
|
|
@@ -221,7 +238,64 @@ A companion `.summary.json` file is also written with aggregate statistics.
|
|
| 221 |
|
| 222 |
## Docs
|
| 223 |
|
| 224 |
-
- `docs/PRD.md` — full PRD with resource budgets and
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 225 |
|
| 226 |
## License
|
| 227 |
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: PDFSystem MNBVC Demo
|
| 3 |
+
emoji: 📄
|
| 4 |
+
colorFrom: green
|
| 5 |
+
colorTo: purple
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: 5.12.0
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
license: apache-2.0
|
| 11 |
+
short_description: FinePDFs-style PDF pipeline demo for MNBVC
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
# pdfsys-mnbvc
|
| 15 |
|
| 16 |
PB-scale PDF → pretraining-data pipeline for the [MNBVC](https://github.com/esbatmop/MNBVC) corpus project.
|
| 17 |
FinePDFs-inspired architecture adapted for Chinese-heavy, mixed-quality input.
|
| 18 |
|
| 19 |
+
> **Try it:** `python app.py` locally, or deploy to Hugging Face Spaces with one click
|
| 20 |
+
> — the YAML header above is all the Space config needed. See [`demo/README.md`](demo/README.md)
|
| 21 |
+
> for both paths.
|
| 22 |
+
|
| 23 |
## Current status: MVP closed loop ✅
|
| 24 |
|
| 25 |
The first end-to-end path — **Router → MuPDF parser → OCR quality scorer** — is working on the OmniDocBench-100 evaluation set. PDFs that need OCR are routed to `PIPELINE` but not yet extracted (that backend is not implemented yet).
|
|
|
|
| 238 |
|
| 239 |
## Docs
|
| 240 |
|
| 241 |
+
- [`docs/PRD.md`](docs/PRD.md) — full PRD with resource budgets and architectural rationale (the "what & why").
|
| 242 |
+
- [`docs/ROADMAP.md`](docs/ROADMAP.md) — prioritised implementation plan with work-estimates and acceptance criteria (the "how & when").
|
| 243 |
+
- [`CONTRIBUTING.md`](CONTRIBUTING.md) — naming, parity rules, commit scopes.
|
| 244 |
+
- [`demo/README.md`](demo/README.md) — Gradio demo + Hugging Face Spaces deploy guide.
|
| 245 |
+
|
| 246 |
+
## Collaborating with Cursor
|
| 247 |
+
|
| 248 |
+
This repo ships a full set of [Cursor project rules](https://docs.cursor.com/context/rules) under `.cursor/rules/`. They give the AI agent the same mental model senior contributors have — including the non-obvious bits (FinePDFs feature parity, `pdfsys-core` zero-dep rule, Gradio UI/logic separation) that a new collaborator would otherwise step on.
|
| 249 |
+
|
| 250 |
+
### Quick start
|
| 251 |
+
|
| 252 |
+
```bash
|
| 253 |
+
# One-shot bootstrap: checks python/uv, syncs workspace, downloads router weights.
|
| 254 |
+
bash scripts/setup_cursor.sh
|
| 255 |
+
```
|
| 256 |
+
|
| 257 |
+
Then open the repo in Cursor (≥ 0.50, which supports `.cursor/rules/*.mdc`). The always-on rules activate immediately; file-specific rules attach as you open matching files.
|
| 258 |
+
|
| 259 |
+
### Active rules
|
| 260 |
+
|
| 261 |
+
| Rule | Scope | What it enforces |
|
| 262 |
+
|------|-------|------------------|
|
| 263 |
+
| `00-project-context.mdc` | always | Project goals, tech stack, must-read docs, explicit non-goals. |
|
| 264 |
+
| `01-architecture-invariants.mdc` | always | 7 load-bearing invariants (zero-dep core, stateless processing, normalized bbox, etc.). |
|
| 265 |
+
| `02-commit-workflow.mdc` | always | Conventional commits with package-scoped names; pre-commit checklist. |
|
| 266 |
+
| `03-doc-sync.mdc` | always | Doc-sync mapping table: which code change forces which doc update. Cursor proactively scans after edits. |
|
| 267 |
+
| `10-python-standards.mdc` | `**/*.py` | Type hints, frozen dataclass, lazy imports for heavy deps. |
|
| 268 |
+
| `20-core-contracts.mdc` | `packages/pdfsys-core/**` | Zero external deps; no I/O; schema change ripple rules. |
|
| 269 |
+
| `21-router-parity.mdc` | `packages/pdfsys-router/**` | FinePDFs 124-feature parity is sacred; how to verify. |
|
| 270 |
+
| `22-parser-backends.mdc` | `packages/pdfsys-parser-*/**` | All three backends must emit identical `ExtractedDoc`. |
|
| 271 |
+
| `23-bench-scorer.mdc` | `packages/pdfsys-bench/**` | torch/transformers lazy load; bf16 default; loop never raises. |
|
| 272 |
+
| `30-gradio-demo.mdc` | `demo/**,app.py` | UI layer has no business logic; callbacks never raise; lazy singletons. |
|
| 273 |
+
|
| 274 |
+
### Recommended Cursor workflow
|
| 275 |
+
|
| 276 |
+
1. **Before touching `pdfsys-core`** — read `20-core-contracts.mdc`. The AI will refuse to add third-party deps here and surface schema-ripple questions.
|
| 277 |
+
2. **Before touching `feature_extractor.py`** — `21-router-parity.mdc` kicks in; the AI will suggest running the parity check before you commit.
|
| 278 |
+
3. **When building a new parser backend** — `22-parser-backends.mdc` walks through the 6-step addition procedure and refuses partial impls.
|
| 279 |
+
4. **When writing demo UI** — `30-gradio-demo.mdc` rejects `import pymupdf` in `demo/app.py` (belongs in `demo/pipeline.py`).
|
| 280 |
+
|
| 281 |
+
### Authoring new rules
|
| 282 |
+
|
| 283 |
+
Rules live in `.cursor/rules/*.mdc`. Format:
|
| 284 |
+
|
| 285 |
+
```yaml
|
| 286 |
+
---
|
| 287 |
+
description: Short description shown in the rule picker
|
| 288 |
+
globs: packages/<pkg>/**/*.py # omit for always-on rules
|
| 289 |
+
alwaysApply: false # true = always loaded
|
| 290 |
+
---
|
| 291 |
+
|
| 292 |
+
# Rule Title
|
| 293 |
+
|
| 294 |
+
- Bullet rule 1 (with ✅/❌ example)
|
| 295 |
+
- Bullet rule 2
|
| 296 |
+
```
|
| 297 |
+
|
| 298 |
+
Keep each rule under 100 lines, one concern per file. See existing rules for patterns.
|
| 299 |
|
| 300 |
## License
|
| 301 |
|
app.py
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Hugging Face Spaces entry point.
|
| 2 |
+
|
| 3 |
+
HF Spaces looks for ``app.py`` at the repo root. We just import the
|
| 4 |
+
actual app from ``demo/`` so the demo code stays tucked away and the
|
| 5 |
+
root stays uncluttered.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
from __future__ import annotations
|
| 9 |
+
|
| 10 |
+
import sys
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
|
| 13 |
+
_DEMO_DIR = Path(__file__).resolve().parent / "demo"
|
| 14 |
+
sys.path.insert(0, str(_DEMO_DIR))
|
| 15 |
+
|
| 16 |
+
from app import demo # noqa: E402,F401 — re-exported for HF Spaces
|
| 17 |
+
|
| 18 |
+
if __name__ == "__main__":
|
| 19 |
+
import os
|
| 20 |
+
|
| 21 |
+
demo.queue(max_size=8).launch(
|
| 22 |
+
server_name=os.environ.get("GRADIO_SERVER_NAME", "0.0.0.0"),
|
| 23 |
+
server_port=int(os.environ.get("GRADIO_SERVER_PORT", "7860")),
|
| 24 |
+
show_api=False,
|
| 25 |
+
)
|
demo/README.md
ADDED
|
@@ -0,0 +1,102 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# pdfsys-mnbvc · Gradio Demo
|
| 2 |
+
|
| 3 |
+
A small self-contained Gradio app that runs the **actually-implemented** MVP
|
| 4 |
+
path of the pdfsys-mnbvc pipeline on a single PDF you upload.
|
| 5 |
+
|
| 6 |
+
It exercises the same three components the bench harness does:
|
| 7 |
+
|
| 8 |
+
1. **Stage-A XGBoost router** (`pdfsys_router.Router`) — 124 PyMuPDF features → `ocr_prob` → one of `mupdf / pipeline / vlm / deferred`.
|
| 9 |
+
2. **MuPDF fast path** (`pdfsys_parser_mupdf.extract_doc`) — runs only when the router picks `mupdf`. Emits `Segment[]` with normalized bboxes + a merged Markdown blob.
|
| 10 |
+
3. **ModernBERT OCR quality scorer** (`pdfsys_bench.quality.OcrQualityScorer`) — optional; heavy; gated behind a checkbox.
|
| 11 |
+
|
| 12 |
+
PIPELINE / VLM / DEFERRED backends are currently stubs in the repo, so the
|
| 13 |
+
demo surfaces the router decision and skips extraction for them.
|
| 14 |
+
|
| 15 |
+
## UI
|
| 16 |
+
|
| 17 |
+
```
|
| 18 |
+
┌─────────────────┬──────────────────────────────────────────────────┐
|
| 19 |
+
│ upload PDF │ Summary · backend · P(ocr) · pages · timing │
|
| 20 |
+
│ threshold ├──────────────────────────────────────────────────┤
|
| 21 |
+
│ ☐ quality │ [ Page preview │ Markdown │ Segments │ │
|
| 22 |
+
│ [Run Pipeline] │ Router features │ Raw JSON ] │
|
| 23 |
+
│ │ │
|
| 24 |
+
│ pipeline │ Page preview draws extracted bboxes (color = │
|
| 25 |
+
│ diagram │ chosen backend) directly on the first page. │
|
| 26 |
+
└─────────────────┴──────────────────────────────────────────────────┘
|
| 27 |
+
```
|
| 28 |
+
|
| 29 |
+
## Run locally
|
| 30 |
+
|
| 31 |
+
```bash
|
| 32 |
+
# option A — full workspace install (recommended)
|
| 33 |
+
uv sync # installs all packages + deps
|
| 34 |
+
python -m pdfsys_router.download_weights # one-time: XGBoost weights (257 KB)
|
| 35 |
+
python app.py # http://localhost:7860
|
| 36 |
+
|
| 37 |
+
# option B — plain pip (matches HF Spaces)
|
| 38 |
+
pip install -r requirements.txt
|
| 39 |
+
python -m pdfsys_router.download_weights
|
| 40 |
+
python app.py
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
First run of the quality scorer pulls `HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn`
|
| 44 |
+
(~800 MB) from the HF Hub. Set `HF_HOME=/path/to/cache` to control where it lands.
|
| 45 |
+
|
| 46 |
+
## Deploy to Hugging Face Spaces
|
| 47 |
+
|
| 48 |
+
The root `README.md` already contains the required [Spaces YAML config](https://huggingface.co/docs/hub/spaces-config-reference):
|
| 49 |
+
|
| 50 |
+
```yaml
|
| 51 |
+
---
|
| 52 |
+
title: PDFSystem MNBVC Demo
|
| 53 |
+
sdk: gradio
|
| 54 |
+
sdk_version: 4.44.0
|
| 55 |
+
app_file: app.py
|
| 56 |
+
license: apache-2.0
|
| 57 |
+
---
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
### Option 1 · One-click from GitHub (recommended)
|
| 61 |
+
|
| 62 |
+
1. Push this repo to GitHub.
|
| 63 |
+
2. Go to <https://huggingface.co/new-space>.
|
| 64 |
+
3. Pick **Gradio** SDK, hardware **CPU basic** is enough for the MVP loop.
|
| 65 |
+
4. In **Files** → **Create Space from an existing GitHub repo**, paste the repo URL.
|
| 66 |
+
|
| 67 |
+
HF Spaces will clone the whole repo, read the YAML header in the root
|
| 68 |
+
`README.md`, install `requirements.txt`, and launch `app.py`. The router's
|
| 69 |
+
XGBoost weights are downloaded automatically on first request (~257 KB, inline
|
| 70 |
+
in the Space container).
|
| 71 |
+
|
| 72 |
+
### Option 2 · Manual push
|
| 73 |
+
|
| 74 |
+
```bash
|
| 75 |
+
git clone https://huggingface.co/spaces/<you>/pdfsys-mnbvc-demo
|
| 76 |
+
cd pdfsys-mnbvc-demo
|
| 77 |
+
# copy repo contents into this dir (the four workspace packages must come
|
| 78 |
+
# along — they are installed editable by requirements.txt)
|
| 79 |
+
cp -r /path/to/pdfsystem_mnbvc/{app.py,requirements.txt,README.md,packages,demo} .
|
| 80 |
+
git add . && git commit -m "Initial deploy" && git push
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
### Resource notes (HF Spaces free tier: CPU, 16 GB RAM)
|
| 84 |
+
|
| 85 |
+
- Router: ~50–100 ms per PDF; effectively free.
|
| 86 |
+
- MuPDF extraction: ~10 ms per page.
|
| 87 |
+
- Quality scorer (ModernBERT-large): ~3–5 s per PDF at bf16; fits in RAM.
|
| 88 |
+
Disabled by default in the UI. **Keep it off** unless you want to wait.
|
| 89 |
+
- GPU Spaces aren't required; the MVP path is CPU-only. A GPU Space becomes
|
| 90 |
+
useful once the Pipeline / VLM parsers land.
|
| 91 |
+
|
| 92 |
+
## Files
|
| 93 |
+
|
| 94 |
+
| Path | Role |
|
| 95 |
+
| ---- | ---- |
|
| 96 |
+
| `demo/app.py` | Gradio `Blocks` definition + event handlers. |
|
| 97 |
+
| `demo/pipeline.py` | Pure-Python wrapper around `Router` + `extract_doc` + `OcrQualityScorer`. Rendering helpers live here too. |
|
| 98 |
+
| `app.py` (repo root) | Thin HF-Spaces entry; imports `demo.app`. |
|
| 99 |
+
| `requirements.txt` (repo root) | Pin-friendly deps for `pip install -r`. Installs the four workspace packages in editable mode. |
|
| 100 |
+
|
| 101 |
+
The demo imports the real pipeline modules — if you change `pdfsys-router`
|
| 102 |
+
or `pdfsys-parser-mupdf`, the demo picks it up on the next launch.
|
demo/app.py
ADDED
|
@@ -0,0 +1,377 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Gradio demo for the pdfsys-mnbvc MVP pipeline.
|
| 2 |
+
|
| 3 |
+
What this demonstrates (matching the code that actually exists in the
|
| 4 |
+
repo today, not the aspirational PRD):
|
| 5 |
+
|
| 6 |
+
* Stage-A XGBoost router — decides text-ok vs needs-ocr from 124
|
| 7 |
+
PyMuPDF-derived features.
|
| 8 |
+
* MuPDF fast path — extracts Markdown-ready segments when the router
|
| 9 |
+
picks ``Backend.MUPDF``. Overlaid on the first page as colored bboxes.
|
| 10 |
+
* ModernBERT OCR quality scorer — optional, heavy (~800 MB download,
|
| 11 |
+
3–5 s per doc on CPU). Off by default to keep the demo snappy.
|
| 12 |
+
|
| 13 |
+
PIPELINE / VLM / DEFERRED backends are surfaced through the router
|
| 14 |
+
decision but are still stubs in ``packages/pdfsys-parser-*``; the UI
|
| 15 |
+
just reports the routing choice in that case and skips extraction.
|
| 16 |
+
|
| 17 |
+
Runs locally (``python demo/app.py``) and as a Hugging Face Space (see
|
| 18 |
+
the repo-root ``README.md`` frontmatter and ``demo/README.md``).
|
| 19 |
+
"""
|
| 20 |
+
|
| 21 |
+
from __future__ import annotations
|
| 22 |
+
|
| 23 |
+
import json
|
| 24 |
+
import os
|
| 25 |
+
import sys
|
| 26 |
+
import tempfile
|
| 27 |
+
import traceback
|
| 28 |
+
from pathlib import Path
|
| 29 |
+
|
| 30 |
+
import gradio as gr
|
| 31 |
+
|
| 32 |
+
# Allow ``python demo/app.py`` without installing the workspace by falling
|
| 33 |
+
# back to the in-tree sources. When running under HF Spaces / uv sync the
|
| 34 |
+
# packages are already on sys.path and these inserts become no-ops.
|
| 35 |
+
_REPO_ROOT = Path(__file__).resolve().parent.parent
|
| 36 |
+
for pkg in ("pdfsys-core", "pdfsys-router", "pdfsys-parser-mupdf", "pdfsys-bench"):
|
| 37 |
+
src = _REPO_ROOT / "packages" / pkg / "src"
|
| 38 |
+
if src.is_dir() and str(src) not in sys.path:
|
| 39 |
+
sys.path.insert(0, str(src))
|
| 40 |
+
|
| 41 |
+
from pipeline import ( # noqa: E402 — must come after sys.path surgery
|
| 42 |
+
PipelineResult,
|
| 43 |
+
pick_curated_features,
|
| 44 |
+
render_first_page_with_bboxes,
|
| 45 |
+
run_pipeline,
|
| 46 |
+
)
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
# ------------------------------------------------------------------ constants
|
| 50 |
+
|
| 51 |
+
DESCRIPTION = """\
|
| 52 |
+
# PDFSystem-MNBVC · Pipeline Demo
|
| 53 |
+
|
| 54 |
+
**FinePDFs-inspired PB-scale PDF → pretraining-data pipeline**, adapted
|
| 55 |
+
for the Chinese MNBVC corpus. This demo shows the MVP closed loop that
|
| 56 |
+
is actually implemented in the repo today:
|
| 57 |
+
|
| 58 |
+
**Router (XGBoost, 124 features)** → **MuPDF fast path** → **OCR Quality Scorer (ModernBERT)**
|
| 59 |
+
|
| 60 |
+
The router decides whether a PDF is cheap to parse with PyMuPDF alone,
|
| 61 |
+
or whether it needs to go to the (still-stubbed) OCR / VLM backends.
|
| 62 |
+
Roughly 90% of a typical PDF corpus takes the green fast-path lane.
|
| 63 |
+
"""
|
| 64 |
+
|
| 65 |
+
PIPELINE_DIAGRAM_MD = """\
|
| 66 |
+
### Pipeline
|
| 67 |
+
|
| 68 |
+
```
|
| 69 |
+
┌────────────────┐
|
| 70 |
+
PDF ───────►│ Stage-A │ XGBoost · ~10 ms/PDF
|
| 71 |
+
│ Router │ 124 PyMuPDF features
|
| 72 |
+
└────────┬───────┘
|
| 73 |
+
│ ocr_prob
|
| 74 |
+
┌─────────────┼─────────────┐
|
| 75 |
+
▼ ▼ ▼
|
| 76 |
+
MUPDF PIPELINE VLM / DEFERRED
|
| 77 |
+
(text-ok) (OCR, stub) (VLM, stub)
|
| 78 |
+
│
|
| 79 |
+
▼
|
| 80 |
+
PyMuPDF blocks ─► Markdown + Segments (with bboxes)
|
| 81 |
+
│
|
| 82 |
+
▼
|
| 83 |
+
ModernBERT-large OCR quality regressor ─► score ∈ [0, 3]
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
**Backend color legend on page preview**
|
| 87 |
+
|
| 88 |
+
- 🟢 `mupdf` — text-ok fast path (implemented)
|
| 89 |
+
- 🟠 `pipeline` — OCR lane (stub, routing only)
|
| 90 |
+
- 🟣 `vlm` — VLM lane (stub, routing only)
|
| 91 |
+
- ⚪ `deferred` — held back until VLM workers online
|
| 92 |
+
"""
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
def _safe(val, default=""):
|
| 96 |
+
"""Coerce NaN / None for Gradio components that don't like them."""
|
| 97 |
+
if val is None:
|
| 98 |
+
return default
|
| 99 |
+
try:
|
| 100 |
+
import math
|
| 101 |
+
|
| 102 |
+
if isinstance(val, float) and math.isnan(val):
|
| 103 |
+
return default
|
| 104 |
+
except Exception:
|
| 105 |
+
pass
|
| 106 |
+
return val
|
| 107 |
+
|
| 108 |
+
|
| 109 |
+
# ------------------------------------------------------------------ handlers
|
| 110 |
+
|
| 111 |
+
|
| 112 |
+
def process_pdf(
|
| 113 |
+
pdf_file: str | None,
|
| 114 |
+
run_quality: bool,
|
| 115 |
+
ocr_threshold: float,
|
| 116 |
+
progress: gr.Progress = gr.Progress(),
|
| 117 |
+
):
|
| 118 |
+
"""Main Gradio callback. Returns one value per output component."""
|
| 119 |
+
empty_segments = [[0, 0, "-", "-", 0, ""]]
|
| 120 |
+
empty_features = [["(no PDF uploaded)", ""]]
|
| 121 |
+
empty_summary = "Upload a PDF to get started."
|
| 122 |
+
|
| 123 |
+
if not pdf_file:
|
| 124 |
+
return (
|
| 125 |
+
empty_summary,
|
| 126 |
+
"", 0.0, 0, "", 0.0,
|
| 127 |
+
None,
|
| 128 |
+
"_No markdown yet._",
|
| 129 |
+
empty_segments,
|
| 130 |
+
empty_features,
|
| 131 |
+
{},
|
| 132 |
+
)
|
| 133 |
+
|
| 134 |
+
pdf_path = Path(pdf_file)
|
| 135 |
+
|
| 136 |
+
try:
|
| 137 |
+
progress(0.1, desc="Routing (XGBoost)…")
|
| 138 |
+
result: PipelineResult = run_pipeline(
|
| 139 |
+
pdf_path,
|
| 140 |
+
run_quality=run_quality,
|
| 141 |
+
ocr_threshold=ocr_threshold,
|
| 142 |
+
)
|
| 143 |
+
|
| 144 |
+
progress(0.7, desc="Rendering first page…")
|
| 145 |
+
preview = render_first_page_with_bboxes(pdf_path, result, page_index=0)
|
| 146 |
+
|
| 147 |
+
except Exception as e: # noqa: BLE001
|
| 148 |
+
tb = traceback.format_exc()
|
| 149 |
+
err_json = {"error": str(e), "traceback": tb.splitlines()[-6:]}
|
| 150 |
+
return (
|
| 151 |
+
f"**Failed:** `{e}`",
|
| 152 |
+
"", 0.0, 0, "", 0.0,
|
| 153 |
+
None,
|
| 154 |
+
f"```\n{tb}\n```",
|
| 155 |
+
empty_segments,
|
| 156 |
+
empty_features,
|
| 157 |
+
err_json,
|
| 158 |
+
)
|
| 159 |
+
|
| 160 |
+
# ------------------------------------------------------------- summary
|
| 161 |
+
lines = [
|
| 162 |
+
f"**File:** `{pdf_path.name}` ({pdf_path.stat().st_size / 1024:.1f} KB)",
|
| 163 |
+
f"**Routed to:** `{result.backend}` · "
|
| 164 |
+
f"P(ocr) = **{result.ocr_prob:.3f}** · {result.num_pages} page(s)",
|
| 165 |
+
]
|
| 166 |
+
flags = []
|
| 167 |
+
if result.is_form:
|
| 168 |
+
flags.append("is_form")
|
| 169 |
+
if result.is_encrypted:
|
| 170 |
+
flags.append("encrypted")
|
| 171 |
+
if result.needs_password:
|
| 172 |
+
flags.append("password-protected")
|
| 173 |
+
if result.garbled_text_ratio > 0.01:
|
| 174 |
+
flags.append(f"garbled_text_ratio={result.garbled_text_ratio:.2%}")
|
| 175 |
+
if flags:
|
| 176 |
+
lines.append("**Flags:** " + ", ".join(f"`{f}`" for f in flags))
|
| 177 |
+
if result.router_error:
|
| 178 |
+
lines.append(f"**Router error:** `{result.router_error}`")
|
| 179 |
+
if result.extract_error:
|
| 180 |
+
lines.append(f"**Extract error:** `{result.extract_error}`")
|
| 181 |
+
if result.quality_error:
|
| 182 |
+
lines.append(f"**Quality error:** `{result.quality_error}`")
|
| 183 |
+
|
| 184 |
+
if result.backend == "mupdf" and not result.extract_error:
|
| 185 |
+
stats = result.extract_stats
|
| 186 |
+
lines.append(
|
| 187 |
+
f"**Extracted:** {stats.get('segment_count', 0)} segments, "
|
| 188 |
+
f"{stats.get('char_count', 0):,} chars "
|
| 189 |
+
f"(pages {stats.get('pages_extracted', 0)}/{stats.get('page_count', 0)})"
|
| 190 |
+
)
|
| 191 |
+
else:
|
| 192 |
+
lines.append(
|
| 193 |
+
"_MuPDF extraction skipped — backend is not `mupdf`. "
|
| 194 |
+
"PIPELINE/VLM backends are still stubs in this repo._"
|
| 195 |
+
)
|
| 196 |
+
|
| 197 |
+
if result.quality_score is not None:
|
| 198 |
+
lines.append(
|
| 199 |
+
f"**OCR quality:** **{result.quality_score:.2f}** / 3.0 "
|
| 200 |
+
f"({result.quality_num_tokens} tokens, `{result.quality_model}`)"
|
| 201 |
+
)
|
| 202 |
+
|
| 203 |
+
lines.append(
|
| 204 |
+
f"**Timing (ms):** router **{result.wall_ms_router:.0f}** · "
|
| 205 |
+
f"extract **{result.wall_ms_extract:.0f}** · "
|
| 206 |
+
f"quality **{result.wall_ms_quality:.0f}**"
|
| 207 |
+
)
|
| 208 |
+
summary_md = "\n\n".join(lines)
|
| 209 |
+
|
| 210 |
+
# ------------------------------------------------------------- markdown
|
| 211 |
+
md_text = result.markdown.strip() or "_No markdown — this PDF was not routed to MuPDF._"
|
| 212 |
+
if len(md_text) > 20_000:
|
| 213 |
+
md_text = md_text[:20_000] + "\n\n…\n\n**[truncated for UI — full Markdown in the JSON tab]**"
|
| 214 |
+
|
| 215 |
+
# ------------------------------------------------------------- segments
|
| 216 |
+
seg_rows = [
|
| 217 |
+
[s["index"], s["page"], s["type"], str(s["bbox_norm"]), s["chars"], s["preview"]]
|
| 218 |
+
for s in result.segments
|
| 219 |
+
] or empty_segments
|
| 220 |
+
|
| 221 |
+
# ------------------------------------------------------------- features
|
| 222 |
+
feat_rows = pick_curated_features(result.router_features) or empty_features
|
| 223 |
+
|
| 224 |
+
# ------------------------------------------------------------- raw JSON
|
| 225 |
+
raw = result.to_record()
|
| 226 |
+
raw["router_features_full"] = result.router_features
|
| 227 |
+
raw["segments_full"] = result.segments
|
| 228 |
+
|
| 229 |
+
return (
|
| 230 |
+
summary_md,
|
| 231 |
+
result.backend,
|
| 232 |
+
float(result.ocr_prob) if result.ocr_prob == result.ocr_prob else 0.0,
|
| 233 |
+
int(result.num_pages),
|
| 234 |
+
("-" if result.quality_score is None else f"{result.quality_score:.2f} / 3.0"),
|
| 235 |
+
float(result.wall_ms_router + result.wall_ms_extract + result.wall_ms_quality),
|
| 236 |
+
preview,
|
| 237 |
+
md_text,
|
| 238 |
+
seg_rows,
|
| 239 |
+
feat_rows,
|
| 240 |
+
raw,
|
| 241 |
+
)
|
| 242 |
+
|
| 243 |
+
|
| 244 |
+
# ---------------------------------------------------------------------- UI
|
| 245 |
+
|
| 246 |
+
CSS = """
|
| 247 |
+
.small-num input { font-weight: 600; font-size: 1.1rem; }
|
| 248 |
+
footer { display: none !important; }
|
| 249 |
+
"""
|
| 250 |
+
|
| 251 |
+
|
| 252 |
+
def build_demo() -> gr.Blocks:
|
| 253 |
+
with gr.Blocks(title="PDFSystem-MNBVC Demo") as demo:
|
| 254 |
+
gr.Markdown(DESCRIPTION)
|
| 255 |
+
|
| 256 |
+
with gr.Row():
|
| 257 |
+
# -------------------- left column: controls + diagram
|
| 258 |
+
with gr.Column(scale=1, min_width=320):
|
| 259 |
+
pdf_input = gr.File(
|
| 260 |
+
label="Upload a PDF",
|
| 261 |
+
file_types=[".pdf"],
|
| 262 |
+
type="filepath",
|
| 263 |
+
)
|
| 264 |
+
with gr.Accordion("Options", open=True):
|
| 265 |
+
ocr_threshold = gr.Slider(
|
| 266 |
+
0.0, 1.0, value=0.5, step=0.05,
|
| 267 |
+
label="OCR probability threshold",
|
| 268 |
+
info="ocr_prob ≥ threshold ⇒ route off the MuPDF fast path",
|
| 269 |
+
)
|
| 270 |
+
run_quality = gr.Checkbox(
|
| 271 |
+
label="Run ModernBERT quality scorer",
|
| 272 |
+
value=False,
|
| 273 |
+
info="~3–5 s on CPU. First run downloads ~800 MB.",
|
| 274 |
+
)
|
| 275 |
+
run_btn = gr.Button("Run Pipeline", variant="primary", size="lg")
|
| 276 |
+
gr.Markdown(PIPELINE_DIAGRAM_MD)
|
| 277 |
+
|
| 278 |
+
# -------------------- right column: outputs
|
| 279 |
+
with gr.Column(scale=2, min_width=520):
|
| 280 |
+
summary_md = gr.Markdown(
|
| 281 |
+
"Upload a PDF and click **Run Pipeline**.",
|
| 282 |
+
label="Summary",
|
| 283 |
+
)
|
| 284 |
+
|
| 285 |
+
with gr.Row():
|
| 286 |
+
backend_out = gr.Textbox(
|
| 287 |
+
label="Backend", interactive=False, elem_classes=["small-num"]
|
| 288 |
+
)
|
| 289 |
+
ocr_prob_out = gr.Number(
|
| 290 |
+
label="P(OCR)", interactive=False, precision=3,
|
| 291 |
+
elem_classes=["small-num"],
|
| 292 |
+
)
|
| 293 |
+
pages_out = gr.Number(
|
| 294 |
+
label="Pages", interactive=False,
|
| 295 |
+
elem_classes=["small-num"],
|
| 296 |
+
)
|
| 297 |
+
quality_out = gr.Textbox(
|
| 298 |
+
label="Quality", interactive=False,
|
| 299 |
+
elem_classes=["small-num"],
|
| 300 |
+
)
|
| 301 |
+
wall_ms_out = gr.Number(
|
| 302 |
+
label="Total ms", interactive=False, precision=0,
|
| 303 |
+
elem_classes=["small-num"],
|
| 304 |
+
)
|
| 305 |
+
|
| 306 |
+
with gr.Tabs():
|
| 307 |
+
with gr.Tab("Page preview"):
|
| 308 |
+
preview_img = gr.Image(
|
| 309 |
+
label="First page with extracted bboxes",
|
| 310 |
+
type="pil",
|
| 311 |
+
interactive=False,
|
| 312 |
+
height=720,
|
| 313 |
+
)
|
| 314 |
+
with gr.Tab("Markdown"):
|
| 315 |
+
md_out = gr.Markdown()
|
| 316 |
+
with gr.Tab("Segments"):
|
| 317 |
+
seg_df = gr.Dataframe(
|
| 318 |
+
headers=["idx", "page", "type", "bbox_norm", "chars", "preview"],
|
| 319 |
+
datatype=["number", "number", "str", "str", "number", "str"],
|
| 320 |
+
wrap=True,
|
| 321 |
+
label="Extracted segments (one row per block)",
|
| 322 |
+
)
|
| 323 |
+
with gr.Tab("Router features"):
|
| 324 |
+
feat_df = gr.Dataframe(
|
| 325 |
+
headers=["feature", "value"],
|
| 326 |
+
datatype=["str", "str"],
|
| 327 |
+
label="Curated subset (full 124-dim vector in Raw JSON)",
|
| 328 |
+
)
|
| 329 |
+
with gr.Tab("Raw JSON"):
|
| 330 |
+
raw_json = gr.JSON(label="All pipeline outputs")
|
| 331 |
+
|
| 332 |
+
# ----------------------------------------------------------- wiring
|
| 333 |
+
outputs = [
|
| 334 |
+
summary_md,
|
| 335 |
+
backend_out, ocr_prob_out, pages_out, quality_out, wall_ms_out,
|
| 336 |
+
preview_img,
|
| 337 |
+
md_out,
|
| 338 |
+
seg_df,
|
| 339 |
+
feat_df,
|
| 340 |
+
raw_json,
|
| 341 |
+
]
|
| 342 |
+
run_btn.click(
|
| 343 |
+
process_pdf,
|
| 344 |
+
inputs=[pdf_input, run_quality, ocr_threshold],
|
| 345 |
+
outputs=outputs,
|
| 346 |
+
)
|
| 347 |
+
# Auto-run on file upload (with quality off for snappiness).
|
| 348 |
+
pdf_input.upload(
|
| 349 |
+
lambda f, t: process_pdf(f, False, t),
|
| 350 |
+
inputs=[pdf_input, ocr_threshold],
|
| 351 |
+
outputs=outputs,
|
| 352 |
+
)
|
| 353 |
+
|
| 354 |
+
gr.Markdown(
|
| 355 |
+
"---\n"
|
| 356 |
+
"Repo: [pdfsystem_mnbvc](https://github.com/) · "
|
| 357 |
+
"Architecture: [FinePDFs](https://huggingface.co/datasets/HuggingFaceFW/finepdfs) · "
|
| 358 |
+
"Router weights: FinePDFs upstream (Apache-2.0) · "
|
| 359 |
+
"Quality model: `HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn`"
|
| 360 |
+
)
|
| 361 |
+
|
| 362 |
+
return demo
|
| 363 |
+
|
| 364 |
+
|
| 365 |
+
demo = build_demo()
|
| 366 |
+
|
| 367 |
+
|
| 368 |
+
if __name__ == "__main__":
|
| 369 |
+
# Sensible defaults for both local dev and HF Spaces.
|
| 370 |
+
server_name = os.environ.get("GRADIO_SERVER_NAME", "0.0.0.0")
|
| 371 |
+
server_port = int(os.environ.get("GRADIO_SERVER_PORT", "7860"))
|
| 372 |
+
demo.queue(max_size=8).launch(
|
| 373 |
+
server_name=server_name,
|
| 374 |
+
server_port=server_port,
|
| 375 |
+
theme=gr.themes.Soft(primary_hue="emerald"),
|
| 376 |
+
css=CSS,
|
| 377 |
+
)
|
demo/pipeline.py
ADDED
|
@@ -0,0 +1,311 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""End-to-end wiring used by the Gradio demo.
|
| 2 |
+
|
| 3 |
+
Wraps the three production-path components in one callable:
|
| 4 |
+
|
| 5 |
+
Router (Stage-A XGBoost)
|
| 6 |
+
└─► Backend.MUPDF → pdfsys_parser_mupdf.extract_doc
|
| 7 |
+
└─► anything else → not extracted (Pipeline/VLM/Deferred are
|
| 8 |
+
still stubs in this repo; we surface the
|
| 9 |
+
router decision and stop).
|
| 10 |
+
|
| 11 |
+
Kept deliberately Gradio-free so the same code is unit-testable and
|
| 12 |
+
reusable from notebooks. ``app.py`` only imports :func:`run_pipeline`
|
| 13 |
+
and :func:`render_first_page_with_bboxes`.
|
| 14 |
+
"""
|
| 15 |
+
|
| 16 |
+
from __future__ import annotations
|
| 17 |
+
|
| 18 |
+
import io
|
| 19 |
+
import time
|
| 20 |
+
from dataclasses import dataclass, field
|
| 21 |
+
from pathlib import Path
|
| 22 |
+
from typing import Any
|
| 23 |
+
|
| 24 |
+
import pymupdf
|
| 25 |
+
from PIL import Image, ImageDraw
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
# ------------------------------------------------------------------ singletons
|
| 29 |
+
|
| 30 |
+
_ROUTER: Any = None
|
| 31 |
+
_SCORER: Any = None
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def _ensure_router_weights() -> None:
|
| 35 |
+
"""Make sure the XGBoost weights are on disk. No-op if already present."""
|
| 36 |
+
from pdfsys_router.download_weights import download, target_path
|
| 37 |
+
|
| 38 |
+
if not target_path().is_file():
|
| 39 |
+
download()
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
def get_router(ocr_threshold: float = 0.5):
|
| 43 |
+
"""Lazy-load the singleton Router. Weights download on first call."""
|
| 44 |
+
global _ROUTER
|
| 45 |
+
_ensure_router_weights()
|
| 46 |
+
from pdfsys_router import Router
|
| 47 |
+
|
| 48 |
+
if _ROUTER is None or abs(_ROUTER.ocr_threshold - ocr_threshold) > 1e-9:
|
| 49 |
+
_ROUTER = Router(ocr_threshold=ocr_threshold)
|
| 50 |
+
return _ROUTER
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
def get_scorer():
|
| 54 |
+
"""Lazy-load the singleton ModernBERT quality scorer (~800 MB download)."""
|
| 55 |
+
global _SCORER
|
| 56 |
+
if _SCORER is None:
|
| 57 |
+
from pdfsys_bench.quality import OcrQualityScorer
|
| 58 |
+
|
| 59 |
+
_SCORER = OcrQualityScorer()
|
| 60 |
+
return _SCORER
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
# ------------------------------------------------------------------ data class
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
@dataclass(slots=True)
|
| 67 |
+
class PipelineResult:
|
| 68 |
+
"""Everything the UI needs in one flat object."""
|
| 69 |
+
|
| 70 |
+
# Router
|
| 71 |
+
backend: str
|
| 72 |
+
ocr_prob: float
|
| 73 |
+
num_pages: int
|
| 74 |
+
is_form: bool
|
| 75 |
+
garbled_text_ratio: float
|
| 76 |
+
is_encrypted: bool
|
| 77 |
+
needs_password: bool
|
| 78 |
+
router_error: str | None
|
| 79 |
+
router_features: dict[str, Any] = field(default_factory=dict)
|
| 80 |
+
|
| 81 |
+
# Extract (only when backend == mupdf)
|
| 82 |
+
sha256: str | None = None
|
| 83 |
+
segments: list[dict[str, Any]] = field(default_factory=list)
|
| 84 |
+
markdown: str = ""
|
| 85 |
+
extract_stats: dict[str, Any] = field(default_factory=dict)
|
| 86 |
+
extract_error: str | None = None
|
| 87 |
+
|
| 88 |
+
# Quality
|
| 89 |
+
quality_score: float | None = None
|
| 90 |
+
quality_num_tokens: int | None = None
|
| 91 |
+
quality_model: str | None = None
|
| 92 |
+
quality_error: str | None = None
|
| 93 |
+
|
| 94 |
+
# Wall times (ms)
|
| 95 |
+
wall_ms_router: float = 0.0
|
| 96 |
+
wall_ms_extract: float = 0.0
|
| 97 |
+
wall_ms_quality: float = 0.0
|
| 98 |
+
|
| 99 |
+
def to_record(self) -> dict[str, Any]:
|
| 100 |
+
"""Flat JSON-friendly dict for the raw output tab."""
|
| 101 |
+
return {
|
| 102 |
+
"backend": self.backend,
|
| 103 |
+
"ocr_prob": self.ocr_prob,
|
| 104 |
+
"num_pages": self.num_pages,
|
| 105 |
+
"is_form": self.is_form,
|
| 106 |
+
"garbled_text_ratio": self.garbled_text_ratio,
|
| 107 |
+
"is_encrypted": self.is_encrypted,
|
| 108 |
+
"needs_password": self.needs_password,
|
| 109 |
+
"router_error": self.router_error,
|
| 110 |
+
"sha256": self.sha256,
|
| 111 |
+
"num_segments": len(self.segments),
|
| 112 |
+
"markdown_chars": len(self.markdown),
|
| 113 |
+
"extract_stats": self.extract_stats,
|
| 114 |
+
"extract_error": self.extract_error,
|
| 115 |
+
"quality_score": self.quality_score,
|
| 116 |
+
"quality_num_tokens": self.quality_num_tokens,
|
| 117 |
+
"quality_model": self.quality_model,
|
| 118 |
+
"quality_error": self.quality_error,
|
| 119 |
+
"wall_ms_router": round(self.wall_ms_router, 1),
|
| 120 |
+
"wall_ms_extract": round(self.wall_ms_extract, 1),
|
| 121 |
+
"wall_ms_quality": round(self.wall_ms_quality, 1),
|
| 122 |
+
}
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
# -------------------------------------------------------------------- helpers
|
| 126 |
+
|
| 127 |
+
|
| 128 |
+
def _segment_to_row(seg: Any) -> dict[str, Any]:
|
| 129 |
+
"""Flatten a :class:`pdfsys_core.Segment` for the UI table."""
|
| 130 |
+
bbox = seg.bbox
|
| 131 |
+
bbox_tuple = None if bbox is None else (
|
| 132 |
+
round(bbox.x0, 4),
|
| 133 |
+
round(bbox.y0, 4),
|
| 134 |
+
round(bbox.x1, 4),
|
| 135 |
+
round(bbox.y1, 4),
|
| 136 |
+
)
|
| 137 |
+
return {
|
| 138 |
+
"index": seg.index,
|
| 139 |
+
"page": seg.page_index,
|
| 140 |
+
"type": seg.type.value,
|
| 141 |
+
"bbox_norm": bbox_tuple,
|
| 142 |
+
"chars": len(seg.content),
|
| 143 |
+
"preview": seg.content[:120].replace("\n", " "),
|
| 144 |
+
}
|
| 145 |
+
|
| 146 |
+
|
| 147 |
+
# ------------------------------------------------------------------ core entry
|
| 148 |
+
|
| 149 |
+
|
| 150 |
+
def run_pipeline(
|
| 151 |
+
pdf_path: str | Path,
|
| 152 |
+
*,
|
| 153 |
+
run_quality: bool = False,
|
| 154 |
+
ocr_threshold: float = 0.5,
|
| 155 |
+
) -> PipelineResult:
|
| 156 |
+
"""Route the PDF, extract if text-ok, optionally score quality.
|
| 157 |
+
|
| 158 |
+
Never raises on malformed input — all failure modes surface via the
|
| 159 |
+
``*_error`` fields so the UI can present them uniformly.
|
| 160 |
+
"""
|
| 161 |
+
pdf_path = Path(pdf_path)
|
| 162 |
+
if not pdf_path.is_file():
|
| 163 |
+
raise FileNotFoundError(f"PDF not found: {pdf_path}")
|
| 164 |
+
|
| 165 |
+
# -- Stage-A router -------------------------------------------------------
|
| 166 |
+
router = get_router(ocr_threshold=ocr_threshold)
|
| 167 |
+
t0 = time.perf_counter()
|
| 168 |
+
decision = router.classify(pdf_path)
|
| 169 |
+
t1 = time.perf_counter()
|
| 170 |
+
|
| 171 |
+
result = PipelineResult(
|
| 172 |
+
backend=decision.backend.value,
|
| 173 |
+
ocr_prob=float(decision.ocr_prob) if decision.ocr_prob == decision.ocr_prob else float("nan"),
|
| 174 |
+
num_pages=decision.num_pages,
|
| 175 |
+
is_form=decision.is_form,
|
| 176 |
+
garbled_text_ratio=decision.garbled_text_ratio,
|
| 177 |
+
is_encrypted=decision.is_encrypted,
|
| 178 |
+
needs_password=decision.needs_password,
|
| 179 |
+
router_error=decision.error,
|
| 180 |
+
router_features=dict(decision.features or {}),
|
| 181 |
+
wall_ms_router=(t1 - t0) * 1000.0,
|
| 182 |
+
)
|
| 183 |
+
|
| 184 |
+
# -- MuPDF extraction (only for text-ok path) -----------------------------
|
| 185 |
+
from pdfsys_core import Backend
|
| 186 |
+
from pdfsys_parser_mupdf import extract_doc
|
| 187 |
+
|
| 188 |
+
if decision.backend == Backend.MUPDF and decision.error is None:
|
| 189 |
+
try:
|
| 190 |
+
t2 = time.perf_counter()
|
| 191 |
+
extracted = extract_doc(pdf_path)
|
| 192 |
+
t3 = time.perf_counter()
|
| 193 |
+
result.sha256 = extracted.sha256
|
| 194 |
+
result.segments = [_segment_to_row(s) for s in extracted.segments]
|
| 195 |
+
result.markdown = extracted.markdown
|
| 196 |
+
result.extract_stats = dict(extracted.stats)
|
| 197 |
+
result.wall_ms_extract = (t3 - t2) * 1000.0
|
| 198 |
+
except Exception as e: # noqa: BLE001 — surface to UI
|
| 199 |
+
result.extract_error = f"{type(e).__name__}: {e}"
|
| 200 |
+
|
| 201 |
+
# -- Quality scoring (optional, heavy) ------------------------------------
|
| 202 |
+
if run_quality and result.markdown:
|
| 203 |
+
try:
|
| 204 |
+
scorer = get_scorer()
|
| 205 |
+
t4 = time.perf_counter()
|
| 206 |
+
q = scorer.score(result.markdown)
|
| 207 |
+
t5 = time.perf_counter()
|
| 208 |
+
result.quality_score = q.score
|
| 209 |
+
result.quality_num_tokens = q.num_tokens
|
| 210 |
+
result.quality_model = q.model
|
| 211 |
+
result.wall_ms_quality = (t5 - t4) * 1000.0
|
| 212 |
+
except Exception as e: # noqa: BLE001
|
| 213 |
+
result.quality_error = f"{type(e).__name__}: {e}"
|
| 214 |
+
|
| 215 |
+
return result
|
| 216 |
+
|
| 217 |
+
|
| 218 |
+
# ----------------------------------------------------------------- rendering
|
| 219 |
+
|
| 220 |
+
|
| 221 |
+
_BACKEND_COLOR = {
|
| 222 |
+
"mupdf": (39, 174, 96), # green — text-ok fast path
|
| 223 |
+
"pipeline": (243, 156, 18), # orange — OCR pipeline (stub)
|
| 224 |
+
"vlm": (155, 89, 182), # purple — VLM (stub)
|
| 225 |
+
"deferred": (127, 140, 141), # gray — held back
|
| 226 |
+
}
|
| 227 |
+
|
| 228 |
+
|
| 229 |
+
def render_first_page_with_bboxes(
|
| 230 |
+
pdf_path: str | Path,
|
| 231 |
+
result: PipelineResult,
|
| 232 |
+
page_index: int = 0,
|
| 233 |
+
target_max_side: int = 1100,
|
| 234 |
+
) -> Image.Image | None:
|
| 235 |
+
"""Render ``page_index`` of the PDF and overlay MuPDF segment bboxes.
|
| 236 |
+
|
| 237 |
+
Falls back to ``None`` on any failure (corrupted / encrypted / etc.).
|
| 238 |
+
"""
|
| 239 |
+
pdf_path = Path(pdf_path)
|
| 240 |
+
try:
|
| 241 |
+
doc = pymupdf.open(str(pdf_path))
|
| 242 |
+
except Exception:
|
| 243 |
+
return None
|
| 244 |
+
|
| 245 |
+
try:
|
| 246 |
+
if len(doc) == 0 or page_index >= len(doc):
|
| 247 |
+
return None
|
| 248 |
+
page = doc[page_index]
|
| 249 |
+
rect = page.rect
|
| 250 |
+
# Scale so the longest side ~= target_max_side (for UI readability).
|
| 251 |
+
zoom = max(1.0, target_max_side / max(rect.width, rect.height))
|
| 252 |
+
pix = page.get_pixmap(matrix=pymupdf.Matrix(zoom, zoom), alpha=False)
|
| 253 |
+
img = Image.open(io.BytesIO(pix.tobytes("png"))).convert("RGB")
|
| 254 |
+
except Exception:
|
| 255 |
+
return None
|
| 256 |
+
finally:
|
| 257 |
+
doc.close()
|
| 258 |
+
|
| 259 |
+
# Overlay segment bboxes for the selected page only.
|
| 260 |
+
color = _BACKEND_COLOR.get(result.backend, (52, 152, 219))
|
| 261 |
+
draw = ImageDraw.Draw(img, "RGBA")
|
| 262 |
+
w, h = img.size
|
| 263 |
+
|
| 264 |
+
drawn = 0
|
| 265 |
+
for seg in result.segments:
|
| 266 |
+
if seg["page"] != page_index or seg["bbox_norm"] is None:
|
| 267 |
+
continue
|
| 268 |
+
x0, y0, x1, y1 = seg["bbox_norm"]
|
| 269 |
+
box = (int(x0 * w), int(y0 * h), int(x1 * w), int(y1 * h))
|
| 270 |
+
# Semi-transparent fill + solid outline.
|
| 271 |
+
draw.rectangle(box, fill=(*color, 28), outline=(*color, 220), width=2)
|
| 272 |
+
# Small index badge.
|
| 273 |
+
label = str(seg["index"])
|
| 274 |
+
tx, ty = box[0] + 2, box[1] + 2
|
| 275 |
+
draw.rectangle((tx, ty, tx + 6 + 7 * len(label), ty + 16), fill=(*color, 220))
|
| 276 |
+
draw.text((tx + 3, ty + 1), label, fill=(255, 255, 255))
|
| 277 |
+
drawn += 1
|
| 278 |
+
|
| 279 |
+
return img
|
| 280 |
+
|
| 281 |
+
|
| 282 |
+
def pick_curated_features(features: dict[str, Any]) -> list[list[Any]]:
|
| 283 |
+
"""Select a small, meaningful subset of the 124-feature vector for display.
|
| 284 |
+
|
| 285 |
+
The full vector goes into the raw JSON tab; this is the "at a glance"
|
| 286 |
+
view. Ordered by importance / interpretability, not by XGBoost column
|
| 287 |
+
order.
|
| 288 |
+
"""
|
| 289 |
+
keys_in_order = [
|
| 290 |
+
"num_pages_successfully_sampled",
|
| 291 |
+
"garbled_text_ratio",
|
| 292 |
+
"is_form",
|
| 293 |
+
"creator_or_producer_is_known_scanner",
|
| 294 |
+
"num_unique_image_xrefs",
|
| 295 |
+
"num_junk_image_xrefs",
|
| 296 |
+
"page_level_char_counts_page1",
|
| 297 |
+
"page_level_unique_font_counts_page1",
|
| 298 |
+
"page_level_text_area_ratios_page1",
|
| 299 |
+
"page_level_image_counts_page1",
|
| 300 |
+
"page_level_bitmap_proportions_page1",
|
| 301 |
+
"page_level_vector_graphics_obj_count_page1",
|
| 302 |
+
"page_level_hidden_char_counts_page1",
|
| 303 |
+
]
|
| 304 |
+
rows: list[list[Any]] = []
|
| 305 |
+
for k in keys_in_order:
|
| 306 |
+
if k in features:
|
| 307 |
+
v = features[k]
|
| 308 |
+
if isinstance(v, float):
|
| 309 |
+
v = round(v, 4)
|
| 310 |
+
rows.append([k, v])
|
| 311 |
+
return rows
|
docs/ROADMAP.md
ADDED
|
@@ -0,0 +1,807 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# pdfsys-mnbvc · Roadmap
|
| 2 |
+
|
| 3 |
+
> 优化方案与实施计划 · v0.1 · 2026-04-17
|
| 4 |
+
>
|
| 5 |
+
> 本文档把 [`PRD.md`](./PRD.md) 描述的目标转化为**带优先级、带工作量、带验收标准**的可执行任务池。PRD 回答"我们要做什么",ROADMAP 回答"按什么顺序做、怎么做、做完怎么验证"。
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 0 · 摘要
|
| 10 |
+
|
| 11 |
+
**一句话**:设计文档与架构框架一流,工程基础设施缺失严重,6 个 stage 只落地了 1.5 个。
|
| 12 |
+
|
| 13 |
+
**冲刺计划**:以 2 周"可协作化"冲刺(P0)作为一切后续工作的前提,再用 4 周打磨性能与可靠性(P1),最后 10–16 周补齐 6-stage 闭环(P2)。P3 是 PB 级规模化与生态,作为长期背景项。
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## 1 · 现状评分卡
|
| 18 |
+
|
| 19 |
+
| 维度 | 状态 | 评分 |
|
| 20 |
+
|---|---|---|
|
| 21 |
+
| 设计文档(PRD) | 441 行,取舍清晰 | 9/10 |
|
| 22 |
+
| 架构分包 | 7 个 workspace 包,边界合理 | 8/10 |
|
| 23 |
+
| 核心契约(`pdfsys-core`) | frozen dataclass + 零依赖 + 原子写 | 9/10 |
|
| 24 |
+
| MVP 闭环(Router→MuPDF→Scorer) | 跑通 OmniDocBench-100 | 7/10 |
|
| 25 |
+
| **测试** | **零测试文件,零 CI** | **0/10** |
|
| 26 |
+
| **依赖管理** | 无 lock 文件,依赖无上界 | 2/10 |
|
| 27 |
+
| **Observability** | 无 logging,无 metrics | 2/10 |
|
| 28 |
+
| 实现完成度 | 2180 行,4/7 包是 stub | 3/10 |
|
| 29 |
+
| Demo & 贡献者体验 | Gradio + Cursor rules 完善 | 8/10 |
|
| 30 |
+
|
| 31 |
+
**关键风险**:当前状态下 1 人可 hack 前进;**任何超过 3 人的协作会立刻失控**——没有测试保护 parity、没有 CI、没有 lock 文件,第一次依赖升级就会毒化路由器。
|
| 32 |
+
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
## 2 · 优化全景
|
| 36 |
+
|
| 37 |
+
```
|
| 38 |
+
┌──────────────────────────────────────────────────────────────────┐
|
| 39 |
+
│ P0 工程基础(2 周,阻塞一切后续) │
|
| 40 |
+
│ ├─ 1.1 测试框架 pytest + 关键单测 │
|
| 41 |
+
│ ├─ 1.2 代码质量 ruff + mypy + pre-commit │
|
| 42 |
+
│ ├─ 1.3 GitHub Actions CI │
|
| 43 |
+
│ ├─ 1.4 uv.lock 入库 + 依赖上界 │
|
| 44 |
+
│ └─ 1.5 Parity harness(router 回归守门) │
|
| 45 |
+
├──────────────────────────────────────────────────────────────────┤
|
| 46 |
+
│ P1 性能与可靠性(4 周) │
|
| 47 |
+
│ ├─ 2.1 Router 热路径优化(49 ms → 10 ms) │
|
| 48 |
+
│ ├─ 2.2 Quality scorer 批量推理 │
|
| 49 |
+
│ ├─ 2.3 structlog 日志系统 │
|
| 50 |
+
│ ├─ 2.4 Prometheus metrics 导出 │
|
| 51 |
+
│ └─ 2.5 错误分类 + quarantine 桶 │
|
| 52 |
+
├──────────────────────────────────────────────────────────────────┤
|
| 53 |
+
│ P2 功能补全(8-12 周,按 PRD roadmap) │
|
| 54 |
+
│ ├─ 3.1 Layout analyser(PP-DocLayoutV3 ONNX INT8) │
|
| 55 |
+
│ ├─ 3.2 Pipeline parser(RapidOCR 简单版式) │
|
| 56 |
+
│ ├─ 3.3 Stage-B router(layout-cache 驱动) │
|
| 57 |
+
│ ├─ 3.4 VLM parser(MinerU 2.5 + LMDeploy) │
|
| 58 |
+
│ ├─ 3.5 Stage-3 后处理 │
|
| 59 |
+
│ ├─ 3.6 Stage-4 质量 / PII / MinHash 去重 │
|
| 60 |
+
│ └─ 3.7 Stage-5 Parquet 打包 │
|
| 61 |
+
├──────────────────────────────────────────────────────────────────┤
|
| 62 |
+
│ P3 规模化与生态(3-6 个月) │
|
| 63 |
+
│ ├─ 4.1 datatrove 编排集成 │
|
| 64 |
+
│ ├─ 4.2 Slurm / K8s runner │
|
| 65 |
+
│ ├─ 4.3 对象存储后端(S3 / OSS / MinIO) │
|
| 66 |
+
│ ├─ 4.4 中文 EduScore 训练 │
|
| 67 |
+
│ └─ 4.5 竖排古籍 LoRA │
|
| 68 |
+
└──────────────────────────────────────────────────────────────────┘
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
---
|
| 72 |
+
|
| 73 |
+
## 3 · P0 工程基础(Week 1-2)
|
| 74 |
+
|
| 75 |
+
### 3.1 测试框架 · pytest
|
| 76 |
+
|
| 77 |
+
**目标**:2 周内 `pdfsys-core` ≥ 90% / `pdfsys-router` ≥ 60% / `pdfsys-parser-mupdf` ≥ 60% 覆盖率。
|
| 78 |
+
|
| 79 |
+
**为什么优先**:`.cursor/rules/01-architecture-invariants.mdc` 里 7 条不变式(BBox 归一化、frozen dataclass、原子写、schema 同构等)**全部可单测验证**。没有测试,"不要违反不变式"只是一句空话。
|
| 80 |
+
|
| 81 |
+
**交付物结构**:
|
| 82 |
+
|
| 83 |
+
```
|
| 84 |
+
tests/
|
| 85 |
+
├── conftest.py # 共享 fixtures
|
| 86 |
+
├── fixtures/pdfs/ # 5-10 个跨类型 PDF(< 100 KB/file,入库)
|
| 87 |
+
├── unit/
|
| 88 |
+
│ ├── core/
|
| 89 |
+
│ │ ├── test_bbox.py # BBox 边界、转换、非法值
|
| 90 |
+
│ │ ├── test_serde.py # to_dict/from_dict roundtrip
|
| 91 |
+
│ │ ├── test_cache.py # LayoutCache 原子写 + 崩溃恢复
|
| 92 |
+
│ │ └── test_types.py # Backend / RegionType 枚举稳定性
|
| 93 |
+
│ ├── router/
|
| 94 |
+
│ │ ├── test_classifier_smoke.py # classify() 不 raise 任何畸形输入
|
| 95 |
+
│ │ ├── test_feature_shape.py # 输出必须 124 列,列名锁定
|
| 96 |
+
│ │ └── test_error_taxonomy.py # encrypted/corrupt/empty 错误分类
|
| 97 |
+
│ ├── parser_mupdf/
|
| 98 |
+
│ │ ├── test_extract_basic.py # 正常 PDF 段落抽取
|
| 99 |
+
│ │ ├── test_bbox_normalized.py # 所有 bbox ∈ [0, 1]
|
| 100 |
+
│ │ └── test_corrupted_pdf.py # 坏 PDF 不 crash
|
| 101 |
+
│ └── bench/
|
| 102 |
+
│ └── test_loop_never_raises.py # 坏 PDF 进去,JSONL 行出来
|
| 103 |
+
├── contract/
|
| 104 |
+
│ ├── test_extracted_doc_schema.py # 所有 parser 输出同构
|
| 105 |
+
│ └── test_cursor_rules_valid.py # .mdc frontmatter 合法
|
| 106 |
+
└── integration/
|
| 107 |
+
└── test_bench_smoke.py # python -m pdfsys_bench --limit 3
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
**关键样例**:
|
| 111 |
+
|
| 112 |
+
```python
|
| 113 |
+
# tests/unit/core/test_bbox.py
|
| 114 |
+
import pytest
|
| 115 |
+
from pdfsys_core import BBox
|
| 116 |
+
|
| 117 |
+
class TestBBoxInvariants:
|
| 118 |
+
@pytest.mark.parametrize("x0,y0,x1,y1", [
|
| 119 |
+
(-0.1, 0, 0.5, 0.5), # 负坐标
|
| 120 |
+
(0, 0, 1.1, 0.5), # 超过 1
|
| 121 |
+
(0.5, 0, 0.3, 0.5), # x1 < x0
|
| 122 |
+
(0, 0, 0, 0), # 零面积
|
| 123 |
+
])
|
| 124 |
+
def test_rejects_invalid(self, x0, y0, x1, y1):
|
| 125 |
+
with pytest.raises(ValueError):
|
| 126 |
+
BBox(x0=x0, y0=y0, x1=x1, y1=y1)
|
| 127 |
+
|
| 128 |
+
def test_to_pixels_roundtrip(self):
|
| 129 |
+
box = BBox(0.1, 0.2, 0.9, 0.8)
|
| 130 |
+
assert box.to_pixels(1000, 500) == (100, 100, 900, 400)
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
```python
|
| 134 |
+
# tests/unit/router/test_feature_shape.py
|
| 135 |
+
EXPECTED_COLUMNS = 124
|
| 136 |
+
|
| 137 |
+
def test_feature_vector_has_124_columns(sample_pdf):
|
| 138 |
+
router = Router()
|
| 139 |
+
decision = router.classify(sample_pdf)
|
| 140 |
+
assert not decision.error
|
| 141 |
+
assert len(decision.features) == EXPECTED_COLUMNS, (
|
| 142 |
+
f"Feature vector drifted from 124 to {len(decision.features)}. "
|
| 143 |
+
"If intentional, retrain XGBoost weights."
|
| 144 |
+
)
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
**实施步骤**:
|
| 148 |
+
|
| 149 |
+
1. `uv add --group dev pytest pytest-cov pytest-xdist hypothesis`
|
| 150 |
+
2. 根 `pyproject.toml` 加 `[tool.pytest.ini_options]` 和 `[tool.coverage.run]`
|
| 151 |
+
3. `conftest.py` 提供 `sample_pdf` / `encrypted_pdf` / `corrupted_pdf` fixture
|
| 152 |
+
4. 按上表顺序写测试(每天 1 个子目录)
|
| 153 |
+
5. 加 `Makefile` 或 `scripts/test.sh`:`uv run pytest -n auto tests/`
|
| 154 |
+
|
| 155 |
+
**验收**:CI 跑通全部测试 < 2 分钟;三包覆盖率达标。
|
| 156 |
+
|
| 157 |
+
**工作量**:1 人 · 10 天
|
| 158 |
+
|
| 159 |
+
---
|
| 160 |
+
|
| 161 |
+
### 3.2 代码质量 · ruff + mypy + pre-commit
|
| 162 |
+
|
| 163 |
+
**目标**:零 ruff 错误、`pdfsys-core` 零 mypy 错误、commit 前自动拦截。
|
| 164 |
+
|
| 165 |
+
**根 `pyproject.toml` 新增**:
|
| 166 |
+
|
| 167 |
+
```toml
|
| 168 |
+
[tool.ruff]
|
| 169 |
+
target-version = "py311"
|
| 170 |
+
line-length = 100
|
| 171 |
+
src = ["packages/pdfsys-core/src", "packages/pdfsys-router/src",
|
| 172 |
+
"packages/pdfsys-parser-mupdf/src", "packages/pdfsys-bench/src",
|
| 173 |
+
"demo"]
|
| 174 |
+
|
| 175 |
+
[tool.ruff.lint]
|
| 176 |
+
select = ["E", "F", "W", "I", "B", "UP", "SIM", "PLC0415", "BLE001", "RET", "ARG"]
|
| 177 |
+
ignore = ["E501"]
|
| 178 |
+
per-file-ignores = { "packages/pdfsys-bench/**" = ["BLE001"] }
|
| 179 |
+
|
| 180 |
+
[tool.mypy]
|
| 181 |
+
python_version = "3.11"
|
| 182 |
+
strict = true
|
| 183 |
+
exclude = ["^packages/pdfsys-parser-(pipeline|vlm)/", "^packages/pdfsys-layout-analyser/"]
|
| 184 |
+
|
| 185 |
+
[[tool.mypy.overrides]]
|
| 186 |
+
module = ["pymupdf.*", "xgboost.*", "gradio.*"]
|
| 187 |
+
ignore_missing_imports = true
|
| 188 |
+
```
|
| 189 |
+
|
| 190 |
+
**`.pre-commit-config.yaml`**:
|
| 191 |
+
|
| 192 |
+
```yaml
|
| 193 |
+
repos:
|
| 194 |
+
- repo: https://github.com/astral-sh/ruff-pre-commit
|
| 195 |
+
rev: v0.6.9
|
| 196 |
+
hooks:
|
| 197 |
+
- id: ruff
|
| 198 |
+
args: [--fix]
|
| 199 |
+
- id: ruff-format
|
| 200 |
+
- repo: https://github.com/pre-commit/mirrors-mypy
|
| 201 |
+
rev: v1.11.2
|
| 202 |
+
hooks:
|
| 203 |
+
- id: mypy
|
| 204 |
+
files: ^packages/pdfsys-core/
|
| 205 |
+
- repo: local
|
| 206 |
+
hooks:
|
| 207 |
+
- id: no-committed-weights
|
| 208 |
+
name: Reject committed model weights
|
| 209 |
+
entry: bash -c '! git diff --cached --name-only | grep -E "\.(ubj|safetensors|pt|bin)$"'
|
| 210 |
+
language: system
|
| 211 |
+
pass_filenames: false
|
| 212 |
+
- id: validate-cursor-rules
|
| 213 |
+
name: Validate .cursor/rules YAML frontmatter
|
| 214 |
+
entry: python scripts/validate_rules.py
|
| 215 |
+
language: system
|
| 216 |
+
files: ^\.cursor/rules/.*\.mdc$
|
| 217 |
+
```
|
| 218 |
+
|
| 219 |
+
**实施步骤**:
|
| 220 |
+
|
| 221 |
+
1. `uv add --group dev ruff mypy pre-commit`
|
| 222 |
+
2. 写上面两个配置
|
| 223 |
+
3. `uv run ruff check --fix .` + `uv run ruff format .` 修现存问题
|
| 224 |
+
4. `uv run mypy packages/pdfsys-core` 直到零错
|
| 225 |
+
5. `pre-commit install` 追加到 `scripts/setup_cursor.sh`
|
| 226 |
+
6. 把 `03-doc-sync.mdc` 里提到的 `scripts/validate_rules.py` 落地
|
| 227 |
+
|
| 228 |
+
**验收**:`pre-commit run --all-files` 全绿。
|
| 229 |
+
|
| 230 |
+
**工作量**:1 人 · 3 天
|
| 231 |
+
|
| 232 |
+
---
|
| 233 |
+
|
| 234 |
+
### 3.3 GitHub Actions CI
|
| 235 |
+
|
| 236 |
+
**`.github/workflows/ci.yml`**:
|
| 237 |
+
|
| 238 |
+
```yaml
|
| 239 |
+
name: CI
|
| 240 |
+
on:
|
| 241 |
+
pull_request:
|
| 242 |
+
push:
|
| 243 |
+
branches: [main]
|
| 244 |
+
|
| 245 |
+
jobs:
|
| 246 |
+
lint:
|
| 247 |
+
runs-on: ubuntu-latest
|
| 248 |
+
steps:
|
| 249 |
+
- uses: actions/checkout@v4
|
| 250 |
+
- uses: astral-sh/setup-uv@v3
|
| 251 |
+
with: { version: "0.4.x", enable-cache: true }
|
| 252 |
+
- run: uv sync --frozen
|
| 253 |
+
- run: uv run ruff check .
|
| 254 |
+
- run: uv run ruff format --check .
|
| 255 |
+
- run: uv run mypy packages/pdfsys-core
|
| 256 |
+
|
| 257 |
+
test:
|
| 258 |
+
runs-on: ubuntu-latest
|
| 259 |
+
strategy:
|
| 260 |
+
matrix:
|
| 261 |
+
python: ["3.11", "3.12"]
|
| 262 |
+
steps:
|
| 263 |
+
- uses: actions/checkout@v4
|
| 264 |
+
- uses: astral-sh/setup-uv@v3
|
| 265 |
+
with: { python-version: "${{ matrix.python }}" }
|
| 266 |
+
- run: uv sync --frozen
|
| 267 |
+
- run: uv run python -m pdfsys_router.download_weights
|
| 268 |
+
- run: uv run pytest -n auto --cov --cov-report=xml tests/
|
| 269 |
+
- uses: codecov/codecov-action@v4
|
| 270 |
+
if: matrix.python == '3.11'
|
| 271 |
+
|
| 272 |
+
parity:
|
| 273 |
+
runs-on: ubuntu-latest
|
| 274 |
+
if: contains(github.event.pull_request.changed_files, 'feature_extractor.py')
|
| 275 |
+
steps:
|
| 276 |
+
- uses: actions/checkout@v4
|
| 277 |
+
with: { fetch-depth: 2 }
|
| 278 |
+
- uses: astral-sh/setup-uv@v3
|
| 279 |
+
- run: uv sync --frozen
|
| 280 |
+
- run: uv run python -m pdfsys_router.download_weights
|
| 281 |
+
- run: bash scripts/check_parity.sh origin/main HEAD
|
| 282 |
+
```
|
| 283 |
+
|
| 284 |
+
**实施步骤**:
|
| 285 |
+
|
| 286 |
+
1. 写上面 workflow
|
| 287 |
+
2. 可选:`.github/workflows/preview-hf-space.yml` PR 自动部署预览 Space
|
| 288 |
+
3. GitHub Settings → Branches 把 `main` 设为 protected、必须通过 CI
|
| 289 |
+
|
| 290 |
+
**验收**:PR 打开 3 分钟内看到 ✅ × 3。
|
| 291 |
+
|
| 292 |
+
**工作量**:1 人 · 1 天
|
| 293 |
+
|
| 294 |
+
---
|
| 295 |
+
|
| 296 |
+
### 3.4 uv.lock 入库 + 依赖上界
|
| 297 |
+
|
| 298 |
+
**当前痛点**:
|
| 299 |
+
- `.gitignore:14` 把 `uv.lock` 排除了(反模式,lock 文件必须入库)
|
| 300 |
+
- 所有依赖只有下界:`pymupdf>=1.24` 明天升级到 2.0 会被自动拉进来
|
| 301 |
+
|
| 302 |
+
**修复**:
|
| 303 |
+
|
| 304 |
+
1. 从 `.gitignore` 移除 `uv.lock`
|
| 305 |
+
2. 给所有依赖加上界(保守策略 major+1):
|
| 306 |
+
|
| 307 |
+
```toml
|
| 308 |
+
# packages/pdfsys-router/pyproject.toml
|
| 309 |
+
dependencies = [
|
| 310 |
+
"pdfsys-core",
|
| 311 |
+
"pymupdf>=1.24,<2.0",
|
| 312 |
+
"xgboost>=2.0,<3.0",
|
| 313 |
+
"scikit-learn>=1.3,<2.0",
|
| 314 |
+
"pandas>=2.0,<3.0",
|
| 315 |
+
"numpy>=1.26,<3.0",
|
| 316 |
+
]
|
| 317 |
+
```
|
| 318 |
+
|
| 319 |
+
3. `uv lock && git add uv.lock`
|
| 320 |
+
4. CI 用 `uv sync --frozen`(见 §3.3)
|
| 321 |
+
|
| 322 |
+
**工作量**:0.5 天
|
| 323 |
+
|
| 324 |
+
---
|
| 325 |
+
|
| 326 |
+
### 3.5 Parity Harness
|
| 327 |
+
|
| 328 |
+
**背景**:`.cursor/rules/21-router-parity.mdc` 已描述 parity 验证流程,但**缺可执行脚本**。
|
| 329 |
+
|
| 330 |
+
**`scripts/check_parity.sh`**:
|
| 331 |
+
|
| 332 |
+
```bash
|
| 333 |
+
#!/usr/bin/env bash
|
| 334 |
+
# Verify router ocr_prob drift between two refs.
|
| 335 |
+
# Usage: bash scripts/check_parity.sh <baseline_ref> <candidate_ref>
|
| 336 |
+
set -euo pipefail
|
| 337 |
+
|
| 338 |
+
BASELINE="${1:-origin/main}"
|
| 339 |
+
CANDIDATE="${2:-HEAD}"
|
| 340 |
+
SAMPLE_DIR="${PARITY_SAMPLE_DIR:-tests/fixtures/pdfs}"
|
| 341 |
+
EPSILON="${PARITY_EPSILON:-1e-6}"
|
| 342 |
+
WORK_DIR="$(mktemp -d)"
|
| 343 |
+
trap 'rm -rf "$WORK_DIR"' EXIT
|
| 344 |
+
|
| 345 |
+
run_bench() {
|
| 346 |
+
local ref="$1" out="$2"
|
| 347 |
+
git worktree add "$WORK_DIR/$ref" "$ref"
|
| 348 |
+
(cd "$WORK_DIR/$ref" && uv sync --frozen --quiet \
|
| 349 |
+
&& uv run python -m pdfsys_router.download_weights >/dev/null \
|
| 350 |
+
&& uv run python -m pdfsys_bench --pdf-dir "$SAMPLE_DIR" --out "$out" --no-quality)
|
| 351 |
+
git worktree remove --force "$WORK_DIR/$ref"
|
| 352 |
+
}
|
| 353 |
+
|
| 354 |
+
run_bench "$BASELINE" "$WORK_DIR/baseline.jsonl"
|
| 355 |
+
run_bench "$CANDIDATE" "$WORK_DIR/candidate.jsonl"
|
| 356 |
+
|
| 357 |
+
uv run python scripts/parity_diff.py \
|
| 358 |
+
"$WORK_DIR/baseline.jsonl" "$WORK_DIR/candidate.jsonl" \
|
| 359 |
+
--epsilon "$EPSILON"
|
| 360 |
+
```
|
| 361 |
+
|
| 362 |
+
**`scripts/parity_diff.py`**:接收两个 JSONL、逐 PDF 对比 `ocr_prob`、漂移超阈值 exit 非零。
|
| 363 |
+
|
| 364 |
+
**工作量**:1 天
|
| 365 |
+
|
| 366 |
+
---
|
| 367 |
+
|
| 368 |
+
## 4 · P1 性能与可靠性(Week 3-6)
|
| 369 |
+
|
| 370 |
+
### 4.1 Router 热路径优化
|
| 371 |
+
|
| 372 |
+
**现状**:49 ms/PDF(PRD 目标 ≤10 ms)。跑 1 PB 语料 ≈ 浪费 10+ 小时 CPU。
|
| 373 |
+
|
| 374 |
+
**优化点**(先 profile 后改,要求 P0 测试先到位):
|
| 375 |
+
|
| 376 |
+
#### (a) 去掉 pandas DataFrame 构造
|
| 377 |
+
|
| 378 |
+
```python
|
| 379 |
+
# ❌ 现状 (packages/pdfsys-router/src/pdfsys_router/xgb_model.py)
|
| 380 |
+
df = pd.DataFrame([features])
|
| 381 |
+
names = getattr(self.model, "feature_names_in_", None)
|
| 382 |
+
if names is not None:
|
| 383 |
+
df = df.reindex(columns=list(names), fill_value=0)
|
| 384 |
+
probs = self.model.predict_proba(df)
|
| 385 |
+
|
| 386 |
+
# ✅ 优化:缓存列序 + numpy array
|
| 387 |
+
class XgbRouterModel:
|
| 388 |
+
def __init__(self, path):
|
| 389 |
+
self._feature_order: list[str] | None = None
|
| 390 |
+
|
| 391 |
+
def predict_proba(self, features: dict[str, float]) -> float:
|
| 392 |
+
if self._feature_order is None:
|
| 393 |
+
self._feature_order = list(self.model.feature_names_in_)
|
| 394 |
+
arr = np.fromiter(
|
| 395 |
+
(features.get(k, 0.0) for k in self._feature_order),
|
| 396 |
+
dtype=np.float32, count=len(self._feature_order),
|
| 397 |
+
).reshape(1, -1)
|
| 398 |
+
return float(self.model.predict_proba(arr)[0, 1])
|
| 399 |
+
```
|
| 400 |
+
|
| 401 |
+
预估:~15 ms → ~2 ms。
|
| 402 |
+
|
| 403 |
+
#### (b) PyMuPDF 文本读取去重
|
| 404 |
+
|
| 405 |
+
`_get_garbled_text_per_page` 对每页 `get_text()`,后续 `compute_features_per_chunk` 对采样页再读一次——同一页读两次。
|
| 406 |
+
优化:读所有采样页文本时就缓存 `page → text` 字典,复用。预估 ~25 ms → ~12 ms。
|
| 407 |
+
|
| 408 |
+
#### (c) 早 return
|
| 409 |
+
|
| 410 |
+
`is_encrypted` / `needs_pass` / `len(doc) == 0` 这类硬错误应在特征提取前 short-circuit。
|
| 411 |
+
|
| 412 |
+
**验收**:Parity harness 验证 `|diff(ocr_prob)| < 1e-6`;OmniDocBench-100 上 p50 ≤ 10 ms。
|
| 413 |
+
|
| 414 |
+
**工作量**:2-3 天
|
| 415 |
+
|
| 416 |
+
---
|
| 417 |
+
|
| 418 |
+
### 4.2 Quality scorer 批量推理
|
| 419 |
+
|
| 420 |
+
**现状**:单条 3.6 s;10 万文档 ≈ 100 小时。
|
| 421 |
+
|
| 422 |
+
**改动**:`OcrQualityScorer.score_many` 从循环改成真正 batch:
|
| 423 |
+
|
| 424 |
+
```python
|
| 425 |
+
def score_many(self, texts: list[str], batch_size: int = 8) -> list[QualityScore]:
|
| 426 |
+
self._ensure_loaded()
|
| 427 |
+
torch = self._torch
|
| 428 |
+
results: list[QualityScore] = []
|
| 429 |
+
for i in range(0, len(texts), batch_size):
|
| 430 |
+
batch = [t[:self.max_chars] or " " for t in texts[i:i + batch_size]]
|
| 431 |
+
enc = self._tokenizer(
|
| 432 |
+
batch, return_tensors="pt", truncation=True,
|
| 433 |
+
max_length=self.max_tokens, padding=True,
|
| 434 |
+
).to(self._device)
|
| 435 |
+
with torch.inference_mode():
|
| 436 |
+
logits = self._model(**enc).logits.squeeze(-1)
|
| 437 |
+
for j, text in enumerate(batch):
|
| 438 |
+
score = max(0.0, min(3.0, float(logits[j].item())))
|
| 439 |
+
results.append(QualityScore(
|
| 440 |
+
score=score,
|
| 441 |
+
num_chars=len(text),
|
| 442 |
+
num_tokens=int(enc["attention_mask"][j].sum()),
|
| 443 |
+
model=self.model_name,
|
| 444 |
+
))
|
| 445 |
+
return results
|
| 446 |
+
```
|
| 447 |
+
|
| 448 |
+
**配套**:`pdfsys_bench.loop.run_loop` 改成"先全部 extract → 批量 score → 展回 JSONL",保持输出顺序。
|
| 449 |
+
|
| 450 |
+
**验收**:batch=8 相比 batch=1 吞吐 ≥ 3×;单样本数值差 `< 1e-3`。
|
| 451 |
+
|
| 452 |
+
**工作量**:3 天
|
| 453 |
+
|
| 454 |
+
---
|
| 455 |
+
|
| 456 |
+
### 4.3 structlog 日志系统
|
| 457 |
+
|
| 458 |
+
**现状**:全仓 `print(...)` × 12 处;无级别、无结构。
|
| 459 |
+
|
| 460 |
+
**方案**:`pdfsys-core` 之外的包引入 `structlog`(core 保持零依赖):
|
| 461 |
+
|
| 462 |
+
```python
|
| 463 |
+
# packages/pdfsys-router/src/pdfsys_router/_log.py
|
| 464 |
+
import structlog
|
| 465 |
+
log = structlog.get_logger("pdfsys.router")
|
| 466 |
+
|
| 467 |
+
# 使用:
|
| 468 |
+
log.info("classified", backend=decision.backend.value,
|
| 469 |
+
ocr_prob=decision.ocr_prob, pdf=str(path),
|
| 470 |
+
num_pages=decision.num_pages)
|
| 471 |
+
```
|
| 472 |
+
|
| 473 |
+
生产用 `JSONRenderer()`(便于 Grafana/ELK 摄入),dev 用 `ConsoleRenderer()`。
|
| 474 |
+
|
| 475 |
+
**工作量**:2 天
|
| 476 |
+
|
| 477 |
+
---
|
| 478 |
+
|
| 479 |
+
### 4.4 Prometheus metrics
|
| 480 |
+
|
| 481 |
+
**最小实现**:
|
| 482 |
+
|
| 483 |
+
```python
|
| 484 |
+
# packages/pdfsys-bench/src/pdfsys_bench/_metrics.py
|
| 485 |
+
from prometheus_client import Counter, Histogram, start_http_server
|
| 486 |
+
|
| 487 |
+
router_decisions = Counter("pdfsys_router_decisions_total",
|
| 488 |
+
"Router decisions by backend", ["backend"])
|
| 489 |
+
router_latency = Histogram("pdfsys_router_duration_seconds",
|
| 490 |
+
"Router classification latency",
|
| 491 |
+
buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0])
|
| 492 |
+
extract_failures = Counter("pdfsys_extract_failures_total",
|
| 493 |
+
"Extraction failures", ["backend", "error_class"])
|
| 494 |
+
|
| 495 |
+
def enable_metrics_endpoint(port: int = 9000) -> None:
|
| 496 |
+
start_http_server(port)
|
| 497 |
+
```
|
| 498 |
+
|
| 499 |
+
`pdfsys-bench` CLI 新增 `--metrics-port` flag。
|
| 500 |
+
|
| 501 |
+
**工作量**:2 天
|
| 502 |
+
|
| 503 |
+
---
|
| 504 |
+
|
| 505 |
+
### 4.5 错误分类 + quarantine 桶
|
| 506 |
+
|
| 507 |
+
**现状**:失败写 `extract_error: "classify_failed: X"` 自由字符串,无法聚合。
|
| 508 |
+
|
| 509 |
+
**方案**:`pdfsys-core` 新增 `errors.py`:
|
| 510 |
+
|
| 511 |
+
```python
|
| 512 |
+
from enum import Enum
|
| 513 |
+
|
| 514 |
+
class ErrorClass(str, Enum):
|
| 515 |
+
OPEN_FAILED = "open_failed"
|
| 516 |
+
ENCRYPTED = "encrypted"
|
| 517 |
+
EMPTY = "empty"
|
| 518 |
+
CORRUPTED_STREAM = "corrupted_stream"
|
| 519 |
+
FEATURE_EXTRACTION_FAILED = "feature_extraction_failed"
|
| 520 |
+
MODEL_INFERENCE_FAILED = "model_inference_failed"
|
| 521 |
+
OOM = "oom"
|
| 522 |
+
UNKNOWN = "unknown"
|
| 523 |
+
```
|
| 524 |
+
|
| 525 |
+
`RouterDecision.error_class: ErrorClass` 替代自由字符串。Bench 按 class 聚合计数。
|
| 526 |
+
|
| 527 |
+
Quarantine 桶:`out/quarantine/<error_class>/<sha256>.json` 保留失败记录(路径 + error + 完整特征向量,**不保留 PDF**),离线分析用。
|
| 528 |
+
|
| 529 |
+
**工作量**:3 天
|
| 530 |
+
|
| 531 |
+
---
|
| 532 |
+
|
| 533 |
+
## 5 · P2 功能补全(Week 7-16)
|
| 534 |
+
|
| 535 |
+
### 依赖 DAG
|
| 536 |
+
|
| 537 |
+
```
|
| 538 |
+
Layout Analyser (3.1) ──┬──► Pipeline Parser (3.2) ──┐
|
| 539 |
+
│ │
|
| 540 |
+
└──► VLM Parser (3.4) ────┼──► Stage-3 (3.5) ──► Stage-4 (3.6) ──► Stage-5 (3.7)
|
| 541 |
+
│
|
| 542 |
+
┌──► Stage-B Router (3.3) ─────┘
|
| 543 |
+
│
|
| 544 |
+
(reads LayoutCache)
|
| 545 |
+
```
|
| 546 |
+
|
| 547 |
+
### 5.1 Layout Analyser · P2-1
|
| 548 |
+
|
| 549 |
+
**选型**:PP-DocLayoutV3 ONNX INT8(CPU ~50 ms/页),未来可接 docling-layout-heron。
|
| 550 |
+
|
| 551 |
+
**交付物**:
|
| 552 |
+
|
| 553 |
+
```
|
| 554 |
+
packages/pdfsys-layout-analyser/src/pdfsys_layout_analyser/
|
| 555 |
+
├── __init__.py
|
| 556 |
+
├── analyser.py # LayoutAnalyser 主类
|
| 557 |
+
├── runners/
|
| 558 |
+
│ ├── pp_doclayoutv3.py # ONNX runtime 驱动
|
| 559 |
+
│ └── heuristic.py # bbox 列数聚类 fallback
|
| 560 |
+
├── render.py # PDF 页 → PNG(DPI 可调)
|
| 561 |
+
└── postprocess.py # 阅读顺序 + 跨栏合并
|
| 562 |
+
```
|
| 563 |
+
|
| 564 |
+
**API**:
|
| 565 |
+
|
| 566 |
+
```python
|
| 567 |
+
class LayoutAnalyser:
|
| 568 |
+
def __init__(self, config: LayoutConfig = LayoutConfig()): ...
|
| 569 |
+
def analyse(self, pdf_path: str | Path) -> LayoutDocument: ...
|
| 570 |
+
def analyse_with_cache(
|
| 571 |
+
self, pdf_path: str | Path, cache: LayoutCache
|
| 572 |
+
) -> LayoutDocument: ... # idempotent
|
| 573 |
+
```
|
| 574 |
+
|
| 575 |
+
**验收**:
|
| 576 |
+
- OmniDocBench-100 上 mAP ≥ 0.85
|
| 577 |
+
- CPU INT8 吞吐 ≥ 20 页/s/core
|
| 578 |
+
- `LayoutDocument` 能被 `LayoutCache.save/load` 完整 roundtrip
|
| 579 |
+
- 空 / 加密 / 损坏 PDF 全部不 crash
|
| 580 |
+
|
| 581 |
+
**工作量**:1 人 · 10 天
|
| 582 |
+
|
| 583 |
+
---
|
| 584 |
+
|
| 585 |
+
### 5.2 Pipeline Parser · P2-2
|
| 586 |
+
|
| 587 |
+
**选型**:RapidOCR(PaddleOCR ONNX 前向,无 Paddle 依赖)。
|
| 588 |
+
|
| 589 |
+
**交付物**:
|
| 590 |
+
|
| 591 |
+
```
|
| 592 |
+
packages/pdfsys-parser-pipeline/src/pdfsys_parser_pipeline/
|
| 593 |
+
├── extract.py # extract_doc / extract_doc_bytes
|
| 594 |
+
├── ocr_engine.py # RapidOCR wrapper (lazy load)
|
| 595 |
+
├── region_processor.py # 按 RegionType 派发
|
| 596 |
+
├── image_cropper.py # bbox → image crop
|
| 597 |
+
└── markdown_emitter.py # region + OCR → Segment
|
| 598 |
+
```
|
| 599 |
+
|
| 600 |
+
**核心逻辑**:
|
| 601 |
+
|
| 602 |
+
```python
|
| 603 |
+
def extract_doc(pdf_path, *, layout_cache: LayoutCache) -> ExtractedDoc:
|
| 604 |
+
layout = layout_cache.load_or_compute(pdf_path, analyser)
|
| 605 |
+
segments = []
|
| 606 |
+
for page in layout.pages:
|
| 607 |
+
for region in page.regions:
|
| 608 |
+
img = crop_region_from_pdf(pdf_path, page.index, region.bbox)
|
| 609 |
+
text = ocr_engine.recognise(img, region.type)
|
| 610 |
+
segments.append(Segment(
|
| 611 |
+
index=len(segments),
|
| 612 |
+
backend=Backend.PIPELINE,
|
| 613 |
+
page_index=page.index,
|
| 614 |
+
type=region.type,
|
| 615 |
+
content=text,
|
| 616 |
+
bbox=region.bbox,
|
| 617 |
+
source_region_id=region.region_id,
|
| 618 |
+
))
|
| 619 |
+
return ExtractedDoc(
|
| 620 |
+
sha256=sha256_of_file(pdf_path),
|
| 621 |
+
backend=Backend.PIPELINE,
|
| 622 |
+
segments=tuple(segments),
|
| 623 |
+
markdown=merge_segments_to_markdown(tuple(segments)),
|
| 624 |
+
stats={"page_count": len(layout.pages)},
|
| 625 |
+
)
|
| 626 |
+
```
|
| 627 |
+
|
| 628 |
+
**验收**:
|
| 629 |
+
- OmniDocBench 扫描件子集中文字符 F1 ≥ 0.90
|
| 630 |
+
- 输出 schema 与 `parser-mupdf` 同构(`tests/contract/test_extracted_doc_schema.py` 保护)
|
| 631 |
+
- CPU 吞吐 ≥ 5 页/s/core
|
| 632 |
+
|
| 633 |
+
**工作量**:1 人 · 12 天
|
| 634 |
+
|
| 635 |
+
---
|
| 636 |
+
|
| 637 |
+
### 5.3 Stage-B Router · P2-3
|
| 638 |
+
|
| 639 |
+
把当前 4 行 stub `decider.py` 做实:
|
| 640 |
+
|
| 641 |
+
```python
|
| 642 |
+
def decide_complex_vs_simple(
|
| 643 |
+
layout: LayoutDocument, config: RouterConfig
|
| 644 |
+
) -> Backend:
|
| 645 |
+
if not config.vlm_enabled:
|
| 646 |
+
return Backend.PIPELINE
|
| 647 |
+
if layout.has_complex_content:
|
| 648 |
+
return Backend.VLM
|
| 649 |
+
return Backend.PIPELINE
|
| 650 |
+
```
|
| 651 |
+
|
| 652 |
+
`Router._route()`:`ocr_prob ≥ threshold` 时先查 `LayoutCache`,命中 → 调 `decide_complex_vs_simple`;未命中 → 返回 `DEFERRED`。
|
| 653 |
+
|
| 654 |
+
**工作量**:2 天
|
| 655 |
+
|
| 656 |
+
---
|
| 657 |
+
|
| 658 |
+
### 5.4 VLM Parser · P2-4
|
| 659 |
+
|
| 660 |
+
**选型**(PRD §4.4):生产用 LMDeploy 驱动 MinerU 2.5-Pro 1.2B。
|
| 661 |
+
|
| 662 |
+
**交付物**:
|
| 663 |
+
|
| 664 |
+
```
|
| 665 |
+
packages/pdfsys-parser-vlm/src/pdfsys_parser_vlm/
|
| 666 |
+
├── extract.py
|
| 667 |
+
├── engines/
|
| 668 |
+
│ ├── mineru.py # LMDeploy wrapper
|
| 669 |
+
│ └── paddleocr_vl.py # 备选
|
| 670 |
+
├── batching.py # dynamic batching
|
| 671 |
+
├── rendering.py # 高 DPI 页面渲染
|
| 672 |
+
└── fallback.py # OOM 降 batch 重试
|
| 673 |
+
```
|
| 674 |
+
|
| 675 |
+
**关键约束**:
|
| 676 |
+
- Worker 常驻模型(单例懒加载)
|
| 677 |
+
- `max_batch_size=16, max_seq=8192`(PRD §4.4)
|
| 678 |
+
- 超长页:单页 > 8192 tokens 按 bbox 聚类切两块
|
| 679 |
+
- 单页 OOM 自动降 batch 重试 ≤ 2 次后写 quarantine(见 §4.5)
|
| 680 |
+
|
| 681 |
+
**工作量**:1 人 · 15 天(含 LMDeploy 调通)
|
| 682 |
+
|
| 683 |
+
---
|
| 684 |
+
|
| 685 |
+
### 5.5 Stage-3 后处理
|
| 686 |
+
|
| 687 |
+
独立成新包 `packages/pdfsys-postproc/`:
|
| 688 |
+
|
| 689 |
+
```
|
| 690 |
+
├── reading_order.py # 跨页合并、脚注挂回正文、双栏交错修正
|
| 691 |
+
├── paragraph_merge.py # 折行还原 + 中文断句
|
| 692 |
+
├── formula_norm.py # KaTeX 语法校验,失败转 image placeholder
|
| 693 |
+
├── table_norm.py # HTML↔Markdown 双格式,行列校验
|
| 694 |
+
└── unicode_norm.py # NFC + 全半角统一 + 零宽字符清理
|
| 695 |
+
```
|
| 696 |
+
|
| 697 |
+
**工作量**:1 人 · 10 天
|
| 698 |
+
|
| 699 |
+
---
|
| 700 |
+
|
| 701 |
+
### 5.6 Stage-4 质量 / PII / MinHash 去重
|
| 702 |
+
|
| 703 |
+
独立成 `packages/pdfsys-quality/`,复用 `datatrove` 的 MinHash block(PRD §4.6.5):
|
| 704 |
+
|
| 705 |
+
```
|
| 706 |
+
├── lang_id.py # GlotLID 段落级语种识别
|
| 707 |
+
├── heuristic.py # 重复 n-gram、非 CJK 比例、行长方差
|
| 708 |
+
├── edu_score.py # 中文 EduScore (fastText → DeBERTa-v3-tiny)
|
| 709 |
+
├── pii.py # 正则 + NER 兜底
|
| 710 |
+
└── dedup/
|
| 711 |
+
├── exact.py # md5 内容精确去重
|
| 712 |
+
└── minhash.py # datatrove MinHash LSH wrapper
|
| 713 |
+
```
|
| 714 |
+
|
| 715 |
+
**工作量**:2 人 · 3 周(MinHash 跨 shard 需全局 shuffle,最复杂)
|
| 716 |
+
|
| 717 |
+
---
|
| 718 |
+
|
| 719 |
+
### 5.7 Stage-5 Parquet 打包
|
| 720 |
+
|
| 721 |
+
独立成 `packages/pdfsys-output/`:
|
| 722 |
+
- Parquet 分片 ~1 GB/shard,zstd 压缩
|
| 723 |
+
- 分桶路径:`v1/lang=zh/source=arxiv/qb=high/shard-NNNNN.parquet`
|
| 724 |
+
- JSONL 镜像 + Markdown 抽样存档(每 shard 0.1%)
|
| 725 |
+
|
| 726 |
+
**工作量**:1 人 · 5 天
|
| 727 |
+
|
| 728 |
+
---
|
| 729 |
+
|
| 730 |
+
## 6 · P3 规模化与生态(3-6 个月)
|
| 731 |
+
|
| 732 |
+
| 项 | 说明 | 工作量 |
|
| 733 |
+
|---|---|---|
|
| 734 |
+
| **datatrove 集成** | 把现有 stage 包成 `datatrove.Block`,原生 Slurm 后端 | 2-3 周 |
|
| 735 |
+
| **Slurm / K8s runner** | 新包 `pdfsys-runner`,支持 shard checkpoint + 反压 | 3-4 周 |
|
| 736 |
+
| **对象存储后端** | `pdfsys-core` 抽象 `FSBackend` 协议,支持 `file://` / `s3://` / `oss://` / `minio://` | 1-2 周 |
|
| 737 |
+
| **中文 EduScore 训练** | fastText → DeBERTa-v3-tiny 分类器 + 数据标注 | 4-6 周(含标注) |
|
| 738 |
+
| **竖排古籍 LoRA** | MinerU 2.5 针对性 LoRA 微调 | 4-6 周(GPU 密集) |
|
| 739 |
+
|
| 740 |
+
---
|
| 741 |
+
|
| 742 |
+
## 7 · 里程碑时间线
|
| 743 |
+
|
| 744 |
+
| 里程碑 | 周 | 标志 |
|
| 745 |
+
|---|---|---|
|
| 746 |
+
| **M1 · 可协作化** | 2 | CI 绿灯;覆盖率达标;lock 文件入库;parity harness 守门 |
|
| 747 |
+
| **M2 · 生产级核心** | 6 | Router p50 ≤ 10 ms;scorer 3× 吞吐;统一 log+metrics;错误可聚合 |
|
| 748 |
+
| **M3 · 6-stage 打通** | 16 | 10 GB 数据集端到端跑完;三种 backend 同构 schema |
|
| 749 |
+
| **M4 · PB 就绪** | 24 | datatrove + Slurm runner;对象存储后端;TCO 估算入库 |
|
| 750 |
+
| **M5 · v0.1 数据集** | 32 | 首个 1 TB 级对外可发布数据集 + 评测报告 |
|
| 751 |
+
|
| 752 |
+
---
|
| 753 |
+
|
| 754 |
+
## 8 · Quick Wins · 两周内可立即启动
|
| 755 |
+
|
| 756 |
+
如果只能挑最高 ROI 的 5 件事立刻做:
|
| 757 |
+
|
| 758 |
+
1. **写 15 个 core / router / parser-mupdf 单测** — 2 天 · 把不变式变成机器可验证
|
| 759 |
+
2. **配 ruff + pre-commit** — 0.5 天 · 新 PR 质量底线立起来
|
| 760 |
+
3. **写 `.github/workflows/ci.yml`** — 0.5 天 · 反馈从"review 时"提前到"push 时"
|
| 761 |
+
4. **`uv.lock` 入库 + 依赖加上界** — 0.5 天 · 依赖不会突然不一样
|
| 762 |
+
5. **`scripts/check_parity.sh` + 10 个样本 PDF 入 fixtures** — 2 天 · router 改动自动守门
|
| 763 |
+
|
| 764 |
+
合计 **5-6 个工作日**,换来"可协作化"的全部前提。强烈建议以这作为第一冲刺。
|
| 765 |
+
|
| 766 |
+
---
|
| 767 |
+
|
| 768 |
+
## 9 · 风险与"不做的事"
|
| 769 |
+
|
| 770 |
+
### 必须克制的诱惑
|
| 771 |
+
|
| 772 |
+
- ❌ **不要在 P0 之前碰 stub 实现**——没有测试和 parity harness 保护,任何功能添加都是技术债的利息
|
| 773 |
+
- ❌ **不要替换 PyMuPDF**——它在中文场景的工程成熟度是第一梯队,换 pdfminer/PyPDF2 会立刻倒退
|
| 774 |
+
- ❌ **不要引入 LangChain / LlamaIndex**——这是数据处理 pipeline,不是 RAG 应用
|
| 775 |
+
- ❌ **不要在 `pdfsys-core` 引入 pydantic**——现有 `dataclass(frozen=True, slots=True)` + `serde.py` 够用,换 pydantic 破坏零依赖不变式
|
| 776 |
+
|
| 777 |
+
### 长期风险对应策略
|
| 778 |
+
|
| 779 |
+
| 风险 | 对应 |
|
| 780 |
+
|---|---|
|
| 781 |
+
| MinerU 2.5 新版许可变化 | PaddleOCR-VL 保持热备,`pdfsys-parser-vlm` 做成 engine 抽象 |
|
| 782 |
+
| PyMuPDF AGPL 限制 | 评估 pikepdf / pdfplumber 作为退路(低优先级) |
|
| 783 |
+
| PB 级对象存储成本失控 | P0 阶段写 `scripts/tco.py` 估算 |
|
| 784 |
+
| 中文 PII 召回不足 | NER 模型兜底,保留审计表便于事后补救 |
|
| 785 |
+
|
| 786 |
+
---
|
| 787 |
+
|
| 788 |
+
## 10 · 如何跟踪进度
|
| 789 |
+
|
| 790 |
+
- **短期(P0-P1)**:GitHub Projects / Milestones。每个子项一 issue,带验收标准。
|
| 791 |
+
- **中期(P2)**:每个 stage 落地时开一个"tracking issue"聚合子 PR,`CHANGELOG.md` 按 SemVer 更新。
|
| 792 |
+
- **长期(P3)**:PRD §10 的 P0/P1/P2/P3 roadmap 每月复盘一次,本文档 v0.N 同步迭代。
|
| 793 |
+
|
| 794 |
+
进度状态在根 `README.md` §What's implemented 表里维护——按 `.cursor/rules/03-doc-sync.mdc` 的映射表,任何 Stage 状态从 ❌→✅ 都必须同步该表。
|
| 795 |
+
|
| 796 |
+
---
|
| 797 |
+
|
| 798 |
+
## 附录 · 总量一览
|
| 799 |
+
|
| 800 |
+
| 阶段 | 周期 | 核心交付 | 人力 |
|
| 801 |
+
|---|---|---|---|
|
| 802 |
+
| **P0 工程基础** | 2 周 | pytest + ruff + CI + lock + parity | 1 人 |
|
| 803 |
+
| **P1 性能/可靠性** | 4 周 | router 5×、scorer 3×、log/metrics | 1-2 人 |
|
| 804 |
+
| **P2 功能补全** | 10-12 周 | 6 stage 闭环 | 2-3 人 |
|
| 805 |
+
| **P3 规模化** | 3-6 月 | datatrove + Slurm + PB 级运行 | 3-4 人 |
|
| 806 |
+
|
| 807 |
+
从 0 到"PB 级准备"约 24 周,累计约 20-30 人周。与 PRD §6 的资源预算 "100 × A100 + 32 节点 CPU 墙钟 ~2 个月"相匹配——**先把工具链造好,再把大算力接上**。
|
packages/pdfsys-router/src/pdfsys_router/download_weights.py
CHANGED
|
@@ -12,39 +12,50 @@ Usage::
|
|
| 12 |
|
| 13 |
from __future__ import annotations
|
| 14 |
|
|
|
|
| 15 |
import sys
|
| 16 |
import urllib.request
|
| 17 |
from pathlib import Path
|
| 18 |
|
| 19 |
-
#
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
"https://
|
| 23 |
-
|
| 24 |
-
)
|
| 25 |
|
| 26 |
|
| 27 |
def target_path() -> Path:
|
| 28 |
return Path(__file__).resolve().parent.parent.parent / "models" / "xgb_classifier.ubj"
|
| 29 |
|
| 30 |
|
| 31 |
-
def download(force: bool = False) -> Path:
|
| 32 |
dst = target_path()
|
| 33 |
if dst.exists() and not force:
|
| 34 |
print(f"[download_weights] already present: {dst}")
|
| 35 |
return dst
|
| 36 |
dst.parent.mkdir(parents=True, exist_ok=True)
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
|
| 50 |
if __name__ == "__main__":
|
|
|
|
| 12 |
|
| 13 |
from __future__ import annotations
|
| 14 |
|
| 15 |
+
import socket
|
| 16 |
import sys
|
| 17 |
import urllib.request
|
| 18 |
from pathlib import Path
|
| 19 |
|
| 20 |
+
# GitHub raw download URL for XGBoost router weights
|
| 21 |
+
WEIGHTS_URLS = [
|
| 22 |
+
"https://github.com/huggingface/finepdfs/raw/main/models/xgb_ocr_classifier/xgb_classifier.ubj",
|
| 23 |
+
"https://raw.githubusercontent.com/huggingface/finepdfs/main/models/xgb_ocr_classifier/xgb_classifier.ubj",
|
| 24 |
+
]
|
|
|
|
| 25 |
|
| 26 |
|
| 27 |
def target_path() -> Path:
|
| 28 |
return Path(__file__).resolve().parent.parent.parent / "models" / "xgb_classifier.ubj"
|
| 29 |
|
| 30 |
|
| 31 |
+
def download(force: bool = False, timeout: int = 30) -> Path:
|
| 32 |
dst = target_path()
|
| 33 |
if dst.exists() and not force:
|
| 34 |
print(f"[download_weights] already present: {dst}")
|
| 35 |
return dst
|
| 36 |
dst.parent.mkdir(parents=True, exist_ok=True)
|
| 37 |
+
|
| 38 |
+
last_error = None
|
| 39 |
+
for url in WEIGHTS_URLS:
|
| 40 |
+
print(f"[download_weights] fetching {url}")
|
| 41 |
+
try:
|
| 42 |
+
# 设置超时
|
| 43 |
+
with urllib.request.urlopen(url, timeout=timeout) as r: # noqa: S310 — pinned URL
|
| 44 |
+
data = r.read()
|
| 45 |
+
if len(data) < 10_000:
|
| 46 |
+
raise RuntimeError(
|
| 47 |
+
f"downloaded blob is suspiciously small ({len(data)} bytes) — "
|
| 48 |
+
"likely an LFS pointer, not the binary"
|
| 49 |
+
)
|
| 50 |
+
dst.write_bytes(data)
|
| 51 |
+
print(f"[download_weights] wrote {len(data)} bytes -> {dst}")
|
| 52 |
+
return dst
|
| 53 |
+
except (urllib.error.URLError, socket.timeout) as e:
|
| 54 |
+
last_error = e
|
| 55 |
+
print(f"[download_weights] failed for {url}: {e}")
|
| 56 |
+
continue
|
| 57 |
+
|
| 58 |
+
raise RuntimeError(f"Failed to download weights from all URLs: {last_error}")
|
| 59 |
|
| 60 |
|
| 61 |
if __name__ == "__main__":
|
requirements.txt
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Hugging Face Spaces installs from this file.
|
| 2 |
+
# Note: Local workspace packages (pdfsys-*) are loaded via sys.path in demo/app.py
|
| 3 |
+
# and do not need editable installation in HF Spaces.
|
| 4 |
+
|
| 5 |
+
# --- Python 3.13 compatibility (audioop removed) --------------------------
|
| 6 |
+
audioop-lts
|
| 7 |
+
|
| 8 |
+
# --- CPU-only torch (HF Spaces free tier is CPU) --------------------------
|
| 9 |
+
--extra-index-url https://download.pytorch.org/whl/cpu
|
| 10 |
+
torch>=2.1,<3.0
|
| 11 |
+
|
| 12 |
+
# --- Third-party runtime deps -------------------------------------------
|
| 13 |
+
gradio==5.12.0
|
| 14 |
+
huggingface-hub>=0.26,<0.29
|
| 15 |
+
pymupdf>=1.24
|
| 16 |
+
xgboost>=2.0
|
| 17 |
+
scikit-learn>=1.3
|
| 18 |
+
pandas>=2.0
|
| 19 |
+
numpy>=1.26
|
| 20 |
+
transformers>=4.44
|
| 21 |
+
pillow>=10.0
|