yin Claude Opus 4.6 commited on
Commit Β·
b8ca6f2
1
Parent(s): d423504
docs: add project README, CONTRIBUTING guide, and per-package READMEs
Browse filesRewrite the top-level README with:
- Quick start (uv sync + download weights + run bench CLI)
- Architecture diagram with implemented/stub status table
- MVP benchmark results on OmniDocBench-100
- Key data structures (RouterDecision, ExtractedDoc, QualityScore)
- Design principles, CLI reference, output format spec
Add CONTRIBUTING.md covering:
- Dev environment setup
- Project structure overview
- Code conventions (naming, immutability, error handling)
- Feature extractor parity rules (124-column contract)
- How to add a new parser backend
- Commit message conventions
Add per-package READMEs for all 7 workspace packages explaining
each one's role, usage, and scope.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- CONTRIBUTING.md +126 -0
- README.md +188 -22
- packages/pdfsys-core/README.md +17 -0
- packages/pdfsys-layout-analyser/README.md +11 -0
- packages/pdfsys-parser-mupdf/README.md +30 -0
- packages/pdfsys-parser-pipeline/README.md +5 -0
- packages/pdfsys-parser-vlm/README.md +5 -0
- packages/pdfsys-router/README.md +44 -0
CONTRIBUTING.md
ADDED
|
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Contributing to pdfsys-mnbvc
|
| 2 |
+
|
| 3 |
+
## Dev environment setup
|
| 4 |
+
|
| 5 |
+
```bash
|
| 6 |
+
# Prerequisites: Python >= 3.11, uv >= 0.4
|
| 7 |
+
uv sync # installs all workspace packages in editable mode
|
| 8 |
+
python -m pdfsys_router.download_weights # one-time: fetch XGBoost weights (257 KB)
|
| 9 |
+
```
|
| 10 |
+
|
| 11 |
+
If you'll be working on quality scoring, torch + transformers are pulled in by `pdfsys-bench`. The ModernBERT-large model (~800 MB) downloads on first scorer use. Set `HF_HOME` to control the cache location.
|
| 12 |
+
|
| 13 |
+
## Project structure
|
| 14 |
+
|
| 15 |
+
```
|
| 16 |
+
pdfsystem_mnbvc/
|
| 17 |
+
βββ pyproject.toml # uv workspace root (meta-package)
|
| 18 |
+
βββ packages/
|
| 19 |
+
β βββ pdfsys-core/ # shared types, enums, layout cache, serde
|
| 20 |
+
β βββ pdfsys-router/ # Stage-A XGBoost classifier
|
| 21 |
+
β β βββ models/ # gitignored xgb_classifier.ubj lives here
|
| 22 |
+
β β βββ src/pdfsys_router/
|
| 23 |
+
β β βββ feature_extractor.py # 124-feature PyMuPDF extractor
|
| 24 |
+
β β βββ xgb_model.py # lazy model loader
|
| 25 |
+
β β βββ classifier.py # Router.classify() β RouterDecision
|
| 26 |
+
β β βββ download_weights.py # fetch weights from HF LFS
|
| 27 |
+
β βββ pdfsys-parser-mupdf/ # text-ok fast path (PyMuPDF blocks β Markdown)
|
| 28 |
+
β βββ pdfsys-parser-pipeline/ # OCR backend (stub)
|
| 29 |
+
β βββ pdfsys-parser-vlm/ # VLM backend (stub)
|
| 30 |
+
β βββ pdfsys-layout-analyser/ # layout model runner (stub)
|
| 31 |
+
β βββ pdfsys-bench/ # evaluation harness + quality scorer
|
| 32 |
+
β βββ omnidocbench_100/ # gitignored bench dataset
|
| 33 |
+
β βββ src/pdfsys_bench/
|
| 34 |
+
β βββ quality.py # ModernBERT-large OCR quality scorer
|
| 35 |
+
β βββ loop.py # router β parser β scorer β JSONL runner
|
| 36 |
+
β βββ __main__.py # CLI entry point
|
| 37 |
+
βββ out/ # gitignored run outputs
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
## Code conventions
|
| 41 |
+
|
| 42 |
+
### Naming
|
| 43 |
+
|
| 44 |
+
- Package dirs: `pdfsys-<name>` (kebab-case in pyproject.toml and directory names).
|
| 45 |
+
- Import names: `pdfsys_<name>` (snake_case, matching `src/pdfsys_<name>/`).
|
| 46 |
+
- All packages live under `packages/` and use the `[tool.uv.workspace]` editable pattern.
|
| 47 |
+
|
| 48 |
+
### Types and immutability
|
| 49 |
+
|
| 50 |
+
- Core data structures are `@dataclass(frozen=True, slots=True)`.
|
| 51 |
+
- Enums live in `pdfsys_core.types`.
|
| 52 |
+
- BBox coordinates are always normalized to `[0, 1]`; convert to pixels/points at the call site.
|
| 53 |
+
- Parser backends all emit `ExtractedDoc` with a `tuple[Segment, ...]` β the schema is backend-agnostic.
|
| 54 |
+
|
| 55 |
+
### Error handling
|
| 56 |
+
|
| 57 |
+
- `Router.classify()` never raises. Errors are surfaced via `RouterDecision.error`.
|
| 58 |
+
- Parser `extract_doc()` may raise; the bench loop catches and records errors in JSONL.
|
| 59 |
+
- Prefer explicit `except Exception` with a recorded message over silent swallowing.
|
| 60 |
+
|
| 61 |
+
### Feature extractor parity
|
| 62 |
+
|
| 63 |
+
The `feature_extractor.py` in `pdfsys-router` is a direct port of FinePDFs'
|
| 64 |
+
`blocks/predictor/ocr_predictor.py`. The 124-column feature vector MUST match
|
| 65 |
+
the upstream layout exactly β the XGBoost weights depend on column order. If you
|
| 66 |
+
change any feature extraction logic, verify against the FinePDFs reference output
|
| 67 |
+
before merging.
|
| 68 |
+
|
| 69 |
+
The feature ordering is:
|
| 70 |
+
1. `num_pages_successfully_sampled` (doc-level)
|
| 71 |
+
2. `garbled_text_ratio` (doc-level)
|
| 72 |
+
3. `is_form` (doc-level)
|
| 73 |
+
4. `creator_or_producer_is_known_scanner` (doc-level)
|
| 74 |
+
5. `page_level_unique_font_counts_page1` through `_page8`
|
| 75 |
+
6. ... (15 page-level features Γ 8 pages = 120 columns)
|
| 76 |
+
|
| 77 |
+
Total: 4 + 120 = 124 features.
|
| 78 |
+
|
| 79 |
+
### Dependencies
|
| 80 |
+
|
| 81 |
+
- `pdfsys-core` has **zero** external dependencies. Keep it that way.
|
| 82 |
+
- Heavy deps (torch, transformers) are lazy-imported so that `import pdfsys_bench` doesn't pull them in at module scope.
|
| 83 |
+
- XGBoost model weights are NOT committed to the repo. They're downloaded on demand via `download_weights.py`.
|
| 84 |
+
|
| 85 |
+
## Running the MVP
|
| 86 |
+
|
| 87 |
+
```bash
|
| 88 |
+
# Full run on OmniDocBench-100 (takes ~4 min on CPU)
|
| 89 |
+
python -m pdfsys_bench \
|
| 90 |
+
--pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \
|
| 91 |
+
--out out/bench_omnidoc100.jsonl \
|
| 92 |
+
--markdown-dir out/bench_omnidoc100_md
|
| 93 |
+
|
| 94 |
+
# Fast smoke test (no quality scoring)
|
| 95 |
+
python -m pdfsys_bench \
|
| 96 |
+
--pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \
|
| 97 |
+
--out out/smoke.jsonl \
|
| 98 |
+
--limit 5 --no-quality
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
+
Output: one JSONL file (per-doc results) + one `.summary.json` (aggregate stats).
|
| 102 |
+
|
| 103 |
+
## Adding a new parser backend
|
| 104 |
+
|
| 105 |
+
1. Implement the backend in its package under `packages/pdfsys-parser-<name>/`.
|
| 106 |
+
2. The entry point should accept a `Path` and return `ExtractedDoc` (from `pdfsys-core`).
|
| 107 |
+
3. Each `Segment` must have `page_index`, `type` (RegionType), `content`, and ideally a normalized `BBox`.
|
| 108 |
+
4. Call `merge_segments_to_markdown(segments)` from `pdfsys-core` to produce the `markdown` field.
|
| 109 |
+
5. Wire it into `loop.py` by handling the corresponding `Backend` enum value.
|
| 110 |
+
|
| 111 |
+
## Adding new features to the router
|
| 112 |
+
|
| 113 |
+
**Do not** modify `feature_extractor.py` unless you're also retraining the XGBoost model. The weights and feature layout are coupled. If you need additional routing signals, add them as post-classification heuristics in `classifier.py` rather than changing the feature vector.
|
| 114 |
+
|
| 115 |
+
## Commit conventions
|
| 116 |
+
|
| 117 |
+
Commit messages follow conventional commits:
|
| 118 |
+
|
| 119 |
+
```
|
| 120 |
+
feat(router): add scanner metadata detection
|
| 121 |
+
fix(parser-mupdf): handle zero-width bbox on empty pages
|
| 122 |
+
docs: update quickstart for new deps
|
| 123 |
+
chore: bump pymupdf to 1.25
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
Scope is the package name without the `pdfsys-` prefix (e.g. `router`, `core`, `bench`, `parser-mupdf`).
|
README.md
CHANGED
|
@@ -1,15 +1,46 @@
|
|
| 1 |
# pdfsys-mnbvc
|
| 2 |
|
| 3 |
-
PB-scale PDF β pretraining-data pipeline for the MNBVC corpus project.
|
| 4 |
FinePDFs-inspired architecture adapted for Chinese-heavy, mixed-quality input.
|
| 5 |
|
| 6 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
```
|
| 11 |
ββββββββββββββββ
|
| 12 |
-
PDF ββΊ
|
| 13 |
ββββββββ¬ββββββββ
|
| 14 |
β
|
| 15 |
text-ok ββββ΄βββΊ needs-ocr
|
|
@@ -26,32 +57,167 @@ PDF ββΊ β pdfsys-routerβ stage A (cheap classifier)
|
|
| 26 |
parser-pipeline parser-vlm
|
| 27 |
```
|
| 28 |
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
## Workspace packages
|
| 34 |
|
| 35 |
-
| Package | Role |
|
| 36 |
-
|---|---|
|
| 37 |
-
| `pdfsys-core` | Shared dataclasses
|
| 38 |
-
| `pdfsys-router` |
|
| 39 |
-
| `pdfsys-layout-analyser` | Page layout model runner
|
| 40 |
-
| `pdfsys-parser-mupdf` | Text-ok backend
|
| 41 |
-
| `pdfsys-parser-pipeline` |
|
| 42 |
-
| `pdfsys-parser-vlm` |
|
| 43 |
-
| `pdfsys-bench` |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
```bash
|
| 48 |
-
|
| 49 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
```
|
| 51 |
|
| 52 |
-
|
| 53 |
-
extraction core (ingest / dedup / quality / tokenize) are not implemented
|
| 54 |
-
yet β see `docs/PRD.md` for the full design.
|
| 55 |
|
| 56 |
## Docs
|
| 57 |
|
|
|
|
| 1 |
# pdfsys-mnbvc
|
| 2 |
|
| 3 |
+
PB-scale PDF β pretraining-data pipeline for the [MNBVC](https://github.com/esbatmop/MNBVC) corpus project.
|
| 4 |
FinePDFs-inspired architecture adapted for Chinese-heavy, mixed-quality input.
|
| 5 |
|
| 6 |
+
## Current status: MVP closed loop β
|
| 7 |
+
|
| 8 |
+
The first end-to-end path β **Router β MuPDF parser β OCR quality scorer** β is working on the OmniDocBench-100 evaluation set. PDFs that need OCR are routed to `PIPELINE` but not yet extracted (that backend is not implemented yet).
|
| 9 |
+
|
| 10 |
+
## Quick start
|
| 11 |
+
|
| 12 |
+
```bash
|
| 13 |
+
# 1. Install uv (>= 0.4)
|
| 14 |
+
curl -LsSf https://astral.sh/uv/install.sh | sh
|
| 15 |
+
|
| 16 |
+
# 2. Clone the repo and sync all workspace packages
|
| 17 |
+
git clone <this-repo-url>
|
| 18 |
+
cd pdfsystem_mnbvc
|
| 19 |
+
uv sync
|
| 20 |
|
| 21 |
+
# 3. Fetch the XGBoost router weights (257 KB, one-time)
|
| 22 |
+
python -m pdfsys_router.download_weights
|
| 23 |
+
|
| 24 |
+
# 4. Run the MVP closed loop on the bench dataset
|
| 25 |
+
python -m pdfsys_bench \
|
| 26 |
+
--pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \
|
| 27 |
+
--out out/bench_omnidoc100.jsonl \
|
| 28 |
+
--markdown-dir out/bench_omnidoc100_md
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
> **Note:** The first run downloads the ModernBERT-large quality scorer
|
| 32 |
+
> (~800 MB) from HuggingFace Hub. Set `HF_HOME` to control where it's
|
| 33 |
+
> cached. If you don't need quality scoring, add `--no-quality` to skip it.
|
| 34 |
+
|
| 35 |
+
> **Note:** The bench dataset (omnidocbench_100) is NOT committed to the repo.
|
| 36 |
+
> You need to obtain it separately and place it under
|
| 37 |
+
> `packages/pdfsys-bench/omnidocbench_100/`.
|
| 38 |
+
|
| 39 |
+
## Architecture
|
| 40 |
|
| 41 |
```
|
| 42 |
ββββββββββββββββ
|
| 43 |
+
PDF βββΊ β pdfsys-routerβ stage A: XGBoost (124 PyMuPDF features)
|
| 44 |
ββββββββ¬ββββββββ
|
| 45 |
β
|
| 46 |
text-ok ββββ΄βββΊ needs-ocr
|
|
|
|
| 57 |
parser-pipeline parser-vlm
|
| 58 |
```
|
| 59 |
|
| 60 |
+
### What's implemented
|
| 61 |
+
|
| 62 |
+
| Stage | Status | Description |
|
| 63 |
+
|-------|--------|-------------|
|
| 64 |
+
| **Stage-A router** | β
| XGBoost binary classifier, ported from FinePDFs. 124 features (4 doc-level + 15 page-level Γ 8 sampled pages). Routes to `MUPDF` (text-ok) or `PIPELINE` (needs-ocr). |
|
| 65 |
+
| **MuPDF parser** | β
| `page.get_text("blocks", sort=True)` β `ExtractedDoc` with normalized bbox and merged Markdown. Fast path for clean-text PDFs. |
|
| 66 |
+
| **OCR quality scorer** | β
| ModernBERT-large regression head (`HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn`). Scores extracted text on a [0, 3] scale. |
|
| 67 |
+
| **Bench CLI** | β
| `python -m pdfsys_bench` β drives the full loop, emits per-doc JSONL + summary JSON. |
|
| 68 |
+
| Stage-B router | β | Pending layout-analyser and LayoutCache integration. |
|
| 69 |
+
| Layout analyser | β | PP-DocLayoutV3 / docling-layout-heron runner β not started. |
|
| 70 |
+
| Pipeline parser | β | Region-level OCR (RapidOCR / PaddleOCR) β not started. |
|
| 71 |
+
| VLM parser | β | MinerU 2.5 / PaddleOCR-VL on complex regions β not started. |
|
| 72 |
+
|
| 73 |
+
### MVP benchmark results (OmniDocBench-100)
|
| 74 |
+
|
| 75 |
+
```
|
| 76 |
+
Backend split: mupdf=70 pipeline=30
|
| 77 |
+
Avg ocr_prob: mupdf=0.034 pipeline=0.634
|
| 78 |
+
Extracted: 70 Errors: 0
|
| 79 |
+
Quality: avg=1.71 min=0.39 max=2.73
|
| 80 |
+
Per-doc time: router=49ms extract=7ms quality=3.6s
|
| 81 |
+
```
|
| 82 |
|
| 83 |
## Workspace packages
|
| 84 |
|
| 85 |
+
| Package | Role | Dependencies |
|
| 86 |
+
|---------|------|-------------|
|
| 87 |
+
| `pdfsys-core` | Shared dataclasses, enums, layout cache, serde. No PDF/ML deps. | stdlib only |
|
| 88 |
+
| `pdfsys-router` | Stage-A XGBoost classifier + Stage-B layout decision (stub). | pymupdf, xgboost, pandas, numpy, scikit-learn |
|
| 89 |
+
| `pdfsys-layout-analyser` | Page layout model runner. Stub only. | β |
|
| 90 |
+
| `pdfsys-parser-mupdf` | Text-ok backend: PyMuPDF block extraction β Markdown. | pymupdf |
|
| 91 |
+
| `pdfsys-parser-pipeline` | OCR backend for simple layouts. Stub only. | β |
|
| 92 |
+
| `pdfsys-parser-vlm` | VLM backend for complex layouts. Stub only. | β |
|
| 93 |
+
| `pdfsys-bench` | Closed-loop evaluation harness + quality scorer. | torch, transformers, pdfsys-router, pdfsys-parser-mupdf |
|
| 94 |
+
|
| 95 |
+
### Package dependency graph
|
| 96 |
+
|
| 97 |
+
```
|
| 98 |
+
pdfsys-core βββ pdfsys-router
|
| 99 |
+
βββ pdfsys-parser-mupdf
|
| 100 |
+
βββ pdfsys-parser-pipeline (stub)
|
| 101 |
+
βββ pdfsys-parser-vlm (stub)
|
| 102 |
+
βββ pdfsys-layout-analyser (stub)
|
| 103 |
+
|
| 104 |
+
pdfsys-router βββ pdfsys-bench
|
| 105 |
+
pdfsys-parser-mupdf βββ pdfsys-bench
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
`pdfsys-core` is the root dependency: every other package imports it, and it has zero external deps beyond the Python stdlib.
|
| 109 |
+
|
| 110 |
+
## Key data structures
|
| 111 |
|
| 112 |
+
### Router output (`RouterDecision`)
|
| 113 |
+
|
| 114 |
+
```python
|
| 115 |
+
@dataclass
|
| 116 |
+
class RouterDecision:
|
| 117 |
+
backend: Backend # MUPDF | PIPELINE | VLM | DEFERRED
|
| 118 |
+
ocr_prob: float # P(needs OCR) from XGBoost, [0, 1]
|
| 119 |
+
num_pages: int
|
| 120 |
+
is_form: bool
|
| 121 |
+
garbled_text_ratio: float
|
| 122 |
+
is_encrypted: bool
|
| 123 |
+
needs_password: bool
|
| 124 |
+
features: dict # full 124-feature vector for debugging
|
| 125 |
+
error: str | None
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
### Parser output (`ExtractedDoc`)
|
| 129 |
+
|
| 130 |
+
```python
|
| 131 |
+
@dataclass(frozen=True)
|
| 132 |
+
class ExtractedDoc:
|
| 133 |
+
sha256: str
|
| 134 |
+
backend: Backend
|
| 135 |
+
segments: tuple[Segment, ...] # ordered block-level units
|
| 136 |
+
markdown: str # segments merged with \n\n
|
| 137 |
+
stats: dict
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
Each `Segment` carries `page_index`, `RegionType` (TEXT/IMAGE/TABLE/FORMULA), `content` (Markdown / HTML / LaTeX), and a normalized `BBox` in [0, 1].
|
| 141 |
+
|
| 142 |
+
### Quality score
|
| 143 |
+
|
| 144 |
+
```python
|
| 145 |
+
@dataclass
|
| 146 |
+
class QualityScore:
|
| 147 |
+
score: float # [0, 3]: 0=garbage, 1=format issues, 2=minor, 3=clean
|
| 148 |
+
num_chars: int
|
| 149 |
+
num_tokens: int
|
| 150 |
+
model: str
|
| 151 |
+
```
|
| 152 |
+
|
| 153 |
+
## Design principles
|
| 154 |
+
|
| 155 |
+
1. **Stateless processing.** No manifest, no central DB. Every PDF produces self-contained output. Following FinePDFs' datatrove-style design.
|
| 156 |
+
2. **Content-addressable caching.** LayoutCache shards by `sha256 + model_tag`. Bumping the model tag lazily invalidates old entries.
|
| 157 |
+
3. **Atomic writes.** All file outputs use `tmp + os.replace()` for crash safety.
|
| 158 |
+
4. **Normalized coordinates.** BBox is always `[0, 1]` with origin top-left; backends convert to pixels/points on demand.
|
| 159 |
+
5. **Backend-agnostic output.** All three parser backends emit the same `ExtractedDoc` / `Segment` schema, so downstream stages don't need to know which backend produced a document.
|
| 160 |
+
|
| 161 |
+
## CLI reference
|
| 162 |
+
|
| 163 |
+
### `python -m pdfsys_bench`
|
| 164 |
+
|
| 165 |
+
```
|
| 166 |
+
usage: pdfsys-bench [-h] --pdf-dir PDF_DIR --out OUT [--limit N]
|
| 167 |
+
[--no-quality] [--quality-model MODEL]
|
| 168 |
+
[--router-weights PATH] [--markdown-dir DIR]
|
| 169 |
+
[--ocr-threshold FLOAT]
|
| 170 |
+
|
| 171 |
+
Run the MVP pdfsys closed loop.
|
| 172 |
+
|
| 173 |
+
options:
|
| 174 |
+
--pdf-dir PATH Directory of PDFs to process (recursive).
|
| 175 |
+
--out PATH Output JSONL path (one line per PDF).
|
| 176 |
+
--limit N Cap the number of PDFs processed.
|
| 177 |
+
--no-quality Skip the ModernBERT quality scorer.
|
| 178 |
+
--quality-model ID HuggingFace model for quality scoring.
|
| 179 |
+
--router-weights P Path to xgb_classifier.ubj.
|
| 180 |
+
--markdown-dir DIR Dump per-PDF extracted markdown here.
|
| 181 |
+
--ocr-threshold F P(ocr) threshold (default: 0.5).
|
| 182 |
+
```
|
| 183 |
+
|
| 184 |
+
### `python -m pdfsys_router.download_weights`
|
| 185 |
+
|
| 186 |
+
Downloads the XGBoost router weights (~257 KB) from the FinePDFs Git LFS.
|
| 187 |
|
| 188 |
```bash
|
| 189 |
+
python -m pdfsys_router.download_weights # first time
|
| 190 |
+
python -m pdfsys_router.download_weights --force # re-download
|
| 191 |
+
```
|
| 192 |
+
|
| 193 |
+
## Output format
|
| 194 |
+
|
| 195 |
+
The JSONL output (`--out`) has one JSON object per PDF:
|
| 196 |
+
|
| 197 |
+
```json
|
| 198 |
+
{
|
| 199 |
+
"pdf_path": "packages/pdfsys-bench/omnidocbench_100/pdfs/example.pdf",
|
| 200 |
+
"sha256": "a53b50cb0d3d...",
|
| 201 |
+
"backend": "mupdf",
|
| 202 |
+
"ocr_prob": 0.003,
|
| 203 |
+
"num_pages": 1,
|
| 204 |
+
"is_form": false,
|
| 205 |
+
"garbled_text_ratio": 0.0,
|
| 206 |
+
"router_error": null,
|
| 207 |
+
"extract_stats": {"page_count": 1, "pages_extracted": 1, "segment_count": 5, "char_count": 5734},
|
| 208 |
+
"extract_error": null,
|
| 209 |
+
"quality_score": 2.45,
|
| 210 |
+
"quality_num_chars": 5734,
|
| 211 |
+
"quality_num_tokens": 512,
|
| 212 |
+
"quality_model": "HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn",
|
| 213 |
+
"markdown_chars": 5734,
|
| 214 |
+
"wall_ms_router": 42.1,
|
| 215 |
+
"wall_ms_extract": 6.3,
|
| 216 |
+
"wall_ms_quality": 3421.0
|
| 217 |
+
}
|
| 218 |
```
|
| 219 |
|
| 220 |
+
A companion `.summary.json` file is also written with aggregate statistics.
|
|
|
|
|
|
|
| 221 |
|
| 222 |
## Docs
|
| 223 |
|
packages/pdfsys-core/README.md
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# pdfsys-core
|
| 2 |
+
|
| 3 |
+
Shared data contracts for the pdfsys pipeline. Every other package depends on this one.
|
| 4 |
+
|
| 5 |
+
## What's in here
|
| 6 |
+
|
| 7 |
+
- **Enums**: `RegionType` (TEXT / IMAGE / TABLE / FORMULA), `Backend` (MUPDF / PIPELINE / VLM / DEFERRED).
|
| 8 |
+
- **PdfRecord**: Frozen dataclass for per-PDF metadata (sha256, source_uri, size, provenance).
|
| 9 |
+
- **Layout schema**: `BBox` (normalized [0,1]), `LayoutRegion`, `LayoutPage`, `LayoutDocument` β the contract between layout-analyser and every parser backend.
|
| 10 |
+
- **ExtractedDoc / Segment**: Backend-agnostic output schema. All three parser backends emit these.
|
| 11 |
+
- **LayoutCache**: Content-addressable on-disk cache for LayoutDocuments, keyed by `sha256 + model_tag`.
|
| 12 |
+
- **PdfsysConfig**: Hierarchical configuration (paths, router, layout, per-backend settings, runtime).
|
| 13 |
+
- **Serde**: Generic `to_dict()` / `from_dict()` for all the above dataclasses.
|
| 14 |
+
|
| 15 |
+
## Key design rule
|
| 16 |
+
|
| 17 |
+
This package has **zero external dependencies** β stdlib only. Do not add pymupdf, torch, or anything else here. The types must be importable everywhere without pulling in heavy ML libraries.
|
packages/pdfsys-layout-analyser/README.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# pdfsys-layout-analyser
|
| 2 |
+
|
| 3 |
+
Page layout model runner. **Stub only β not yet implemented.**
|
| 4 |
+
|
| 5 |
+
Will run a layout detection model (PP-DocLayoutV3 / docling-layout-heron) on each page and write a `LayoutDocument` to the `LayoutCache`. This layout is consumed by:
|
| 6 |
+
|
| 7 |
+
1. **pdfsys-router Stage B** β checks `has_complex_content` to decide pipeline vs VLM.
|
| 8 |
+
2. **pdfsys-parser-pipeline** β uses region bboxes to crop and OCR individual regions.
|
| 9 |
+
3. **pdfsys-parser-vlm** β sends complex regions to a vision-language model.
|
| 10 |
+
|
| 11 |
+
Layout inference runs at most once per PDF (keyed by sha256 + model_tag in the cache).
|
packages/pdfsys-parser-mupdf/README.md
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# pdfsys-parser-mupdf
|
| 2 |
+
|
| 3 |
+
Text-ok extraction backend. This is the fast path for PDFs that the router classifies as having a clean embedded text layer (i.e. `ocr_prob < threshold`).
|
| 4 |
+
|
| 5 |
+
## What it does
|
| 6 |
+
|
| 7 |
+
1. Opens the PDF with PyMuPDF.
|
| 8 |
+
2. Iterates every page, calling `page.get_text("blocks", sort=True)`.
|
| 9 |
+
3. Filters to text blocks (drops image blocks).
|
| 10 |
+
4. Normalizes each block's bbox to [0, 1] coordinates.
|
| 11 |
+
5. Produces one `Segment` per block, joined into an `ExtractedDoc` with merged Markdown.
|
| 12 |
+
|
| 13 |
+
## Usage
|
| 14 |
+
|
| 15 |
+
```python
|
| 16 |
+
from pdfsys_parser_mupdf import extract_doc
|
| 17 |
+
|
| 18 |
+
doc = extract_doc("path/to/clean.pdf")
|
| 19 |
+
print(doc.markdown[:500])
|
| 20 |
+
print(f"{doc.segment_count} segments, {doc.char_count} chars")
|
| 21 |
+
```
|
| 22 |
+
|
| 23 |
+
## Scope
|
| 24 |
+
|
| 25 |
+
This backend intentionally does NOT:
|
| 26 |
+
- Run OCR (that's what parser-pipeline and parser-vlm are for)
|
| 27 |
+
- Use a layout model (not needed for text-ok PDFs)
|
| 28 |
+
- Extract images or tables (image-heavy PDFs should be routed elsewhere)
|
| 29 |
+
|
| 30 |
+
It is the simplest possible extraction: unwrap PyMuPDF blocks into structured output.
|
packages/pdfsys-parser-pipeline/README.md
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# pdfsys-parser-pipeline
|
| 2 |
+
|
| 3 |
+
Region-level OCR backend for scanned PDFs with simple layouts. **Stub only β not yet implemented.**
|
| 4 |
+
|
| 5 |
+
Will take a `LayoutDocument` from the cache, crop each region at the configured DPI, and run OCR (RapidOCR / PaddleOCR-classic) on each crop individually. Produces an `ExtractedDoc` following the same schema as parser-mupdf.
|
packages/pdfsys-parser-vlm/README.md
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# pdfsys-parser-vlm
|
| 2 |
+
|
| 3 |
+
Vision-language model backend for scanned PDFs with complex content (tables, formulas). **Stub only β not yet implemented.**
|
| 4 |
+
|
| 5 |
+
Will handle regions flagged as TABLE or FORMULA by the layout analyser, sending them to a VLM (MinerU 2.5 / PaddleOCR-VL) that can produce structured output (HTML tables, LaTeX formulas). Simple text regions in the same document may still be handled by the pipeline backend.
|
packages/pdfsys-router/README.md
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# pdfsys-router
|
| 2 |
+
|
| 3 |
+
Two-stage routing for the pdfsys extraction pipeline.
|
| 4 |
+
|
| 5 |
+
## Stage A (implemented)
|
| 6 |
+
|
| 7 |
+
XGBoost binary classifier ported from [FinePDFs](https://github.com/huggingface/finepdfs). Given a PDF, it extracts 124 features using PyMuPDF (4 doc-level + 15 page-level Γ 8 sampled pages) and predicts `P(needs OCR)`.
|
| 8 |
+
|
| 9 |
+
- `ocr_prob < threshold` β **MUPDF** (text-ok, fast path)
|
| 10 |
+
- `ocr_prob >= threshold` β **PIPELINE** (needs OCR)
|
| 11 |
+
|
| 12 |
+
### Usage
|
| 13 |
+
|
| 14 |
+
```python
|
| 15 |
+
from pdfsys_router import Router
|
| 16 |
+
|
| 17 |
+
router = Router() # loads xgb_classifier.ubj lazily
|
| 18 |
+
decision = router.classify("path/to/document.pdf")
|
| 19 |
+
print(decision.backend, decision.ocr_prob)
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
### Weights
|
| 23 |
+
|
| 24 |
+
The XGBoost model (`models/xgb_classifier.ubj`, 257 KB) is gitignored. Fetch it once:
|
| 25 |
+
|
| 26 |
+
```bash
|
| 27 |
+
python -m pdfsys_router.download_weights
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
## Stage B (not yet implemented)
|
| 31 |
+
|
| 32 |
+
For PDFs routed to OCR, Stage B reads the cached `LayoutDocument` and decides:
|
| 33 |
+
- No complex content β `PIPELINE` (region-level OCR)
|
| 34 |
+
- Tables / formulas present β `VLM` (vision-language model)
|
| 35 |
+
|
| 36 |
+
## Module layout
|
| 37 |
+
|
| 38 |
+
| File | Purpose |
|
| 39 |
+
|------|---------|
|
| 40 |
+
| `feature_extractor.py` | Port of FinePDFs' `PDFFeatureExtractor` β DO NOT modify without retraining |
|
| 41 |
+
| `xgb_model.py` | Lazy XGBoost model loader |
|
| 42 |
+
| `classifier.py` | `Router.classify()` β `RouterDecision` public API |
|
| 43 |
+
| `download_weights.py` | Fetches weights from FinePDFs Git LFS |
|
| 44 |
+
| `decider.py` | Stage-B stub |
|