feat(mvp): wire router → mupdf parser → OCR quality scorer closed loop
Browse filesShip the first end-to-end cut of the pdfsys pipeline on OmniDocBench-100:
* pdfsys-router: port FinePDFs PDFFeatureExtractor (15 page features × 8
sampled pages + 4 doc features = 124 columns) and load the upstream
xgb.ubj weights via a thin XgbRouterModel wrapper. Router.classify()
returns a RouterDecision with Backend {MUPDF, PIPELINE, VLM, DEFERRED},
ocr_prob, and the full feature dict for debugging. Seeded RNG keeps the
feature vector reproducible per PDF. Weights live under models/ and are
gitignored; download_weights.py fetches them from the HF LFS media URL.
* pdfsys-parser-mupdf: text-ok backend using page.get_text("blocks",
sort=True), emits one Segment per paragraph-shaped block with bbox
normalized to [0, 1] and converts the whole doc into an ExtractedDoc
with merged Markdown. No layout-analyser dependency by design.
* pdfsys-bench: add quality.py (ModernBERT-large regression head from
HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn, loaded in
bfloat16 with max_tokens=512 to fit a 4 GB RAM dev box), loop.py
(router → parser → scorer → JSONL runner), and __main__ CLI.
End-to-end run on the full 100-doc OmniDocBench subset:
* 70 routed to MUPDF (avg ocr_prob 0.034), 30 routed to PIPELINE
(avg ocr_prob 0.634)
* 70 extracted + quality scored, 0 errors
* avg quality 1.71, wall clock 259 s
* per-doc: router 49 ms, extract 7 ms, quality 3.6 s
Stage-B (LayoutCache-driven pipeline-vs-vlm decision) and the PIPELINE
and VLM parser backends are out of scope for this MVP.
- .gitignore +5 -0
- packages/pdfsys-bench/README.md +88 -0
- packages/pdfsys-bench/pyproject.toml +6 -0
- packages/pdfsys-bench/src/pdfsys_bench/__init__.py +18 -3
- packages/pdfsys-bench/src/pdfsys_bench/__main__.py +98 -0
- packages/pdfsys-bench/src/pdfsys_bench/loop.py +200 -0
- packages/pdfsys-bench/src/pdfsys_bench/quality.py +148 -0
- packages/pdfsys-parser-mupdf/pyproject.toml +1 -0
- packages/pdfsys-parser-mupdf/src/pdfsys_parser_mupdf/__init__.py +8 -2
- packages/pdfsys-parser-mupdf/src/pdfsys_parser_mupdf/extract.py +181 -1
- packages/pdfsys-router/models/.gitignore +4 -0
- packages/pdfsys-router/models/README.md +16 -0
- packages/pdfsys-router/pyproject.toml +5 -0
- packages/pdfsys-router/src/pdfsys_router/__init__.py +20 -2
- packages/pdfsys-router/src/pdfsys_router/classifier.py +199 -2
- packages/pdfsys-router/src/pdfsys_router/download_weights.py +52 -0
- packages/pdfsys-router/src/pdfsys_router/feature_extractor.py +484 -0
- packages/pdfsys-router/src/pdfsys_router/xgb_model.py +66 -0
|
@@ -16,11 +16,16 @@ uv.lock
|
|
| 16 |
# local pipeline scratch
|
| 17 |
work/
|
| 18 |
output/
|
|
|
|
| 19 |
.cache/
|
| 20 |
samples/
|
| 21 |
bench_data/
|
| 22 |
*.layout.json
|
| 23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
# models / weights (too big for git)
|
| 25 |
models/
|
| 26 |
*.onnx
|
|
|
|
| 16 |
# local pipeline scratch
|
| 17 |
work/
|
| 18 |
output/
|
| 19 |
+
out/
|
| 20 |
.cache/
|
| 21 |
samples/
|
| 22 |
bench_data/
|
| 23 |
*.layout.json
|
| 24 |
|
| 25 |
+
# bench datasets — large binary corpora, distributed out of band
|
| 26 |
+
packages/pdfsys-bench/omnidocbench_100/
|
| 27 |
+
packages/pdfsys-bench/olmocr_bench_50/
|
| 28 |
+
|
| 29 |
# models / weights (too big for git)
|
| 30 |
models/
|
| 31 |
*.onnx
|
|
@@ -0,0 +1,88 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# bench/ — PDF processing pipeline evaluation set
|
| 2 |
+
|
| 3 |
+
This directory is the **canonical test set** for evaluating the end-to-end PDF
|
| 4 |
+
processing pipeline (layout → OCR → markdown / structured text). It bundles
|
| 5 |
+
two complementary, pre-sampled subsets so that runs are reproducible and
|
| 6 |
+
cheap to iterate on.
|
| 7 |
+
|
| 8 |
+
| Subset | PDFs | Source benchmark | Focus |
|
| 9 |
+
|---|---:|---|---|
|
| 10 |
+
| [`olmocr_bench_50/`](./olmocr_bench_50) | 50 | [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench) | Fine-grained unit tests on text presence / absence, reading order, tables, math |
|
| 11 |
+
| [`omnidocbench_100/`](./omnidocbench_100) | 100 | [OmniDocBench](https://github.com/opendatalab/OmniDocBench) | Holistic document-level eval with layout / language / special-issue coverage |
|
| 12 |
+
|
| 13 |
+
Total footprint: ~108 MB, 150 PDFs.
|
| 14 |
+
|
| 15 |
+
## Subset details
|
| 16 |
+
|
| 17 |
+
### `olmocr_bench_50/`
|
| 18 |
+
Stratified sample drawn from the 1,403-PDF olmOCR-bench with the script
|
| 19 |
+
`scripts/sample_olmocr_subset.py` (seed `20260411`). Covers all 7 document
|
| 20 |
+
sources with a minimum floor of 3 PDFs per category plus largest-remainder
|
| 21 |
+
proportional allocation, and diversifies by source document inside each
|
| 22 |
+
category (at most one page per arXiv paper / scan ID before any repeat).
|
| 23 |
+
|
| 24 |
+
```
|
| 25 |
+
olmocr_bench_50/
|
| 26 |
+
├── pdfs/
|
| 27 |
+
│ ├── arxiv_math/ (14)
|
| 28 |
+
│ ├── headers_footers/ (8)
|
| 29 |
+
│ ├── long_tiny_text/ (4)
|
| 30 |
+
│ ├── multi_column/ (8)
|
| 31 |
+
│ ├── old_scans/ (5)
|
| 32 |
+
│ ├── old_scans_math/ (4)
|
| 33 |
+
│ └── tables/ (7)
|
| 34 |
+
├── subset_tests.jsonl # 283 olmOCR-bench unit tests for these 50 PDFs
|
| 35 |
+
└── subset_manifest.json # seed, quotas, selected file list, source bench_dir
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
The `subset_tests.jsonl` file is a filtered copy of the original per-category
|
| 39 |
+
`*.jsonl` test files merged into one; each row keeps the exact schema used by
|
| 40 |
+
the upstream olmOCR-bench evaluator (`pdf`, `type`, `max_diffs`, `checked`,
|
| 41 |
+
and type-specific fields like `math`, `cell`, `before`/`after`, …).
|
| 42 |
+
|
| 43 |
+
Regenerate or resize:
|
| 44 |
+
```bash
|
| 45 |
+
python3 scripts/sample_olmocr_subset.py --target 50 # default → bench/olmocr_bench_50
|
| 46 |
+
python3 scripts/sample_olmocr_subset.py --target 100 --seed 42 # alt subset
|
| 47 |
+
python3 scripts/sample_olmocr_subset.py --dry-run # plan only
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
### `omnidocbench_100/`
|
| 51 |
+
Pre-built 100-PDF subset of OmniDocBench v2 with full stratified coverage
|
| 52 |
+
across every categorical axis in the upstream dataset.
|
| 53 |
+
|
| 54 |
+
```
|
| 55 |
+
omnidocbench_100/
|
| 56 |
+
├── pdfs/ # 100 single-page PDFs
|
| 57 |
+
├── img/ # matching rendered JPGs (1 per PDF)
|
| 58 |
+
├── subset_100.json # full OmniDocBench annotations for the 100 samples
|
| 59 |
+
├── subset_100_stats.json # coverage & distribution stats vs. full 981-doc set
|
| 60 |
+
├── subset_100_pdfs.txt # flat list of selected PDF filenames
|
| 61 |
+
└── subset_100_images.txt # flat list of selected image filenames
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
Coverage (from `subset_100_stats.json`) — every bucket of every axis is hit:
|
| 65 |
+
- **data_source** 9/9 · **language** 3/3 · **layout** 5/5
|
| 66 |
+
- **special_issue** 13/13 · **stratum** 67/67
|
| 67 |
+
|
| 68 |
+
## Using the bench
|
| 69 |
+
|
| 70 |
+
These two subsets are intended to be run as a pair — olmOCR-bench gives you
|
| 71 |
+
sharp per-feature pass/fail signals and OmniDocBench gives you an aggregate
|
| 72 |
+
quality score across real-world document types. For each new pipeline
|
| 73 |
+
version, run both subsets, record per-subset metrics, and diff against the
|
| 74 |
+
previous run.
|
| 75 |
+
|
| 76 |
+
Common entry points (to be wired up by the pipeline evaluator):
|
| 77 |
+
|
| 78 |
+
```text
|
| 79 |
+
bench/olmocr_bench_50/pdfs/**/*.pdf # inputs
|
| 80 |
+
bench/olmocr_bench_50/subset_tests.jsonl # ground truth unit tests
|
| 81 |
+
|
| 82 |
+
bench/omnidocbench_100/pdfs/*.pdf # inputs
|
| 83 |
+
bench/omnidocbench_100/subset_100.json # ground truth annotations
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
Do **not** manually edit files under `bench/`. Regenerate with the sampling
|
| 87 |
+
script (for olmocr) or re-export from the upstream builder (for omnidoc) so
|
| 88 |
+
results stay reproducible.
|
|
@@ -9,10 +9,16 @@ description = "Cross-backend benchmarking — throughput, latency, and F1 on a s
|
|
| 9 |
requires-python = ">=3.11"
|
| 10 |
dependencies = [
|
| 11 |
"pdfsys-core",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
]
|
| 13 |
|
| 14 |
[tool.uv.sources]
|
| 15 |
pdfsys-core = { workspace = true }
|
|
|
|
|
|
|
| 16 |
|
| 17 |
[tool.hatch.build.targets.wheel]
|
| 18 |
packages = ["src/pdfsys_bench"]
|
|
|
|
| 9 |
requires-python = ">=3.11"
|
| 10 |
dependencies = [
|
| 11 |
"pdfsys-core",
|
| 12 |
+
"pdfsys-router",
|
| 13 |
+
"pdfsys-parser-mupdf",
|
| 14 |
+
"torch>=2.1",
|
| 15 |
+
"transformers>=4.44",
|
| 16 |
]
|
| 17 |
|
| 18 |
[tool.uv.sources]
|
| 19 |
pdfsys-core = { workspace = true }
|
| 20 |
+
pdfsys-router = { workspace = true }
|
| 21 |
+
pdfsys-parser-mupdf = { workspace = true }
|
| 22 |
|
| 23 |
[tool.hatch.build.targets.wheel]
|
| 24 |
packages = ["src/pdfsys_bench"]
|
|
@@ -1,7 +1,22 @@
|
|
| 1 |
-
"""pdfsys-bench — evaluation harness.
|
| 2 |
|
| 3 |
-
Runs
|
| 4 |
-
|
|
|
|
|
|
|
| 5 |
"""
|
| 6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
__version__ = "0.0.1"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""pdfsys-bench — evaluation harness and MVP closed-loop runner.
|
| 2 |
|
| 3 |
+
Runs a PDF directory through router → parser → OCR-quality scorer and
|
| 4 |
+
writes one JSONL row per PDF. This is the minimal end-to-end harness; a
|
| 5 |
+
richer benchmark (throughput, F1 against gold Markdown, cross-backend
|
| 6 |
+
comparison) will layer on top of it.
|
| 7 |
"""
|
| 8 |
|
| 9 |
+
from __future__ import annotations
|
| 10 |
+
|
| 11 |
+
from .loop import LoopResult, run_loop
|
| 12 |
+
from .quality import OcrQualityScorer, QualityScore
|
| 13 |
+
|
| 14 |
__version__ = "0.0.1"
|
| 15 |
+
|
| 16 |
+
__all__ = [
|
| 17 |
+
"__version__",
|
| 18 |
+
"LoopResult",
|
| 19 |
+
"run_loop",
|
| 20 |
+
"OcrQualityScorer",
|
| 21 |
+
"QualityScore",
|
| 22 |
+
]
|
|
@@ -0,0 +1,98 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""pdfsys-bench CLI — run the MVP closed loop on a directory of PDFs.
|
| 2 |
+
|
| 3 |
+
Usage::
|
| 4 |
+
|
| 5 |
+
python -m pdfsys_bench \\
|
| 6 |
+
--pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \\
|
| 7 |
+
--out out/bench_omnidoc100.jsonl \\
|
| 8 |
+
--limit 20
|
| 9 |
+
|
| 10 |
+
Flags exposed here are intentionally minimal — anything more is the job
|
| 11 |
+
of a proper runner package. This CLI is meant for smoke-testing.
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
from __future__ import annotations
|
| 15 |
+
|
| 16 |
+
import argparse
|
| 17 |
+
import sys
|
| 18 |
+
from pathlib import Path
|
| 19 |
+
|
| 20 |
+
from .loop import run_loop
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
def build_parser() -> argparse.ArgumentParser:
|
| 24 |
+
p = argparse.ArgumentParser(prog="pdfsys-bench", description="Run the MVP pdfsys closed loop.")
|
| 25 |
+
p.add_argument(
|
| 26 |
+
"--pdf-dir",
|
| 27 |
+
type=Path,
|
| 28 |
+
required=True,
|
| 29 |
+
help="Directory of PDFs to process (recursive).",
|
| 30 |
+
)
|
| 31 |
+
p.add_argument(
|
| 32 |
+
"--out",
|
| 33 |
+
type=Path,
|
| 34 |
+
required=True,
|
| 35 |
+
help="Output JSONL path (one line per PDF).",
|
| 36 |
+
)
|
| 37 |
+
p.add_argument(
|
| 38 |
+
"--limit",
|
| 39 |
+
type=int,
|
| 40 |
+
default=None,
|
| 41 |
+
help="Cap the number of PDFs processed. Default: no cap.",
|
| 42 |
+
)
|
| 43 |
+
p.add_argument(
|
| 44 |
+
"--no-quality",
|
| 45 |
+
action="store_true",
|
| 46 |
+
help="Skip the ModernBERT quality scorer (fast smoke test).",
|
| 47 |
+
)
|
| 48 |
+
p.add_argument(
|
| 49 |
+
"--quality-model",
|
| 50 |
+
default="HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn",
|
| 51 |
+
help="HuggingFace repo id for the quality scorer.",
|
| 52 |
+
)
|
| 53 |
+
p.add_argument(
|
| 54 |
+
"--router-weights",
|
| 55 |
+
type=Path,
|
| 56 |
+
default=None,
|
| 57 |
+
help="Path to xgb_classifier.ubj. Defaults to the package's bundled path.",
|
| 58 |
+
)
|
| 59 |
+
p.add_argument(
|
| 60 |
+
"--markdown-dir",
|
| 61 |
+
type=Path,
|
| 62 |
+
default=None,
|
| 63 |
+
help="Optional directory to dump per-PDF extracted markdown.",
|
| 64 |
+
)
|
| 65 |
+
p.add_argument(
|
| 66 |
+
"--ocr-threshold",
|
| 67 |
+
type=float,
|
| 68 |
+
default=0.5,
|
| 69 |
+
help="P(ocr) threshold above which a PDF is routed off the text-ok path.",
|
| 70 |
+
)
|
| 71 |
+
return p
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
def main(argv: list[str] | None = None) -> int:
|
| 75 |
+
args = build_parser().parse_args(argv)
|
| 76 |
+
summary = run_loop(
|
| 77 |
+
pdf_dir=args.pdf_dir,
|
| 78 |
+
out_path=args.out,
|
| 79 |
+
limit=args.limit,
|
| 80 |
+
score_quality=not args.no_quality,
|
| 81 |
+
router_weights=args.router_weights,
|
| 82 |
+
quality_model=args.quality_model,
|
| 83 |
+
markdown_dir=args.markdown_dir,
|
| 84 |
+
ocr_threshold=args.ocr_threshold,
|
| 85 |
+
)
|
| 86 |
+
|
| 87 |
+
print(f"[pdfsys-bench] processed {summary['num_pdfs']} PDFs in {summary['wall_seconds']:.1f}s")
|
| 88 |
+
print(f"[pdfsys-bench] by_backend: {summary['by_backend']}")
|
| 89 |
+
print(f"[pdfsys-bench] extracted={summary['num_extracted']} scored={summary['num_scored']} errors={summary['num_errors']}")
|
| 90 |
+
if summary.get("avg_quality") is not None:
|
| 91 |
+
print(f"[pdfsys-bench] avg_quality={summary['avg_quality']:.3f}")
|
| 92 |
+
print(f"[pdfsys-bench] jsonl: {summary['out_path']}")
|
| 93 |
+
print(f"[pdfsys-bench] summary: {summary['summary_path']}")
|
| 94 |
+
return 0
|
| 95 |
+
|
| 96 |
+
|
| 97 |
+
if __name__ == "__main__":
|
| 98 |
+
sys.exit(main())
|
|
@@ -0,0 +1,200 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""MVP closed-loop runner: router → parser → quality scorer → JSONL.
|
| 2 |
+
|
| 3 |
+
This is the tiniest possible end-to-end harness for the pdfsys pipeline.
|
| 4 |
+
Given a directory of PDFs, it:
|
| 5 |
+
|
| 6 |
+
1. runs :class:`pdfsys_router.Router` to pick a backend per document;
|
| 7 |
+
2. for PDFs routed to ``Backend.MUPDF``, runs :func:`pdfsys_parser_mupdf.extract_doc`
|
| 8 |
+
to produce an :class:`pdfsys_core.ExtractedDoc`;
|
| 9 |
+
3. scores the resulting Markdown with :class:`pdfsys_bench.OcrQualityScorer`
|
| 10 |
+
(the ModernBERT-large regression head from FinePDFs);
|
| 11 |
+
4. writes one JSON line per PDF to an output file with routing decision,
|
| 12 |
+
extraction stats, and quality score.
|
| 13 |
+
|
| 14 |
+
PDFs routed to ``PIPELINE`` / ``VLM`` / ``DEFERRED`` are recorded with
|
| 15 |
+
their routing decision but skipped for extraction — those backends are
|
| 16 |
+
not implemented yet in this MVP.
|
| 17 |
+
"""
|
| 18 |
+
|
| 19 |
+
from __future__ import annotations
|
| 20 |
+
|
| 21 |
+
import json
|
| 22 |
+
import time
|
| 23 |
+
from dataclasses import asdict, dataclass, field
|
| 24 |
+
from pathlib import Path
|
| 25 |
+
from typing import Any, Iterable
|
| 26 |
+
|
| 27 |
+
from pdfsys_core import Backend
|
| 28 |
+
from pdfsys_parser_mupdf import extract_doc
|
| 29 |
+
from pdfsys_router import Router
|
| 30 |
+
|
| 31 |
+
from .quality import OcrQualityScorer, QualityScore
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
@dataclass(slots=True)
|
| 35 |
+
class LoopResult:
|
| 36 |
+
"""Per-PDF result row, serialized to JSONL."""
|
| 37 |
+
|
| 38 |
+
pdf_path: str
|
| 39 |
+
sha256: str | None
|
| 40 |
+
backend: str
|
| 41 |
+
ocr_prob: float
|
| 42 |
+
num_pages: int
|
| 43 |
+
is_form: bool
|
| 44 |
+
garbled_text_ratio: float
|
| 45 |
+
router_error: str | None
|
| 46 |
+
extract_stats: dict[str, Any] = field(default_factory=dict)
|
| 47 |
+
extract_error: str | None = None
|
| 48 |
+
quality_score: float | None = None
|
| 49 |
+
quality_num_chars: int | None = None
|
| 50 |
+
quality_num_tokens: int | None = None
|
| 51 |
+
quality_model: str | None = None
|
| 52 |
+
markdown_chars: int = 0
|
| 53 |
+
wall_ms_router: float = 0.0
|
| 54 |
+
wall_ms_extract: float = 0.0
|
| 55 |
+
wall_ms_quality: float = 0.0
|
| 56 |
+
|
| 57 |
+
def to_json_line(self) -> str:
|
| 58 |
+
return json.dumps(asdict(self), ensure_ascii=False)
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
def _iter_pdfs(root: Path, limit: int | None) -> Iterable[Path]:
|
| 62 |
+
pdfs = sorted(p for p in root.rglob("*.pdf") if p.is_file())
|
| 63 |
+
if limit is not None:
|
| 64 |
+
pdfs = pdfs[:limit]
|
| 65 |
+
yield from pdfs
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
def run_loop(
|
| 69 |
+
pdf_dir: str | Path,
|
| 70 |
+
out_path: str | Path,
|
| 71 |
+
*,
|
| 72 |
+
limit: int | None = None,
|
| 73 |
+
score_quality: bool = True,
|
| 74 |
+
router_weights: str | Path | None = None,
|
| 75 |
+
quality_model: str = "HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn",
|
| 76 |
+
markdown_dir: str | Path | None = None,
|
| 77 |
+
ocr_threshold: float = 0.5,
|
| 78 |
+
) -> dict[str, Any]:
|
| 79 |
+
"""Drive the full MVP loop over a PDF directory.
|
| 80 |
+
|
| 81 |
+
Returns an aggregate summary dict. Individual result rows are written
|
| 82 |
+
to ``out_path`` as JSONL (one line per PDF, in input-order).
|
| 83 |
+
"""
|
| 84 |
+
pdf_dir = Path(pdf_dir)
|
| 85 |
+
out_path = Path(out_path)
|
| 86 |
+
out_path.parent.mkdir(parents=True, exist_ok=True)
|
| 87 |
+
|
| 88 |
+
router = Router(model_path=router_weights, ocr_threshold=ocr_threshold)
|
| 89 |
+
scorer = OcrQualityScorer(model_name=quality_model) if score_quality else None
|
| 90 |
+
|
| 91 |
+
md_root = Path(markdown_dir) if markdown_dir else None
|
| 92 |
+
if md_root is not None:
|
| 93 |
+
md_root.mkdir(parents=True, exist_ok=True)
|
| 94 |
+
|
| 95 |
+
summary: dict[str, Any] = {
|
| 96 |
+
"pdf_dir": str(pdf_dir),
|
| 97 |
+
"out_path": str(out_path),
|
| 98 |
+
"num_pdfs": 0,
|
| 99 |
+
"by_backend": {},
|
| 100 |
+
"num_extracted": 0,
|
| 101 |
+
"num_scored": 0,
|
| 102 |
+
"num_errors": 0,
|
| 103 |
+
"sum_quality": 0.0,
|
| 104 |
+
"started_at": time.time(),
|
| 105 |
+
}
|
| 106 |
+
|
| 107 |
+
with out_path.open("w", encoding="utf-8") as out_f:
|
| 108 |
+
for pdf_path in _iter_pdfs(pdf_dir, limit):
|
| 109 |
+
row = _run_one(
|
| 110 |
+
pdf_path=pdf_path,
|
| 111 |
+
router=router,
|
| 112 |
+
scorer=scorer,
|
| 113 |
+
md_root=md_root,
|
| 114 |
+
)
|
| 115 |
+
out_f.write(row.to_json_line() + "\n")
|
| 116 |
+
out_f.flush()
|
| 117 |
+
|
| 118 |
+
summary["num_pdfs"] += 1
|
| 119 |
+
by_b = summary["by_backend"]
|
| 120 |
+
by_b[row.backend] = by_b.get(row.backend, 0) + 1
|
| 121 |
+
if row.extract_error is None and row.backend == Backend.MUPDF.value:
|
| 122 |
+
summary["num_extracted"] += 1
|
| 123 |
+
if row.quality_score is not None:
|
| 124 |
+
summary["num_scored"] += 1
|
| 125 |
+
summary["sum_quality"] += row.quality_score
|
| 126 |
+
if row.router_error or row.extract_error:
|
| 127 |
+
summary["num_errors"] += 1
|
| 128 |
+
|
| 129 |
+
summary["finished_at"] = time.time()
|
| 130 |
+
summary["wall_seconds"] = summary["finished_at"] - summary["started_at"]
|
| 131 |
+
summary["avg_quality"] = (
|
| 132 |
+
summary["sum_quality"] / summary["num_scored"] if summary["num_scored"] else None
|
| 133 |
+
)
|
| 134 |
+
|
| 135 |
+
summary_path = out_path.with_suffix(".summary.json")
|
| 136 |
+
summary_path.write_text(json.dumps(summary, indent=2, ensure_ascii=False))
|
| 137 |
+
summary["summary_path"] = str(summary_path)
|
| 138 |
+
|
| 139 |
+
return summary
|
| 140 |
+
|
| 141 |
+
|
| 142 |
+
def _run_one(
|
| 143 |
+
*,
|
| 144 |
+
pdf_path: Path,
|
| 145 |
+
router: Router,
|
| 146 |
+
scorer: OcrQualityScorer | None,
|
| 147 |
+
md_root: Path | None,
|
| 148 |
+
) -> LoopResult:
|
| 149 |
+
# -- Stage-A routing ------------------------------------------------------
|
| 150 |
+
t0 = time.perf_counter()
|
| 151 |
+
decision = router.classify(pdf_path)
|
| 152 |
+
t1 = time.perf_counter()
|
| 153 |
+
|
| 154 |
+
row = LoopResult(
|
| 155 |
+
pdf_path=str(pdf_path),
|
| 156 |
+
sha256=None,
|
| 157 |
+
backend=decision.backend.value,
|
| 158 |
+
ocr_prob=decision.ocr_prob,
|
| 159 |
+
num_pages=decision.num_pages,
|
| 160 |
+
is_form=decision.is_form,
|
| 161 |
+
garbled_text_ratio=decision.garbled_text_ratio,
|
| 162 |
+
router_error=decision.error,
|
| 163 |
+
wall_ms_router=(t1 - t0) * 1000.0,
|
| 164 |
+
)
|
| 165 |
+
|
| 166 |
+
# -- MVP only extracts the text-ok fast path ------------------------------
|
| 167 |
+
if decision.backend != Backend.MUPDF:
|
| 168 |
+
return row
|
| 169 |
+
|
| 170 |
+
try:
|
| 171 |
+
t2 = time.perf_counter()
|
| 172 |
+
extracted = extract_doc(pdf_path)
|
| 173 |
+
t3 = time.perf_counter()
|
| 174 |
+
row.sha256 = extracted.sha256
|
| 175 |
+
row.extract_stats = dict(extracted.stats)
|
| 176 |
+
row.markdown_chars = extracted.char_count
|
| 177 |
+
row.wall_ms_extract = (t3 - t2) * 1000.0
|
| 178 |
+
except Exception as e: # noqa: BLE001
|
| 179 |
+
row.extract_error = f"extract_failed: {e}"
|
| 180 |
+
return row
|
| 181 |
+
|
| 182 |
+
if md_root is not None and extracted.markdown:
|
| 183 |
+
md_path = md_root / f"{extracted.sha256}.md"
|
| 184 |
+
md_path.write_text(extracted.markdown, encoding="utf-8")
|
| 185 |
+
|
| 186 |
+
# -- Quality scoring ------------------------------------------------------
|
| 187 |
+
if scorer is not None and extracted.markdown:
|
| 188 |
+
try:
|
| 189 |
+
t4 = time.perf_counter()
|
| 190 |
+
q: QualityScore = scorer.score(extracted.markdown)
|
| 191 |
+
t5 = time.perf_counter()
|
| 192 |
+
row.quality_score = q.score
|
| 193 |
+
row.quality_num_chars = q.num_chars
|
| 194 |
+
row.quality_num_tokens = q.num_tokens
|
| 195 |
+
row.quality_model = q.model
|
| 196 |
+
row.wall_ms_quality = (t5 - t4) * 1000.0
|
| 197 |
+
except Exception as e: # noqa: BLE001
|
| 198 |
+
row.extract_error = f"quality_failed: {e}"
|
| 199 |
+
|
| 200 |
+
return row
|
|
@@ -0,0 +1,148 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""OCR quality scorer backed by the FinePDFs ModernBERT classifier.
|
| 2 |
+
|
| 3 |
+
Wraps ``HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn`` — a
|
| 4 |
+
single-head regression fine-tune of ModernBERT-large (~0.4 B params)
|
| 5 |
+
that emits a float in ``[0, 3]`` where:
|
| 6 |
+
|
| 7 |
+
* 0 → garbage / unreadable OCR
|
| 8 |
+
* 1 → formatting issues but mostly readable
|
| 9 |
+
* 2 → minor problems
|
| 10 |
+
* 3 → clean text
|
| 11 |
+
|
| 12 |
+
The scorer takes raw extracted text (Markdown or plain), truncates to at
|
| 13 |
+
most ``max_chars`` characters before tokenization, tokenizes with the
|
| 14 |
+
model's own tokenizer, runs one forward pass, and returns the scalar.
|
| 15 |
+
|
| 16 |
+
Heavy dependencies (``torch`` + ``transformers``) are imported lazily so
|
| 17 |
+
that merely importing :mod:`pdfsys_bench` does not pull them in.
|
| 18 |
+
"""
|
| 19 |
+
|
| 20 |
+
from __future__ import annotations
|
| 21 |
+
|
| 22 |
+
from dataclasses import dataclass
|
| 23 |
+
from typing import Any
|
| 24 |
+
|
| 25 |
+
DEFAULT_MODEL = "HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn"
|
| 26 |
+
DEFAULT_MAX_CHARS = 10_000
|
| 27 |
+
# Upstream FinePDFs uses max_tokens=2048, but ModernBERT-large activations
|
| 28 |
+
# at that length need ≈ 3 GB of RAM — too much for a 4 GB dev box. 512
|
| 29 |
+
# tokens is enough to give a stable quality signal in practice and keeps
|
| 30 |
+
# peak memory well under a gig.
|
| 31 |
+
DEFAULT_MAX_TOKENS = 512
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
@dataclass(slots=True)
|
| 35 |
+
class QualityScore:
|
| 36 |
+
"""Result of scoring one document."""
|
| 37 |
+
|
| 38 |
+
score: float
|
| 39 |
+
num_chars: int
|
| 40 |
+
num_tokens: int
|
| 41 |
+
model: str
|
| 42 |
+
|
| 43 |
+
def as_record(self) -> dict[str, Any]:
|
| 44 |
+
return {
|
| 45 |
+
"quality_score": self.score,
|
| 46 |
+
"quality_num_chars": self.num_chars,
|
| 47 |
+
"quality_num_tokens": self.num_tokens,
|
| 48 |
+
"quality_model": self.model,
|
| 49 |
+
}
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
class OcrQualityScorer:
|
| 53 |
+
"""Lazy ModernBERT regression scorer. Re-uses model/tokenizer across calls."""
|
| 54 |
+
|
| 55 |
+
def __init__(
|
| 56 |
+
self,
|
| 57 |
+
model_name: str = DEFAULT_MODEL,
|
| 58 |
+
max_chars: int = DEFAULT_MAX_CHARS,
|
| 59 |
+
max_tokens: int = DEFAULT_MAX_TOKENS,
|
| 60 |
+
device: str | None = None,
|
| 61 |
+
dtype: str = "bfloat16",
|
| 62 |
+
) -> None:
|
| 63 |
+
self.model_name = model_name
|
| 64 |
+
self.max_chars = max_chars
|
| 65 |
+
self.max_tokens = max_tokens
|
| 66 |
+
self._device_name = device
|
| 67 |
+
self.dtype_name = dtype
|
| 68 |
+
self._tokenizer: Any = None
|
| 69 |
+
self._model: Any = None
|
| 70 |
+
self._torch: Any = None
|
| 71 |
+
self._device: Any = None
|
| 72 |
+
|
| 73 |
+
def _ensure_loaded(self) -> None:
|
| 74 |
+
if self._model is not None:
|
| 75 |
+
return
|
| 76 |
+
import torch # noqa: PLC0415 — lazy import is intentional
|
| 77 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer # noqa: PLC0415
|
| 78 |
+
|
| 79 |
+
self._torch = torch
|
| 80 |
+
self._device = torch.device(
|
| 81 |
+
self._device_name
|
| 82 |
+
or ("cuda" if torch.cuda.is_available() else "cpu")
|
| 83 |
+
)
|
| 84 |
+
# Use bfloat16 on CPU to halve the model's memory footprint —
|
| 85 |
+
# ModernBERT-large is ~0.4 B params, so fp32 weights alone take
|
| 86 |
+
# ~1.6 GB and OOM a 4 GB-RAM dev box. bf16 inference is
|
| 87 |
+
# numerically stable enough for a regression head like this.
|
| 88 |
+
torch_dtype = getattr(torch, self.dtype_name, torch.float32)
|
| 89 |
+
|
| 90 |
+
self._tokenizer = AutoTokenizer.from_pretrained(self.model_name)
|
| 91 |
+
# ``dtype`` is the transformers≥5 name; ``torch_dtype`` was the
|
| 92 |
+
# transformers<5 name. Pass ``dtype`` and fall back for older releases.
|
| 93 |
+
try:
|
| 94 |
+
model = AutoModelForSequenceClassification.from_pretrained(
|
| 95 |
+
self.model_name,
|
| 96 |
+
dtype=torch_dtype,
|
| 97 |
+
)
|
| 98 |
+
except TypeError:
|
| 99 |
+
model = AutoModelForSequenceClassification.from_pretrained(
|
| 100 |
+
self.model_name,
|
| 101 |
+
torch_dtype=torch_dtype,
|
| 102 |
+
)
|
| 103 |
+
model.eval()
|
| 104 |
+
model.to(self._device)
|
| 105 |
+
self._model = model
|
| 106 |
+
|
| 107 |
+
def score(self, text: str) -> QualityScore:
|
| 108 |
+
"""Score a single document. Empty input returns 0.0."""
|
| 109 |
+
if not text or not text.strip():
|
| 110 |
+
return QualityScore(
|
| 111 |
+
score=0.0, num_chars=0, num_tokens=0, model=self.model_name
|
| 112 |
+
)
|
| 113 |
+
|
| 114 |
+
self._ensure_loaded()
|
| 115 |
+
assert self._tokenizer is not None and self._model is not None
|
| 116 |
+
torch = self._torch
|
| 117 |
+
|
| 118 |
+
clipped = text[: self.max_chars]
|
| 119 |
+
enc = self._tokenizer(
|
| 120 |
+
clipped,
|
| 121 |
+
return_tensors="pt",
|
| 122 |
+
truncation=True,
|
| 123 |
+
max_length=self.max_tokens,
|
| 124 |
+
)
|
| 125 |
+
num_tokens = int(enc["input_ids"].shape[1])
|
| 126 |
+
enc = {k: v.to(self._device) for k, v in enc.items()}
|
| 127 |
+
|
| 128 |
+
with torch.inference_mode():
|
| 129 |
+
out = self._model(**enc)
|
| 130 |
+
logits = out.logits # shape [1, 1] for regression
|
| 131 |
+
raw = float(logits.squeeze().item())
|
| 132 |
+
# Drop the forward-pass tensors eagerly so large-seq runs on CPU
|
| 133 |
+
# don't hold onto activations between calls.
|
| 134 |
+
del enc, out, logits
|
| 135 |
+
|
| 136 |
+
# Clamp to the documented [0, 3] range.
|
| 137 |
+
clamped = max(0.0, min(3.0, raw))
|
| 138 |
+
|
| 139 |
+
return QualityScore(
|
| 140 |
+
score=clamped,
|
| 141 |
+
num_chars=len(clipped),
|
| 142 |
+
num_tokens=num_tokens,
|
| 143 |
+
model=self.model_name,
|
| 144 |
+
)
|
| 145 |
+
|
| 146 |
+
def score_many(self, texts: list[str]) -> list[QualityScore]:
|
| 147 |
+
"""Serial scoring — tiny MVP harness, not a batched hot path."""
|
| 148 |
+
return [self.score(t) for t in texts]
|
|
@@ -9,6 +9,7 @@ description = "Text-ok backend: PyMuPDF extraction + reading order + Markdown em
|
|
| 9 |
requires-python = ">=3.11"
|
| 10 |
dependencies = [
|
| 11 |
"pdfsys-core",
|
|
|
|
| 12 |
]
|
| 13 |
|
| 14 |
[tool.uv.sources]
|
|
|
|
| 9 |
requires-python = ">=3.11"
|
| 10 |
dependencies = [
|
| 11 |
"pdfsys-core",
|
| 12 |
+
"pymupdf>=1.24",
|
| 13 |
]
|
| 14 |
|
| 15 |
[tool.uv.sources]
|
|
@@ -1,8 +1,14 @@
|
|
| 1 |
"""pdfsys-parser-mupdf — text-ok extraction backend.
|
| 2 |
|
| 3 |
Consumes PDFs classified as text-ok by pdfsys-router. Uses PyMuPDF for
|
| 4 |
-
block extraction
|
| 5 |
-
Does NOT depend on pdfsys-layout-analyser.
|
| 6 |
"""
|
| 7 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
__version__ = "0.0.1"
|
|
|
|
|
|
|
|
|
| 1 |
"""pdfsys-parser-mupdf — text-ok extraction backend.
|
| 2 |
|
| 3 |
Consumes PDFs classified as text-ok by pdfsys-router. Uses PyMuPDF for
|
| 4 |
+
block extraction (``page.get_text("blocks", sort=True)``) and emits
|
| 5 |
+
Markdown. Does NOT depend on pdfsys-layout-analyser.
|
| 6 |
"""
|
| 7 |
|
| 8 |
+
from __future__ import annotations
|
| 9 |
+
|
| 10 |
+
from .extract import extract_doc, extract_doc_bytes
|
| 11 |
+
|
| 12 |
__version__ = "0.0.1"
|
| 13 |
+
|
| 14 |
+
__all__ = ["__version__", "extract_doc", "extract_doc_bytes"]
|
|
@@ -1 +1,181 @@
|
|
| 1 |
-
"""PyMuPDF extraction
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""PyMuPDF-based text extraction for the mupdf (text-ok) backend.
|
| 2 |
+
|
| 3 |
+
This is the simplest of the three parser backends. It assumes the PDF
|
| 4 |
+
already has a clean text layer and just needs unwrapping into Markdown —
|
| 5 |
+
which is why the router routes here only when the XGBoost classifier says
|
| 6 |
+
``ocr_prob < threshold``.
|
| 7 |
+
|
| 8 |
+
We use ``page.get_text("blocks")`` which returns paragraph-shaped blocks
|
| 9 |
+
with coordinates already in reading order (PyMuPDF's internal sorting).
|
| 10 |
+
Each block becomes one :class:`pdfsys_core.Segment` of type
|
| 11 |
+
:attr:`pdfsys_core.RegionType.TEXT`, with its bbox normalized to ``[0, 1]``.
|
| 12 |
+
Empty and image-only blocks are dropped.
|
| 13 |
+
|
| 14 |
+
No layout-model dependency, no GPU, no OCR — this is the text-ok fast
|
| 15 |
+
path, and stays that way.
|
| 16 |
+
"""
|
| 17 |
+
|
| 18 |
+
from __future__ import annotations
|
| 19 |
+
|
| 20 |
+
import hashlib
|
| 21 |
+
import io
|
| 22 |
+
from pathlib import Path
|
| 23 |
+
from typing import Any
|
| 24 |
+
|
| 25 |
+
import pymupdf
|
| 26 |
+
|
| 27 |
+
from pdfsys_core import (
|
| 28 |
+
Backend,
|
| 29 |
+
BBox,
|
| 30 |
+
ExtractedDoc,
|
| 31 |
+
RegionType,
|
| 32 |
+
Segment,
|
| 33 |
+
merge_segments_to_markdown,
|
| 34 |
+
)
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
# PyMuPDF block tuple layout: (x0, y0, x1, y1, text, block_no, block_type).
|
| 38 |
+
# block_type 0 = text, 1 = image.
|
| 39 |
+
_TEXT_BLOCK_TYPE = 0
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
def _sha256_of_file(path: Path) -> str:
|
| 43 |
+
h = hashlib.sha256()
|
| 44 |
+
with path.open("rb") as f:
|
| 45 |
+
for chunk in iter(lambda: f.read(1 << 20), b""):
|
| 46 |
+
h.update(chunk)
|
| 47 |
+
return h.hexdigest()
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
def _sha256_of_bytes(data: bytes) -> str:
|
| 51 |
+
return hashlib.sha256(data).hexdigest()
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
def _normalize_text(text: str) -> str:
|
| 55 |
+
"""Trim trailing whitespace and collapse PyMuPDF's soft linebreaks.
|
| 56 |
+
|
| 57 |
+
PyMuPDF returns block text with intra-paragraph newlines. For Markdown
|
| 58 |
+
emission we keep paragraphs on one line; actual paragraph breaks come
|
| 59 |
+
from the block boundaries themselves.
|
| 60 |
+
"""
|
| 61 |
+
if not text:
|
| 62 |
+
return ""
|
| 63 |
+
# Strip and replace single newlines with spaces while preserving
|
| 64 |
+
# double-newlines (rare, but occasionally emitted for list items).
|
| 65 |
+
paragraphs = [p.strip() for p in text.split("\n\n")]
|
| 66 |
+
joined = "\n\n".join(" ".join(p.split()) for p in paragraphs if p.strip())
|
| 67 |
+
return joined.strip()
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
def _block_bbox(
|
| 71 |
+
block: tuple[Any, ...],
|
| 72 |
+
page_width_pt: float,
|
| 73 |
+
page_height_pt: float,
|
| 74 |
+
) -> BBox | None:
|
| 75 |
+
"""Normalize a PyMuPDF block bbox to ``[0, 1]`` or return None on overflow."""
|
| 76 |
+
x0, y0, x1, y1 = block[0], block[1], block[2], block[3]
|
| 77 |
+
if page_width_pt <= 0 or page_height_pt <= 0:
|
| 78 |
+
return None
|
| 79 |
+
|
| 80 |
+
def clamp(v: float) -> float:
|
| 81 |
+
if v < 0.0:
|
| 82 |
+
return 0.0
|
| 83 |
+
if v > 1.0:
|
| 84 |
+
return 1.0
|
| 85 |
+
return v
|
| 86 |
+
|
| 87 |
+
nx0 = clamp(x0 / page_width_pt)
|
| 88 |
+
ny0 = clamp(y0 / page_height_pt)
|
| 89 |
+
nx1 = clamp(x1 / page_width_pt)
|
| 90 |
+
ny1 = clamp(y1 / page_height_pt)
|
| 91 |
+
if nx1 <= nx0 or ny1 <= ny0:
|
| 92 |
+
return None
|
| 93 |
+
try:
|
| 94 |
+
return BBox(x0=nx0, y0=ny0, x1=nx1, y1=ny1)
|
| 95 |
+
except ValueError:
|
| 96 |
+
return None
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
def extract_doc(pdf_path: str | Path) -> ExtractedDoc:
|
| 100 |
+
"""Run the mupdf backend on a single PDF file and return its ExtractedDoc."""
|
| 101 |
+
path = Path(pdf_path)
|
| 102 |
+
sha256 = _sha256_of_file(path)
|
| 103 |
+
doc = pymupdf.open(str(path))
|
| 104 |
+
try:
|
| 105 |
+
return _extract(doc, sha256)
|
| 106 |
+
finally:
|
| 107 |
+
doc.close()
|
| 108 |
+
|
| 109 |
+
|
| 110 |
+
def extract_doc_bytes(pdf_bytes: bytes, sha256: str | None = None) -> ExtractedDoc:
|
| 111 |
+
"""Run the mupdf backend on an in-memory PDF buffer."""
|
| 112 |
+
sha = sha256 or _sha256_of_bytes(pdf_bytes)
|
| 113 |
+
doc = pymupdf.open(stream=io.BytesIO(pdf_bytes), filetype="pdf")
|
| 114 |
+
try:
|
| 115 |
+
return _extract(doc, sha)
|
| 116 |
+
finally:
|
| 117 |
+
doc.close()
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
def _extract(doc: pymupdf.Document, sha256: str) -> ExtractedDoc:
|
| 121 |
+
segments: list[Segment] = []
|
| 122 |
+
pages_extracted = 0
|
| 123 |
+
pages_skipped = 0
|
| 124 |
+
|
| 125 |
+
for page_index, page in enumerate(doc):
|
| 126 |
+
page_width_pt = float(page.rect.width)
|
| 127 |
+
page_height_pt = float(page.rect.height)
|
| 128 |
+
|
| 129 |
+
try:
|
| 130 |
+
blocks = page.get_text(
|
| 131 |
+
"blocks",
|
| 132 |
+
flags=pymupdf.TEXT_PRESERVE_WHITESPACE | pymupdf.TEXT_MEDIABOX_CLIP,
|
| 133 |
+
sort=True,
|
| 134 |
+
)
|
| 135 |
+
except Exception:
|
| 136 |
+
pages_skipped += 1
|
| 137 |
+
continue
|
| 138 |
+
|
| 139 |
+
pages_extracted += 1
|
| 140 |
+
for block in blocks:
|
| 141 |
+
# block tuple: (x0, y0, x1, y1, text, block_no, block_type)
|
| 142 |
+
if len(block) < 7:
|
| 143 |
+
continue
|
| 144 |
+
if block[6] != _TEXT_BLOCK_TYPE:
|
| 145 |
+
# image block — mupdf backend doesn't emit IMAGE segments by
|
| 146 |
+
# design; image-heavy PDFs should have been routed elsewhere.
|
| 147 |
+
continue
|
| 148 |
+
text = _normalize_text(block[4] or "")
|
| 149 |
+
if not text:
|
| 150 |
+
continue
|
| 151 |
+
bbox = _block_bbox(block, page_width_pt, page_height_pt)
|
| 152 |
+
segments.append(
|
| 153 |
+
Segment(
|
| 154 |
+
index=len(segments),
|
| 155 |
+
backend=Backend.MUPDF,
|
| 156 |
+
page_index=page_index,
|
| 157 |
+
type=RegionType.TEXT,
|
| 158 |
+
content=text,
|
| 159 |
+
bbox=bbox,
|
| 160 |
+
source_region_id=None,
|
| 161 |
+
)
|
| 162 |
+
)
|
| 163 |
+
|
| 164 |
+
seg_tuple = tuple(segments)
|
| 165 |
+
markdown = merge_segments_to_markdown(seg_tuple)
|
| 166 |
+
|
| 167 |
+
stats: dict[str, Any] = {
|
| 168 |
+
"page_count": len(doc),
|
| 169 |
+
"pages_extracted": pages_extracted,
|
| 170 |
+
"pages_skipped": pages_skipped,
|
| 171 |
+
"segment_count": len(seg_tuple),
|
| 172 |
+
"char_count": len(markdown),
|
| 173 |
+
}
|
| 174 |
+
|
| 175 |
+
return ExtractedDoc(
|
| 176 |
+
sha256=sha256,
|
| 177 |
+
backend=Backend.MUPDF,
|
| 178 |
+
segments=seg_tuple,
|
| 179 |
+
markdown=markdown,
|
| 180 |
+
stats=stats,
|
| 181 |
+
)
|
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# External model weights downloaded by pdfsys_router.download_weights.
|
| 2 |
+
# The xgb_classifier.ubj file is FinePDFs IP and should not be
|
| 3 |
+
# committed. Run `python -m pdfsys_router.download_weights` to fetch it.
|
| 4 |
+
xgb_classifier.ubj
|
|
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Router model weights
|
| 2 |
+
|
| 3 |
+
This directory is where the Stage-A XGBoost classifier weights live on disk.
|
| 4 |
+
|
| 5 |
+
The file `xgb_classifier.ubj` (≈ 257 KB) is **not committed** — it's the
|
| 6 |
+
ported FinePDFs binary classifier weights, owned by HuggingFace. Fetch it
|
| 7 |
+
once with:
|
| 8 |
+
|
| 9 |
+
```bash
|
| 10 |
+
python -m pdfsys_router.download_weights
|
| 11 |
+
```
|
| 12 |
+
|
| 13 |
+
The downloader pulls from
|
| 14 |
+
`media.githubusercontent.com/media/huggingface/finepdfs/main/blocks/predictor/xgb.ubj`,
|
| 15 |
+
which is the actual Git-LFS payload (not the pointer file that plain
|
| 16 |
+
`raw.githubusercontent.com` would return).
|
|
@@ -9,6 +9,11 @@ description = "Stage-1 classifier: decides text-ok vs needs-ocr; consults Layout
|
|
| 9 |
requires-python = ">=3.11"
|
| 10 |
dependencies = [
|
| 11 |
"pdfsys-core",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
]
|
| 13 |
|
| 14 |
[tool.uv.sources]
|
|
|
|
| 9 |
requires-python = ">=3.11"
|
| 10 |
dependencies = [
|
| 11 |
"pdfsys-core",
|
| 12 |
+
"pymupdf>=1.24",
|
| 13 |
+
"xgboost>=2.0",
|
| 14 |
+
"scikit-learn>=1.3",
|
| 15 |
+
"pandas>=2.0",
|
| 16 |
+
"numpy>=1.26",
|
| 17 |
]
|
| 18 |
|
| 19 |
[tool.uv.sources]
|
|
@@ -1,9 +1,27 @@
|
|
| 1 |
"""pdfsys-router — two-stage routing for the pdfsys extraction pipeline.
|
| 2 |
|
| 3 |
-
Stage A (cheap): classify text-ok vs needs-ocr from PyMuPDF features
|
|
|
|
|
|
|
| 4 |
Stage B (uses layout cache): for needs-ocr, read the LayoutDocument written
|
| 5 |
by pdfsys-layout-analyser and decide pipeline vs vlm based on whether
|
| 6 |
-
complex regions (tables / formulas) exist.
|
| 7 |
"""
|
| 8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
__version__ = "0.0.1"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
"""pdfsys-router — two-stage routing for the pdfsys extraction pipeline.
|
| 2 |
|
| 3 |
+
Stage A (cheap): classify text-ok vs needs-ocr from PyMuPDF features, using
|
| 4 |
+
a ported FinePDFs XGBoost classifier over 124 hand-crafted features.
|
| 5 |
+
|
| 6 |
Stage B (uses layout cache): for needs-ocr, read the LayoutDocument written
|
| 7 |
by pdfsys-layout-analyser and decide pipeline vs vlm based on whether
|
| 8 |
+
complex regions (tables / formulas) exist. Stage B is not in the MVP.
|
| 9 |
"""
|
| 10 |
|
| 11 |
+
from __future__ import annotations
|
| 12 |
+
|
| 13 |
+
from .classifier import Router, RouterDecision
|
| 14 |
+
from .feature_extractor import PDFFeatureExtractor, flatten_per_page_features
|
| 15 |
+
from .xgb_model import XgbRouterModel, default_weights_path
|
| 16 |
+
|
| 17 |
__version__ = "0.0.1"
|
| 18 |
+
|
| 19 |
+
__all__ = [
|
| 20 |
+
"__version__",
|
| 21 |
+
"Router",
|
| 22 |
+
"RouterDecision",
|
| 23 |
+
"PDFFeatureExtractor",
|
| 24 |
+
"flatten_per_page_features",
|
| 25 |
+
"XgbRouterModel",
|
| 26 |
+
"default_weights_path",
|
| 27 |
+
]
|
|
@@ -1,4 +1,201 @@
|
|
| 1 |
-
"""Stage-A classifier: text-ok vs needs-ocr.
|
| 2 |
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Stage-A classifier: decides text-ok (MUPDF) vs needs-ocr (PIPELINE/VLM).
|
| 2 |
|
| 3 |
+
This is the single public entry point of the router for the MVP. Stage-B
|
| 4 |
+
(layout-cache driven pipeline-vs-vlm decision) will be added later; for
|
| 5 |
+
now, anything that needs OCR is routed to ``Backend.PIPELINE`` unless the
|
| 6 |
+
configured policy says otherwise.
|
| 7 |
+
|
| 8 |
+
The classifier is deliberately stateless. It loads the XGBoost model once
|
| 9 |
+
(lazily) and then exposes ``classify(pdf_path) -> RouterDecision``. No
|
| 10 |
+
caching, no I/O side effects — pure in, pure out.
|
| 11 |
"""
|
| 12 |
+
|
| 13 |
+
from __future__ import annotations
|
| 14 |
+
|
| 15 |
+
import random
|
| 16 |
+
from dataclasses import dataclass, field
|
| 17 |
+
from pathlib import Path
|
| 18 |
+
from typing import Any
|
| 19 |
+
|
| 20 |
+
import numpy as np
|
| 21 |
+
import pymupdf
|
| 22 |
+
|
| 23 |
+
from pdfsys_core import Backend, RouterConfig
|
| 24 |
+
|
| 25 |
+
from .feature_extractor import PDFFeatureExtractor, flatten_per_page_features
|
| 26 |
+
from .xgb_model import XgbRouterModel, default_weights_path
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
@dataclass(slots=True)
|
| 30 |
+
class RouterDecision:
|
| 31 |
+
"""Result of running the Stage-A classifier on a single PDF."""
|
| 32 |
+
|
| 33 |
+
backend: Backend
|
| 34 |
+
ocr_prob: float
|
| 35 |
+
num_pages: int
|
| 36 |
+
is_form: bool
|
| 37 |
+
garbled_text_ratio: float
|
| 38 |
+
is_encrypted: bool
|
| 39 |
+
needs_password: bool
|
| 40 |
+
features: dict[str, Any] = field(default_factory=dict)
|
| 41 |
+
error: str | None = None
|
| 42 |
+
|
| 43 |
+
def as_record(self) -> dict[str, Any]:
|
| 44 |
+
"""Flat dict for JSONL emission."""
|
| 45 |
+
return {
|
| 46 |
+
"backend": self.backend.value,
|
| 47 |
+
"ocr_prob": self.ocr_prob,
|
| 48 |
+
"num_pages": self.num_pages,
|
| 49 |
+
"is_form": bool(self.is_form),
|
| 50 |
+
"garbled_text_ratio": float(self.garbled_text_ratio),
|
| 51 |
+
"is_encrypted": bool(self.is_encrypted),
|
| 52 |
+
"needs_password": bool(self.needs_password),
|
| 53 |
+
"error": self.error,
|
| 54 |
+
}
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
class Router:
|
| 58 |
+
"""Stage-A router: PyMuPDF features → XGBoost → Backend."""
|
| 59 |
+
|
| 60 |
+
def __init__(
|
| 61 |
+
self,
|
| 62 |
+
config: RouterConfig | None = None,
|
| 63 |
+
model_path: str | Path | None = None,
|
| 64 |
+
num_pages_to_sample: int = 8,
|
| 65 |
+
ocr_threshold: float = 0.5,
|
| 66 |
+
seed: int = 42,
|
| 67 |
+
) -> None:
|
| 68 |
+
self.config = config or RouterConfig()
|
| 69 |
+
self.num_pages_to_sample = num_pages_to_sample
|
| 70 |
+
self.ocr_threshold = ocr_threshold
|
| 71 |
+
self.seed = seed
|
| 72 |
+
self._extractor = PDFFeatureExtractor(
|
| 73 |
+
num_chunks=1, num_pages_to_sample=num_pages_to_sample
|
| 74 |
+
)
|
| 75 |
+
self._model = XgbRouterModel(model_path or default_weights_path())
|
| 76 |
+
|
| 77 |
+
# ------------------------------------------------------------------ api
|
| 78 |
+
|
| 79 |
+
def classify(self, pdf_path: str | Path) -> RouterDecision:
|
| 80 |
+
"""Classify a PDF file. Never raises — errors are in ``decision.error``."""
|
| 81 |
+
path = Path(pdf_path)
|
| 82 |
+
try:
|
| 83 |
+
doc = pymupdf.open(str(path))
|
| 84 |
+
except Exception as e: # noqa: BLE001 — we want to capture anything
|
| 85 |
+
return RouterDecision(
|
| 86 |
+
backend=Backend.DEFERRED,
|
| 87 |
+
ocr_prob=float("nan"),
|
| 88 |
+
num_pages=0,
|
| 89 |
+
is_form=False,
|
| 90 |
+
garbled_text_ratio=0.0,
|
| 91 |
+
is_encrypted=False,
|
| 92 |
+
needs_password=False,
|
| 93 |
+
error=f"open_failed: {e}",
|
| 94 |
+
)
|
| 95 |
+
|
| 96 |
+
try:
|
| 97 |
+
return self._classify_doc(doc)
|
| 98 |
+
finally:
|
| 99 |
+
try:
|
| 100 |
+
doc.close()
|
| 101 |
+
except Exception:
|
| 102 |
+
pass
|
| 103 |
+
|
| 104 |
+
def classify_bytes(self, pdf_bytes: bytes) -> RouterDecision:
|
| 105 |
+
"""Same as :meth:`classify`, but from an in-memory buffer."""
|
| 106 |
+
import io
|
| 107 |
+
|
| 108 |
+
try:
|
| 109 |
+
doc = pymupdf.open(stream=io.BytesIO(pdf_bytes), filetype="pdf")
|
| 110 |
+
except Exception as e: # noqa: BLE001
|
| 111 |
+
return RouterDecision(
|
| 112 |
+
backend=Backend.DEFERRED,
|
| 113 |
+
ocr_prob=float("nan"),
|
| 114 |
+
num_pages=0,
|
| 115 |
+
is_form=False,
|
| 116 |
+
garbled_text_ratio=0.0,
|
| 117 |
+
is_encrypted=False,
|
| 118 |
+
needs_password=False,
|
| 119 |
+
error=f"open_failed: {e}",
|
| 120 |
+
)
|
| 121 |
+
try:
|
| 122 |
+
return self._classify_doc(doc)
|
| 123 |
+
finally:
|
| 124 |
+
try:
|
| 125 |
+
doc.close()
|
| 126 |
+
except Exception:
|
| 127 |
+
pass
|
| 128 |
+
|
| 129 |
+
# --------------------------------------------------------------- internal
|
| 130 |
+
|
| 131 |
+
def _classify_doc(self, doc: pymupdf.Document) -> RouterDecision:
|
| 132 |
+
# Seed the sampling RNGs so the same PDF always produces the same
|
| 133 |
+
# feature vector — critical for reproducibility and debugging.
|
| 134 |
+
random.seed(self.seed)
|
| 135 |
+
np.random.seed(self.seed)
|
| 136 |
+
|
| 137 |
+
try:
|
| 138 |
+
if doc.is_encrypted or doc.needs_pass:
|
| 139 |
+
return RouterDecision(
|
| 140 |
+
backend=Backend.DEFERRED,
|
| 141 |
+
ocr_prob=float("nan"),
|
| 142 |
+
num_pages=len(doc),
|
| 143 |
+
is_form=False,
|
| 144 |
+
garbled_text_ratio=0.0,
|
| 145 |
+
is_encrypted=bool(doc.is_encrypted),
|
| 146 |
+
needs_password=bool(doc.needs_pass),
|
| 147 |
+
error="encrypted_or_password_protected",
|
| 148 |
+
)
|
| 149 |
+
|
| 150 |
+
raw_chunks = self._extractor.extract_all_features(doc)
|
| 151 |
+
if not raw_chunks:
|
| 152 |
+
return RouterDecision(
|
| 153 |
+
backend=Backend.DEFERRED,
|
| 154 |
+
ocr_prob=float("nan"),
|
| 155 |
+
num_pages=len(doc),
|
| 156 |
+
is_form=False,
|
| 157 |
+
garbled_text_ratio=0.0,
|
| 158 |
+
is_encrypted=False,
|
| 159 |
+
needs_password=False,
|
| 160 |
+
error="no_pages_sampled",
|
| 161 |
+
)
|
| 162 |
+
|
| 163 |
+
flat = flatten_per_page_features(
|
| 164 |
+
raw_chunks[0], sample_to_k_page_features=self.num_pages_to_sample
|
| 165 |
+
)
|
| 166 |
+
ocr_prob = self._model.predict_proba(flat)
|
| 167 |
+
|
| 168 |
+
backend = self._route(ocr_prob)
|
| 169 |
+
return RouterDecision(
|
| 170 |
+
backend=backend,
|
| 171 |
+
ocr_prob=ocr_prob,
|
| 172 |
+
num_pages=len(doc),
|
| 173 |
+
is_form=bool(flat.get("is_form", False)),
|
| 174 |
+
garbled_text_ratio=float(flat.get("garbled_text_ratio", 0.0)),
|
| 175 |
+
is_encrypted=bool(doc.is_encrypted),
|
| 176 |
+
needs_password=bool(doc.needs_pass),
|
| 177 |
+
features=flat,
|
| 178 |
+
)
|
| 179 |
+
except Exception as e: # noqa: BLE001
|
| 180 |
+
return RouterDecision(
|
| 181 |
+
backend=Backend.DEFERRED,
|
| 182 |
+
ocr_prob=float("nan"),
|
| 183 |
+
num_pages=len(doc) if doc else 0,
|
| 184 |
+
is_form=False,
|
| 185 |
+
garbled_text_ratio=0.0,
|
| 186 |
+
is_encrypted=False,
|
| 187 |
+
needs_password=False,
|
| 188 |
+
error=f"classify_failed: {e}",
|
| 189 |
+
)
|
| 190 |
+
|
| 191 |
+
def _route(self, ocr_prob: float) -> Backend:
|
| 192 |
+
"""Map XGBoost probability + fleet policy → concrete Backend."""
|
| 193 |
+
if ocr_prob < self.ocr_threshold:
|
| 194 |
+
return Backend.MUPDF
|
| 195 |
+
# OCR needed. Stage-B would check LayoutCache for complex content
|
| 196 |
+
# here. For the MVP we have no layout cache yet, so honour the
|
| 197 |
+
# fleet VLM gate: if VLM is enabled we'd need Stage-B to decide,
|
| 198 |
+
# otherwise pipeline handles everything flagged as scanned.
|
| 199 |
+
if self.config.vlm_enabled:
|
| 200 |
+
return Backend.DEFERRED # Stage-B will run once layout is cached
|
| 201 |
+
return Backend.PIPELINE
|
|
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Fetch the FinePDFs XGBoost router weights from upstream.
|
| 2 |
+
|
| 3 |
+
The weights file (``xgb.ubj``, ~257 KB) is not committed to this repo —
|
| 4 |
+
it's external IP owned by HuggingFace/FinePDFs and lives on their Git-LFS
|
| 5 |
+
bucket. Running this module downloads it once into ``models/xgb_classifier.ubj``
|
| 6 |
+
next to this package.
|
| 7 |
+
|
| 8 |
+
Usage::
|
| 9 |
+
|
| 10 |
+
python -m pdfsys_router.download_weights
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
from __future__ import annotations
|
| 14 |
+
|
| 15 |
+
import sys
|
| 16 |
+
import urllib.request
|
| 17 |
+
from pathlib import Path
|
| 18 |
+
|
| 19 |
+
# media.githubusercontent.com serves the actual LFS payload directly,
|
| 20 |
+
# bypassing the pointer file that raw.githubusercontent.com returns.
|
| 21 |
+
WEIGHTS_URL = (
|
| 22 |
+
"https://media.githubusercontent.com/media/huggingface/finepdfs/main/"
|
| 23 |
+
"blocks/predictor/xgb.ubj"
|
| 24 |
+
)
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def target_path() -> Path:
|
| 28 |
+
return Path(__file__).resolve().parent.parent.parent / "models" / "xgb_classifier.ubj"
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
def download(force: bool = False) -> Path:
|
| 32 |
+
dst = target_path()
|
| 33 |
+
if dst.exists() and not force:
|
| 34 |
+
print(f"[download_weights] already present: {dst}")
|
| 35 |
+
return dst
|
| 36 |
+
dst.parent.mkdir(parents=True, exist_ok=True)
|
| 37 |
+
print(f"[download_weights] fetching {WEIGHTS_URL}")
|
| 38 |
+
with urllib.request.urlopen(WEIGHTS_URL) as r: # noqa: S310 — pinned URL
|
| 39 |
+
data = r.read()
|
| 40 |
+
if len(data) < 10_000:
|
| 41 |
+
raise RuntimeError(
|
| 42 |
+
f"downloaded blob is suspiciously small ({len(data)} bytes) — "
|
| 43 |
+
"likely an LFS pointer, not the binary"
|
| 44 |
+
)
|
| 45 |
+
dst.write_bytes(data)
|
| 46 |
+
print(f"[download_weights] wrote {len(data)} bytes -> {dst}")
|
| 47 |
+
return dst
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
if __name__ == "__main__":
|
| 51 |
+
force = "--force" in sys.argv
|
| 52 |
+
download(force=force)
|
|
@@ -0,0 +1,484 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""PyMuPDF-only feature extractor for the Stage-A router classifier.
|
| 2 |
+
|
| 3 |
+
Ported verbatim (modulo stylistic cleanup and removal of datatrove imports)
|
| 4 |
+
from FinePDFs' ``blocks/predictor/ocr_predictor.py``:
|
| 5 |
+
|
| 6 |
+
https://github.com/huggingface/finepdfs/blob/main/blocks/predictor/ocr_predictor.py
|
| 7 |
+
|
| 8 |
+
The goal is bit-exact feature compatibility with the upstream XGBoost
|
| 9 |
+
``xgb.ubj`` weights. If you touch anything in here, run the parity harness
|
| 10 |
+
in ``pdfsys-bench`` against FinePDFs' reference output first.
|
| 11 |
+
|
| 12 |
+
The extractor samples up to ``num_pages_to_sample`` pages at random, then
|
| 13 |
+
computes:
|
| 14 |
+
|
| 15 |
+
* 4 doc-level features: ``num_pages_successfully_sampled``,
|
| 16 |
+
``garbled_text_ratio``, ``is_form``, ``creator_or_producer_is_known_scanner``.
|
| 17 |
+
* 15 page-level features × 8 sampled pages = 120 features.
|
| 18 |
+
|
| 19 |
+
:func:`flatten_per_page_features` produces the flat 124-feature dict the
|
| 20 |
+
XGBoost model expects, in the exact column order of ``feature_names_in_``.
|
| 21 |
+
"""
|
| 22 |
+
|
| 23 |
+
from __future__ import annotations
|
| 24 |
+
|
| 25 |
+
import random
|
| 26 |
+
from collections import Counter
|
| 27 |
+
from typing import Any
|
| 28 |
+
|
| 29 |
+
import numpy as np
|
| 30 |
+
import pymupdf
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
# Keep this list in sync with FinePDFs upstream. These strings are
|
| 34 |
+
# lowercased substring-matched against PDF metadata creator/producer to
|
| 35 |
+
# flag scanner-origin PDFs which almost always need OCR.
|
| 36 |
+
KNOWN_SCANNER_STRINGS: tuple[str, ...] = (
|
| 37 |
+
"scanner",
|
| 38 |
+
"scan",
|
| 39 |
+
"epson",
|
| 40 |
+
"hp scanjet",
|
| 41 |
+
"canon",
|
| 42 |
+
"fujitsu",
|
| 43 |
+
"kodak",
|
| 44 |
+
"brother",
|
| 45 |
+
"xerox",
|
| 46 |
+
"lexmark",
|
| 47 |
+
"kmc",
|
| 48 |
+
"kofax",
|
| 49 |
+
"ricoh",
|
| 50 |
+
"iris",
|
| 51 |
+
"capturedocument",
|
| 52 |
+
"paperport",
|
| 53 |
+
"readiris",
|
| 54 |
+
"simpleocr",
|
| 55 |
+
)
|
| 56 |
+
|
| 57 |
+
# Strip-merge tuning constants — used to coalesce image slices that some
|
| 58 |
+
# PDFs explode into dozens of thin rectangles, so we don't overcount.
|
| 59 |
+
JUNK_IMAGE_THRESHOLD_RATIO = 0.5
|
| 60 |
+
JUNK_IMAGE_MIN_PAGES_FOR_THRESHOLD = 3
|
| 61 |
+
MERGE_MAX_OFFSET = 5
|
| 62 |
+
MERGE_MAX_GAP = 2
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
def flatten_per_page_features(
|
| 66 |
+
feature_dict_sample: dict[str, Any],
|
| 67 |
+
sample_to_k_page_features: int = 8,
|
| 68 |
+
) -> dict[str, Any]:
|
| 69 |
+
"""Flatten a nested feature dict into the flat schema XGBoost expects.
|
| 70 |
+
|
| 71 |
+
The XGBoost model was trained on a 124-column DataFrame whose columns
|
| 72 |
+
are, in order:
|
| 73 |
+
|
| 74 |
+
num_pages_successfully_sampled
|
| 75 |
+
garbled_text_ratio
|
| 76 |
+
is_form
|
| 77 |
+
creator_or_producer_is_known_scanner
|
| 78 |
+
page_level_unique_font_counts_page1
|
| 79 |
+
...
|
| 80 |
+
page_level_vector_graphics_obj_count_page8
|
| 81 |
+
|
| 82 |
+
If fewer than 8 pages were actually sampled, pages are resampled with
|
| 83 |
+
replacement to pad the vector — this matches the upstream behavior.
|
| 84 |
+
Seed numpy before calling this function if you need determinism.
|
| 85 |
+
"""
|
| 86 |
+
flattened: dict[str, Any] = {}
|
| 87 |
+
|
| 88 |
+
doc_level_features = (
|
| 89 |
+
"num_pages_successfully_sampled",
|
| 90 |
+
"num_unique_image_xrefs",
|
| 91 |
+
"num_junk_image_xrefs",
|
| 92 |
+
"garbled_text_ratio",
|
| 93 |
+
"is_form",
|
| 94 |
+
"creator_or_producer_is_known_scanner",
|
| 95 |
+
"class",
|
| 96 |
+
)
|
| 97 |
+
|
| 98 |
+
used_keys: set[str] = set()
|
| 99 |
+
|
| 100 |
+
for key in doc_level_features:
|
| 101 |
+
if key in feature_dict_sample:
|
| 102 |
+
flattened[key] = feature_dict_sample[key]
|
| 103 |
+
used_keys.add(key)
|
| 104 |
+
|
| 105 |
+
page_level_features = (
|
| 106 |
+
"page_level_unique_font_counts",
|
| 107 |
+
"page_level_char_counts",
|
| 108 |
+
"page_level_text_box_counts",
|
| 109 |
+
"page_level_avg_text_box_lengths",
|
| 110 |
+
"page_level_text_area_ratios",
|
| 111 |
+
"page_level_hidden_char_counts",
|
| 112 |
+
"page_level_hidden_text_box_counts",
|
| 113 |
+
"page_level_hidden_avg_text_box_lengths",
|
| 114 |
+
"page_level_hidden_text_area_ratios",
|
| 115 |
+
"page_level_image_counts",
|
| 116 |
+
"page_level_non_junk_image_counts",
|
| 117 |
+
"page_level_bitmap_proportions",
|
| 118 |
+
"page_level_max_merged_strip_areas",
|
| 119 |
+
"page_level_drawing_strokes_count",
|
| 120 |
+
"page_level_vector_graphics_obj_count",
|
| 121 |
+
)
|
| 122 |
+
|
| 123 |
+
num_pages = len(feature_dict_sample["page_level_unique_font_counts"])
|
| 124 |
+
page_indices = list(range(num_pages))
|
| 125 |
+
# If we don't have enough pages, resample random pages. Upstream uses
|
| 126 |
+
# np.random.choice here, so seed numpy if determinism matters.
|
| 127 |
+
if num_pages < sample_to_k_page_features:
|
| 128 |
+
extra = np.random.choice(
|
| 129 |
+
num_pages, sample_to_k_page_features - num_pages, replace=True
|
| 130 |
+
).tolist()
|
| 131 |
+
page_indices += extra
|
| 132 |
+
|
| 133 |
+
for key in page_level_features:
|
| 134 |
+
list_data = feature_dict_sample.get(key)
|
| 135 |
+
if list_data is None:
|
| 136 |
+
continue
|
| 137 |
+
for page_idx, ind in enumerate(page_indices):
|
| 138 |
+
flattened[f"{key}_page{page_idx + 1}"] = list_data[ind]
|
| 139 |
+
used_keys.add(key)
|
| 140 |
+
|
| 141 |
+
return flattened
|
| 142 |
+
|
| 143 |
+
|
| 144 |
+
class PDFFeatureExtractor:
|
| 145 |
+
"""PyMuPDF feature extraction. Pure — no I/O, no network, no state."""
|
| 146 |
+
|
| 147 |
+
def __init__(self, num_pages_to_sample: int = 8, num_chunks: int = 1) -> None:
|
| 148 |
+
if not isinstance(num_pages_to_sample, int):
|
| 149 |
+
raise ValueError("num_pages_to_sample must be an integer.")
|
| 150 |
+
self.num_pages_to_sample = num_pages_to_sample
|
| 151 |
+
self.num_chunks = num_chunks
|
| 152 |
+
|
| 153 |
+
# --------------------------------------------------------------- sampling
|
| 154 |
+
|
| 155 |
+
def _get_sampled_page_indices(self, doc: pymupdf.Document) -> list[list[int]]:
|
| 156 |
+
total_pages = len(doc)
|
| 157 |
+
if total_pages == 0 or self.num_pages_to_sample <= 0:
|
| 158 |
+
return []
|
| 159 |
+
|
| 160 |
+
available = list(range(total_pages))
|
| 161 |
+
sampled: list[list[int]] = []
|
| 162 |
+
|
| 163 |
+
if self.num_chunks == -1:
|
| 164 |
+
num_chunks = len(available) // self.num_pages_to_sample + 1
|
| 165 |
+
else:
|
| 166 |
+
num_chunks = self.num_chunks
|
| 167 |
+
|
| 168 |
+
for _ in range(num_chunks):
|
| 169 |
+
if not available:
|
| 170 |
+
break
|
| 171 |
+
chunk_size = min(self.num_pages_to_sample, len(available))
|
| 172 |
+
chunk = random.sample(available, chunk_size)
|
| 173 |
+
for idx in chunk:
|
| 174 |
+
available.remove(idx)
|
| 175 |
+
sampled.append(sorted(chunk))
|
| 176 |
+
|
| 177 |
+
return sampled
|
| 178 |
+
|
| 179 |
+
# ----------------------------------------------------------- doc-level
|
| 180 |
+
|
| 181 |
+
def _get_garbled_text_per_page(
|
| 182 |
+
self, doc: pymupdf.Document
|
| 183 |
+
) -> tuple[list[int], list[int]]:
|
| 184 |
+
all_text: list[int] = []
|
| 185 |
+
garbled_text: list[int] = []
|
| 186 |
+
replacement = chr(0xFFFD)
|
| 187 |
+
for page in doc:
|
| 188 |
+
text = page.get_text(
|
| 189 |
+
"text",
|
| 190 |
+
flags=pymupdf.TEXT_PRESERVE_WHITESPACE | pymupdf.TEXT_MEDIABOX_CLIP,
|
| 191 |
+
)
|
| 192 |
+
all_text.append(len(text))
|
| 193 |
+
garbled_text.append(text.count(replacement))
|
| 194 |
+
return all_text, garbled_text
|
| 195 |
+
|
| 196 |
+
def _check_creator_producer_scanner(self, doc: pymupdf.Document) -> bool:
|
| 197 |
+
metadata = doc.metadata or {}
|
| 198 |
+
creator = (metadata.get("creator") or "").lower()
|
| 199 |
+
producer = (metadata.get("producer") or "").lower()
|
| 200 |
+
for keyword in KNOWN_SCANNER_STRINGS:
|
| 201 |
+
if keyword in creator or keyword in producer:
|
| 202 |
+
return True
|
| 203 |
+
return False
|
| 204 |
+
|
| 205 |
+
def _extract_document_level_stats_from_sampled_pages(
|
| 206 |
+
self, doc: pymupdf.Document, sampled_page_indices: list[int]
|
| 207 |
+
) -> dict[str, Any]:
|
| 208 |
+
"""Identify junk images (same xref repeated on most sampled pages)."""
|
| 209 |
+
stats: dict[str, Any] = {"junk_image_xrefs_list": []}
|
| 210 |
+
|
| 211 |
+
if not sampled_page_indices:
|
| 212 |
+
return stats
|
| 213 |
+
|
| 214 |
+
all_instances: list[int] = []
|
| 215 |
+
per_page: dict[int, set[int]] = {}
|
| 216 |
+
for page_idx in sampled_page_indices:
|
| 217 |
+
try:
|
| 218 |
+
page = doc.load_page(page_idx)
|
| 219 |
+
unique_xrefs: set[int] = set()
|
| 220 |
+
for img_def in page.get_images(full=False):
|
| 221 |
+
xref = img_def[0]
|
| 222 |
+
if xref == 0:
|
| 223 |
+
continue
|
| 224 |
+
unique_xrefs.add(xref)
|
| 225 |
+
all_instances.append(xref)
|
| 226 |
+
per_page[page_idx] = unique_xrefs
|
| 227 |
+
except Exception:
|
| 228 |
+
per_page[page_idx] = set()
|
| 229 |
+
|
| 230 |
+
if not all_instances:
|
| 231 |
+
return stats
|
| 232 |
+
|
| 233 |
+
stats["num_unique_image_xrefs"] = len(set(all_instances))
|
| 234 |
+
|
| 235 |
+
xref_page_counts: Counter[int] = Counter()
|
| 236 |
+
for page_xrefs in per_page.values():
|
| 237 |
+
xref_page_counts.update(page_xrefs)
|
| 238 |
+
|
| 239 |
+
num_sampled = len(sampled_page_indices)
|
| 240 |
+
# Upstream overrides the ratio check and requires an xref to be on
|
| 241 |
+
# every sampled page to be flagged as junk — matches FinePDFs.
|
| 242 |
+
min_threshold = num_sampled
|
| 243 |
+
|
| 244 |
+
junk_list: list[int] = []
|
| 245 |
+
if num_sampled >= JUNK_IMAGE_MIN_PAGES_FOR_THRESHOLD:
|
| 246 |
+
for xref, count in xref_page_counts.items():
|
| 247 |
+
if count >= min_threshold:
|
| 248 |
+
junk_list.append(xref)
|
| 249 |
+
|
| 250 |
+
stats["num_junk_image_xrefs"] = len(junk_list)
|
| 251 |
+
stats["junk_image_xrefs_list"] = junk_list
|
| 252 |
+
return stats
|
| 253 |
+
|
| 254 |
+
# ------------------------------------------------------------- imaging
|
| 255 |
+
|
| 256 |
+
def _heuristic_merge_image_strips_on_page(
|
| 257 |
+
self,
|
| 258 |
+
single_page_image_list: list[list[Any]],
|
| 259 |
+
page_width: float,
|
| 260 |
+
page_height: float,
|
| 261 |
+
) -> list[list[Any]]:
|
| 262 |
+
if not single_page_image_list:
|
| 263 |
+
return []
|
| 264 |
+
|
| 265 |
+
deduped: list[list[Any]] = []
|
| 266 |
+
seen: set[tuple[float, float, float, float]] = set()
|
| 267 |
+
for img_data in single_page_image_list:
|
| 268 |
+
key = (img_data[0], img_data[1], img_data[2], img_data[3])
|
| 269 |
+
if key not in seen:
|
| 270 |
+
seen.add(key)
|
| 271 |
+
deduped.append(img_data)
|
| 272 |
+
if not deduped:
|
| 273 |
+
return []
|
| 274 |
+
|
| 275 |
+
deduped.sort(key=lambda img: (img[1], img[0]))
|
| 276 |
+
merged: list[list[Any]] = [deduped[0]]
|
| 277 |
+
|
| 278 |
+
for img in deduped[1:]:
|
| 279 |
+
x0, y0, x1, y1, imgid = img
|
| 280 |
+
last = merged[-1]
|
| 281 |
+
lx0, ly0, lx1, ly1, _ = last
|
| 282 |
+
|
| 283 |
+
cur_w = abs(x1 - x0)
|
| 284 |
+
cur_h = abs(y1 - y0)
|
| 285 |
+
full_w = page_width > 0 and cur_w >= page_width * 0.9
|
| 286 |
+
full_h = page_height > 0 and cur_h >= page_height * 0.9
|
| 287 |
+
|
| 288 |
+
can_merge = False
|
| 289 |
+
if full_w:
|
| 290 |
+
if (
|
| 291 |
+
abs(lx0 - x0) <= MERGE_MAX_OFFSET
|
| 292 |
+
and abs(lx1 - x1) <= MERGE_MAX_OFFSET
|
| 293 |
+
and abs(y0 - ly1) <= MERGE_MAX_GAP
|
| 294 |
+
):
|
| 295 |
+
can_merge = True
|
| 296 |
+
if not can_merge and full_h:
|
| 297 |
+
if (
|
| 298 |
+
abs(ly0 - y0) <= MERGE_MAX_OFFSET
|
| 299 |
+
and abs(ly1 - y1) <= MERGE_MAX_OFFSET
|
| 300 |
+
and abs(x0 - lx1) <= MERGE_MAX_GAP
|
| 301 |
+
):
|
| 302 |
+
can_merge = True
|
| 303 |
+
|
| 304 |
+
if can_merge:
|
| 305 |
+
merged[-1] = [
|
| 306 |
+
min(x0, lx0),
|
| 307 |
+
min(y0, ly0),
|
| 308 |
+
max(x1, lx1),
|
| 309 |
+
max(y1, ly1),
|
| 310 |
+
imgid,
|
| 311 |
+
]
|
| 312 |
+
else:
|
| 313 |
+
merged.append(img)
|
| 314 |
+
|
| 315 |
+
return merged
|
| 316 |
+
|
| 317 |
+
# ---------------------------------------------------------------- main
|
| 318 |
+
|
| 319 |
+
def compute_features_per_chunk(
|
| 320 |
+
self, doc: pymupdf.Document, sampled_page_indices: list[int]
|
| 321 |
+
) -> dict[str, Any]:
|
| 322 |
+
features: dict[str, Any] = {
|
| 323 |
+
"is_form": False,
|
| 324 |
+
"creator_or_producer_is_known_scanner": False,
|
| 325 |
+
"garbled_text_ratio": 0,
|
| 326 |
+
"page_level_unique_font_counts": [],
|
| 327 |
+
"page_level_char_counts": [],
|
| 328 |
+
"page_level_text_box_counts": [],
|
| 329 |
+
"page_level_avg_text_box_lengths": [],
|
| 330 |
+
"page_level_text_area_ratios": [],
|
| 331 |
+
"page_level_hidden_char_counts": [],
|
| 332 |
+
"page_level_hidden_text_box_counts": [],
|
| 333 |
+
"page_level_hidden_avg_text_box_lengths": [],
|
| 334 |
+
"page_level_hidden_text_area_ratios": [],
|
| 335 |
+
"page_level_image_counts": [],
|
| 336 |
+
"page_level_non_junk_image_counts": [],
|
| 337 |
+
"page_level_bitmap_proportions": [],
|
| 338 |
+
"page_level_max_merged_strip_areas": [],
|
| 339 |
+
"page_level_drawing_strokes_count": [],
|
| 340 |
+
"page_level_vector_graphics_obj_count": [],
|
| 341 |
+
"num_pages_successfully_sampled": 0,
|
| 342 |
+
"num_pages_requested_for_sampling": 0,
|
| 343 |
+
"sampled_page_indices": [],
|
| 344 |
+
}
|
| 345 |
+
|
| 346 |
+
features["num_pages_requested_for_sampling"] = len(sampled_page_indices)
|
| 347 |
+
if not sampled_page_indices:
|
| 348 |
+
return features
|
| 349 |
+
|
| 350 |
+
doc_stats = self._extract_document_level_stats_from_sampled_pages(
|
| 351 |
+
doc, sampled_page_indices
|
| 352 |
+
)
|
| 353 |
+
junk_xrefs: set[int] = set(doc_stats.get("junk_image_xrefs_list", []))
|
| 354 |
+
|
| 355 |
+
features["is_form"] = bool(doc.is_form_pdf) if doc.is_form_pdf is not None else False
|
| 356 |
+
features["creator_or_producer_is_known_scanner"] = self._check_creator_producer_scanner(doc)
|
| 357 |
+
|
| 358 |
+
# Garbled text: U+FFFD replacement character / total chars. Computed
|
| 359 |
+
# over ALL pages, but the rate reported to XGBoost is restricted to
|
| 360 |
+
# the sampled pages (upstream semantics).
|
| 361 |
+
all_text, garbled_text = self._get_garbled_text_per_page(doc)
|
| 362 |
+
all_sum = sum(all_text)
|
| 363 |
+
garb_sum = sum(garbled_text)
|
| 364 |
+
features["global_garbled_text_ratio"] = 0 if all_sum == 0 else garb_sum / all_sum
|
| 365 |
+
|
| 366 |
+
sampled_garb = sum(garbled_text[i] for i in sampled_page_indices)
|
| 367 |
+
sampled_all = sum(all_text[i] for i in sampled_page_indices)
|
| 368 |
+
features["garbled_text_ratio"] = 0 if sampled_all == 0 else sampled_garb / sampled_all
|
| 369 |
+
|
| 370 |
+
for page_idx in sampled_page_indices:
|
| 371 |
+
try:
|
| 372 |
+
page = doc.load_page(page_idx)
|
| 373 |
+
except Exception:
|
| 374 |
+
continue
|
| 375 |
+
|
| 376 |
+
features["sampled_page_indices"].append(page_idx)
|
| 377 |
+
features["num_pages_successfully_sampled"] += 1
|
| 378 |
+
|
| 379 |
+
page_rect = page.rect
|
| 380 |
+
page_area = float(page_rect.width * page_rect.height) or 1.0
|
| 381 |
+
|
| 382 |
+
# --- Fonts ---
|
| 383 |
+
fonts: set[str] = set()
|
| 384 |
+
try:
|
| 385 |
+
for fi in page.get_fonts(full=True):
|
| 386 |
+
if len(fi) > 3 and fi[3]:
|
| 387 |
+
fonts.add(fi[3])
|
| 388 |
+
except Exception:
|
| 389 |
+
pass
|
| 390 |
+
features["page_level_unique_font_counts"].append(len(fonts))
|
| 391 |
+
|
| 392 |
+
# --- Visible vs hidden text via texttrace ---
|
| 393 |
+
char_count = 0
|
| 394 |
+
text_area = 0.0
|
| 395 |
+
text_boxes = 0
|
| 396 |
+
hidden_chars = 0
|
| 397 |
+
hidden_area = 0.0
|
| 398 |
+
hidden_boxes = 0
|
| 399 |
+
try:
|
| 400 |
+
for tr in page.get_texttrace():
|
| 401 |
+
n = len(tr.get("chars", []))
|
| 402 |
+
bbox = tr.get("bbox", (0, 0, 0, 0))
|
| 403 |
+
box_area = (bbox[2] - bbox[0]) * (bbox[3] - bbox[1])
|
| 404 |
+
if tr.get("type") == 3 or tr.get("opacity", 1.0) == 0:
|
| 405 |
+
hidden_chars += n
|
| 406 |
+
hidden_area += box_area
|
| 407 |
+
hidden_boxes += 1
|
| 408 |
+
else:
|
| 409 |
+
char_count += n
|
| 410 |
+
text_area += box_area
|
| 411 |
+
text_boxes += 1
|
| 412 |
+
except Exception:
|
| 413 |
+
pass
|
| 414 |
+
|
| 415 |
+
features["page_level_char_counts"].append(char_count)
|
| 416 |
+
features["page_level_text_box_counts"].append(text_boxes)
|
| 417 |
+
features["page_level_avg_text_box_lengths"].append(
|
| 418 |
+
text_area / text_boxes if text_boxes else 0.0
|
| 419 |
+
)
|
| 420 |
+
features["page_level_text_area_ratios"].append(text_area / page_area)
|
| 421 |
+
features["page_level_hidden_char_counts"].append(hidden_chars)
|
| 422 |
+
features["page_level_hidden_text_box_counts"].append(hidden_boxes)
|
| 423 |
+
features["page_level_hidden_avg_text_box_lengths"].append(
|
| 424 |
+
hidden_area / hidden_boxes if hidden_boxes else 0.0
|
| 425 |
+
)
|
| 426 |
+
features["page_level_hidden_text_area_ratios"].append(hidden_area / page_area)
|
| 427 |
+
|
| 428 |
+
# --- Images ---
|
| 429 |
+
total_imgs = 0
|
| 430 |
+
non_junk_imgs = 0
|
| 431 |
+
non_junk_rects: list[list[Any]] = []
|
| 432 |
+
try:
|
| 433 |
+
for img_def in page.get_images(full=False):
|
| 434 |
+
xref = img_def[0]
|
| 435 |
+
if xref == 0:
|
| 436 |
+
continue
|
| 437 |
+
rects = page.get_image_rects(xref, transform=False)
|
| 438 |
+
total_imgs += len(rects)
|
| 439 |
+
if xref not in junk_xrefs:
|
| 440 |
+
non_junk_imgs += len(rects)
|
| 441 |
+
for r in rects:
|
| 442 |
+
if r.is_empty or r.is_infinite:
|
| 443 |
+
continue
|
| 444 |
+
non_junk_rects.append([r.x0, r.y0, r.x1, r.y1, xref])
|
| 445 |
+
except Exception:
|
| 446 |
+
pass
|
| 447 |
+
|
| 448 |
+
features["page_level_image_counts"].append(total_imgs)
|
| 449 |
+
features["page_level_non_junk_image_counts"].append(non_junk_imgs)
|
| 450 |
+
|
| 451 |
+
merged = self._heuristic_merge_image_strips_on_page(
|
| 452 |
+
non_junk_rects, page_rect.width, page_rect.height
|
| 453 |
+
)
|
| 454 |
+
strip_areas = [abs(b[2] - b[0]) * abs(b[3] - b[1]) for b in merged]
|
| 455 |
+
if strip_areas:
|
| 456 |
+
features["page_level_max_merged_strip_areas"].append(max(strip_areas) / page_area)
|
| 457 |
+
features["page_level_bitmap_proportions"].append(sum(strip_areas) / page_area)
|
| 458 |
+
else:
|
| 459 |
+
features["page_level_max_merged_strip_areas"].append(0.0)
|
| 460 |
+
features["page_level_bitmap_proportions"].append(0.0)
|
| 461 |
+
|
| 462 |
+
# --- Drawings / vector graphics ---
|
| 463 |
+
stroke_count = 0
|
| 464 |
+
vector_objs = 0
|
| 465 |
+
try:
|
| 466 |
+
drawings = page.get_cdrawings()
|
| 467 |
+
vector_objs = len(drawings)
|
| 468 |
+
for path in drawings:
|
| 469 |
+
for item in path.get("items", []):
|
| 470 |
+
if item[0] in ("l", "c", "q"):
|
| 471 |
+
stroke_count += 1
|
| 472 |
+
if path.get("rect") or path.get("quad"):
|
| 473 |
+
if path.get("stroke_opacity", 1) > 0 and path.get("color"):
|
| 474 |
+
stroke_count += 1
|
| 475 |
+
except Exception:
|
| 476 |
+
pass
|
| 477 |
+
features["page_level_drawing_strokes_count"].append(stroke_count)
|
| 478 |
+
features["page_level_vector_graphics_obj_count"].append(vector_objs)
|
| 479 |
+
|
| 480 |
+
return features
|
| 481 |
+
|
| 482 |
+
def extract_all_features(self, doc: pymupdf.Document) -> list[dict[str, Any]]:
|
| 483 |
+
chunks = self._get_sampled_page_indices(doc)
|
| 484 |
+
return [self.compute_features_per_chunk(doc, c) for c in chunks]
|
|
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Thin loader around the FinePDFs XGBoost ``xgb.ubj`` weights.
|
| 2 |
+
|
| 3 |
+
The model is a binary classifier where class 1 = "needs OCR" (scanned /
|
| 4 |
+
garbled / image-heavy / form). It takes a 124-column feature vector whose
|
| 5 |
+
column order is fixed by :func:`feature_extractor.flatten_per_page_features`.
|
| 6 |
+
|
| 7 |
+
We keep the loader tiny on purpose: the calibration between feature layout
|
| 8 |
+
and column order lives entirely in ``feature_extractor.py`` — this file
|
| 9 |
+
only knows "give me a dict-of-features, I'll give you a probability".
|
| 10 |
+
"""
|
| 11 |
+
|
| 12 |
+
from __future__ import annotations
|
| 13 |
+
|
| 14 |
+
from pathlib import Path
|
| 15 |
+
|
| 16 |
+
import numpy as np
|
| 17 |
+
import pandas as pd
|
| 18 |
+
from xgboost import XGBClassifier
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
class XgbRouterModel:
|
| 22 |
+
"""Lazy-loading wrapper around an ``xgb.ubj`` binary classifier."""
|
| 23 |
+
|
| 24 |
+
def __init__(self, path_to_model: str | Path) -> None:
|
| 25 |
+
self.path_to_model = Path(path_to_model)
|
| 26 |
+
self._model: XGBClassifier | None = None
|
| 27 |
+
|
| 28 |
+
@property
|
| 29 |
+
def model(self) -> XGBClassifier:
|
| 30 |
+
if self._model is None:
|
| 31 |
+
if not self.path_to_model.is_file():
|
| 32 |
+
raise FileNotFoundError(
|
| 33 |
+
f"XGBoost weights not found at {self.path_to_model}. "
|
| 34 |
+
"Run `python -m pdfsys_router.download_weights` to fetch them."
|
| 35 |
+
)
|
| 36 |
+
m = XGBClassifier()
|
| 37 |
+
m.load_model(str(self.path_to_model))
|
| 38 |
+
self._model = m
|
| 39 |
+
return self._model
|
| 40 |
+
|
| 41 |
+
def predict_proba(self, features: dict[str, float]) -> float:
|
| 42 |
+
"""Return P(class=1, i.e. needs OCR)."""
|
| 43 |
+
df = pd.DataFrame([features])
|
| 44 |
+
# Column ordering must match the training schema — realign using
|
| 45 |
+
# the model's recorded feature_names_in_ when available.
|
| 46 |
+
names = getattr(self.model, "feature_names_in_", None)
|
| 47 |
+
if names is not None:
|
| 48 |
+
df = df.reindex(columns=list(names), fill_value=0)
|
| 49 |
+
probs = self.model.predict_proba(df)
|
| 50 |
+
return float(probs[0][1])
|
| 51 |
+
|
| 52 |
+
@property
|
| 53 |
+
def feature_names(self) -> list[str]:
|
| 54 |
+
names = getattr(self.model, "feature_names_in_", None)
|
| 55 |
+
if names is None:
|
| 56 |
+
return []
|
| 57 |
+
return list(names)
|
| 58 |
+
|
| 59 |
+
@property
|
| 60 |
+
def n_features(self) -> int:
|
| 61 |
+
return int(getattr(self.model, "n_features_in_", 0))
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
def default_weights_path() -> Path:
|
| 65 |
+
"""Return the canonical on-disk location of the bundled weights."""
|
| 66 |
+
return Path(__file__).resolve().parent.parent.parent / "models" / "xgb_classifier.ubj"
|