yin commited on
Commit
d423504
·
1 Parent(s): 1f63780

feat(mvp): wire router → mupdf parser → OCR quality scorer closed loop

Browse files

Ship the first end-to-end cut of the pdfsys pipeline on OmniDocBench-100:

* pdfsys-router: port FinePDFs PDFFeatureExtractor (15 page features × 8
sampled pages + 4 doc features = 124 columns) and load the upstream
xgb.ubj weights via a thin XgbRouterModel wrapper. Router.classify()
returns a RouterDecision with Backend {MUPDF, PIPELINE, VLM, DEFERRED},
ocr_prob, and the full feature dict for debugging. Seeded RNG keeps the
feature vector reproducible per PDF. Weights live under models/ and are
gitignored; download_weights.py fetches them from the HF LFS media URL.

* pdfsys-parser-mupdf: text-ok backend using page.get_text("blocks",
sort=True), emits one Segment per paragraph-shaped block with bbox
normalized to [0, 1] and converts the whole doc into an ExtractedDoc
with merged Markdown. No layout-analyser dependency by design.

* pdfsys-bench: add quality.py (ModernBERT-large regression head from
HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn, loaded in
bfloat16 with max_tokens=512 to fit a 4 GB RAM dev box), loop.py
(router → parser → scorer → JSONL runner), and __main__ CLI.

End-to-end run on the full 100-doc OmniDocBench subset:
* 70 routed to MUPDF (avg ocr_prob 0.034), 30 routed to PIPELINE
(avg ocr_prob 0.634)
* 70 extracted + quality scored, 0 errors
* avg quality 1.71, wall clock 259 s
* per-doc: router 49 ms, extract 7 ms, quality 3.6 s

Stage-B (LayoutCache-driven pipeline-vs-vlm decision) and the PIPELINE
and VLM parser backends are out of scope for this MVP.

.gitignore CHANGED
@@ -16,11 +16,16 @@ uv.lock
16
  # local pipeline scratch
17
  work/
18
  output/
 
19
  .cache/
20
  samples/
21
  bench_data/
22
  *.layout.json
23
 
 
 
 
 
24
  # models / weights (too big for git)
25
  models/
26
  *.onnx
 
16
  # local pipeline scratch
17
  work/
18
  output/
19
+ out/
20
  .cache/
21
  samples/
22
  bench_data/
23
  *.layout.json
24
 
25
+ # bench datasets — large binary corpora, distributed out of band
26
+ packages/pdfsys-bench/omnidocbench_100/
27
+ packages/pdfsys-bench/olmocr_bench_50/
28
+
29
  # models / weights (too big for git)
30
  models/
31
  *.onnx
packages/pdfsys-bench/README.md ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # bench/ — PDF processing pipeline evaluation set
2
+
3
+ This directory is the **canonical test set** for evaluating the end-to-end PDF
4
+ processing pipeline (layout → OCR → markdown / structured text). It bundles
5
+ two complementary, pre-sampled subsets so that runs are reproducible and
6
+ cheap to iterate on.
7
+
8
+ | Subset | PDFs | Source benchmark | Focus |
9
+ |---|---:|---|---|
10
+ | [`olmocr_bench_50/`](./olmocr_bench_50) | 50 | [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench) | Fine-grained unit tests on text presence / absence, reading order, tables, math |
11
+ | [`omnidocbench_100/`](./omnidocbench_100) | 100 | [OmniDocBench](https://github.com/opendatalab/OmniDocBench) | Holistic document-level eval with layout / language / special-issue coverage |
12
+
13
+ Total footprint: ~108 MB, 150 PDFs.
14
+
15
+ ## Subset details
16
+
17
+ ### `olmocr_bench_50/`
18
+ Stratified sample drawn from the 1,403-PDF olmOCR-bench with the script
19
+ `scripts/sample_olmocr_subset.py` (seed `20260411`). Covers all 7 document
20
+ sources with a minimum floor of 3 PDFs per category plus largest-remainder
21
+ proportional allocation, and diversifies by source document inside each
22
+ category (at most one page per arXiv paper / scan ID before any repeat).
23
+
24
+ ```
25
+ olmocr_bench_50/
26
+ ├── pdfs/
27
+ │ ├── arxiv_math/ (14)
28
+ │ ├── headers_footers/ (8)
29
+ │ ├── long_tiny_text/ (4)
30
+ │ ├── multi_column/ (8)
31
+ │ ├── old_scans/ (5)
32
+ │ ├── old_scans_math/ (4)
33
+ │ └── tables/ (7)
34
+ ├── subset_tests.jsonl # 283 olmOCR-bench unit tests for these 50 PDFs
35
+ └── subset_manifest.json # seed, quotas, selected file list, source bench_dir
36
+ ```
37
+
38
+ The `subset_tests.jsonl` file is a filtered copy of the original per-category
39
+ `*.jsonl` test files merged into one; each row keeps the exact schema used by
40
+ the upstream olmOCR-bench evaluator (`pdf`, `type`, `max_diffs`, `checked`,
41
+ and type-specific fields like `math`, `cell`, `before`/`after`, …).
42
+
43
+ Regenerate or resize:
44
+ ```bash
45
+ python3 scripts/sample_olmocr_subset.py --target 50 # default → bench/olmocr_bench_50
46
+ python3 scripts/sample_olmocr_subset.py --target 100 --seed 42 # alt subset
47
+ python3 scripts/sample_olmocr_subset.py --dry-run # plan only
48
+ ```
49
+
50
+ ### `omnidocbench_100/`
51
+ Pre-built 100-PDF subset of OmniDocBench v2 with full stratified coverage
52
+ across every categorical axis in the upstream dataset.
53
+
54
+ ```
55
+ omnidocbench_100/
56
+ ├── pdfs/ # 100 single-page PDFs
57
+ ├── img/ # matching rendered JPGs (1 per PDF)
58
+ ├── subset_100.json # full OmniDocBench annotations for the 100 samples
59
+ ├── subset_100_stats.json # coverage & distribution stats vs. full 981-doc set
60
+ ├── subset_100_pdfs.txt # flat list of selected PDF filenames
61
+ └── subset_100_images.txt # flat list of selected image filenames
62
+ ```
63
+
64
+ Coverage (from `subset_100_stats.json`) — every bucket of every axis is hit:
65
+ - **data_source** 9/9 · **language** 3/3 · **layout** 5/5
66
+ - **special_issue** 13/13 · **stratum** 67/67
67
+
68
+ ## Using the bench
69
+
70
+ These two subsets are intended to be run as a pair — olmOCR-bench gives you
71
+ sharp per-feature pass/fail signals and OmniDocBench gives you an aggregate
72
+ quality score across real-world document types. For each new pipeline
73
+ version, run both subsets, record per-subset metrics, and diff against the
74
+ previous run.
75
+
76
+ Common entry points (to be wired up by the pipeline evaluator):
77
+
78
+ ```text
79
+ bench/olmocr_bench_50/pdfs/**/*.pdf # inputs
80
+ bench/olmocr_bench_50/subset_tests.jsonl # ground truth unit tests
81
+
82
+ bench/omnidocbench_100/pdfs/*.pdf # inputs
83
+ bench/omnidocbench_100/subset_100.json # ground truth annotations
84
+ ```
85
+
86
+ Do **not** manually edit files under `bench/`. Regenerate with the sampling
87
+ script (for olmocr) or re-export from the upstream builder (for omnidoc) so
88
+ results stay reproducible.
packages/pdfsys-bench/pyproject.toml CHANGED
@@ -9,10 +9,16 @@ description = "Cross-backend benchmarking — throughput, latency, and F1 on a s
9
  requires-python = ">=3.11"
10
  dependencies = [
11
  "pdfsys-core",
 
 
 
 
12
  ]
13
 
14
  [tool.uv.sources]
15
  pdfsys-core = { workspace = true }
 
 
16
 
17
  [tool.hatch.build.targets.wheel]
18
  packages = ["src/pdfsys_bench"]
 
9
  requires-python = ">=3.11"
10
  dependencies = [
11
  "pdfsys-core",
12
+ "pdfsys-router",
13
+ "pdfsys-parser-mupdf",
14
+ "torch>=2.1",
15
+ "transformers>=4.44",
16
  ]
17
 
18
  [tool.uv.sources]
19
  pdfsys-core = { workspace = true }
20
+ pdfsys-router = { workspace = true }
21
+ pdfsys-parser-mupdf = { workspace = true }
22
 
23
  [tool.hatch.build.targets.wheel]
24
  packages = ["src/pdfsys_bench"]
packages/pdfsys-bench/src/pdfsys_bench/__init__.py CHANGED
@@ -1,7 +1,22 @@
1
- """pdfsys-bench — evaluation harness.
2
 
3
- Runs the same sample PDF set through mupdf / pipeline / vlm backends and
4
- reports throughput, latency, and F1 against gold Markdown references.
 
 
5
  """
6
 
 
 
 
 
 
7
  __version__ = "0.0.1"
 
 
 
 
 
 
 
 
 
1
+ """pdfsys-bench — evaluation harness and MVP closed-loop runner.
2
 
3
+ Runs a PDF directory through router parser OCR-quality scorer and
4
+ writes one JSONL row per PDF. This is the minimal end-to-end harness; a
5
+ richer benchmark (throughput, F1 against gold Markdown, cross-backend
6
+ comparison) will layer on top of it.
7
  """
8
 
9
+ from __future__ import annotations
10
+
11
+ from .loop import LoopResult, run_loop
12
+ from .quality import OcrQualityScorer, QualityScore
13
+
14
  __version__ = "0.0.1"
15
+
16
+ __all__ = [
17
+ "__version__",
18
+ "LoopResult",
19
+ "run_loop",
20
+ "OcrQualityScorer",
21
+ "QualityScore",
22
+ ]
packages/pdfsys-bench/src/pdfsys_bench/__main__.py ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """pdfsys-bench CLI — run the MVP closed loop on a directory of PDFs.
2
+
3
+ Usage::
4
+
5
+ python -m pdfsys_bench \\
6
+ --pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \\
7
+ --out out/bench_omnidoc100.jsonl \\
8
+ --limit 20
9
+
10
+ Flags exposed here are intentionally minimal — anything more is the job
11
+ of a proper runner package. This CLI is meant for smoke-testing.
12
+ """
13
+
14
+ from __future__ import annotations
15
+
16
+ import argparse
17
+ import sys
18
+ from pathlib import Path
19
+
20
+ from .loop import run_loop
21
+
22
+
23
+ def build_parser() -> argparse.ArgumentParser:
24
+ p = argparse.ArgumentParser(prog="pdfsys-bench", description="Run the MVP pdfsys closed loop.")
25
+ p.add_argument(
26
+ "--pdf-dir",
27
+ type=Path,
28
+ required=True,
29
+ help="Directory of PDFs to process (recursive).",
30
+ )
31
+ p.add_argument(
32
+ "--out",
33
+ type=Path,
34
+ required=True,
35
+ help="Output JSONL path (one line per PDF).",
36
+ )
37
+ p.add_argument(
38
+ "--limit",
39
+ type=int,
40
+ default=None,
41
+ help="Cap the number of PDFs processed. Default: no cap.",
42
+ )
43
+ p.add_argument(
44
+ "--no-quality",
45
+ action="store_true",
46
+ help="Skip the ModernBERT quality scorer (fast smoke test).",
47
+ )
48
+ p.add_argument(
49
+ "--quality-model",
50
+ default="HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn",
51
+ help="HuggingFace repo id for the quality scorer.",
52
+ )
53
+ p.add_argument(
54
+ "--router-weights",
55
+ type=Path,
56
+ default=None,
57
+ help="Path to xgb_classifier.ubj. Defaults to the package's bundled path.",
58
+ )
59
+ p.add_argument(
60
+ "--markdown-dir",
61
+ type=Path,
62
+ default=None,
63
+ help="Optional directory to dump per-PDF extracted markdown.",
64
+ )
65
+ p.add_argument(
66
+ "--ocr-threshold",
67
+ type=float,
68
+ default=0.5,
69
+ help="P(ocr) threshold above which a PDF is routed off the text-ok path.",
70
+ )
71
+ return p
72
+
73
+
74
+ def main(argv: list[str] | None = None) -> int:
75
+ args = build_parser().parse_args(argv)
76
+ summary = run_loop(
77
+ pdf_dir=args.pdf_dir,
78
+ out_path=args.out,
79
+ limit=args.limit,
80
+ score_quality=not args.no_quality,
81
+ router_weights=args.router_weights,
82
+ quality_model=args.quality_model,
83
+ markdown_dir=args.markdown_dir,
84
+ ocr_threshold=args.ocr_threshold,
85
+ )
86
+
87
+ print(f"[pdfsys-bench] processed {summary['num_pdfs']} PDFs in {summary['wall_seconds']:.1f}s")
88
+ print(f"[pdfsys-bench] by_backend: {summary['by_backend']}")
89
+ print(f"[pdfsys-bench] extracted={summary['num_extracted']} scored={summary['num_scored']} errors={summary['num_errors']}")
90
+ if summary.get("avg_quality") is not None:
91
+ print(f"[pdfsys-bench] avg_quality={summary['avg_quality']:.3f}")
92
+ print(f"[pdfsys-bench] jsonl: {summary['out_path']}")
93
+ print(f"[pdfsys-bench] summary: {summary['summary_path']}")
94
+ return 0
95
+
96
+
97
+ if __name__ == "__main__":
98
+ sys.exit(main())
packages/pdfsys-bench/src/pdfsys_bench/loop.py ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """MVP closed-loop runner: router → parser → quality scorer → JSONL.
2
+
3
+ This is the tiniest possible end-to-end harness for the pdfsys pipeline.
4
+ Given a directory of PDFs, it:
5
+
6
+ 1. runs :class:`pdfsys_router.Router` to pick a backend per document;
7
+ 2. for PDFs routed to ``Backend.MUPDF``, runs :func:`pdfsys_parser_mupdf.extract_doc`
8
+ to produce an :class:`pdfsys_core.ExtractedDoc`;
9
+ 3. scores the resulting Markdown with :class:`pdfsys_bench.OcrQualityScorer`
10
+ (the ModernBERT-large regression head from FinePDFs);
11
+ 4. writes one JSON line per PDF to an output file with routing decision,
12
+ extraction stats, and quality score.
13
+
14
+ PDFs routed to ``PIPELINE`` / ``VLM`` / ``DEFERRED`` are recorded with
15
+ their routing decision but skipped for extraction — those backends are
16
+ not implemented yet in this MVP.
17
+ """
18
+
19
+ from __future__ import annotations
20
+
21
+ import json
22
+ import time
23
+ from dataclasses import asdict, dataclass, field
24
+ from pathlib import Path
25
+ from typing import Any, Iterable
26
+
27
+ from pdfsys_core import Backend
28
+ from pdfsys_parser_mupdf import extract_doc
29
+ from pdfsys_router import Router
30
+
31
+ from .quality import OcrQualityScorer, QualityScore
32
+
33
+
34
+ @dataclass(slots=True)
35
+ class LoopResult:
36
+ """Per-PDF result row, serialized to JSONL."""
37
+
38
+ pdf_path: str
39
+ sha256: str | None
40
+ backend: str
41
+ ocr_prob: float
42
+ num_pages: int
43
+ is_form: bool
44
+ garbled_text_ratio: float
45
+ router_error: str | None
46
+ extract_stats: dict[str, Any] = field(default_factory=dict)
47
+ extract_error: str | None = None
48
+ quality_score: float | None = None
49
+ quality_num_chars: int | None = None
50
+ quality_num_tokens: int | None = None
51
+ quality_model: str | None = None
52
+ markdown_chars: int = 0
53
+ wall_ms_router: float = 0.0
54
+ wall_ms_extract: float = 0.0
55
+ wall_ms_quality: float = 0.0
56
+
57
+ def to_json_line(self) -> str:
58
+ return json.dumps(asdict(self), ensure_ascii=False)
59
+
60
+
61
+ def _iter_pdfs(root: Path, limit: int | None) -> Iterable[Path]:
62
+ pdfs = sorted(p for p in root.rglob("*.pdf") if p.is_file())
63
+ if limit is not None:
64
+ pdfs = pdfs[:limit]
65
+ yield from pdfs
66
+
67
+
68
+ def run_loop(
69
+ pdf_dir: str | Path,
70
+ out_path: str | Path,
71
+ *,
72
+ limit: int | None = None,
73
+ score_quality: bool = True,
74
+ router_weights: str | Path | None = None,
75
+ quality_model: str = "HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn",
76
+ markdown_dir: str | Path | None = None,
77
+ ocr_threshold: float = 0.5,
78
+ ) -> dict[str, Any]:
79
+ """Drive the full MVP loop over a PDF directory.
80
+
81
+ Returns an aggregate summary dict. Individual result rows are written
82
+ to ``out_path`` as JSONL (one line per PDF, in input-order).
83
+ """
84
+ pdf_dir = Path(pdf_dir)
85
+ out_path = Path(out_path)
86
+ out_path.parent.mkdir(parents=True, exist_ok=True)
87
+
88
+ router = Router(model_path=router_weights, ocr_threshold=ocr_threshold)
89
+ scorer = OcrQualityScorer(model_name=quality_model) if score_quality else None
90
+
91
+ md_root = Path(markdown_dir) if markdown_dir else None
92
+ if md_root is not None:
93
+ md_root.mkdir(parents=True, exist_ok=True)
94
+
95
+ summary: dict[str, Any] = {
96
+ "pdf_dir": str(pdf_dir),
97
+ "out_path": str(out_path),
98
+ "num_pdfs": 0,
99
+ "by_backend": {},
100
+ "num_extracted": 0,
101
+ "num_scored": 0,
102
+ "num_errors": 0,
103
+ "sum_quality": 0.0,
104
+ "started_at": time.time(),
105
+ }
106
+
107
+ with out_path.open("w", encoding="utf-8") as out_f:
108
+ for pdf_path in _iter_pdfs(pdf_dir, limit):
109
+ row = _run_one(
110
+ pdf_path=pdf_path,
111
+ router=router,
112
+ scorer=scorer,
113
+ md_root=md_root,
114
+ )
115
+ out_f.write(row.to_json_line() + "\n")
116
+ out_f.flush()
117
+
118
+ summary["num_pdfs"] += 1
119
+ by_b = summary["by_backend"]
120
+ by_b[row.backend] = by_b.get(row.backend, 0) + 1
121
+ if row.extract_error is None and row.backend == Backend.MUPDF.value:
122
+ summary["num_extracted"] += 1
123
+ if row.quality_score is not None:
124
+ summary["num_scored"] += 1
125
+ summary["sum_quality"] += row.quality_score
126
+ if row.router_error or row.extract_error:
127
+ summary["num_errors"] += 1
128
+
129
+ summary["finished_at"] = time.time()
130
+ summary["wall_seconds"] = summary["finished_at"] - summary["started_at"]
131
+ summary["avg_quality"] = (
132
+ summary["sum_quality"] / summary["num_scored"] if summary["num_scored"] else None
133
+ )
134
+
135
+ summary_path = out_path.with_suffix(".summary.json")
136
+ summary_path.write_text(json.dumps(summary, indent=2, ensure_ascii=False))
137
+ summary["summary_path"] = str(summary_path)
138
+
139
+ return summary
140
+
141
+
142
+ def _run_one(
143
+ *,
144
+ pdf_path: Path,
145
+ router: Router,
146
+ scorer: OcrQualityScorer | None,
147
+ md_root: Path | None,
148
+ ) -> LoopResult:
149
+ # -- Stage-A routing ------------------------------------------------------
150
+ t0 = time.perf_counter()
151
+ decision = router.classify(pdf_path)
152
+ t1 = time.perf_counter()
153
+
154
+ row = LoopResult(
155
+ pdf_path=str(pdf_path),
156
+ sha256=None,
157
+ backend=decision.backend.value,
158
+ ocr_prob=decision.ocr_prob,
159
+ num_pages=decision.num_pages,
160
+ is_form=decision.is_form,
161
+ garbled_text_ratio=decision.garbled_text_ratio,
162
+ router_error=decision.error,
163
+ wall_ms_router=(t1 - t0) * 1000.0,
164
+ )
165
+
166
+ # -- MVP only extracts the text-ok fast path ------------------------------
167
+ if decision.backend != Backend.MUPDF:
168
+ return row
169
+
170
+ try:
171
+ t2 = time.perf_counter()
172
+ extracted = extract_doc(pdf_path)
173
+ t3 = time.perf_counter()
174
+ row.sha256 = extracted.sha256
175
+ row.extract_stats = dict(extracted.stats)
176
+ row.markdown_chars = extracted.char_count
177
+ row.wall_ms_extract = (t3 - t2) * 1000.0
178
+ except Exception as e: # noqa: BLE001
179
+ row.extract_error = f"extract_failed: {e}"
180
+ return row
181
+
182
+ if md_root is not None and extracted.markdown:
183
+ md_path = md_root / f"{extracted.sha256}.md"
184
+ md_path.write_text(extracted.markdown, encoding="utf-8")
185
+
186
+ # -- Quality scoring ------------------------------------------------------
187
+ if scorer is not None and extracted.markdown:
188
+ try:
189
+ t4 = time.perf_counter()
190
+ q: QualityScore = scorer.score(extracted.markdown)
191
+ t5 = time.perf_counter()
192
+ row.quality_score = q.score
193
+ row.quality_num_chars = q.num_chars
194
+ row.quality_num_tokens = q.num_tokens
195
+ row.quality_model = q.model
196
+ row.wall_ms_quality = (t5 - t4) * 1000.0
197
+ except Exception as e: # noqa: BLE001
198
+ row.extract_error = f"quality_failed: {e}"
199
+
200
+ return row
packages/pdfsys-bench/src/pdfsys_bench/quality.py ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """OCR quality scorer backed by the FinePDFs ModernBERT classifier.
2
+
3
+ Wraps ``HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn`` — a
4
+ single-head regression fine-tune of ModernBERT-large (~0.4 B params)
5
+ that emits a float in ``[0, 3]`` where:
6
+
7
+ * 0 → garbage / unreadable OCR
8
+ * 1 → formatting issues but mostly readable
9
+ * 2 → minor problems
10
+ * 3 → clean text
11
+
12
+ The scorer takes raw extracted text (Markdown or plain), truncates to at
13
+ most ``max_chars`` characters before tokenization, tokenizes with the
14
+ model's own tokenizer, runs one forward pass, and returns the scalar.
15
+
16
+ Heavy dependencies (``torch`` + ``transformers``) are imported lazily so
17
+ that merely importing :mod:`pdfsys_bench` does not pull them in.
18
+ """
19
+
20
+ from __future__ import annotations
21
+
22
+ from dataclasses import dataclass
23
+ from typing import Any
24
+
25
+ DEFAULT_MODEL = "HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn"
26
+ DEFAULT_MAX_CHARS = 10_000
27
+ # Upstream FinePDFs uses max_tokens=2048, but ModernBERT-large activations
28
+ # at that length need ≈ 3 GB of RAM — too much for a 4 GB dev box. 512
29
+ # tokens is enough to give a stable quality signal in practice and keeps
30
+ # peak memory well under a gig.
31
+ DEFAULT_MAX_TOKENS = 512
32
+
33
+
34
+ @dataclass(slots=True)
35
+ class QualityScore:
36
+ """Result of scoring one document."""
37
+
38
+ score: float
39
+ num_chars: int
40
+ num_tokens: int
41
+ model: str
42
+
43
+ def as_record(self) -> dict[str, Any]:
44
+ return {
45
+ "quality_score": self.score,
46
+ "quality_num_chars": self.num_chars,
47
+ "quality_num_tokens": self.num_tokens,
48
+ "quality_model": self.model,
49
+ }
50
+
51
+
52
+ class OcrQualityScorer:
53
+ """Lazy ModernBERT regression scorer. Re-uses model/tokenizer across calls."""
54
+
55
+ def __init__(
56
+ self,
57
+ model_name: str = DEFAULT_MODEL,
58
+ max_chars: int = DEFAULT_MAX_CHARS,
59
+ max_tokens: int = DEFAULT_MAX_TOKENS,
60
+ device: str | None = None,
61
+ dtype: str = "bfloat16",
62
+ ) -> None:
63
+ self.model_name = model_name
64
+ self.max_chars = max_chars
65
+ self.max_tokens = max_tokens
66
+ self._device_name = device
67
+ self.dtype_name = dtype
68
+ self._tokenizer: Any = None
69
+ self._model: Any = None
70
+ self._torch: Any = None
71
+ self._device: Any = None
72
+
73
+ def _ensure_loaded(self) -> None:
74
+ if self._model is not None:
75
+ return
76
+ import torch # noqa: PLC0415 — lazy import is intentional
77
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer # noqa: PLC0415
78
+
79
+ self._torch = torch
80
+ self._device = torch.device(
81
+ self._device_name
82
+ or ("cuda" if torch.cuda.is_available() else "cpu")
83
+ )
84
+ # Use bfloat16 on CPU to halve the model's memory footprint —
85
+ # ModernBERT-large is ~0.4 B params, so fp32 weights alone take
86
+ # ~1.6 GB and OOM a 4 GB-RAM dev box. bf16 inference is
87
+ # numerically stable enough for a regression head like this.
88
+ torch_dtype = getattr(torch, self.dtype_name, torch.float32)
89
+
90
+ self._tokenizer = AutoTokenizer.from_pretrained(self.model_name)
91
+ # ``dtype`` is the transformers≥5 name; ``torch_dtype`` was the
92
+ # transformers<5 name. Pass ``dtype`` and fall back for older releases.
93
+ try:
94
+ model = AutoModelForSequenceClassification.from_pretrained(
95
+ self.model_name,
96
+ dtype=torch_dtype,
97
+ )
98
+ except TypeError:
99
+ model = AutoModelForSequenceClassification.from_pretrained(
100
+ self.model_name,
101
+ torch_dtype=torch_dtype,
102
+ )
103
+ model.eval()
104
+ model.to(self._device)
105
+ self._model = model
106
+
107
+ def score(self, text: str) -> QualityScore:
108
+ """Score a single document. Empty input returns 0.0."""
109
+ if not text or not text.strip():
110
+ return QualityScore(
111
+ score=0.0, num_chars=0, num_tokens=0, model=self.model_name
112
+ )
113
+
114
+ self._ensure_loaded()
115
+ assert self._tokenizer is not None and self._model is not None
116
+ torch = self._torch
117
+
118
+ clipped = text[: self.max_chars]
119
+ enc = self._tokenizer(
120
+ clipped,
121
+ return_tensors="pt",
122
+ truncation=True,
123
+ max_length=self.max_tokens,
124
+ )
125
+ num_tokens = int(enc["input_ids"].shape[1])
126
+ enc = {k: v.to(self._device) for k, v in enc.items()}
127
+
128
+ with torch.inference_mode():
129
+ out = self._model(**enc)
130
+ logits = out.logits # shape [1, 1] for regression
131
+ raw = float(logits.squeeze().item())
132
+ # Drop the forward-pass tensors eagerly so large-seq runs on CPU
133
+ # don't hold onto activations between calls.
134
+ del enc, out, logits
135
+
136
+ # Clamp to the documented [0, 3] range.
137
+ clamped = max(0.0, min(3.0, raw))
138
+
139
+ return QualityScore(
140
+ score=clamped,
141
+ num_chars=len(clipped),
142
+ num_tokens=num_tokens,
143
+ model=self.model_name,
144
+ )
145
+
146
+ def score_many(self, texts: list[str]) -> list[QualityScore]:
147
+ """Serial scoring — tiny MVP harness, not a batched hot path."""
148
+ return [self.score(t) for t in texts]
packages/pdfsys-parser-mupdf/pyproject.toml CHANGED
@@ -9,6 +9,7 @@ description = "Text-ok backend: PyMuPDF extraction + reading order + Markdown em
9
  requires-python = ">=3.11"
10
  dependencies = [
11
  "pdfsys-core",
 
12
  ]
13
 
14
  [tool.uv.sources]
 
9
  requires-python = ">=3.11"
10
  dependencies = [
11
  "pdfsys-core",
12
+ "pymupdf>=1.24",
13
  ]
14
 
15
  [tool.uv.sources]
packages/pdfsys-parser-mupdf/src/pdfsys_parser_mupdf/__init__.py CHANGED
@@ -1,8 +1,14 @@
1
  """pdfsys-parser-mupdf — text-ok extraction backend.
2
 
3
  Consumes PDFs classified as text-ok by pdfsys-router. Uses PyMuPDF for
4
- block extraction, simple two-column reading order, and emits Markdown.
5
- Does NOT depend on pdfsys-layout-analyser.
6
  """
7
 
 
 
 
 
8
  __version__ = "0.0.1"
 
 
 
1
  """pdfsys-parser-mupdf — text-ok extraction backend.
2
 
3
  Consumes PDFs classified as text-ok by pdfsys-router. Uses PyMuPDF for
4
+ block extraction (``page.get_text("blocks", sort=True)``) and emits
5
+ Markdown. Does NOT depend on pdfsys-layout-analyser.
6
  """
7
 
8
+ from __future__ import annotations
9
+
10
+ from .extract import extract_doc, extract_doc_bytes
11
+
12
  __version__ = "0.0.1"
13
+
14
+ __all__ = ["__version__", "extract_doc", "extract_doc_bytes"]
packages/pdfsys-parser-mupdf/src/pdfsys_parser_mupdf/extract.py CHANGED
@@ -1 +1,181 @@
1
- """PyMuPDF extraction entrypoint. Stub only."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """PyMuPDF-based text extraction for the mupdf (text-ok) backend.
2
+
3
+ This is the simplest of the three parser backends. It assumes the PDF
4
+ already has a clean text layer and just needs unwrapping into Markdown —
5
+ which is why the router routes here only when the XGBoost classifier says
6
+ ``ocr_prob < threshold``.
7
+
8
+ We use ``page.get_text("blocks")`` which returns paragraph-shaped blocks
9
+ with coordinates already in reading order (PyMuPDF's internal sorting).
10
+ Each block becomes one :class:`pdfsys_core.Segment` of type
11
+ :attr:`pdfsys_core.RegionType.TEXT`, with its bbox normalized to ``[0, 1]``.
12
+ Empty and image-only blocks are dropped.
13
+
14
+ No layout-model dependency, no GPU, no OCR — this is the text-ok fast
15
+ path, and stays that way.
16
+ """
17
+
18
+ from __future__ import annotations
19
+
20
+ import hashlib
21
+ import io
22
+ from pathlib import Path
23
+ from typing import Any
24
+
25
+ import pymupdf
26
+
27
+ from pdfsys_core import (
28
+ Backend,
29
+ BBox,
30
+ ExtractedDoc,
31
+ RegionType,
32
+ Segment,
33
+ merge_segments_to_markdown,
34
+ )
35
+
36
+
37
+ # PyMuPDF block tuple layout: (x0, y0, x1, y1, text, block_no, block_type).
38
+ # block_type 0 = text, 1 = image.
39
+ _TEXT_BLOCK_TYPE = 0
40
+
41
+
42
+ def _sha256_of_file(path: Path) -> str:
43
+ h = hashlib.sha256()
44
+ with path.open("rb") as f:
45
+ for chunk in iter(lambda: f.read(1 << 20), b""):
46
+ h.update(chunk)
47
+ return h.hexdigest()
48
+
49
+
50
+ def _sha256_of_bytes(data: bytes) -> str:
51
+ return hashlib.sha256(data).hexdigest()
52
+
53
+
54
+ def _normalize_text(text: str) -> str:
55
+ """Trim trailing whitespace and collapse PyMuPDF's soft linebreaks.
56
+
57
+ PyMuPDF returns block text with intra-paragraph newlines. For Markdown
58
+ emission we keep paragraphs on one line; actual paragraph breaks come
59
+ from the block boundaries themselves.
60
+ """
61
+ if not text:
62
+ return ""
63
+ # Strip and replace single newlines with spaces while preserving
64
+ # double-newlines (rare, but occasionally emitted for list items).
65
+ paragraphs = [p.strip() for p in text.split("\n\n")]
66
+ joined = "\n\n".join(" ".join(p.split()) for p in paragraphs if p.strip())
67
+ return joined.strip()
68
+
69
+
70
+ def _block_bbox(
71
+ block: tuple[Any, ...],
72
+ page_width_pt: float,
73
+ page_height_pt: float,
74
+ ) -> BBox | None:
75
+ """Normalize a PyMuPDF block bbox to ``[0, 1]`` or return None on overflow."""
76
+ x0, y0, x1, y1 = block[0], block[1], block[2], block[3]
77
+ if page_width_pt <= 0 or page_height_pt <= 0:
78
+ return None
79
+
80
+ def clamp(v: float) -> float:
81
+ if v < 0.0:
82
+ return 0.0
83
+ if v > 1.0:
84
+ return 1.0
85
+ return v
86
+
87
+ nx0 = clamp(x0 / page_width_pt)
88
+ ny0 = clamp(y0 / page_height_pt)
89
+ nx1 = clamp(x1 / page_width_pt)
90
+ ny1 = clamp(y1 / page_height_pt)
91
+ if nx1 <= nx0 or ny1 <= ny0:
92
+ return None
93
+ try:
94
+ return BBox(x0=nx0, y0=ny0, x1=nx1, y1=ny1)
95
+ except ValueError:
96
+ return None
97
+
98
+
99
+ def extract_doc(pdf_path: str | Path) -> ExtractedDoc:
100
+ """Run the mupdf backend on a single PDF file and return its ExtractedDoc."""
101
+ path = Path(pdf_path)
102
+ sha256 = _sha256_of_file(path)
103
+ doc = pymupdf.open(str(path))
104
+ try:
105
+ return _extract(doc, sha256)
106
+ finally:
107
+ doc.close()
108
+
109
+
110
+ def extract_doc_bytes(pdf_bytes: bytes, sha256: str | None = None) -> ExtractedDoc:
111
+ """Run the mupdf backend on an in-memory PDF buffer."""
112
+ sha = sha256 or _sha256_of_bytes(pdf_bytes)
113
+ doc = pymupdf.open(stream=io.BytesIO(pdf_bytes), filetype="pdf")
114
+ try:
115
+ return _extract(doc, sha)
116
+ finally:
117
+ doc.close()
118
+
119
+
120
+ def _extract(doc: pymupdf.Document, sha256: str) -> ExtractedDoc:
121
+ segments: list[Segment] = []
122
+ pages_extracted = 0
123
+ pages_skipped = 0
124
+
125
+ for page_index, page in enumerate(doc):
126
+ page_width_pt = float(page.rect.width)
127
+ page_height_pt = float(page.rect.height)
128
+
129
+ try:
130
+ blocks = page.get_text(
131
+ "blocks",
132
+ flags=pymupdf.TEXT_PRESERVE_WHITESPACE | pymupdf.TEXT_MEDIABOX_CLIP,
133
+ sort=True,
134
+ )
135
+ except Exception:
136
+ pages_skipped += 1
137
+ continue
138
+
139
+ pages_extracted += 1
140
+ for block in blocks:
141
+ # block tuple: (x0, y0, x1, y1, text, block_no, block_type)
142
+ if len(block) < 7:
143
+ continue
144
+ if block[6] != _TEXT_BLOCK_TYPE:
145
+ # image block — mupdf backend doesn't emit IMAGE segments by
146
+ # design; image-heavy PDFs should have been routed elsewhere.
147
+ continue
148
+ text = _normalize_text(block[4] or "")
149
+ if not text:
150
+ continue
151
+ bbox = _block_bbox(block, page_width_pt, page_height_pt)
152
+ segments.append(
153
+ Segment(
154
+ index=len(segments),
155
+ backend=Backend.MUPDF,
156
+ page_index=page_index,
157
+ type=RegionType.TEXT,
158
+ content=text,
159
+ bbox=bbox,
160
+ source_region_id=None,
161
+ )
162
+ )
163
+
164
+ seg_tuple = tuple(segments)
165
+ markdown = merge_segments_to_markdown(seg_tuple)
166
+
167
+ stats: dict[str, Any] = {
168
+ "page_count": len(doc),
169
+ "pages_extracted": pages_extracted,
170
+ "pages_skipped": pages_skipped,
171
+ "segment_count": len(seg_tuple),
172
+ "char_count": len(markdown),
173
+ }
174
+
175
+ return ExtractedDoc(
176
+ sha256=sha256,
177
+ backend=Backend.MUPDF,
178
+ segments=seg_tuple,
179
+ markdown=markdown,
180
+ stats=stats,
181
+ )
packages/pdfsys-router/models/.gitignore ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ # External model weights downloaded by pdfsys_router.download_weights.
2
+ # The xgb_classifier.ubj file is FinePDFs IP and should not be
3
+ # committed. Run `python -m pdfsys_router.download_weights` to fetch it.
4
+ xgb_classifier.ubj
packages/pdfsys-router/models/README.md ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Router model weights
2
+
3
+ This directory is where the Stage-A XGBoost classifier weights live on disk.
4
+
5
+ The file `xgb_classifier.ubj` (≈ 257 KB) is **not committed** — it's the
6
+ ported FinePDFs binary classifier weights, owned by HuggingFace. Fetch it
7
+ once with:
8
+
9
+ ```bash
10
+ python -m pdfsys_router.download_weights
11
+ ```
12
+
13
+ The downloader pulls from
14
+ `media.githubusercontent.com/media/huggingface/finepdfs/main/blocks/predictor/xgb.ubj`,
15
+ which is the actual Git-LFS payload (not the pointer file that plain
16
+ `raw.githubusercontent.com` would return).
packages/pdfsys-router/pyproject.toml CHANGED
@@ -9,6 +9,11 @@ description = "Stage-1 classifier: decides text-ok vs needs-ocr; consults Layout
9
  requires-python = ">=3.11"
10
  dependencies = [
11
  "pdfsys-core",
 
 
 
 
 
12
  ]
13
 
14
  [tool.uv.sources]
 
9
  requires-python = ">=3.11"
10
  dependencies = [
11
  "pdfsys-core",
12
+ "pymupdf>=1.24",
13
+ "xgboost>=2.0",
14
+ "scikit-learn>=1.3",
15
+ "pandas>=2.0",
16
+ "numpy>=1.26",
17
  ]
18
 
19
  [tool.uv.sources]
packages/pdfsys-router/src/pdfsys_router/__init__.py CHANGED
@@ -1,9 +1,27 @@
1
  """pdfsys-router — two-stage routing for the pdfsys extraction pipeline.
2
 
3
- Stage A (cheap): classify text-ok vs needs-ocr from PyMuPDF features.
 
 
4
  Stage B (uses layout cache): for needs-ocr, read the LayoutDocument written
5
  by pdfsys-layout-analyser and decide pipeline vs vlm based on whether
6
- complex regions (tables / formulas) exist.
7
  """
8
 
 
 
 
 
 
 
9
  __version__ = "0.0.1"
 
 
 
 
 
 
 
 
 
 
 
1
  """pdfsys-router — two-stage routing for the pdfsys extraction pipeline.
2
 
3
+ Stage A (cheap): classify text-ok vs needs-ocr from PyMuPDF features, using
4
+ a ported FinePDFs XGBoost classifier over 124 hand-crafted features.
5
+
6
  Stage B (uses layout cache): for needs-ocr, read the LayoutDocument written
7
  by pdfsys-layout-analyser and decide pipeline vs vlm based on whether
8
+ complex regions (tables / formulas) exist. Stage B is not in the MVP.
9
  """
10
 
11
+ from __future__ import annotations
12
+
13
+ from .classifier import Router, RouterDecision
14
+ from .feature_extractor import PDFFeatureExtractor, flatten_per_page_features
15
+ from .xgb_model import XgbRouterModel, default_weights_path
16
+
17
  __version__ = "0.0.1"
18
+
19
+ __all__ = [
20
+ "__version__",
21
+ "Router",
22
+ "RouterDecision",
23
+ "PDFFeatureExtractor",
24
+ "flatten_per_page_features",
25
+ "XgbRouterModel",
26
+ "default_weights_path",
27
+ ]
packages/pdfsys-router/src/pdfsys_router/classifier.py CHANGED
@@ -1,4 +1,201 @@
1
- """Stage-A classifier: text-ok vs needs-ocr.
2
 
3
- Stub only.
 
 
 
 
 
 
 
4
  """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Stage-A classifier: decides text-ok (MUPDF) vs needs-ocr (PIPELINE/VLM).
2
 
3
+ This is the single public entry point of the router for the MVP. Stage-B
4
+ (layout-cache driven pipeline-vs-vlm decision) will be added later; for
5
+ now, anything that needs OCR is routed to ``Backend.PIPELINE`` unless the
6
+ configured policy says otherwise.
7
+
8
+ The classifier is deliberately stateless. It loads the XGBoost model once
9
+ (lazily) and then exposes ``classify(pdf_path) -> RouterDecision``. No
10
+ caching, no I/O side effects — pure in, pure out.
11
  """
12
+
13
+ from __future__ import annotations
14
+
15
+ import random
16
+ from dataclasses import dataclass, field
17
+ from pathlib import Path
18
+ from typing import Any
19
+
20
+ import numpy as np
21
+ import pymupdf
22
+
23
+ from pdfsys_core import Backend, RouterConfig
24
+
25
+ from .feature_extractor import PDFFeatureExtractor, flatten_per_page_features
26
+ from .xgb_model import XgbRouterModel, default_weights_path
27
+
28
+
29
+ @dataclass(slots=True)
30
+ class RouterDecision:
31
+ """Result of running the Stage-A classifier on a single PDF."""
32
+
33
+ backend: Backend
34
+ ocr_prob: float
35
+ num_pages: int
36
+ is_form: bool
37
+ garbled_text_ratio: float
38
+ is_encrypted: bool
39
+ needs_password: bool
40
+ features: dict[str, Any] = field(default_factory=dict)
41
+ error: str | None = None
42
+
43
+ def as_record(self) -> dict[str, Any]:
44
+ """Flat dict for JSONL emission."""
45
+ return {
46
+ "backend": self.backend.value,
47
+ "ocr_prob": self.ocr_prob,
48
+ "num_pages": self.num_pages,
49
+ "is_form": bool(self.is_form),
50
+ "garbled_text_ratio": float(self.garbled_text_ratio),
51
+ "is_encrypted": bool(self.is_encrypted),
52
+ "needs_password": bool(self.needs_password),
53
+ "error": self.error,
54
+ }
55
+
56
+
57
+ class Router:
58
+ """Stage-A router: PyMuPDF features → XGBoost → Backend."""
59
+
60
+ def __init__(
61
+ self,
62
+ config: RouterConfig | None = None,
63
+ model_path: str | Path | None = None,
64
+ num_pages_to_sample: int = 8,
65
+ ocr_threshold: float = 0.5,
66
+ seed: int = 42,
67
+ ) -> None:
68
+ self.config = config or RouterConfig()
69
+ self.num_pages_to_sample = num_pages_to_sample
70
+ self.ocr_threshold = ocr_threshold
71
+ self.seed = seed
72
+ self._extractor = PDFFeatureExtractor(
73
+ num_chunks=1, num_pages_to_sample=num_pages_to_sample
74
+ )
75
+ self._model = XgbRouterModel(model_path or default_weights_path())
76
+
77
+ # ------------------------------------------------------------------ api
78
+
79
+ def classify(self, pdf_path: str | Path) -> RouterDecision:
80
+ """Classify a PDF file. Never raises — errors are in ``decision.error``."""
81
+ path = Path(pdf_path)
82
+ try:
83
+ doc = pymupdf.open(str(path))
84
+ except Exception as e: # noqa: BLE001 — we want to capture anything
85
+ return RouterDecision(
86
+ backend=Backend.DEFERRED,
87
+ ocr_prob=float("nan"),
88
+ num_pages=0,
89
+ is_form=False,
90
+ garbled_text_ratio=0.0,
91
+ is_encrypted=False,
92
+ needs_password=False,
93
+ error=f"open_failed: {e}",
94
+ )
95
+
96
+ try:
97
+ return self._classify_doc(doc)
98
+ finally:
99
+ try:
100
+ doc.close()
101
+ except Exception:
102
+ pass
103
+
104
+ def classify_bytes(self, pdf_bytes: bytes) -> RouterDecision:
105
+ """Same as :meth:`classify`, but from an in-memory buffer."""
106
+ import io
107
+
108
+ try:
109
+ doc = pymupdf.open(stream=io.BytesIO(pdf_bytes), filetype="pdf")
110
+ except Exception as e: # noqa: BLE001
111
+ return RouterDecision(
112
+ backend=Backend.DEFERRED,
113
+ ocr_prob=float("nan"),
114
+ num_pages=0,
115
+ is_form=False,
116
+ garbled_text_ratio=0.0,
117
+ is_encrypted=False,
118
+ needs_password=False,
119
+ error=f"open_failed: {e}",
120
+ )
121
+ try:
122
+ return self._classify_doc(doc)
123
+ finally:
124
+ try:
125
+ doc.close()
126
+ except Exception:
127
+ pass
128
+
129
+ # --------------------------------------------------------------- internal
130
+
131
+ def _classify_doc(self, doc: pymupdf.Document) -> RouterDecision:
132
+ # Seed the sampling RNGs so the same PDF always produces the same
133
+ # feature vector — critical for reproducibility and debugging.
134
+ random.seed(self.seed)
135
+ np.random.seed(self.seed)
136
+
137
+ try:
138
+ if doc.is_encrypted or doc.needs_pass:
139
+ return RouterDecision(
140
+ backend=Backend.DEFERRED,
141
+ ocr_prob=float("nan"),
142
+ num_pages=len(doc),
143
+ is_form=False,
144
+ garbled_text_ratio=0.0,
145
+ is_encrypted=bool(doc.is_encrypted),
146
+ needs_password=bool(doc.needs_pass),
147
+ error="encrypted_or_password_protected",
148
+ )
149
+
150
+ raw_chunks = self._extractor.extract_all_features(doc)
151
+ if not raw_chunks:
152
+ return RouterDecision(
153
+ backend=Backend.DEFERRED,
154
+ ocr_prob=float("nan"),
155
+ num_pages=len(doc),
156
+ is_form=False,
157
+ garbled_text_ratio=0.0,
158
+ is_encrypted=False,
159
+ needs_password=False,
160
+ error="no_pages_sampled",
161
+ )
162
+
163
+ flat = flatten_per_page_features(
164
+ raw_chunks[0], sample_to_k_page_features=self.num_pages_to_sample
165
+ )
166
+ ocr_prob = self._model.predict_proba(flat)
167
+
168
+ backend = self._route(ocr_prob)
169
+ return RouterDecision(
170
+ backend=backend,
171
+ ocr_prob=ocr_prob,
172
+ num_pages=len(doc),
173
+ is_form=bool(flat.get("is_form", False)),
174
+ garbled_text_ratio=float(flat.get("garbled_text_ratio", 0.0)),
175
+ is_encrypted=bool(doc.is_encrypted),
176
+ needs_password=bool(doc.needs_pass),
177
+ features=flat,
178
+ )
179
+ except Exception as e: # noqa: BLE001
180
+ return RouterDecision(
181
+ backend=Backend.DEFERRED,
182
+ ocr_prob=float("nan"),
183
+ num_pages=len(doc) if doc else 0,
184
+ is_form=False,
185
+ garbled_text_ratio=0.0,
186
+ is_encrypted=False,
187
+ needs_password=False,
188
+ error=f"classify_failed: {e}",
189
+ )
190
+
191
+ def _route(self, ocr_prob: float) -> Backend:
192
+ """Map XGBoost probability + fleet policy → concrete Backend."""
193
+ if ocr_prob < self.ocr_threshold:
194
+ return Backend.MUPDF
195
+ # OCR needed. Stage-B would check LayoutCache for complex content
196
+ # here. For the MVP we have no layout cache yet, so honour the
197
+ # fleet VLM gate: if VLM is enabled we'd need Stage-B to decide,
198
+ # otherwise pipeline handles everything flagged as scanned.
199
+ if self.config.vlm_enabled:
200
+ return Backend.DEFERRED # Stage-B will run once layout is cached
201
+ return Backend.PIPELINE
packages/pdfsys-router/src/pdfsys_router/download_weights.py ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Fetch the FinePDFs XGBoost router weights from upstream.
2
+
3
+ The weights file (``xgb.ubj``, ~257 KB) is not committed to this repo —
4
+ it's external IP owned by HuggingFace/FinePDFs and lives on their Git-LFS
5
+ bucket. Running this module downloads it once into ``models/xgb_classifier.ubj``
6
+ next to this package.
7
+
8
+ Usage::
9
+
10
+ python -m pdfsys_router.download_weights
11
+ """
12
+
13
+ from __future__ import annotations
14
+
15
+ import sys
16
+ import urllib.request
17
+ from pathlib import Path
18
+
19
+ # media.githubusercontent.com serves the actual LFS payload directly,
20
+ # bypassing the pointer file that raw.githubusercontent.com returns.
21
+ WEIGHTS_URL = (
22
+ "https://media.githubusercontent.com/media/huggingface/finepdfs/main/"
23
+ "blocks/predictor/xgb.ubj"
24
+ )
25
+
26
+
27
+ def target_path() -> Path:
28
+ return Path(__file__).resolve().parent.parent.parent / "models" / "xgb_classifier.ubj"
29
+
30
+
31
+ def download(force: bool = False) -> Path:
32
+ dst = target_path()
33
+ if dst.exists() and not force:
34
+ print(f"[download_weights] already present: {dst}")
35
+ return dst
36
+ dst.parent.mkdir(parents=True, exist_ok=True)
37
+ print(f"[download_weights] fetching {WEIGHTS_URL}")
38
+ with urllib.request.urlopen(WEIGHTS_URL) as r: # noqa: S310 — pinned URL
39
+ data = r.read()
40
+ if len(data) < 10_000:
41
+ raise RuntimeError(
42
+ f"downloaded blob is suspiciously small ({len(data)} bytes) — "
43
+ "likely an LFS pointer, not the binary"
44
+ )
45
+ dst.write_bytes(data)
46
+ print(f"[download_weights] wrote {len(data)} bytes -> {dst}")
47
+ return dst
48
+
49
+
50
+ if __name__ == "__main__":
51
+ force = "--force" in sys.argv
52
+ download(force=force)
packages/pdfsys-router/src/pdfsys_router/feature_extractor.py ADDED
@@ -0,0 +1,484 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """PyMuPDF-only feature extractor for the Stage-A router classifier.
2
+
3
+ Ported verbatim (modulo stylistic cleanup and removal of datatrove imports)
4
+ from FinePDFs' ``blocks/predictor/ocr_predictor.py``:
5
+
6
+ https://github.com/huggingface/finepdfs/blob/main/blocks/predictor/ocr_predictor.py
7
+
8
+ The goal is bit-exact feature compatibility with the upstream XGBoost
9
+ ``xgb.ubj`` weights. If you touch anything in here, run the parity harness
10
+ in ``pdfsys-bench`` against FinePDFs' reference output first.
11
+
12
+ The extractor samples up to ``num_pages_to_sample`` pages at random, then
13
+ computes:
14
+
15
+ * 4 doc-level features: ``num_pages_successfully_sampled``,
16
+ ``garbled_text_ratio``, ``is_form``, ``creator_or_producer_is_known_scanner``.
17
+ * 15 page-level features × 8 sampled pages = 120 features.
18
+
19
+ :func:`flatten_per_page_features` produces the flat 124-feature dict the
20
+ XGBoost model expects, in the exact column order of ``feature_names_in_``.
21
+ """
22
+
23
+ from __future__ import annotations
24
+
25
+ import random
26
+ from collections import Counter
27
+ from typing import Any
28
+
29
+ import numpy as np
30
+ import pymupdf
31
+
32
+
33
+ # Keep this list in sync with FinePDFs upstream. These strings are
34
+ # lowercased substring-matched against PDF metadata creator/producer to
35
+ # flag scanner-origin PDFs which almost always need OCR.
36
+ KNOWN_SCANNER_STRINGS: tuple[str, ...] = (
37
+ "scanner",
38
+ "scan",
39
+ "epson",
40
+ "hp scanjet",
41
+ "canon",
42
+ "fujitsu",
43
+ "kodak",
44
+ "brother",
45
+ "xerox",
46
+ "lexmark",
47
+ "kmc",
48
+ "kofax",
49
+ "ricoh",
50
+ "iris",
51
+ "capturedocument",
52
+ "paperport",
53
+ "readiris",
54
+ "simpleocr",
55
+ )
56
+
57
+ # Strip-merge tuning constants — used to coalesce image slices that some
58
+ # PDFs explode into dozens of thin rectangles, so we don't overcount.
59
+ JUNK_IMAGE_THRESHOLD_RATIO = 0.5
60
+ JUNK_IMAGE_MIN_PAGES_FOR_THRESHOLD = 3
61
+ MERGE_MAX_OFFSET = 5
62
+ MERGE_MAX_GAP = 2
63
+
64
+
65
+ def flatten_per_page_features(
66
+ feature_dict_sample: dict[str, Any],
67
+ sample_to_k_page_features: int = 8,
68
+ ) -> dict[str, Any]:
69
+ """Flatten a nested feature dict into the flat schema XGBoost expects.
70
+
71
+ The XGBoost model was trained on a 124-column DataFrame whose columns
72
+ are, in order:
73
+
74
+ num_pages_successfully_sampled
75
+ garbled_text_ratio
76
+ is_form
77
+ creator_or_producer_is_known_scanner
78
+ page_level_unique_font_counts_page1
79
+ ...
80
+ page_level_vector_graphics_obj_count_page8
81
+
82
+ If fewer than 8 pages were actually sampled, pages are resampled with
83
+ replacement to pad the vector — this matches the upstream behavior.
84
+ Seed numpy before calling this function if you need determinism.
85
+ """
86
+ flattened: dict[str, Any] = {}
87
+
88
+ doc_level_features = (
89
+ "num_pages_successfully_sampled",
90
+ "num_unique_image_xrefs",
91
+ "num_junk_image_xrefs",
92
+ "garbled_text_ratio",
93
+ "is_form",
94
+ "creator_or_producer_is_known_scanner",
95
+ "class",
96
+ )
97
+
98
+ used_keys: set[str] = set()
99
+
100
+ for key in doc_level_features:
101
+ if key in feature_dict_sample:
102
+ flattened[key] = feature_dict_sample[key]
103
+ used_keys.add(key)
104
+
105
+ page_level_features = (
106
+ "page_level_unique_font_counts",
107
+ "page_level_char_counts",
108
+ "page_level_text_box_counts",
109
+ "page_level_avg_text_box_lengths",
110
+ "page_level_text_area_ratios",
111
+ "page_level_hidden_char_counts",
112
+ "page_level_hidden_text_box_counts",
113
+ "page_level_hidden_avg_text_box_lengths",
114
+ "page_level_hidden_text_area_ratios",
115
+ "page_level_image_counts",
116
+ "page_level_non_junk_image_counts",
117
+ "page_level_bitmap_proportions",
118
+ "page_level_max_merged_strip_areas",
119
+ "page_level_drawing_strokes_count",
120
+ "page_level_vector_graphics_obj_count",
121
+ )
122
+
123
+ num_pages = len(feature_dict_sample["page_level_unique_font_counts"])
124
+ page_indices = list(range(num_pages))
125
+ # If we don't have enough pages, resample random pages. Upstream uses
126
+ # np.random.choice here, so seed numpy if determinism matters.
127
+ if num_pages < sample_to_k_page_features:
128
+ extra = np.random.choice(
129
+ num_pages, sample_to_k_page_features - num_pages, replace=True
130
+ ).tolist()
131
+ page_indices += extra
132
+
133
+ for key in page_level_features:
134
+ list_data = feature_dict_sample.get(key)
135
+ if list_data is None:
136
+ continue
137
+ for page_idx, ind in enumerate(page_indices):
138
+ flattened[f"{key}_page{page_idx + 1}"] = list_data[ind]
139
+ used_keys.add(key)
140
+
141
+ return flattened
142
+
143
+
144
+ class PDFFeatureExtractor:
145
+ """PyMuPDF feature extraction. Pure — no I/O, no network, no state."""
146
+
147
+ def __init__(self, num_pages_to_sample: int = 8, num_chunks: int = 1) -> None:
148
+ if not isinstance(num_pages_to_sample, int):
149
+ raise ValueError("num_pages_to_sample must be an integer.")
150
+ self.num_pages_to_sample = num_pages_to_sample
151
+ self.num_chunks = num_chunks
152
+
153
+ # --------------------------------------------------------------- sampling
154
+
155
+ def _get_sampled_page_indices(self, doc: pymupdf.Document) -> list[list[int]]:
156
+ total_pages = len(doc)
157
+ if total_pages == 0 or self.num_pages_to_sample <= 0:
158
+ return []
159
+
160
+ available = list(range(total_pages))
161
+ sampled: list[list[int]] = []
162
+
163
+ if self.num_chunks == -1:
164
+ num_chunks = len(available) // self.num_pages_to_sample + 1
165
+ else:
166
+ num_chunks = self.num_chunks
167
+
168
+ for _ in range(num_chunks):
169
+ if not available:
170
+ break
171
+ chunk_size = min(self.num_pages_to_sample, len(available))
172
+ chunk = random.sample(available, chunk_size)
173
+ for idx in chunk:
174
+ available.remove(idx)
175
+ sampled.append(sorted(chunk))
176
+
177
+ return sampled
178
+
179
+ # ----------------------------------------------------------- doc-level
180
+
181
+ def _get_garbled_text_per_page(
182
+ self, doc: pymupdf.Document
183
+ ) -> tuple[list[int], list[int]]:
184
+ all_text: list[int] = []
185
+ garbled_text: list[int] = []
186
+ replacement = chr(0xFFFD)
187
+ for page in doc:
188
+ text = page.get_text(
189
+ "text",
190
+ flags=pymupdf.TEXT_PRESERVE_WHITESPACE | pymupdf.TEXT_MEDIABOX_CLIP,
191
+ )
192
+ all_text.append(len(text))
193
+ garbled_text.append(text.count(replacement))
194
+ return all_text, garbled_text
195
+
196
+ def _check_creator_producer_scanner(self, doc: pymupdf.Document) -> bool:
197
+ metadata = doc.metadata or {}
198
+ creator = (metadata.get("creator") or "").lower()
199
+ producer = (metadata.get("producer") or "").lower()
200
+ for keyword in KNOWN_SCANNER_STRINGS:
201
+ if keyword in creator or keyword in producer:
202
+ return True
203
+ return False
204
+
205
+ def _extract_document_level_stats_from_sampled_pages(
206
+ self, doc: pymupdf.Document, sampled_page_indices: list[int]
207
+ ) -> dict[str, Any]:
208
+ """Identify junk images (same xref repeated on most sampled pages)."""
209
+ stats: dict[str, Any] = {"junk_image_xrefs_list": []}
210
+
211
+ if not sampled_page_indices:
212
+ return stats
213
+
214
+ all_instances: list[int] = []
215
+ per_page: dict[int, set[int]] = {}
216
+ for page_idx in sampled_page_indices:
217
+ try:
218
+ page = doc.load_page(page_idx)
219
+ unique_xrefs: set[int] = set()
220
+ for img_def in page.get_images(full=False):
221
+ xref = img_def[0]
222
+ if xref == 0:
223
+ continue
224
+ unique_xrefs.add(xref)
225
+ all_instances.append(xref)
226
+ per_page[page_idx] = unique_xrefs
227
+ except Exception:
228
+ per_page[page_idx] = set()
229
+
230
+ if not all_instances:
231
+ return stats
232
+
233
+ stats["num_unique_image_xrefs"] = len(set(all_instances))
234
+
235
+ xref_page_counts: Counter[int] = Counter()
236
+ for page_xrefs in per_page.values():
237
+ xref_page_counts.update(page_xrefs)
238
+
239
+ num_sampled = len(sampled_page_indices)
240
+ # Upstream overrides the ratio check and requires an xref to be on
241
+ # every sampled page to be flagged as junk — matches FinePDFs.
242
+ min_threshold = num_sampled
243
+
244
+ junk_list: list[int] = []
245
+ if num_sampled >= JUNK_IMAGE_MIN_PAGES_FOR_THRESHOLD:
246
+ for xref, count in xref_page_counts.items():
247
+ if count >= min_threshold:
248
+ junk_list.append(xref)
249
+
250
+ stats["num_junk_image_xrefs"] = len(junk_list)
251
+ stats["junk_image_xrefs_list"] = junk_list
252
+ return stats
253
+
254
+ # ------------------------------------------------------------- imaging
255
+
256
+ def _heuristic_merge_image_strips_on_page(
257
+ self,
258
+ single_page_image_list: list[list[Any]],
259
+ page_width: float,
260
+ page_height: float,
261
+ ) -> list[list[Any]]:
262
+ if not single_page_image_list:
263
+ return []
264
+
265
+ deduped: list[list[Any]] = []
266
+ seen: set[tuple[float, float, float, float]] = set()
267
+ for img_data in single_page_image_list:
268
+ key = (img_data[0], img_data[1], img_data[2], img_data[3])
269
+ if key not in seen:
270
+ seen.add(key)
271
+ deduped.append(img_data)
272
+ if not deduped:
273
+ return []
274
+
275
+ deduped.sort(key=lambda img: (img[1], img[0]))
276
+ merged: list[list[Any]] = [deduped[0]]
277
+
278
+ for img in deduped[1:]:
279
+ x0, y0, x1, y1, imgid = img
280
+ last = merged[-1]
281
+ lx0, ly0, lx1, ly1, _ = last
282
+
283
+ cur_w = abs(x1 - x0)
284
+ cur_h = abs(y1 - y0)
285
+ full_w = page_width > 0 and cur_w >= page_width * 0.9
286
+ full_h = page_height > 0 and cur_h >= page_height * 0.9
287
+
288
+ can_merge = False
289
+ if full_w:
290
+ if (
291
+ abs(lx0 - x0) <= MERGE_MAX_OFFSET
292
+ and abs(lx1 - x1) <= MERGE_MAX_OFFSET
293
+ and abs(y0 - ly1) <= MERGE_MAX_GAP
294
+ ):
295
+ can_merge = True
296
+ if not can_merge and full_h:
297
+ if (
298
+ abs(ly0 - y0) <= MERGE_MAX_OFFSET
299
+ and abs(ly1 - y1) <= MERGE_MAX_OFFSET
300
+ and abs(x0 - lx1) <= MERGE_MAX_GAP
301
+ ):
302
+ can_merge = True
303
+
304
+ if can_merge:
305
+ merged[-1] = [
306
+ min(x0, lx0),
307
+ min(y0, ly0),
308
+ max(x1, lx1),
309
+ max(y1, ly1),
310
+ imgid,
311
+ ]
312
+ else:
313
+ merged.append(img)
314
+
315
+ return merged
316
+
317
+ # ---------------------------------------------------------------- main
318
+
319
+ def compute_features_per_chunk(
320
+ self, doc: pymupdf.Document, sampled_page_indices: list[int]
321
+ ) -> dict[str, Any]:
322
+ features: dict[str, Any] = {
323
+ "is_form": False,
324
+ "creator_or_producer_is_known_scanner": False,
325
+ "garbled_text_ratio": 0,
326
+ "page_level_unique_font_counts": [],
327
+ "page_level_char_counts": [],
328
+ "page_level_text_box_counts": [],
329
+ "page_level_avg_text_box_lengths": [],
330
+ "page_level_text_area_ratios": [],
331
+ "page_level_hidden_char_counts": [],
332
+ "page_level_hidden_text_box_counts": [],
333
+ "page_level_hidden_avg_text_box_lengths": [],
334
+ "page_level_hidden_text_area_ratios": [],
335
+ "page_level_image_counts": [],
336
+ "page_level_non_junk_image_counts": [],
337
+ "page_level_bitmap_proportions": [],
338
+ "page_level_max_merged_strip_areas": [],
339
+ "page_level_drawing_strokes_count": [],
340
+ "page_level_vector_graphics_obj_count": [],
341
+ "num_pages_successfully_sampled": 0,
342
+ "num_pages_requested_for_sampling": 0,
343
+ "sampled_page_indices": [],
344
+ }
345
+
346
+ features["num_pages_requested_for_sampling"] = len(sampled_page_indices)
347
+ if not sampled_page_indices:
348
+ return features
349
+
350
+ doc_stats = self._extract_document_level_stats_from_sampled_pages(
351
+ doc, sampled_page_indices
352
+ )
353
+ junk_xrefs: set[int] = set(doc_stats.get("junk_image_xrefs_list", []))
354
+
355
+ features["is_form"] = bool(doc.is_form_pdf) if doc.is_form_pdf is not None else False
356
+ features["creator_or_producer_is_known_scanner"] = self._check_creator_producer_scanner(doc)
357
+
358
+ # Garbled text: U+FFFD replacement character / total chars. Computed
359
+ # over ALL pages, but the rate reported to XGBoost is restricted to
360
+ # the sampled pages (upstream semantics).
361
+ all_text, garbled_text = self._get_garbled_text_per_page(doc)
362
+ all_sum = sum(all_text)
363
+ garb_sum = sum(garbled_text)
364
+ features["global_garbled_text_ratio"] = 0 if all_sum == 0 else garb_sum / all_sum
365
+
366
+ sampled_garb = sum(garbled_text[i] for i in sampled_page_indices)
367
+ sampled_all = sum(all_text[i] for i in sampled_page_indices)
368
+ features["garbled_text_ratio"] = 0 if sampled_all == 0 else sampled_garb / sampled_all
369
+
370
+ for page_idx in sampled_page_indices:
371
+ try:
372
+ page = doc.load_page(page_idx)
373
+ except Exception:
374
+ continue
375
+
376
+ features["sampled_page_indices"].append(page_idx)
377
+ features["num_pages_successfully_sampled"] += 1
378
+
379
+ page_rect = page.rect
380
+ page_area = float(page_rect.width * page_rect.height) or 1.0
381
+
382
+ # --- Fonts ---
383
+ fonts: set[str] = set()
384
+ try:
385
+ for fi in page.get_fonts(full=True):
386
+ if len(fi) > 3 and fi[3]:
387
+ fonts.add(fi[3])
388
+ except Exception:
389
+ pass
390
+ features["page_level_unique_font_counts"].append(len(fonts))
391
+
392
+ # --- Visible vs hidden text via texttrace ---
393
+ char_count = 0
394
+ text_area = 0.0
395
+ text_boxes = 0
396
+ hidden_chars = 0
397
+ hidden_area = 0.0
398
+ hidden_boxes = 0
399
+ try:
400
+ for tr in page.get_texttrace():
401
+ n = len(tr.get("chars", []))
402
+ bbox = tr.get("bbox", (0, 0, 0, 0))
403
+ box_area = (bbox[2] - bbox[0]) * (bbox[3] - bbox[1])
404
+ if tr.get("type") == 3 or tr.get("opacity", 1.0) == 0:
405
+ hidden_chars += n
406
+ hidden_area += box_area
407
+ hidden_boxes += 1
408
+ else:
409
+ char_count += n
410
+ text_area += box_area
411
+ text_boxes += 1
412
+ except Exception:
413
+ pass
414
+
415
+ features["page_level_char_counts"].append(char_count)
416
+ features["page_level_text_box_counts"].append(text_boxes)
417
+ features["page_level_avg_text_box_lengths"].append(
418
+ text_area / text_boxes if text_boxes else 0.0
419
+ )
420
+ features["page_level_text_area_ratios"].append(text_area / page_area)
421
+ features["page_level_hidden_char_counts"].append(hidden_chars)
422
+ features["page_level_hidden_text_box_counts"].append(hidden_boxes)
423
+ features["page_level_hidden_avg_text_box_lengths"].append(
424
+ hidden_area / hidden_boxes if hidden_boxes else 0.0
425
+ )
426
+ features["page_level_hidden_text_area_ratios"].append(hidden_area / page_area)
427
+
428
+ # --- Images ---
429
+ total_imgs = 0
430
+ non_junk_imgs = 0
431
+ non_junk_rects: list[list[Any]] = []
432
+ try:
433
+ for img_def in page.get_images(full=False):
434
+ xref = img_def[0]
435
+ if xref == 0:
436
+ continue
437
+ rects = page.get_image_rects(xref, transform=False)
438
+ total_imgs += len(rects)
439
+ if xref not in junk_xrefs:
440
+ non_junk_imgs += len(rects)
441
+ for r in rects:
442
+ if r.is_empty or r.is_infinite:
443
+ continue
444
+ non_junk_rects.append([r.x0, r.y0, r.x1, r.y1, xref])
445
+ except Exception:
446
+ pass
447
+
448
+ features["page_level_image_counts"].append(total_imgs)
449
+ features["page_level_non_junk_image_counts"].append(non_junk_imgs)
450
+
451
+ merged = self._heuristic_merge_image_strips_on_page(
452
+ non_junk_rects, page_rect.width, page_rect.height
453
+ )
454
+ strip_areas = [abs(b[2] - b[0]) * abs(b[3] - b[1]) for b in merged]
455
+ if strip_areas:
456
+ features["page_level_max_merged_strip_areas"].append(max(strip_areas) / page_area)
457
+ features["page_level_bitmap_proportions"].append(sum(strip_areas) / page_area)
458
+ else:
459
+ features["page_level_max_merged_strip_areas"].append(0.0)
460
+ features["page_level_bitmap_proportions"].append(0.0)
461
+
462
+ # --- Drawings / vector graphics ---
463
+ stroke_count = 0
464
+ vector_objs = 0
465
+ try:
466
+ drawings = page.get_cdrawings()
467
+ vector_objs = len(drawings)
468
+ for path in drawings:
469
+ for item in path.get("items", []):
470
+ if item[0] in ("l", "c", "q"):
471
+ stroke_count += 1
472
+ if path.get("rect") or path.get("quad"):
473
+ if path.get("stroke_opacity", 1) > 0 and path.get("color"):
474
+ stroke_count += 1
475
+ except Exception:
476
+ pass
477
+ features["page_level_drawing_strokes_count"].append(stroke_count)
478
+ features["page_level_vector_graphics_obj_count"].append(vector_objs)
479
+
480
+ return features
481
+
482
+ def extract_all_features(self, doc: pymupdf.Document) -> list[dict[str, Any]]:
483
+ chunks = self._get_sampled_page_indices(doc)
484
+ return [self.compute_features_per_chunk(doc, c) for c in chunks]
packages/pdfsys-router/src/pdfsys_router/xgb_model.py ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Thin loader around the FinePDFs XGBoost ``xgb.ubj`` weights.
2
+
3
+ The model is a binary classifier where class 1 = "needs OCR" (scanned /
4
+ garbled / image-heavy / form). It takes a 124-column feature vector whose
5
+ column order is fixed by :func:`feature_extractor.flatten_per_page_features`.
6
+
7
+ We keep the loader tiny on purpose: the calibration between feature layout
8
+ and column order lives entirely in ``feature_extractor.py`` — this file
9
+ only knows "give me a dict-of-features, I'll give you a probability".
10
+ """
11
+
12
+ from __future__ import annotations
13
+
14
+ from pathlib import Path
15
+
16
+ import numpy as np
17
+ import pandas as pd
18
+ from xgboost import XGBClassifier
19
+
20
+
21
+ class XgbRouterModel:
22
+ """Lazy-loading wrapper around an ``xgb.ubj`` binary classifier."""
23
+
24
+ def __init__(self, path_to_model: str | Path) -> None:
25
+ self.path_to_model = Path(path_to_model)
26
+ self._model: XGBClassifier | None = None
27
+
28
+ @property
29
+ def model(self) -> XGBClassifier:
30
+ if self._model is None:
31
+ if not self.path_to_model.is_file():
32
+ raise FileNotFoundError(
33
+ f"XGBoost weights not found at {self.path_to_model}. "
34
+ "Run `python -m pdfsys_router.download_weights` to fetch them."
35
+ )
36
+ m = XGBClassifier()
37
+ m.load_model(str(self.path_to_model))
38
+ self._model = m
39
+ return self._model
40
+
41
+ def predict_proba(self, features: dict[str, float]) -> float:
42
+ """Return P(class=1, i.e. needs OCR)."""
43
+ df = pd.DataFrame([features])
44
+ # Column ordering must match the training schema — realign using
45
+ # the model's recorded feature_names_in_ when available.
46
+ names = getattr(self.model, "feature_names_in_", None)
47
+ if names is not None:
48
+ df = df.reindex(columns=list(names), fill_value=0)
49
+ probs = self.model.predict_proba(df)
50
+ return float(probs[0][1])
51
+
52
+ @property
53
+ def feature_names(self) -> list[str]:
54
+ names = getattr(self.model, "feature_names_in_", None)
55
+ if names is None:
56
+ return []
57
+ return list(names)
58
+
59
+ @property
60
+ def n_features(self) -> int:
61
+ return int(getattr(self.model, "n_features_in_", 0))
62
+
63
+
64
+ def default_weights_path() -> Path:
65
+ """Return the canonical on-disk location of the bundled weights."""
66
+ return Path(__file__).resolve().parent.parent.parent / "models" / "xgb_classifier.ubj"