Spaces:

roger1024
/

DocPipe

Running

yin commited on 16 days ago

Commit

d423504

1 Parent(s): 1f63780

feat(mvp): wire router → mupdf parser → OCR quality scorer closed loop

Ship the first end-to-end cut of the pdfsys pipeline on OmniDocBench-100:

* pdfsys-router: port FinePDFs PDFFeatureExtractor (15 page features × 8
sampled pages + 4 doc features = 124 columns) and load the upstream
xgb.ubj weights via a thin XgbRouterModel wrapper. Router.classify()
returns a RouterDecision with Backend {MUPDF, PIPELINE, VLM, DEFERRED},
ocr_prob, and the full feature dict for debugging. Seeded RNG keeps the
feature vector reproducible per PDF. Weights live under models/ and are
gitignored; download_weights.py fetches them from the HF LFS media URL.

* pdfsys-parser-mupdf: text-ok backend using page.get_text("blocks",
sort=True), emits one Segment per paragraph-shaped block with bbox
normalized to [0, 1] and converts the whole doc into an ExtractedDoc
with merged Markdown. No layout-analyser dependency by design.

* pdfsys-bench: add quality.py (ModernBERT-large regression head from
HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn, loaded in
bfloat16 with max_tokens=512 to fit a 4 GB RAM dev box), loop.py
(router → parser → scorer → JSONL runner), and __main__ CLI.

End-to-end run on the full 100-doc OmniDocBench subset:
* 70 routed to MUPDF (avg ocr_prob 0.034), 30 routed to PIPELINE
(avg ocr_prob 0.634)
* 70 extracted + quality scored, 0 errors
* avg quality 1.71, wall clock 259 s
* per-doc: router 49 ms, extract 7 ms, quality 3.6 s

Stage-B (LayoutCache-driven pipeline-vs-vlm decision) and the PIPELINE
and VLM parser backends are out of scope for this MVP.

Files changed (18) hide show

.gitignore +5 -0
packages/pdfsys-bench/README.md +88 -0
packages/pdfsys-bench/pyproject.toml +6 -0
packages/pdfsys-bench/src/pdfsys_bench/__init__.py +18 -3
packages/pdfsys-bench/src/pdfsys_bench/__main__.py +98 -0
packages/pdfsys-bench/src/pdfsys_bench/loop.py +200 -0
packages/pdfsys-bench/src/pdfsys_bench/quality.py +148 -0
packages/pdfsys-parser-mupdf/pyproject.toml +1 -0
packages/pdfsys-parser-mupdf/src/pdfsys_parser_mupdf/__init__.py +8 -2
packages/pdfsys-parser-mupdf/src/pdfsys_parser_mupdf/extract.py +181 -1
packages/pdfsys-router/models/.gitignore +4 -0
packages/pdfsys-router/models/README.md +16 -0
packages/pdfsys-router/pyproject.toml +5 -0
packages/pdfsys-router/src/pdfsys_router/__init__.py +20 -2
packages/pdfsys-router/src/pdfsys_router/classifier.py +199 -2
packages/pdfsys-router/src/pdfsys_router/download_weights.py +52 -0
packages/pdfsys-router/src/pdfsys_router/feature_extractor.py +484 -0
packages/pdfsys-router/src/pdfsys_router/xgb_model.py +66 -0

.gitignore CHANGED Viewed

@@ -16,11 +16,16 @@ uv.lock
 # local pipeline scratch
 work/
 output/
 .cache/
 samples/
 bench_data/
 *.layout.json
 # models / weights (too big for git)
 models/
 *.onnx

 # local pipeline scratch
 work/
 output/
+out/
 .cache/
 samples/
 bench_data/
 *.layout.json
+# bench datasets — large binary corpora, distributed out of band
+packages/pdfsys-bench/omnidocbench_100/
+packages/pdfsys-bench/olmocr_bench_50/
 # models / weights (too big for git)
 models/
 *.onnx

packages/pdfsys-bench/README.md ADDED Viewed

	@@ -0,0 +1,88 @@

+# bench/ — PDF processing pipeline evaluation set
+This directory is the **canonical test set** for evaluating the end-to-end PDF
+processing pipeline (layout → OCR → markdown / structured text). It bundles
+two complementary, pre-sampled subsets so that runs are reproducible and
+cheap to iterate on.
+| Subset | PDFs | Source benchmark | Focus |
+|---|---:|---|---|
+| [`olmocr_bench_50/`](./olmocr_bench_50) | 50 | [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench) | Fine-grained unit tests on text presence / absence, reading order, tables, math |
+| [`omnidocbench_100/`](./omnidocbench_100) | 100 | [OmniDocBench](https://github.com/opendatalab/OmniDocBench) | Holistic document-level eval with layout / language / special-issue coverage |
+Total footprint: ~108 MB, 150 PDFs.
+## Subset details
+### `olmocr_bench_50/`
+Stratified sample drawn from the 1,403-PDF olmOCR-bench with the script
+`scripts/sample_olmocr_subset.py` (seed `20260411`). Covers all 7 document
+sources with a minimum floor of 3 PDFs per category plus largest-remainder
+proportional allocation, and diversifies by source document inside each
+category (at most one page per arXiv paper / scan ID before any repeat).
+```
+olmocr_bench_50/
+├── pdfs/
+│   ├── arxiv_math/         (14)
+│   ├── headers_footers/    (8)
+│   ├── long_tiny_text/     (4)
+│   ├── multi_column/       (8)
+│   ├── old_scans/          (5)
+│   ├── old_scans_math/     (4)
+│   └── tables/             (7)
+├── subset_tests.jsonl      # 283 olmOCR-bench unit tests for these 50 PDFs
+└── subset_manifest.json    # seed, quotas, selected file list, source bench_dir
+```
+The `subset_tests.jsonl` file is a filtered copy of the original per-category
+`*.jsonl` test files merged into one; each row keeps the exact schema used by
+the upstream olmOCR-bench evaluator (`pdf`, `type`, `max_diffs`, `checked`,
+and type-specific fields like `math`, `cell`, `before`/`after`, …).
+Regenerate or resize:
+```bash
+python3 scripts/sample_olmocr_subset.py --target 50             # default → bench/olmocr_bench_50
+python3 scripts/sample_olmocr_subset.py --target 100 --seed 42  # alt subset
+python3 scripts/sample_olmocr_subset.py --dry-run               # plan only
+```
+### `omnidocbench_100/`
+Pre-built 100-PDF subset of OmniDocBench v2 with full stratified coverage
+across every categorical axis in the upstream dataset.
+```
+omnidocbench_100/
+├── pdfs/                   # 100 single-page PDFs
+├── img/                    # matching rendered JPGs (1 per PDF)
+├── subset_100.json         # full OmniDocBench annotations for the 100 samples
+├── subset_100_stats.json   # coverage & distribution stats vs. full 981-doc set
+├── subset_100_pdfs.txt     # flat list of selected PDF filenames
+└── subset_100_images.txt   # flat list of selected image filenames
+```
+Coverage (from `subset_100_stats.json`) — every bucket of every axis is hit:
+- **data_source** 9/9 · **language** 3/3 · **layout** 5/5
+- **special_issue** 13/13 · **stratum** 67/67
+## Using the bench
+These two subsets are intended to be run as a pair — olmOCR-bench gives you
+sharp per-feature pass/fail signals and OmniDocBench gives you an aggregate
+quality score across real-world document types. For each new pipeline
+version, run both subsets, record per-subset metrics, and diff against the
+previous run.
+Common entry points (to be wired up by the pipeline evaluator):
+```text
+bench/olmocr_bench_50/pdfs/**/*.pdf      # inputs
+bench/olmocr_bench_50/subset_tests.jsonl # ground truth unit tests
+bench/omnidocbench_100/pdfs/*.pdf        # inputs
+bench/omnidocbench_100/subset_100.json   # ground truth annotations
+```
+Do **not** manually edit files under `bench/`. Regenerate with the sampling
+script (for olmocr) or re-export from the upstream builder (for omnidoc) so
+results stay reproducible.

packages/pdfsys-bench/pyproject.toml CHANGED Viewed

@@ -9,10 +9,16 @@ description = "Cross-backend benchmarking — throughput, latency, and F1 on a s
 requires-python = ">=3.11"
 dependencies = [
     "pdfsys-core",
 ]
 [tool.uv.sources]
 pdfsys-core = { workspace = true }
 [tool.hatch.build.targets.wheel]
 packages = ["src/pdfsys_bench"]

 requires-python = ">=3.11"
 dependencies = [
     "pdfsys-core",
+    "pdfsys-router",
+    "pdfsys-parser-mupdf",
+    "torch>=2.1",
+    "transformers>=4.44",
 ]
 [tool.uv.sources]
 pdfsys-core = { workspace = true }
+pdfsys-router = { workspace = true }
+pdfsys-parser-mupdf = { workspace = true }
 [tool.hatch.build.targets.wheel]
 packages = ["src/pdfsys_bench"]

packages/pdfsys-bench/src/pdfsys_bench/__init__.py CHANGED Viewed

@@ -1,7 +1,22 @@
-"""pdfsys-bench — evaluation harness.
-Runs the same sample PDF set through mupdf / pipeline / vlm backends and
-reports throughput, latency, and F1 against gold Markdown references.
 """
 __version__ = "0.0.1"

+"""pdfsys-bench — evaluation harness and MVP closed-loop runner.
+Runs a PDF directory through router → parser → OCR-quality scorer and
+writes one JSONL row per PDF. This is the minimal end-to-end harness; a
+richer benchmark (throughput, F1 against gold Markdown, cross-backend
+comparison) will layer on top of it.
 """
+from __future__ import annotations
+from .loop import LoopResult, run_loop
+from .quality import OcrQualityScorer, QualityScore
 __version__ = "0.0.1"
+__all__ = [
+    "__version__",
+    "LoopResult",
+    "run_loop",
+    "OcrQualityScorer",
+    "QualityScore",
+]

packages/pdfsys-bench/src/pdfsys_bench/__main__.py ADDED Viewed

	@@ -0,0 +1,98 @@

+"""pdfsys-bench CLI — run the MVP closed loop on a directory of PDFs.
+Usage::
+    python -m pdfsys_bench \\
+        --pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \\
+        --out out/bench_omnidoc100.jsonl \\
+        --limit 20
+Flags exposed here are intentionally minimal — anything more is the job
+of a proper runner package. This CLI is meant for smoke-testing.
+"""
+from __future__ import annotations
+import argparse
+import sys
+from pathlib import Path
+from .loop import run_loop
+def build_parser() -> argparse.ArgumentParser:
+    p = argparse.ArgumentParser(prog="pdfsys-bench", description="Run the MVP pdfsys closed loop.")
+    p.add_argument(
+        "--pdf-dir",
+        type=Path,
+        required=True,
+        help="Directory of PDFs to process (recursive).",
+    )
+    p.add_argument(
+        "--out",
+        type=Path,
+        required=True,
+        help="Output JSONL path (one line per PDF).",
+    )
+    p.add_argument(
+        "--limit",
+        type=int,
+        default=None,
+        help="Cap the number of PDFs processed. Default: no cap.",
+    )
+    p.add_argument(
+        "--no-quality",
+        action="store_true",
+        help="Skip the ModernBERT quality scorer (fast smoke test).",
+    )
+    p.add_argument(
+        "--quality-model",
+        default="HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn",
+        help="HuggingFace repo id for the quality scorer.",
+    )
+    p.add_argument(
+        "--router-weights",
+        type=Path,
+        default=None,
+        help="Path to xgb_classifier.ubj. Defaults to the package's bundled path.",
+    )
+    p.add_argument(
+        "--markdown-dir",
+        type=Path,
+        default=None,
+        help="Optional directory to dump per-PDF extracted markdown.",
+    )
+    p.add_argument(
+        "--ocr-threshold",
+        type=float,
+        default=0.5,
+        help="P(ocr) threshold above which a PDF is routed off the text-ok path.",
+    )
+    return p
+def main(argv: list[str] | None = None) -> int:
+    args = build_parser().parse_args(argv)
+    summary = run_loop(
+        pdf_dir=args.pdf_dir,
+        out_path=args.out,
+        limit=args.limit,
+        score_quality=not args.no_quality,
+        router_weights=args.router_weights,
+        quality_model=args.quality_model,
+        markdown_dir=args.markdown_dir,
+        ocr_threshold=args.ocr_threshold,
+    )
+    print(f"[pdfsys-bench] processed {summary['num_pdfs']} PDFs in {summary['wall_seconds']:.1f}s")
+    print(f"[pdfsys-bench] by_backend: {summary['by_backend']}")
+    print(f"[pdfsys-bench] extracted={summary['num_extracted']} scored={summary['num_scored']} errors={summary['num_errors']}")
+    if summary.get("avg_quality") is not None:
+        print(f"[pdfsys-bench] avg_quality={summary['avg_quality']:.3f}")
+    print(f"[pdfsys-bench] jsonl: {summary['out_path']}")
+    print(f"[pdfsys-bench] summary: {summary['summary_path']}")
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())

packages/pdfsys-bench/src/pdfsys_bench/loop.py ADDED Viewed

	@@ -0,0 +1,200 @@

+"""MVP closed-loop runner: router → parser → quality scorer → JSONL.
+This is the tiniest possible end-to-end harness for the pdfsys pipeline.
+Given a directory of PDFs, it:
+1. runs :class:`pdfsys_router.Router` to pick a backend per document;
+2. for PDFs routed to ``Backend.MUPDF``, runs :func:`pdfsys_parser_mupdf.extract_doc`
+   to produce an :class:`pdfsys_core.ExtractedDoc`;
+3. scores the resulting Markdown with :class:`pdfsys_bench.OcrQualityScorer`
+   (the ModernBERT-large regression head from FinePDFs);
+4. writes one JSON line per PDF to an output file with routing decision,
+   extraction stats, and quality score.
+PDFs routed to ``PIPELINE`` / ``VLM`` / ``DEFERRED`` are recorded with
+their routing decision but skipped for extraction — those backends are
+not implemented yet in this MVP.
+"""
+from __future__ import annotations
+import json
+import time
+from dataclasses import asdict, dataclass, field
+from pathlib import Path
+from typing import Any, Iterable
+from pdfsys_core import Backend
+from pdfsys_parser_mupdf import extract_doc
+from pdfsys_router import Router
+from .quality import OcrQualityScorer, QualityScore
+@dataclass(slots=True)
+class LoopResult:
+    """Per-PDF result row, serialized to JSONL."""
+    pdf_path: str
+    sha256: str | None
+    backend: str
+    ocr_prob: float
+    num_pages: int
+    is_form: bool
+    garbled_text_ratio: float
+    router_error: str | None
+    extract_stats: dict[str, Any] = field(default_factory=dict)
+    extract_error: str | None = None
+    quality_score: float | None = None
+    quality_num_chars: int | None = None
+    quality_num_tokens: int | None = None
+    quality_model: str | None = None
+    markdown_chars: int = 0
+    wall_ms_router: float = 0.0
+    wall_ms_extract: float = 0.0
+    wall_ms_quality: float = 0.0
+    def to_json_line(self) -> str:
+        return json.dumps(asdict(self), ensure_ascii=False)
+def _iter_pdfs(root: Path, limit: int | None) -> Iterable[Path]:
+    pdfs = sorted(p for p in root.rglob("*.pdf") if p.is_file())
+    if limit is not None:
+        pdfs = pdfs[:limit]
+    yield from pdfs
+def run_loop(
+    pdf_dir: str | Path,
+    out_path: str | Path,
+    *,
+    limit: int | None = None,
+    score_quality: bool = True,
+    router_weights: str | Path | None = None,
+    quality_model: str = "HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn",
+    markdown_dir: str | Path | None = None,
+    ocr_threshold: float = 0.5,
+) -> dict[str, Any]:
+    """Drive the full MVP loop over a PDF directory.
+    Returns an aggregate summary dict. Individual result rows are written
+    to ``out_path`` as JSONL (one line per PDF, in input-order).
+    """
+    pdf_dir = Path(pdf_dir)
+    out_path = Path(out_path)
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+    router = Router(model_path=router_weights, ocr_threshold=ocr_threshold)
+    scorer = OcrQualityScorer(model_name=quality_model) if score_quality else None
+    md_root = Path(markdown_dir) if markdown_dir else None
+    if md_root is not None:
+        md_root.mkdir(parents=True, exist_ok=True)
+    summary: dict[str, Any] = {
+        "pdf_dir": str(pdf_dir),
+        "out_path": str(out_path),
+        "num_pdfs": 0,
+        "by_backend": {},
+        "num_extracted": 0,
+        "num_scored": 0,
+        "num_errors": 0,
+        "sum_quality": 0.0,
+        "started_at": time.time(),
+    }
+    with out_path.open("w", encoding="utf-8") as out_f:
+        for pdf_path in _iter_pdfs(pdf_dir, limit):
+            row = _run_one(
+                pdf_path=pdf_path,
+                router=router,
+                scorer=scorer,
+                md_root=md_root,
+            )
+            out_f.write(row.to_json_line() + "\n")
+            out_f.flush()
+            summary["num_pdfs"] += 1
+            by_b = summary["by_backend"]
+            by_b[row.backend] = by_b.get(row.backend, 0) + 1
+            if row.extract_error is None and row.backend == Backend.MUPDF.value:
+                summary["num_extracted"] += 1
+            if row.quality_score is not None:
+                summary["num_scored"] += 1
+                summary["sum_quality"] += row.quality_score
+            if row.router_error or row.extract_error:
+                summary["num_errors"] += 1
+    summary["finished_at"] = time.time()
+    summary["wall_seconds"] = summary["finished_at"] - summary["started_at"]
+    summary["avg_quality"] = (
+        summary["sum_quality"] / summary["num_scored"] if summary["num_scored"] else None
+    )
+    summary_path = out_path.with_suffix(".summary.json")
+    summary_path.write_text(json.dumps(summary, indent=2, ensure_ascii=False))
+    summary["summary_path"] = str(summary_path)
+    return summary
+def _run_one(
+    *,
+    pdf_path: Path,
+    router: Router,
+    scorer: OcrQualityScorer | None,
+    md_root: Path | None,
+) -> LoopResult:
+    # -- Stage-A routing ------------------------------------------------------
+    t0 = time.perf_counter()
+    decision = router.classify(pdf_path)
+    t1 = time.perf_counter()
+    row = LoopResult(
+        pdf_path=str(pdf_path),
+        sha256=None,
+        backend=decision.backend.value,
+        ocr_prob=decision.ocr_prob,
+        num_pages=decision.num_pages,
+        is_form=decision.is_form,
+        garbled_text_ratio=decision.garbled_text_ratio,
+        router_error=decision.error,
+        wall_ms_router=(t1 - t0) * 1000.0,
+    )
+    # -- MVP only extracts the text-ok fast path ------------------------------
+    if decision.backend != Backend.MUPDF:
+        return row
+    try:
+        t2 = time.perf_counter()
+        extracted = extract_doc(pdf_path)
+        t3 = time.perf_counter()
+        row.sha256 = extracted.sha256
+        row.extract_stats = dict(extracted.stats)
+        row.markdown_chars = extracted.char_count
+        row.wall_ms_extract = (t3 - t2) * 1000.0
+    except Exception as e:  # noqa: BLE001
+        row.extract_error = f"extract_failed: {e}"
+        return row
+    if md_root is not None and extracted.markdown:
+        md_path = md_root / f"{extracted.sha256}.md"
+        md_path.write_text(extracted.markdown, encoding="utf-8")
+    # -- Quality scoring ------------------------------------------------------
+    if scorer is not None and extracted.markdown:
+        try:
+            t4 = time.perf_counter()
+            q: QualityScore = scorer.score(extracted.markdown)
+            t5 = time.perf_counter()
+            row.quality_score = q.score
+            row.quality_num_chars = q.num_chars
+            row.quality_num_tokens = q.num_tokens
+            row.quality_model = q.model
+            row.wall_ms_quality = (t5 - t4) * 1000.0
+        except Exception as e:  # noqa: BLE001
+            row.extract_error = f"quality_failed: {e}"
+    return row

packages/pdfsys-bench/src/pdfsys_bench/quality.py ADDED Viewed

	@@ -0,0 +1,148 @@

+"""OCR quality scorer backed by the FinePDFs ModernBERT classifier.
+Wraps ``HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn`` — a
+single-head regression fine-tune of ModernBERT-large (~0.4 B params)
+that emits a float in ``[0, 3]`` where:
+* 0 → garbage / unreadable OCR
+* 1 → formatting issues but mostly readable
+* 2 → minor problems
+* 3 → clean text
+The scorer takes raw extracted text (Markdown or plain), truncates to at
+most ``max_chars`` characters before tokenization, tokenizes with the
+model's own tokenizer, runs one forward pass, and returns the scalar.
+Heavy dependencies (``torch`` + ``transformers``) are imported lazily so
+that merely importing :mod:`pdfsys_bench` does not pull them in.
+"""
+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Any
+DEFAULT_MODEL = "HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn"
+DEFAULT_MAX_CHARS = 10_000
+# Upstream FinePDFs uses max_tokens=2048, but ModernBERT-large activations
+# at that length need ≈ 3 GB of RAM — too much for a 4 GB dev box. 512
+# tokens is enough to give a stable quality signal in practice and keeps
+# peak memory well under a gig.
+DEFAULT_MAX_TOKENS = 512
+@dataclass(slots=True)
+class QualityScore:
+    """Result of scoring one document."""
+    score: float
+    num_chars: int
+    num_tokens: int
+    model: str
+    def as_record(self) -> dict[str, Any]:
+        return {
+            "quality_score": self.score,
+            "quality_num_chars": self.num_chars,
+            "quality_num_tokens": self.num_tokens,
+            "quality_model": self.model,
+        }
+class OcrQualityScorer:
+    """Lazy ModernBERT regression scorer. Re-uses model/tokenizer across calls."""
+    def __init__(
+        self,
+        model_name: str = DEFAULT_MODEL,
+        max_chars: int = DEFAULT_MAX_CHARS,
+        max_tokens: int = DEFAULT_MAX_TOKENS,
+        device: str | None = None,
+        dtype: str = "bfloat16",
+    ) -> None:
+        self.model_name = model_name
+        self.max_chars = max_chars
+        self.max_tokens = max_tokens
+        self._device_name = device
+        self.dtype_name = dtype
+        self._tokenizer: Any = None
+        self._model: Any = None
+        self._torch: Any = None
+        self._device: Any = None
+    def _ensure_loaded(self) -> None:
+        if self._model is not None:
+            return
+        import torch  # noqa: PLC0415 — lazy import is intentional
+        from transformers import AutoModelForSequenceClassification, AutoTokenizer  # noqa: PLC0415
+        self._torch = torch
+        self._device = torch.device(
+            self._device_name
+            or ("cuda" if torch.cuda.is_available() else "cpu")
+        )
+        # Use bfloat16 on CPU to halve the model's memory footprint —
+        # ModernBERT-large is ~0.4 B params, so fp32 weights alone take
+        # ~1.6 GB and OOM a 4 GB-RAM dev box. bf16 inference is
+        # numerically stable enough for a regression head like this.
+        torch_dtype = getattr(torch, self.dtype_name, torch.float32)
+        self._tokenizer = AutoTokenizer.from_pretrained(self.model_name)
+        # ``dtype`` is the transformers≥5 name; ``torch_dtype`` was the
+        # transformers<5 name. Pass ``dtype`` and fall back for older releases.
+        try:
+            model = AutoModelForSequenceClassification.from_pretrained(
+                self.model_name,
+                dtype=torch_dtype,
+            )
+        except TypeError:
+            model = AutoModelForSequenceClassification.from_pretrained(
+                self.model_name,
+                torch_dtype=torch_dtype,
+            )
+        model.eval()
+        model.to(self._device)
+        self._model = model
+    def score(self, text: str) -> QualityScore:
+        """Score a single document. Empty input returns 0.0."""
+        if not text or not text.strip():
+            return QualityScore(
+                score=0.0, num_chars=0, num_tokens=0, model=self.model_name
+            )
+        self._ensure_loaded()
+        assert self._tokenizer is not None and self._model is not None
+        torch = self._torch
+        clipped = text[: self.max_chars]
+        enc = self._tokenizer(
+            clipped,
+            return_tensors="pt",
+            truncation=True,
+            max_length=self.max_tokens,
+        )
+        num_tokens = int(enc["input_ids"].shape[1])
+        enc = {k: v.to(self._device) for k, v in enc.items()}
+        with torch.inference_mode():
+            out = self._model(**enc)
+            logits = out.logits  # shape [1, 1] for regression
+            raw = float(logits.squeeze().item())
+        # Drop the forward-pass tensors eagerly so large-seq runs on CPU
+        # don't hold onto activations between calls.
+        del enc, out, logits
+        # Clamp to the documented [0, 3] range.
+        clamped = max(0.0, min(3.0, raw))
+        return QualityScore(
+            score=clamped,
+            num_chars=len(clipped),
+            num_tokens=num_tokens,
+            model=self.model_name,
+        )
+    def score_many(self, texts: list[str]) -> list[QualityScore]:
+        """Serial scoring — tiny MVP harness, not a batched hot path."""
+        return [self.score(t) for t in texts]

packages/pdfsys-parser-mupdf/pyproject.toml CHANGED Viewed

@@ -9,6 +9,7 @@ description = "Text-ok backend: PyMuPDF extraction + reading order + Markdown em
 requires-python = ">=3.11"
 dependencies = [
     "pdfsys-core",
 ]
 [tool.uv.sources]

 requires-python = ">=3.11"
 dependencies = [
     "pdfsys-core",
+    "pymupdf>=1.24",
 ]
 [tool.uv.sources]

packages/pdfsys-parser-mupdf/src/pdfsys_parser_mupdf/__init__.py CHANGED Viewed

@@ -1,8 +1,14 @@
 """pdfsys-parser-mupdf — text-ok extraction backend.
 Consumes PDFs classified as text-ok by pdfsys-router. Uses PyMuPDF for
-block extraction, simple two-column reading order, and emits Markdown.
-Does NOT depend on pdfsys-layout-analyser.
 """
 __version__ = "0.0.1"

 """pdfsys-parser-mupdf — text-ok extraction backend.
 Consumes PDFs classified as text-ok by pdfsys-router. Uses PyMuPDF for
+block extraction (``page.get_text("blocks", sort=True)``) and emits
+Markdown. Does NOT depend on pdfsys-layout-analyser.
 """
+from __future__ import annotations
+from .extract import extract_doc, extract_doc_bytes
 __version__ = "0.0.1"
+__all__ = ["__version__", "extract_doc", "extract_doc_bytes"]

packages/pdfsys-parser-mupdf/src/pdfsys_parser_mupdf/extract.py CHANGED Viewed

	@@ -1 +1,181 @@
1	- """PyMuPDF extraction ~~entrypoint.~~ ~~Stub~~ ~~only~~.~~"""~~

+"""PyMuPDF-based text extraction for the mupdf (text-ok) backend.
+This is the simplest of the three parser backends. It assumes the PDF
+already has a clean text layer and just needs unwrapping into Markdown —
+which is why the router routes here only when the XGBoost classifier says
+``ocr_prob < threshold``.
+We use ``page.get_text("blocks")`` which returns paragraph-shaped blocks
+with coordinates already in reading order (PyMuPDF's internal sorting).
+Each block becomes one :class:`pdfsys_core.Segment` of type
+:attr:`pdfsys_core.RegionType.TEXT`, with its bbox normalized to ``[0, 1]``.
+Empty and image-only blocks are dropped.
+No layout-model dependency, no GPU, no OCR — this is the text-ok fast
+path, and stays that way.
+"""
+from __future__ import annotations
+import hashlib
+import io
+from pathlib import Path
+from typing import Any
+import pymupdf
+from pdfsys_core import (
+    Backend,
+    BBox,
+    ExtractedDoc,
+    RegionType,
+    Segment,
+    merge_segments_to_markdown,
+)
+# PyMuPDF block tuple layout: (x0, y0, x1, y1, text, block_no, block_type).
+# block_type 0 = text, 1 = image.
+_TEXT_BLOCK_TYPE = 0
+def _sha256_of_file(path: Path) -> str:
+    h = hashlib.sha256()
+    with path.open("rb") as f:
+        for chunk in iter(lambda: f.read(1 << 20), b""):
+            h.update(chunk)
+    return h.hexdigest()
+def _sha256_of_bytes(data: bytes) -> str:
+    return hashlib.sha256(data).hexdigest()
+def _normalize_text(text: str) -> str:
+    """Trim trailing whitespace and collapse PyMuPDF's soft linebreaks.
+    PyMuPDF returns block text with intra-paragraph newlines. For Markdown
+    emission we keep paragraphs on one line; actual paragraph breaks come
+    from the block boundaries themselves.
+    """
+    if not text:
+        return ""
+    # Strip and replace single newlines with spaces while preserving
+    # double-newlines (rare, but occasionally emitted for list items).
+    paragraphs = [p.strip() for p in text.split("\n\n")]
+    joined = "\n\n".join(" ".join(p.split()) for p in paragraphs if p.strip())
+    return joined.strip()
+def _block_bbox(
+    block: tuple[Any, ...],
+    page_width_pt: float,
+    page_height_pt: float,
+) -> BBox | None:
+    """Normalize a PyMuPDF block bbox to ``[0, 1]`` or return None on overflow."""
+    x0, y0, x1, y1 = block[0], block[1], block[2], block[3]
+    if page_width_pt <= 0 or page_height_pt <= 0:
+        return None
+    def clamp(v: float) -> float:
+        if v < 0.0:
+            return 0.0
+        if v > 1.0:
+            return 1.0
+        return v
+    nx0 = clamp(x0 / page_width_pt)
+    ny0 = clamp(y0 / page_height_pt)
+    nx1 = clamp(x1 / page_width_pt)
+    ny1 = clamp(y1 / page_height_pt)
+    if nx1 <= nx0 or ny1 <= ny0:
+        return None
+    try:
+        return BBox(x0=nx0, y0=ny0, x1=nx1, y1=ny1)
+    except ValueError:
+        return None
+def extract_doc(pdf_path: str | Path) -> ExtractedDoc:
+    """Run the mupdf backend on a single PDF file and return its ExtractedDoc."""
+    path = Path(pdf_path)
+    sha256 = _sha256_of_file(path)
+    doc = pymupdf.open(str(path))
+    try:
+        return _extract(doc, sha256)
+    finally:
+        doc.close()
+def extract_doc_bytes(pdf_bytes: bytes, sha256: str | None = None) -> ExtractedDoc:
+    """Run the mupdf backend on an in-memory PDF buffer."""
+    sha = sha256 or _sha256_of_bytes(pdf_bytes)
+    doc = pymupdf.open(stream=io.BytesIO(pdf_bytes), filetype="pdf")
+    try:
+        return _extract(doc, sha)
+    finally:
+        doc.close()
+def _extract(doc: pymupdf.Document, sha256: str) -> ExtractedDoc:
+    segments: list[Segment] = []
+    pages_extracted = 0
+    pages_skipped = 0
+    for page_index, page in enumerate(doc):
+        page_width_pt = float(page.rect.width)
+        page_height_pt = float(page.rect.height)
+        try:
+            blocks = page.get_text(
+                "blocks",
+                flags=pymupdf.TEXT_PRESERVE_WHITESPACE | pymupdf.TEXT_MEDIABOX_CLIP,
+                sort=True,
+            )
+        except Exception:
+            pages_skipped += 1
+            continue
+        pages_extracted += 1
+        for block in blocks:
+            # block tuple: (x0, y0, x1, y1, text, block_no, block_type)
+            if len(block) < 7:
+                continue
+            if block[6] != _TEXT_BLOCK_TYPE:
+                # image block — mupdf backend doesn't emit IMAGE segments by
+                # design; image-heavy PDFs should have been routed elsewhere.
+                continue
+            text = _normalize_text(block[4] or "")
+            if not text:
+                continue
+            bbox = _block_bbox(block, page_width_pt, page_height_pt)
+            segments.append(
+                Segment(
+                    index=len(segments),
+                    backend=Backend.MUPDF,
+                    page_index=page_index,
+                    type=RegionType.TEXT,
+                    content=text,
+                    bbox=bbox,
+                    source_region_id=None,
+                )
+            )
+    seg_tuple = tuple(segments)
+    markdown = merge_segments_to_markdown(seg_tuple)
+    stats: dict[str, Any] = {
+        "page_count": len(doc),
+        "pages_extracted": pages_extracted,
+        "pages_skipped": pages_skipped,
+        "segment_count": len(seg_tuple),
+        "char_count": len(markdown),
+    }
+    return ExtractedDoc(
+        sha256=sha256,
+        backend=Backend.MUPDF,
+        segments=seg_tuple,
+        markdown=markdown,
+        stats=stats,
+    )

packages/pdfsys-router/models/.gitignore ADDED Viewed

	@@ -0,0 +1,4 @@

+# External model weights downloaded by pdfsys_router.download_weights.
+# The xgb_classifier.ubj file is FinePDFs IP and should not be
+# committed. Run `python -m pdfsys_router.download_weights` to fetch it.
+xgb_classifier.ubj

packages/pdfsys-router/models/README.md ADDED Viewed

	@@ -0,0 +1,16 @@

+# Router model weights
+This directory is where the Stage-A XGBoost classifier weights live on disk.
+The file `xgb_classifier.ubj` (≈ 257 KB) is **not committed** — it's the
+ported FinePDFs binary classifier weights, owned by HuggingFace. Fetch it
+once with:
+```bash
+python -m pdfsys_router.download_weights
+```
+The downloader pulls from
+`media.githubusercontent.com/media/huggingface/finepdfs/main/blocks/predictor/xgb.ubj`,
+which is the actual Git-LFS payload (not the pointer file that plain
+`raw.githubusercontent.com` would return).

packages/pdfsys-router/pyproject.toml CHANGED Viewed

@@ -9,6 +9,11 @@ description = "Stage-1 classifier: decides text-ok vs needs-ocr; consults Layout
 requires-python = ">=3.11"
 dependencies = [
     "pdfsys-core",
 ]
 [tool.uv.sources]

 requires-python = ">=3.11"
 dependencies = [
     "pdfsys-core",
+    "pymupdf>=1.24",
+    "xgboost>=2.0",
+    "scikit-learn>=1.3",
+    "pandas>=2.0",
+    "numpy>=1.26",
 ]
 [tool.uv.sources]

packages/pdfsys-router/src/pdfsys_router/__init__.py CHANGED Viewed

@@ -1,9 +1,27 @@
 """pdfsys-router — two-stage routing for the pdfsys extraction pipeline.
-Stage A (cheap): classify text-ok vs needs-ocr from PyMuPDF features.
 Stage B (uses layout cache): for needs-ocr, read the LayoutDocument written
 by pdfsys-layout-analyser and decide pipeline vs vlm based on whether
-complex regions (tables / formulas) exist.
 """
 __version__ = "0.0.1"

 """pdfsys-router — two-stage routing for the pdfsys extraction pipeline.
+Stage A (cheap): classify text-ok vs needs-ocr from PyMuPDF features, using
+a ported FinePDFs XGBoost classifier over 124 hand-crafted features.
 Stage B (uses layout cache): for needs-ocr, read the LayoutDocument written
 by pdfsys-layout-analyser and decide pipeline vs vlm based on whether
+complex regions (tables / formulas) exist. Stage B is not in the MVP.
 """
+from __future__ import annotations
+from .classifier import Router, RouterDecision
+from .feature_extractor import PDFFeatureExtractor, flatten_per_page_features
+from .xgb_model import XgbRouterModel, default_weights_path
 __version__ = "0.0.1"
+__all__ = [
+    "__version__",
+    "Router",
+    "RouterDecision",
+    "PDFFeatureExtractor",
+    "flatten_per_page_features",
+    "XgbRouterModel",
+    "default_weights_path",
+]

packages/pdfsys-router/src/pdfsys_router/classifier.py CHANGED Viewed

@@ -1,4 +1,201 @@
-"""Stage-A classifier: text-ok vs needs-ocr.
-Stub only.
 """

+"""Stage-A classifier: decides text-ok (MUPDF) vs needs-ocr (PIPELINE/VLM).
+This is the single public entry point of the router for the MVP. Stage-B
+(layout-cache driven pipeline-vs-vlm decision) will be added later; for
+now, anything that needs OCR is routed to ``Backend.PIPELINE`` unless the
+configured policy says otherwise.
+The classifier is deliberately stateless. It loads the XGBoost model once
+(lazily) and then exposes ``classify(pdf_path) -> RouterDecision``. No
+caching, no I/O side effects — pure in, pure out.
 """
+from __future__ import annotations
+import random
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+import numpy as np
+import pymupdf
+from pdfsys_core import Backend, RouterConfig
+from .feature_extractor import PDFFeatureExtractor, flatten_per_page_features
+from .xgb_model import XgbRouterModel, default_weights_path
+@dataclass(slots=True)
+class RouterDecision:
+    """Result of running the Stage-A classifier on a single PDF."""
+    backend: Backend
+    ocr_prob: float
+    num_pages: int
+    is_form: bool
+    garbled_text_ratio: float
+    is_encrypted: bool
+    needs_password: bool
+    features: dict[str, Any] = field(default_factory=dict)
+    error: str | None = None
+    def as_record(self) -> dict[str, Any]:
+        """Flat dict for JSONL emission."""
+        return {
+            "backend": self.backend.value,
+            "ocr_prob": self.ocr_prob,
+            "num_pages": self.num_pages,
+            "is_form": bool(self.is_form),
+            "garbled_text_ratio": float(self.garbled_text_ratio),
+            "is_encrypted": bool(self.is_encrypted),
+            "needs_password": bool(self.needs_password),
+            "error": self.error,
+        }
+class Router:
+    """Stage-A router: PyMuPDF features → XGBoost → Backend."""
+    def __init__(
+        self,
+        config: RouterConfig | None = None,
+        model_path: str | Path | None = None,
+        num_pages_to_sample: int = 8,
+        ocr_threshold: float = 0.5,
+        seed: int = 42,
+    ) -> None:
+        self.config = config or RouterConfig()
+        self.num_pages_to_sample = num_pages_to_sample
+        self.ocr_threshold = ocr_threshold
+        self.seed = seed
+        self._extractor = PDFFeatureExtractor(
+            num_chunks=1, num_pages_to_sample=num_pages_to_sample
+        )
+        self._model = XgbRouterModel(model_path or default_weights_path())
+    # ------------------------------------------------------------------ api
+    def classify(self, pdf_path: str | Path) -> RouterDecision:
+        """Classify a PDF file. Never raises — errors are in ``decision.error``."""
+        path = Path(pdf_path)
+        try:
+            doc = pymupdf.open(str(path))
+        except Exception as e:  # noqa: BLE001 — we want to capture anything
+            return RouterDecision(
+                backend=Backend.DEFERRED,
+                ocr_prob=float("nan"),
+                num_pages=0,
+                is_form=False,
+                garbled_text_ratio=0.0,
+                is_encrypted=False,
+                needs_password=False,
+                error=f"open_failed: {e}",
+            )
+        try:
+            return self._classify_doc(doc)
+        finally:
+            try:
+                doc.close()
+            except Exception:
+                pass
+    def classify_bytes(self, pdf_bytes: bytes) -> RouterDecision:
+        """Same as :meth:`classify`, but from an in-memory buffer."""
+        import io
+        try:
+            doc = pymupdf.open(stream=io.BytesIO(pdf_bytes), filetype="pdf")
+        except Exception as e:  # noqa: BLE001
+            return RouterDecision(
+                backend=Backend.DEFERRED,
+                ocr_prob=float("nan"),
+                num_pages=0,
+                is_form=False,
+                garbled_text_ratio=0.0,
+                is_encrypted=False,
+                needs_password=False,
+                error=f"open_failed: {e}",
+            )
+        try:
+            return self._classify_doc(doc)
+        finally:
+            try:
+                doc.close()
+            except Exception:
+                pass
+    # --------------------------------------------------------------- internal
+    def _classify_doc(self, doc: pymupdf.Document) -> RouterDecision:
+        # Seed the sampling RNGs so the same PDF always produces the same
+        # feature vector — critical for reproducibility and debugging.
+        random.seed(self.seed)
+        np.random.seed(self.seed)
+        try:
+            if doc.is_encrypted or doc.needs_pass:
+                return RouterDecision(
+                    backend=Backend.DEFERRED,
+                    ocr_prob=float("nan"),
+                    num_pages=len(doc),
+                    is_form=False,
+                    garbled_text_ratio=0.0,
+                    is_encrypted=bool(doc.is_encrypted),
+                    needs_password=bool(doc.needs_pass),
+                    error="encrypted_or_password_protected",
+                )
+            raw_chunks = self._extractor.extract_all_features(doc)
+            if not raw_chunks:
+                return RouterDecision(
+                    backend=Backend.DEFERRED,
+                    ocr_prob=float("nan"),
+                    num_pages=len(doc),
+                    is_form=False,
+                    garbled_text_ratio=0.0,
+                    is_encrypted=False,
+                    needs_password=False,
+                    error="no_pages_sampled",
+                )
+            flat = flatten_per_page_features(
+                raw_chunks[0], sample_to_k_page_features=self.num_pages_to_sample
+            )
+            ocr_prob = self._model.predict_proba(flat)
+            backend = self._route(ocr_prob)
+            return RouterDecision(
+                backend=backend,
+                ocr_prob=ocr_prob,
+                num_pages=len(doc),
+                is_form=bool(flat.get("is_form", False)),
+                garbled_text_ratio=float(flat.get("garbled_text_ratio", 0.0)),
+                is_encrypted=bool(doc.is_encrypted),
+                needs_password=bool(doc.needs_pass),
+                features=flat,
+            )
+        except Exception as e:  # noqa: BLE001
+            return RouterDecision(
+                backend=Backend.DEFERRED,
+                ocr_prob=float("nan"),
+                num_pages=len(doc) if doc else 0,
+                is_form=False,
+                garbled_text_ratio=0.0,
+                is_encrypted=False,
+                needs_password=False,
+                error=f"classify_failed: {e}",
+            )
+    def _route(self, ocr_prob: float) -> Backend:
+        """Map XGBoost probability + fleet policy → concrete Backend."""
+        if ocr_prob < self.ocr_threshold:
+            return Backend.MUPDF
+        # OCR needed. Stage-B would check LayoutCache for complex content
+        # here. For the MVP we have no layout cache yet, so honour the
+        # fleet VLM gate: if VLM is enabled we'd need Stage-B to decide,
+        # otherwise pipeline handles everything flagged as scanned.
+        if self.config.vlm_enabled:
+            return Backend.DEFERRED  # Stage-B will run once layout is cached
+        return Backend.PIPELINE

packages/pdfsys-router/src/pdfsys_router/download_weights.py ADDED Viewed

	@@ -0,0 +1,52 @@

+"""Fetch the FinePDFs XGBoost router weights from upstream.
+The weights file (``xgb.ubj``, ~257 KB) is not committed to this repo —
+it's external IP owned by HuggingFace/FinePDFs and lives on their Git-LFS
+bucket. Running this module downloads it once into ``models/xgb_classifier.ubj``
+next to this package.
+Usage::
+    python -m pdfsys_router.download_weights
+"""
+from __future__ import annotations
+import sys
+import urllib.request
+from pathlib import Path
+# media.githubusercontent.com serves the actual LFS payload directly,
+# bypassing the pointer file that raw.githubusercontent.com returns.
+WEIGHTS_URL = (
+    "https://media.githubusercontent.com/media/huggingface/finepdfs/main/"
+    "blocks/predictor/xgb.ubj"
+)
+def target_path() -> Path:
+    return Path(__file__).resolve().parent.parent.parent / "models" / "xgb_classifier.ubj"
+def download(force: bool = False) -> Path:
+    dst = target_path()
+    if dst.exists() and not force:
+        print(f"[download_weights] already present: {dst}")
+        return dst
+    dst.parent.mkdir(parents=True, exist_ok=True)
+    print(f"[download_weights] fetching {WEIGHTS_URL}")
+    with urllib.request.urlopen(WEIGHTS_URL) as r:  # noqa: S310 — pinned URL
+        data = r.read()
+    if len(data) < 10_000:
+        raise RuntimeError(
+            f"downloaded blob is suspiciously small ({len(data)} bytes) — "
+            "likely an LFS pointer, not the binary"
+        )
+    dst.write_bytes(data)
+    print(f"[download_weights] wrote {len(data)} bytes -> {dst}")
+    return dst
+if __name__ == "__main__":
+    force = "--force" in sys.argv
+    download(force=force)

packages/pdfsys-router/src/pdfsys_router/feature_extractor.py ADDED Viewed

	@@ -0,0 +1,484 @@

+"""PyMuPDF-only feature extractor for the Stage-A router classifier.
+Ported verbatim (modulo stylistic cleanup and removal of datatrove imports)
+from FinePDFs' ``blocks/predictor/ocr_predictor.py``:
+    https://github.com/huggingface/finepdfs/blob/main/blocks/predictor/ocr_predictor.py
+The goal is bit-exact feature compatibility with the upstream XGBoost
+``xgb.ubj`` weights. If you touch anything in here, run the parity harness
+in ``pdfsys-bench`` against FinePDFs' reference output first.
+The extractor samples up to ``num_pages_to_sample`` pages at random, then
+computes:
+* 4 doc-level features: ``num_pages_successfully_sampled``,
+  ``garbled_text_ratio``, ``is_form``, ``creator_or_producer_is_known_scanner``.
+* 15 page-level features × 8 sampled pages = 120 features.
+:func:`flatten_per_page_features` produces the flat 124-feature dict the
+XGBoost model expects, in the exact column order of ``feature_names_in_``.
+"""
+from __future__ import annotations
+import random
+from collections import Counter
+from typing import Any
+import numpy as np
+import pymupdf
+# Keep this list in sync with FinePDFs upstream. These strings are
+# lowercased substring-matched against PDF metadata creator/producer to
+# flag scanner-origin PDFs which almost always need OCR.
+KNOWN_SCANNER_STRINGS: tuple[str, ...] = (
+    "scanner",
+    "scan",
+    "epson",
+    "hp scanjet",
+    "canon",
+    "fujitsu",
+    "kodak",
+    "brother",
+    "xerox",
+    "lexmark",
+    "kmc",
+    "kofax",
+    "ricoh",
+    "iris",
+    "capturedocument",
+    "paperport",
+    "readiris",
+    "simpleocr",
+)
+# Strip-merge tuning constants — used to coalesce image slices that some
+# PDFs explode into dozens of thin rectangles, so we don't overcount.
+JUNK_IMAGE_THRESHOLD_RATIO = 0.5
+JUNK_IMAGE_MIN_PAGES_FOR_THRESHOLD = 3
+MERGE_MAX_OFFSET = 5
+MERGE_MAX_GAP = 2
+def flatten_per_page_features(
+    feature_dict_sample: dict[str, Any],
+    sample_to_k_page_features: int = 8,
+) -> dict[str, Any]:
+    """Flatten a nested feature dict into the flat schema XGBoost expects.
+    The XGBoost model was trained on a 124-column DataFrame whose columns
+    are, in order:
+        num_pages_successfully_sampled
+        garbled_text_ratio
+        is_form
+        creator_or_producer_is_known_scanner
+        page_level_unique_font_counts_page1
+        ...
+        page_level_vector_graphics_obj_count_page8
+    If fewer than 8 pages were actually sampled, pages are resampled with
+    replacement to pad the vector — this matches the upstream behavior.
+    Seed numpy before calling this function if you need determinism.
+    """
+    flattened: dict[str, Any] = {}
+    doc_level_features = (
+        "num_pages_successfully_sampled",
+        "num_unique_image_xrefs",
+        "num_junk_image_xrefs",
+        "garbled_text_ratio",
+        "is_form",
+        "creator_or_producer_is_known_scanner",
+        "class",
+    )
+    used_keys: set[str] = set()
+    for key in doc_level_features:
+        if key in feature_dict_sample:
+            flattened[key] = feature_dict_sample[key]
+            used_keys.add(key)
+    page_level_features = (
+        "page_level_unique_font_counts",
+        "page_level_char_counts",
+        "page_level_text_box_counts",
+        "page_level_avg_text_box_lengths",
+        "page_level_text_area_ratios",
+        "page_level_hidden_char_counts",
+        "page_level_hidden_text_box_counts",
+        "page_level_hidden_avg_text_box_lengths",
+        "page_level_hidden_text_area_ratios",
+        "page_level_image_counts",
+        "page_level_non_junk_image_counts",
+        "page_level_bitmap_proportions",
+        "page_level_max_merged_strip_areas",
+        "page_level_drawing_strokes_count",
+        "page_level_vector_graphics_obj_count",
+    )
+    num_pages = len(feature_dict_sample["page_level_unique_font_counts"])
+    page_indices = list(range(num_pages))
+    # If we don't have enough pages, resample random pages. Upstream uses
+    # np.random.choice here, so seed numpy if determinism matters.
+    if num_pages < sample_to_k_page_features:
+        extra = np.random.choice(
+            num_pages, sample_to_k_page_features - num_pages, replace=True
+        ).tolist()
+        page_indices += extra
+    for key in page_level_features:
+        list_data = feature_dict_sample.get(key)
+        if list_data is None:
+            continue
+        for page_idx, ind in enumerate(page_indices):
+            flattened[f"{key}_page{page_idx + 1}"] = list_data[ind]
+        used_keys.add(key)
+    return flattened
+class PDFFeatureExtractor:
+    """PyMuPDF feature extraction. Pure — no I/O, no network, no state."""
+    def __init__(self, num_pages_to_sample: int = 8, num_chunks: int = 1) -> None:
+        if not isinstance(num_pages_to_sample, int):
+            raise ValueError("num_pages_to_sample must be an integer.")
+        self.num_pages_to_sample = num_pages_to_sample
+        self.num_chunks = num_chunks
+    # --------------------------------------------------------------- sampling
+    def _get_sampled_page_indices(self, doc: pymupdf.Document) -> list[list[int]]:
+        total_pages = len(doc)
+        if total_pages == 0 or self.num_pages_to_sample <= 0:
+            return []
+        available = list(range(total_pages))
+        sampled: list[list[int]] = []
+        if self.num_chunks == -1:
+            num_chunks = len(available) // self.num_pages_to_sample + 1
+        else:
+            num_chunks = self.num_chunks
+        for _ in range(num_chunks):
+            if not available:
+                break
+            chunk_size = min(self.num_pages_to_sample, len(available))
+            chunk = random.sample(available, chunk_size)
+            for idx in chunk:
+                available.remove(idx)
+            sampled.append(sorted(chunk))
+        return sampled
+    # ----------------------------------------------------------- doc-level
+    def _get_garbled_text_per_page(
+        self, doc: pymupdf.Document
+    ) -> tuple[list[int], list[int]]:
+        all_text: list[int] = []
+        garbled_text: list[int] = []
+        replacement = chr(0xFFFD)
+        for page in doc:
+            text = page.get_text(
+                "text",
+                flags=pymupdf.TEXT_PRESERVE_WHITESPACE | pymupdf.TEXT_MEDIABOX_CLIP,
+            )
+            all_text.append(len(text))
+            garbled_text.append(text.count(replacement))
+        return all_text, garbled_text
+    def _check_creator_producer_scanner(self, doc: pymupdf.Document) -> bool:
+        metadata = doc.metadata or {}
+        creator = (metadata.get("creator") or "").lower()
+        producer = (metadata.get("producer") or "").lower()
+        for keyword in KNOWN_SCANNER_STRINGS:
+            if keyword in creator or keyword in producer:
+                return True
+        return False
+    def _extract_document_level_stats_from_sampled_pages(
+        self, doc: pymupdf.Document, sampled_page_indices: list[int]
+    ) -> dict[str, Any]:
+        """Identify junk images (same xref repeated on most sampled pages)."""
+        stats: dict[str, Any] = {"junk_image_xrefs_list": []}
+        if not sampled_page_indices:
+            return stats
+        all_instances: list[int] = []
+        per_page: dict[int, set[int]] = {}
+        for page_idx in sampled_page_indices:
+            try:
+                page = doc.load_page(page_idx)
+                unique_xrefs: set[int] = set()
+                for img_def in page.get_images(full=False):
+                    xref = img_def[0]
+                    if xref == 0:
+                        continue
+                    unique_xrefs.add(xref)
+                    all_instances.append(xref)
+                per_page[page_idx] = unique_xrefs
+            except Exception:
+                per_page[page_idx] = set()
+        if not all_instances:
+            return stats
+        stats["num_unique_image_xrefs"] = len(set(all_instances))
+        xref_page_counts: Counter[int] = Counter()
+        for page_xrefs in per_page.values():
+            xref_page_counts.update(page_xrefs)
+        num_sampled = len(sampled_page_indices)
+        # Upstream overrides the ratio check and requires an xref to be on
+        # every sampled page to be flagged as junk — matches FinePDFs.
+        min_threshold = num_sampled
+        junk_list: list[int] = []
+        if num_sampled >= JUNK_IMAGE_MIN_PAGES_FOR_THRESHOLD:
+            for xref, count in xref_page_counts.items():
+                if count >= min_threshold:
+                    junk_list.append(xref)
+        stats["num_junk_image_xrefs"] = len(junk_list)
+        stats["junk_image_xrefs_list"] = junk_list
+        return stats
+    # ------------------------------------------------------------- imaging
+    def _heuristic_merge_image_strips_on_page(
+        self,
+        single_page_image_list: list[list[Any]],
+        page_width: float,
+        page_height: float,
+    ) -> list[list[Any]]:
+        if not single_page_image_list:
+            return []
+        deduped: list[list[Any]] = []
+        seen: set[tuple[float, float, float, float]] = set()
+        for img_data in single_page_image_list:
+            key = (img_data[0], img_data[1], img_data[2], img_data[3])
+            if key not in seen:
+                seen.add(key)
+                deduped.append(img_data)
+        if not deduped:
+            return []
+        deduped.sort(key=lambda img: (img[1], img[0]))
+        merged: list[list[Any]] = [deduped[0]]
+        for img in deduped[1:]:
+            x0, y0, x1, y1, imgid = img
+            last = merged[-1]
+            lx0, ly0, lx1, ly1, _ = last
+            cur_w = abs(x1 - x0)
+            cur_h = abs(y1 - y0)
+            full_w = page_width > 0 and cur_w >= page_width * 0.9
+            full_h = page_height > 0 and cur_h >= page_height * 0.9
+            can_merge = False
+            if full_w:
+                if (
+                    abs(lx0 - x0) <= MERGE_MAX_OFFSET
+                    and abs(lx1 - x1) <= MERGE_MAX_OFFSET
+                    and abs(y0 - ly1) <= MERGE_MAX_GAP
+                ):
+                    can_merge = True
+            if not can_merge and full_h:
+                if (
+                    abs(ly0 - y0) <= MERGE_MAX_OFFSET
+                    and abs(ly1 - y1) <= MERGE_MAX_OFFSET
+                    and abs(x0 - lx1) <= MERGE_MAX_GAP
+                ):
+                    can_merge = True
+            if can_merge:
+                merged[-1] = [
+                    min(x0, lx0),
+                    min(y0, ly0),
+                    max(x1, lx1),
+                    max(y1, ly1),
+                    imgid,
+                ]
+            else:
+                merged.append(img)
+        return merged
+    # ---------------------------------------------------------------- main
+    def compute_features_per_chunk(
+        self, doc: pymupdf.Document, sampled_page_indices: list[int]
+    ) -> dict[str, Any]:
+        features: dict[str, Any] = {
+            "is_form": False,
+            "creator_or_producer_is_known_scanner": False,
+            "garbled_text_ratio": 0,
+            "page_level_unique_font_counts": [],
+            "page_level_char_counts": [],
+            "page_level_text_box_counts": [],
+            "page_level_avg_text_box_lengths": [],
+            "page_level_text_area_ratios": [],
+            "page_level_hidden_char_counts": [],
+            "page_level_hidden_text_box_counts": [],
+            "page_level_hidden_avg_text_box_lengths": [],
+            "page_level_hidden_text_area_ratios": [],
+            "page_level_image_counts": [],
+            "page_level_non_junk_image_counts": [],
+            "page_level_bitmap_proportions": [],
+            "page_level_max_merged_strip_areas": [],
+            "page_level_drawing_strokes_count": [],
+            "page_level_vector_graphics_obj_count": [],
+            "num_pages_successfully_sampled": 0,
+            "num_pages_requested_for_sampling": 0,
+            "sampled_page_indices": [],
+        }
+        features["num_pages_requested_for_sampling"] = len(sampled_page_indices)
+        if not sampled_page_indices:
+            return features
+        doc_stats = self._extract_document_level_stats_from_sampled_pages(
+            doc, sampled_page_indices
+        )
+        junk_xrefs: set[int] = set(doc_stats.get("junk_image_xrefs_list", []))
+        features["is_form"] = bool(doc.is_form_pdf) if doc.is_form_pdf is not None else False
+        features["creator_or_producer_is_known_scanner"] = self._check_creator_producer_scanner(doc)
+        # Garbled text: U+FFFD replacement character / total chars. Computed
+        # over ALL pages, but the rate reported to XGBoost is restricted to
+        # the sampled pages (upstream semantics).
+        all_text, garbled_text = self._get_garbled_text_per_page(doc)
+        all_sum = sum(all_text)
+        garb_sum = sum(garbled_text)
+        features["global_garbled_text_ratio"] = 0 if all_sum == 0 else garb_sum / all_sum
+        sampled_garb = sum(garbled_text[i] for i in sampled_page_indices)
+        sampled_all = sum(all_text[i] for i in sampled_page_indices)
+        features["garbled_text_ratio"] = 0 if sampled_all == 0 else sampled_garb / sampled_all
+        for page_idx in sampled_page_indices:
+            try:
+                page = doc.load_page(page_idx)
+            except Exception:
+                continue
+            features["sampled_page_indices"].append(page_idx)
+            features["num_pages_successfully_sampled"] += 1
+            page_rect = page.rect
+            page_area = float(page_rect.width * page_rect.height) or 1.0
+            # --- Fonts ---
+            fonts: set[str] = set()
+            try:
+                for fi in page.get_fonts(full=True):
+                    if len(fi) > 3 and fi[3]:
+                        fonts.add(fi[3])
+            except Exception:
+                pass
+            features["page_level_unique_font_counts"].append(len(fonts))
+            # --- Visible vs hidden text via texttrace ---
+            char_count = 0
+            text_area = 0.0
+            text_boxes = 0
+            hidden_chars = 0
+            hidden_area = 0.0
+            hidden_boxes = 0
+            try:
+                for tr in page.get_texttrace():
+                    n = len(tr.get("chars", []))
+                    bbox = tr.get("bbox", (0, 0, 0, 0))
+                    box_area = (bbox[2] - bbox[0]) * (bbox[3] - bbox[1])
+                    if tr.get("type") == 3 or tr.get("opacity", 1.0) == 0:
+                        hidden_chars += n
+                        hidden_area += box_area
+                        hidden_boxes += 1
+                    else:
+                        char_count += n
+                        text_area += box_area
+                        text_boxes += 1
+            except Exception:
+                pass
+            features["page_level_char_counts"].append(char_count)
+            features["page_level_text_box_counts"].append(text_boxes)
+            features["page_level_avg_text_box_lengths"].append(
+                text_area / text_boxes if text_boxes else 0.0
+            )
+            features["page_level_text_area_ratios"].append(text_area / page_area)
+            features["page_level_hidden_char_counts"].append(hidden_chars)
+            features["page_level_hidden_text_box_counts"].append(hidden_boxes)
+            features["page_level_hidden_avg_text_box_lengths"].append(
+                hidden_area / hidden_boxes if hidden_boxes else 0.0
+            )
+            features["page_level_hidden_text_area_ratios"].append(hidden_area / page_area)
+            # --- Images ---
+            total_imgs = 0
+            non_junk_imgs = 0
+            non_junk_rects: list[list[Any]] = []
+            try:
+                for img_def in page.get_images(full=False):
+                    xref = img_def[0]
+                    if xref == 0:
+                        continue
+                    rects = page.get_image_rects(xref, transform=False)
+                    total_imgs += len(rects)
+                    if xref not in junk_xrefs:
+                        non_junk_imgs += len(rects)
+                        for r in rects:
+                            if r.is_empty or r.is_infinite:
+                                continue
+                            non_junk_rects.append([r.x0, r.y0, r.x1, r.y1, xref])
+            except Exception:
+                pass
+            features["page_level_image_counts"].append(total_imgs)
+            features["page_level_non_junk_image_counts"].append(non_junk_imgs)
+            merged = self._heuristic_merge_image_strips_on_page(
+                non_junk_rects, page_rect.width, page_rect.height
+            )
+            strip_areas = [abs(b[2] - b[0]) * abs(b[3] - b[1]) for b in merged]
+            if strip_areas:
+                features["page_level_max_merged_strip_areas"].append(max(strip_areas) / page_area)
+                features["page_level_bitmap_proportions"].append(sum(strip_areas) / page_area)
+            else:
+                features["page_level_max_merged_strip_areas"].append(0.0)
+                features["page_level_bitmap_proportions"].append(0.0)
+            # --- Drawings / vector graphics ---
+            stroke_count = 0
+            vector_objs = 0
+            try:
+                drawings = page.get_cdrawings()
+                vector_objs = len(drawings)
+                for path in drawings:
+                    for item in path.get("items", []):
+                        if item[0] in ("l", "c", "q"):
+                            stroke_count += 1
+                    if path.get("rect") or path.get("quad"):
+                        if path.get("stroke_opacity", 1) > 0 and path.get("color"):
+                            stroke_count += 1
+            except Exception:
+                pass
+            features["page_level_drawing_strokes_count"].append(stroke_count)
+            features["page_level_vector_graphics_obj_count"].append(vector_objs)
+        return features
+    def extract_all_features(self, doc: pymupdf.Document) -> list[dict[str, Any]]:
+        chunks = self._get_sampled_page_indices(doc)
+        return [self.compute_features_per_chunk(doc, c) for c in chunks]

packages/pdfsys-router/src/pdfsys_router/xgb_model.py ADDED Viewed

	@@ -0,0 +1,66 @@

+"""Thin loader around the FinePDFs XGBoost ``xgb.ubj`` weights.
+The model is a binary classifier where class 1 = "needs OCR" (scanned /
+garbled / image-heavy / form). It takes a 124-column feature vector whose
+column order is fixed by :func:`feature_extractor.flatten_per_page_features`.
+We keep the loader tiny on purpose: the calibration between feature layout
+and column order lives entirely in ``feature_extractor.py`` — this file
+only knows "give me a dict-of-features, I'll give you a probability".
+"""
+from __future__ import annotations
+from pathlib import Path
+import numpy as np
+import pandas as pd
+from xgboost import XGBClassifier
+class XgbRouterModel:
+    """Lazy-loading wrapper around an ``xgb.ubj`` binary classifier."""
+    def __init__(self, path_to_model: str | Path) -> None:
+        self.path_to_model = Path(path_to_model)
+        self._model: XGBClassifier | None = None
+    @property
+    def model(self) -> XGBClassifier:
+        if self._model is None:
+            if not self.path_to_model.is_file():
+                raise FileNotFoundError(
+                    f"XGBoost weights not found at {self.path_to_model}. "
+                    "Run `python -m pdfsys_router.download_weights` to fetch them."
+                )
+            m = XGBClassifier()
+            m.load_model(str(self.path_to_model))
+            self._model = m
+        return self._model
+    def predict_proba(self, features: dict[str, float]) -> float:
+        """Return P(class=1, i.e. needs OCR)."""
+        df = pd.DataFrame([features])
+        # Column ordering must match the training schema — realign using
+        # the model's recorded feature_names_in_ when available.
+        names = getattr(self.model, "feature_names_in_", None)
+        if names is not None:
+            df = df.reindex(columns=list(names), fill_value=0)
+        probs = self.model.predict_proba(df)
+        return float(probs[0][1])
+    @property
+    def feature_names(self) -> list[str]:
+        names = getattr(self.model, "feature_names_in_", None)
+        if names is None:
+            return []
+        return list(names)
+    @property
+    def n_features(self) -> int:
+        return int(getattr(self.model, "n_features_in_", 0))
+def default_weights_path() -> Path:
+    """Return the canonical on-disk location of the bundled weights."""
+    return Path(__file__).resolve().parent.parent.parent / "models" / "xgb_classifier.ubj"