A newer version of the Gradio SDK is available: 6.13.0
Contributing to pdfsys-mnbvc
Dev environment setup
# Prerequisites: Python >= 3.11, uv >= 0.4
uv sync # installs all workspace packages in editable mode
python -m pdfsys_router.download_weights # one-time: fetch XGBoost weights (257 KB)
If you'll be working on quality scoring, torch + transformers are pulled in by pdfsys-bench. The ModernBERT-large model (~800 MB) downloads on first scorer use. Set HF_HOME to control the cache location.
Project structure
pdfsystem_mnbvc/
βββ pyproject.toml # uv workspace root (meta-package)
βββ packages/
β βββ pdfsys-core/ # shared types, enums, layout cache, serde
β βββ pdfsys-router/ # Stage-A XGBoost classifier
β β βββ models/ # gitignored xgb_classifier.ubj lives here
β β βββ src/pdfsys_router/
β β βββ feature_extractor.py # 124-feature PyMuPDF extractor
β β βββ xgb_model.py # lazy model loader
β β βββ classifier.py # Router.classify() β RouterDecision
β β βββ download_weights.py # fetch weights from HF LFS
β βββ pdfsys-parser-mupdf/ # text-ok fast path (PyMuPDF blocks β Markdown)
β βββ pdfsys-parser-pipeline/ # OCR backend (stub)
β βββ pdfsys-parser-vlm/ # VLM backend (stub)
β βββ pdfsys-layout-analyser/ # layout model runner (stub)
β βββ pdfsys-bench/ # evaluation harness + quality scorer
β βββ omnidocbench_100/ # gitignored bench dataset
β βββ src/pdfsys_bench/
β βββ quality.py # ModernBERT-large OCR quality scorer
β βββ loop.py # router β parser β scorer β JSONL runner
β βββ __main__.py # CLI entry point
βββ out/ # gitignored run outputs
Code conventions
Naming
- Package dirs:
pdfsys-<name>(kebab-case in pyproject.toml and directory names). - Import names:
pdfsys_<name>(snake_case, matchingsrc/pdfsys_<name>/). - All packages live under
packages/and use the[tool.uv.workspace]editable pattern.
Types and immutability
- Core data structures are
@dataclass(frozen=True, slots=True). - Enums live in
pdfsys_core.types. - BBox coordinates are always normalized to
[0, 1]; convert to pixels/points at the call site. - Parser backends all emit
ExtractedDocwith atuple[Segment, ...]β the schema is backend-agnostic.
Error handling
Router.classify()never raises. Errors are surfaced viaRouterDecision.error.- Parser
extract_doc()may raise; the bench loop catches and records errors in JSONL. - Prefer explicit
except Exceptionwith a recorded message over silent swallowing.
Feature extractor parity
The feature_extractor.py in pdfsys-router is a direct port of FinePDFs'
blocks/predictor/ocr_predictor.py. The 124-column feature vector MUST match
the upstream layout exactly β the XGBoost weights depend on column order. If you
change any feature extraction logic, verify against the FinePDFs reference output
before merging.
The feature ordering is:
num_pages_successfully_sampled(doc-level)garbled_text_ratio(doc-level)is_form(doc-level)creator_or_producer_is_known_scanner(doc-level)page_level_unique_font_counts_page1through_page8- ... (15 page-level features Γ 8 pages = 120 columns)
Total: 4 + 120 = 124 features.
Dependencies
pdfsys-corehas zero external dependencies. Keep it that way.- Heavy deps (torch, transformers) are lazy-imported so that
import pdfsys_benchdoesn't pull them in at module scope. - XGBoost model weights are NOT committed to the repo. They're downloaded on demand via
download_weights.py.
Running the MVP
# Full run on OmniDocBench-100 (takes ~4 min on CPU)
python -m pdfsys_bench \
--pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \
--out out/bench_omnidoc100.jsonl \
--markdown-dir out/bench_omnidoc100_md
# Fast smoke test (no quality scoring)
python -m pdfsys_bench \
--pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \
--out out/smoke.jsonl \
--limit 5 --no-quality
Output: one JSONL file (per-doc results) + one .summary.json (aggregate stats).
Adding a new parser backend
- Implement the backend in its package under
packages/pdfsys-parser-<name>/. - The entry point should accept a
Pathand returnExtractedDoc(frompdfsys-core). - Each
Segmentmust havepage_index,type(RegionType),content, and ideally a normalizedBBox. - Call
merge_segments_to_markdown(segments)frompdfsys-coreto produce themarkdownfield. - Wire it into
loop.pyby handling the correspondingBackendenum value.
Adding new features to the router
Do not modify feature_extractor.py unless you're also retraining the XGBoost model. The weights and feature layout are coupled. If you need additional routing signals, add them as post-classification heuristics in classifier.py rather than changing the feature vector.
Commit conventions
Commit messages follow conventional commits:
feat(router): add scanner metadata detection
fix(parser-mupdf): handle zero-width bbox on empty pages
docs: update quickstart for new deps
chore: bump pymupdf to 1.25
Scope is the package name without the pdfsys- prefix (e.g. router, core, bench, parser-mupdf).