DocPipe / CONTRIBUTING.md
yin
docs: add project README, CONTRIBUTING guide, and per-package READMEs
b8ca6f2

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

Contributing to pdfsys-mnbvc

Dev environment setup

# Prerequisites: Python >= 3.11, uv >= 0.4
uv sync                             # installs all workspace packages in editable mode
python -m pdfsys_router.download_weights   # one-time: fetch XGBoost weights (257 KB)

If you'll be working on quality scoring, torch + transformers are pulled in by pdfsys-bench. The ModernBERT-large model (~800 MB) downloads on first scorer use. Set HF_HOME to control the cache location.

Project structure

pdfsystem_mnbvc/
β”œβ”€β”€ pyproject.toml              # uv workspace root (meta-package)
β”œβ”€β”€ packages/
β”‚   β”œβ”€β”€ pdfsys-core/            # shared types, enums, layout cache, serde
β”‚   β”œβ”€β”€ pdfsys-router/          # Stage-A XGBoost classifier
β”‚   β”‚   β”œβ”€β”€ models/             # gitignored xgb_classifier.ubj lives here
β”‚   β”‚   └── src/pdfsys_router/
β”‚   β”‚       β”œβ”€β”€ feature_extractor.py   # 124-feature PyMuPDF extractor
β”‚   β”‚       β”œβ”€β”€ xgb_model.py           # lazy model loader
β”‚   β”‚       β”œβ”€β”€ classifier.py          # Router.classify() β†’ RouterDecision
β”‚   β”‚       └── download_weights.py    # fetch weights from HF LFS
β”‚   β”œβ”€β”€ pdfsys-parser-mupdf/    # text-ok fast path (PyMuPDF blocks β†’ Markdown)
β”‚   β”œβ”€β”€ pdfsys-parser-pipeline/ # OCR backend (stub)
β”‚   β”œβ”€β”€ pdfsys-parser-vlm/      # VLM backend (stub)
β”‚   β”œβ”€β”€ pdfsys-layout-analyser/ # layout model runner (stub)
β”‚   └── pdfsys-bench/           # evaluation harness + quality scorer
β”‚       β”œβ”€β”€ omnidocbench_100/   # gitignored bench dataset
β”‚       └── src/pdfsys_bench/
β”‚           β”œβ”€β”€ quality.py      # ModernBERT-large OCR quality scorer
β”‚           β”œβ”€β”€ loop.py         # router β†’ parser β†’ scorer β†’ JSONL runner
β”‚           └── __main__.py     # CLI entry point
└── out/                        # gitignored run outputs

Code conventions

Naming

  • Package dirs: pdfsys-<name> (kebab-case in pyproject.toml and directory names).
  • Import names: pdfsys_<name> (snake_case, matching src/pdfsys_<name>/).
  • All packages live under packages/ and use the [tool.uv.workspace] editable pattern.

Types and immutability

  • Core data structures are @dataclass(frozen=True, slots=True).
  • Enums live in pdfsys_core.types.
  • BBox coordinates are always normalized to [0, 1]; convert to pixels/points at the call site.
  • Parser backends all emit ExtractedDoc with a tuple[Segment, ...] β€” the schema is backend-agnostic.

Error handling

  • Router.classify() never raises. Errors are surfaced via RouterDecision.error.
  • Parser extract_doc() may raise; the bench loop catches and records errors in JSONL.
  • Prefer explicit except Exception with a recorded message over silent swallowing.

Feature extractor parity

The feature_extractor.py in pdfsys-router is a direct port of FinePDFs' blocks/predictor/ocr_predictor.py. The 124-column feature vector MUST match the upstream layout exactly β€” the XGBoost weights depend on column order. If you change any feature extraction logic, verify against the FinePDFs reference output before merging.

The feature ordering is:

  1. num_pages_successfully_sampled (doc-level)
  2. garbled_text_ratio (doc-level)
  3. is_form (doc-level)
  4. creator_or_producer_is_known_scanner (doc-level)
  5. page_level_unique_font_counts_page1 through _page8
  6. ... (15 page-level features Γ— 8 pages = 120 columns)

Total: 4 + 120 = 124 features.

Dependencies

  • pdfsys-core has zero external dependencies. Keep it that way.
  • Heavy deps (torch, transformers) are lazy-imported so that import pdfsys_bench doesn't pull them in at module scope.
  • XGBoost model weights are NOT committed to the repo. They're downloaded on demand via download_weights.py.

Running the MVP

# Full run on OmniDocBench-100 (takes ~4 min on CPU)
python -m pdfsys_bench \
  --pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \
  --out out/bench_omnidoc100.jsonl \
  --markdown-dir out/bench_omnidoc100_md

# Fast smoke test (no quality scoring)
python -m pdfsys_bench \
  --pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \
  --out out/smoke.jsonl \
  --limit 5 --no-quality

Output: one JSONL file (per-doc results) + one .summary.json (aggregate stats).

Adding a new parser backend

  1. Implement the backend in its package under packages/pdfsys-parser-<name>/.
  2. The entry point should accept a Path and return ExtractedDoc (from pdfsys-core).
  3. Each Segment must have page_index, type (RegionType), content, and ideally a normalized BBox.
  4. Call merge_segments_to_markdown(segments) from pdfsys-core to produce the markdown field.
  5. Wire it into loop.py by handling the corresponding Backend enum value.

Adding new features to the router

Do not modify feature_extractor.py unless you're also retraining the XGBoost model. The weights and feature layout are coupled. If you need additional routing signals, add them as post-classification heuristics in classifier.py rather than changing the feature vector.

Commit conventions

Commit messages follow conventional commits:

feat(router): add scanner metadata detection
fix(parser-mupdf): handle zero-width bbox on empty pages
docs: update quickstart for new deps
chore: bump pymupdf to 1.25

Scope is the package name without the pdfsys- prefix (e.g. router, core, bench, parser-mupdf).