File size: 5,560 Bytes
b8ca6f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# Contributing to pdfsys-mnbvc

## Dev environment setup

```bash
# Prerequisites: Python >= 3.11, uv >= 0.4
uv sync                             # installs all workspace packages in editable mode
python -m pdfsys_router.download_weights   # one-time: fetch XGBoost weights (257 KB)
```

If you'll be working on quality scoring, torch + transformers are pulled in by `pdfsys-bench`. The ModernBERT-large model (~800 MB) downloads on first scorer use. Set `HF_HOME` to control the cache location.

## Project structure

```
pdfsystem_mnbvc/
β”œβ”€β”€ pyproject.toml              # uv workspace root (meta-package)
β”œβ”€β”€ packages/
β”‚   β”œβ”€β”€ pdfsys-core/            # shared types, enums, layout cache, serde
β”‚   β”œβ”€β”€ pdfsys-router/          # Stage-A XGBoost classifier
β”‚   β”‚   β”œβ”€β”€ models/             # gitignored xgb_classifier.ubj lives here
β”‚   β”‚   └── src/pdfsys_router/
β”‚   β”‚       β”œβ”€β”€ feature_extractor.py   # 124-feature PyMuPDF extractor
β”‚   β”‚       β”œβ”€β”€ xgb_model.py           # lazy model loader
β”‚   β”‚       β”œβ”€β”€ classifier.py          # Router.classify() β†’ RouterDecision
β”‚   β”‚       └── download_weights.py    # fetch weights from HF LFS
β”‚   β”œβ”€β”€ pdfsys-parser-mupdf/    # text-ok fast path (PyMuPDF blocks β†’ Markdown)
β”‚   β”œβ”€β”€ pdfsys-parser-pipeline/ # OCR backend (stub)
β”‚   β”œβ”€β”€ pdfsys-parser-vlm/      # VLM backend (stub)
β”‚   β”œβ”€β”€ pdfsys-layout-analyser/ # layout model runner (stub)
β”‚   └── pdfsys-bench/           # evaluation harness + quality scorer
β”‚       β”œβ”€β”€ omnidocbench_100/   # gitignored bench dataset
β”‚       └── src/pdfsys_bench/
β”‚           β”œβ”€β”€ quality.py      # ModernBERT-large OCR quality scorer
β”‚           β”œβ”€β”€ loop.py         # router β†’ parser β†’ scorer β†’ JSONL runner
β”‚           └── __main__.py     # CLI entry point
└── out/                        # gitignored run outputs
```

## Code conventions

### Naming

- Package dirs: `pdfsys-<name>` (kebab-case in pyproject.toml and directory names).
- Import names: `pdfsys_<name>` (snake_case, matching `src/pdfsys_<name>/`).
- All packages live under `packages/` and use the `[tool.uv.workspace]` editable pattern.

### Types and immutability

- Core data structures are `@dataclass(frozen=True, slots=True)`.
- Enums live in `pdfsys_core.types`.
- BBox coordinates are always normalized to `[0, 1]`; convert to pixels/points at the call site.
- Parser backends all emit `ExtractedDoc` with a `tuple[Segment, ...]` β€” the schema is backend-agnostic.

### Error handling

- `Router.classify()` never raises. Errors are surfaced via `RouterDecision.error`.
- Parser `extract_doc()` may raise; the bench loop catches and records errors in JSONL.
- Prefer explicit `except Exception` with a recorded message over silent swallowing.

### Feature extractor parity

The `feature_extractor.py` in `pdfsys-router` is a direct port of FinePDFs'
`blocks/predictor/ocr_predictor.py`. The 124-column feature vector MUST match
the upstream layout exactly β€” the XGBoost weights depend on column order. If you
change any feature extraction logic, verify against the FinePDFs reference output
before merging.

The feature ordering is:
1. `num_pages_successfully_sampled` (doc-level)
2. `garbled_text_ratio` (doc-level)
3. `is_form` (doc-level)
4. `creator_or_producer_is_known_scanner` (doc-level)
5. `page_level_unique_font_counts_page1` through `_page8`
6. ... (15 page-level features Γ— 8 pages = 120 columns)

Total: 4 + 120 = 124 features.

### Dependencies

- `pdfsys-core` has **zero** external dependencies. Keep it that way.
- Heavy deps (torch, transformers) are lazy-imported so that `import pdfsys_bench` doesn't pull them in at module scope.
- XGBoost model weights are NOT committed to the repo. They're downloaded on demand via `download_weights.py`.

## Running the MVP

```bash
# Full run on OmniDocBench-100 (takes ~4 min on CPU)
python -m pdfsys_bench \
  --pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \
  --out out/bench_omnidoc100.jsonl \
  --markdown-dir out/bench_omnidoc100_md

# Fast smoke test (no quality scoring)
python -m pdfsys_bench \
  --pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \
  --out out/smoke.jsonl \
  --limit 5 --no-quality
```

Output: one JSONL file (per-doc results) + one `.summary.json` (aggregate stats).

## Adding a new parser backend

1. Implement the backend in its package under `packages/pdfsys-parser-<name>/`.
2. The entry point should accept a `Path` and return `ExtractedDoc` (from `pdfsys-core`).
3. Each `Segment` must have `page_index`, `type` (RegionType), `content`, and ideally a normalized `BBox`.
4. Call `merge_segments_to_markdown(segments)` from `pdfsys-core` to produce the `markdown` field.
5. Wire it into `loop.py` by handling the corresponding `Backend` enum value.

## Adding new features to the router

**Do not** modify `feature_extractor.py` unless you're also retraining the XGBoost model. The weights and feature layout are coupled. If you need additional routing signals, add them as post-classification heuristics in `classifier.py` rather than changing the feature vector.

## Commit conventions

Commit messages follow conventional commits:

```
feat(router): add scanner metadata detection
fix(parser-mupdf): handle zero-width bbox on empty pages
docs: update quickstart for new deps
chore: bump pymupdf to 1.25
```

Scope is the package name without the `pdfsys-` prefix (e.g. `router`, `core`, `bench`, `parser-mupdf`).