yin Claude Opus 4.6 commited on
Commit
b8ca6f2
Β·
1 Parent(s): d423504

docs: add project README, CONTRIBUTING guide, and per-package READMEs

Browse files

Rewrite the top-level README with:
- Quick start (uv sync + download weights + run bench CLI)
- Architecture diagram with implemented/stub status table
- MVP benchmark results on OmniDocBench-100
- Key data structures (RouterDecision, ExtractedDoc, QualityScore)
- Design principles, CLI reference, output format spec

Add CONTRIBUTING.md covering:
- Dev environment setup
- Project structure overview
- Code conventions (naming, immutability, error handling)
- Feature extractor parity rules (124-column contract)
- How to add a new parser backend
- Commit message conventions

Add per-package READMEs for all 7 workspace packages explaining
each one's role, usage, and scope.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CONTRIBUTING.md ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Contributing to pdfsys-mnbvc
2
+
3
+ ## Dev environment setup
4
+
5
+ ```bash
6
+ # Prerequisites: Python >= 3.11, uv >= 0.4
7
+ uv sync # installs all workspace packages in editable mode
8
+ python -m pdfsys_router.download_weights # one-time: fetch XGBoost weights (257 KB)
9
+ ```
10
+
11
+ If you'll be working on quality scoring, torch + transformers are pulled in by `pdfsys-bench`. The ModernBERT-large model (~800 MB) downloads on first scorer use. Set `HF_HOME` to control the cache location.
12
+
13
+ ## Project structure
14
+
15
+ ```
16
+ pdfsystem_mnbvc/
17
+ β”œβ”€β”€ pyproject.toml # uv workspace root (meta-package)
18
+ β”œβ”€β”€ packages/
19
+ β”‚ β”œβ”€β”€ pdfsys-core/ # shared types, enums, layout cache, serde
20
+ β”‚ β”œβ”€β”€ pdfsys-router/ # Stage-A XGBoost classifier
21
+ β”‚ β”‚ β”œβ”€β”€ models/ # gitignored xgb_classifier.ubj lives here
22
+ β”‚ β”‚ └── src/pdfsys_router/
23
+ β”‚ β”‚ β”œβ”€β”€ feature_extractor.py # 124-feature PyMuPDF extractor
24
+ β”‚ β”‚ β”œβ”€β”€ xgb_model.py # lazy model loader
25
+ β”‚ β”‚ β”œβ”€β”€ classifier.py # Router.classify() β†’ RouterDecision
26
+ β”‚ β”‚ └── download_weights.py # fetch weights from HF LFS
27
+ β”‚ β”œβ”€β”€ pdfsys-parser-mupdf/ # text-ok fast path (PyMuPDF blocks β†’ Markdown)
28
+ β”‚ β”œβ”€β”€ pdfsys-parser-pipeline/ # OCR backend (stub)
29
+ β”‚ β”œβ”€β”€ pdfsys-parser-vlm/ # VLM backend (stub)
30
+ β”‚ β”œβ”€β”€ pdfsys-layout-analyser/ # layout model runner (stub)
31
+ β”‚ └── pdfsys-bench/ # evaluation harness + quality scorer
32
+ β”‚ β”œβ”€β”€ omnidocbench_100/ # gitignored bench dataset
33
+ β”‚ └── src/pdfsys_bench/
34
+ β”‚ β”œβ”€β”€ quality.py # ModernBERT-large OCR quality scorer
35
+ β”‚ β”œβ”€β”€ loop.py # router β†’ parser β†’ scorer β†’ JSONL runner
36
+ β”‚ └── __main__.py # CLI entry point
37
+ └── out/ # gitignored run outputs
38
+ ```
39
+
40
+ ## Code conventions
41
+
42
+ ### Naming
43
+
44
+ - Package dirs: `pdfsys-<name>` (kebab-case in pyproject.toml and directory names).
45
+ - Import names: `pdfsys_<name>` (snake_case, matching `src/pdfsys_<name>/`).
46
+ - All packages live under `packages/` and use the `[tool.uv.workspace]` editable pattern.
47
+
48
+ ### Types and immutability
49
+
50
+ - Core data structures are `@dataclass(frozen=True, slots=True)`.
51
+ - Enums live in `pdfsys_core.types`.
52
+ - BBox coordinates are always normalized to `[0, 1]`; convert to pixels/points at the call site.
53
+ - Parser backends all emit `ExtractedDoc` with a `tuple[Segment, ...]` β€” the schema is backend-agnostic.
54
+
55
+ ### Error handling
56
+
57
+ - `Router.classify()` never raises. Errors are surfaced via `RouterDecision.error`.
58
+ - Parser `extract_doc()` may raise; the bench loop catches and records errors in JSONL.
59
+ - Prefer explicit `except Exception` with a recorded message over silent swallowing.
60
+
61
+ ### Feature extractor parity
62
+
63
+ The `feature_extractor.py` in `pdfsys-router` is a direct port of FinePDFs'
64
+ `blocks/predictor/ocr_predictor.py`. The 124-column feature vector MUST match
65
+ the upstream layout exactly β€” the XGBoost weights depend on column order. If you
66
+ change any feature extraction logic, verify against the FinePDFs reference output
67
+ before merging.
68
+
69
+ The feature ordering is:
70
+ 1. `num_pages_successfully_sampled` (doc-level)
71
+ 2. `garbled_text_ratio` (doc-level)
72
+ 3. `is_form` (doc-level)
73
+ 4. `creator_or_producer_is_known_scanner` (doc-level)
74
+ 5. `page_level_unique_font_counts_page1` through `_page8`
75
+ 6. ... (15 page-level features Γ— 8 pages = 120 columns)
76
+
77
+ Total: 4 + 120 = 124 features.
78
+
79
+ ### Dependencies
80
+
81
+ - `pdfsys-core` has **zero** external dependencies. Keep it that way.
82
+ - Heavy deps (torch, transformers) are lazy-imported so that `import pdfsys_bench` doesn't pull them in at module scope.
83
+ - XGBoost model weights are NOT committed to the repo. They're downloaded on demand via `download_weights.py`.
84
+
85
+ ## Running the MVP
86
+
87
+ ```bash
88
+ # Full run on OmniDocBench-100 (takes ~4 min on CPU)
89
+ python -m pdfsys_bench \
90
+ --pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \
91
+ --out out/bench_omnidoc100.jsonl \
92
+ --markdown-dir out/bench_omnidoc100_md
93
+
94
+ # Fast smoke test (no quality scoring)
95
+ python -m pdfsys_bench \
96
+ --pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \
97
+ --out out/smoke.jsonl \
98
+ --limit 5 --no-quality
99
+ ```
100
+
101
+ Output: one JSONL file (per-doc results) + one `.summary.json` (aggregate stats).
102
+
103
+ ## Adding a new parser backend
104
+
105
+ 1. Implement the backend in its package under `packages/pdfsys-parser-<name>/`.
106
+ 2. The entry point should accept a `Path` and return `ExtractedDoc` (from `pdfsys-core`).
107
+ 3. Each `Segment` must have `page_index`, `type` (RegionType), `content`, and ideally a normalized `BBox`.
108
+ 4. Call `merge_segments_to_markdown(segments)` from `pdfsys-core` to produce the `markdown` field.
109
+ 5. Wire it into `loop.py` by handling the corresponding `Backend` enum value.
110
+
111
+ ## Adding new features to the router
112
+
113
+ **Do not** modify `feature_extractor.py` unless you're also retraining the XGBoost model. The weights and feature layout are coupled. If you need additional routing signals, add them as post-classification heuristics in `classifier.py` rather than changing the feature vector.
114
+
115
+ ## Commit conventions
116
+
117
+ Commit messages follow conventional commits:
118
+
119
+ ```
120
+ feat(router): add scanner metadata detection
121
+ fix(parser-mupdf): handle zero-width bbox on empty pages
122
+ docs: update quickstart for new deps
123
+ chore: bump pymupdf to 1.25
124
+ ```
125
+
126
+ Scope is the package name without the `pdfsys-` prefix (e.g. `router`, `core`, `bench`, `parser-mupdf`).
README.md CHANGED
@@ -1,15 +1,46 @@
1
  # pdfsys-mnbvc
2
 
3
- PB-scale PDF β†’ pretraining-data pipeline for the MNBVC corpus project.
4
  FinePDFs-inspired architecture adapted for Chinese-heavy, mixed-quality input.
5
 
6
- ## Architecture
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
8
- Two-stage routing, cascaded:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  ```
11
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
12
- PDF ─► β”‚ pdfsys-routerβ”‚ stage A (cheap classifier)
13
  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
14
  β”‚
15
  text-ok ◄──┴──► needs-ocr
@@ -26,32 +57,167 @@ PDF ─► β”‚ pdfsys-routerβ”‚ stage A (cheap classifier)
26
  parser-pipeline parser-vlm
27
  ```
28
 
29
- The `LayoutDocument` produced by `pdfsys-layout-analyser` is cached to disk
30
- and consumed by **both** the stage-B decision in `pdfsys-router` **and** the
31
- downstream parser backend β€” layout inference runs at most once per PDF.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  ## Workspace packages
34
 
35
- | Package | Role |
36
- |---|---|
37
- | `pdfsys-core` | Shared dataclasses (`PdfRecord`, `LayoutDocument`), manifest IO, layout cache. No PDF/ML deps. |
38
- | `pdfsys-router` | Two-stage router. Stage A text-ok/needs-ocr; Stage B pipeline/vlm from cached layout. |
39
- | `pdfsys-layout-analyser` | Page layout model runner (PP-DocLayoutV3 / docling-layout-heron). Runs once, writes cache. |
40
- | `pdfsys-parser-mupdf` | Text-ok backend. PyMuPDF + reading order β†’ Markdown. |
41
- | `pdfsys-parser-pipeline` | Needs-ocr + simple layout backend. Region-level OCR (RapidOCR / PaddleOCR-classic). |
42
- | `pdfsys-parser-vlm` | Needs-ocr + complex layout backend. MinerU 2.5 / PaddleOCR-VL on complex regions. |
43
- | `pdfsys-bench` | Cross-backend throughput / latency / F1 evaluation. |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
- ## Setup (macOS)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
  ```bash
48
- # Requires uv >= 0.4
49
- uv sync
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  ```
51
 
52
- Running a single PDF through the pipeline, and orchestration above the
53
- extraction core (ingest / dedup / quality / tokenize) are not implemented
54
- yet β€” see `docs/PRD.md` for the full design.
55
 
56
  ## Docs
57
 
 
1
  # pdfsys-mnbvc
2
 
3
+ PB-scale PDF β†’ pretraining-data pipeline for the [MNBVC](https://github.com/esbatmop/MNBVC) corpus project.
4
  FinePDFs-inspired architecture adapted for Chinese-heavy, mixed-quality input.
5
 
6
+ ## Current status: MVP closed loop βœ…
7
+
8
+ The first end-to-end path β€” **Router β†’ MuPDF parser β†’ OCR quality scorer** β€” is working on the OmniDocBench-100 evaluation set. PDFs that need OCR are routed to `PIPELINE` but not yet extracted (that backend is not implemented yet).
9
+
10
+ ## Quick start
11
+
12
+ ```bash
13
+ # 1. Install uv (>= 0.4)
14
+ curl -LsSf https://astral.sh/uv/install.sh | sh
15
+
16
+ # 2. Clone the repo and sync all workspace packages
17
+ git clone <this-repo-url>
18
+ cd pdfsystem_mnbvc
19
+ uv sync
20
 
21
+ # 3. Fetch the XGBoost router weights (257 KB, one-time)
22
+ python -m pdfsys_router.download_weights
23
+
24
+ # 4. Run the MVP closed loop on the bench dataset
25
+ python -m pdfsys_bench \
26
+ --pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \
27
+ --out out/bench_omnidoc100.jsonl \
28
+ --markdown-dir out/bench_omnidoc100_md
29
+ ```
30
+
31
+ > **Note:** The first run downloads the ModernBERT-large quality scorer
32
+ > (~800 MB) from HuggingFace Hub. Set `HF_HOME` to control where it's
33
+ > cached. If you don't need quality scoring, add `--no-quality` to skip it.
34
+
35
+ > **Note:** The bench dataset (omnidocbench_100) is NOT committed to the repo.
36
+ > You need to obtain it separately and place it under
37
+ > `packages/pdfsys-bench/omnidocbench_100/`.
38
+
39
+ ## Architecture
40
 
41
  ```
42
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
43
+ PDF ──► β”‚ pdfsys-routerβ”‚ stage A: XGBoost (124 PyMuPDF features)
44
  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
45
  β”‚
46
  text-ok ◄──┴──► needs-ocr
 
57
  parser-pipeline parser-vlm
58
  ```
59
 
60
+ ### What's implemented
61
+
62
+ | Stage | Status | Description |
63
+ |-------|--------|-------------|
64
+ | **Stage-A router** | βœ… | XGBoost binary classifier, ported from FinePDFs. 124 features (4 doc-level + 15 page-level Γ— 8 sampled pages). Routes to `MUPDF` (text-ok) or `PIPELINE` (needs-ocr). |
65
+ | **MuPDF parser** | βœ… | `page.get_text("blocks", sort=True)` β†’ `ExtractedDoc` with normalized bbox and merged Markdown. Fast path for clean-text PDFs. |
66
+ | **OCR quality scorer** | βœ… | ModernBERT-large regression head (`HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn`). Scores extracted text on a [0, 3] scale. |
67
+ | **Bench CLI** | βœ… | `python -m pdfsys_bench` β€” drives the full loop, emits per-doc JSONL + summary JSON. |
68
+ | Stage-B router | ❌ | Pending layout-analyser and LayoutCache integration. |
69
+ | Layout analyser | ❌ | PP-DocLayoutV3 / docling-layout-heron runner β€” not started. |
70
+ | Pipeline parser | ❌ | Region-level OCR (RapidOCR / PaddleOCR) β€” not started. |
71
+ | VLM parser | ❌ | MinerU 2.5 / PaddleOCR-VL on complex regions β€” not started. |
72
+
73
+ ### MVP benchmark results (OmniDocBench-100)
74
+
75
+ ```
76
+ Backend split: mupdf=70 pipeline=30
77
+ Avg ocr_prob: mupdf=0.034 pipeline=0.634
78
+ Extracted: 70 Errors: 0
79
+ Quality: avg=1.71 min=0.39 max=2.73
80
+ Per-doc time: router=49ms extract=7ms quality=3.6s
81
+ ```
82
 
83
  ## Workspace packages
84
 
85
+ | Package | Role | Dependencies |
86
+ |---------|------|-------------|
87
+ | `pdfsys-core` | Shared dataclasses, enums, layout cache, serde. No PDF/ML deps. | stdlib only |
88
+ | `pdfsys-router` | Stage-A XGBoost classifier + Stage-B layout decision (stub). | pymupdf, xgboost, pandas, numpy, scikit-learn |
89
+ | `pdfsys-layout-analyser` | Page layout model runner. Stub only. | β€” |
90
+ | `pdfsys-parser-mupdf` | Text-ok backend: PyMuPDF block extraction β†’ Markdown. | pymupdf |
91
+ | `pdfsys-parser-pipeline` | OCR backend for simple layouts. Stub only. | β€” |
92
+ | `pdfsys-parser-vlm` | VLM backend for complex layouts. Stub only. | β€” |
93
+ | `pdfsys-bench` | Closed-loop evaluation harness + quality scorer. | torch, transformers, pdfsys-router, pdfsys-parser-mupdf |
94
+
95
+ ### Package dependency graph
96
+
97
+ ```
98
+ pdfsys-core ◄── pdfsys-router
99
+ ◄── pdfsys-parser-mupdf
100
+ ◄── pdfsys-parser-pipeline (stub)
101
+ ◄── pdfsys-parser-vlm (stub)
102
+ ◄── pdfsys-layout-analyser (stub)
103
+
104
+ pdfsys-router ◄── pdfsys-bench
105
+ pdfsys-parser-mupdf ◄── pdfsys-bench
106
+ ```
107
+
108
+ `pdfsys-core` is the root dependency: every other package imports it, and it has zero external deps beyond the Python stdlib.
109
+
110
+ ## Key data structures
111
 
112
+ ### Router output (`RouterDecision`)
113
+
114
+ ```python
115
+ @dataclass
116
+ class RouterDecision:
117
+ backend: Backend # MUPDF | PIPELINE | VLM | DEFERRED
118
+ ocr_prob: float # P(needs OCR) from XGBoost, [0, 1]
119
+ num_pages: int
120
+ is_form: bool
121
+ garbled_text_ratio: float
122
+ is_encrypted: bool
123
+ needs_password: bool
124
+ features: dict # full 124-feature vector for debugging
125
+ error: str | None
126
+ ```
127
+
128
+ ### Parser output (`ExtractedDoc`)
129
+
130
+ ```python
131
+ @dataclass(frozen=True)
132
+ class ExtractedDoc:
133
+ sha256: str
134
+ backend: Backend
135
+ segments: tuple[Segment, ...] # ordered block-level units
136
+ markdown: str # segments merged with \n\n
137
+ stats: dict
138
+ ```
139
+
140
+ Each `Segment` carries `page_index`, `RegionType` (TEXT/IMAGE/TABLE/FORMULA), `content` (Markdown / HTML / LaTeX), and a normalized `BBox` in [0, 1].
141
+
142
+ ### Quality score
143
+
144
+ ```python
145
+ @dataclass
146
+ class QualityScore:
147
+ score: float # [0, 3]: 0=garbage, 1=format issues, 2=minor, 3=clean
148
+ num_chars: int
149
+ num_tokens: int
150
+ model: str
151
+ ```
152
+
153
+ ## Design principles
154
+
155
+ 1. **Stateless processing.** No manifest, no central DB. Every PDF produces self-contained output. Following FinePDFs' datatrove-style design.
156
+ 2. **Content-addressable caching.** LayoutCache shards by `sha256 + model_tag`. Bumping the model tag lazily invalidates old entries.
157
+ 3. **Atomic writes.** All file outputs use `tmp + os.replace()` for crash safety.
158
+ 4. **Normalized coordinates.** BBox is always `[0, 1]` with origin top-left; backends convert to pixels/points on demand.
159
+ 5. **Backend-agnostic output.** All three parser backends emit the same `ExtractedDoc` / `Segment` schema, so downstream stages don't need to know which backend produced a document.
160
+
161
+ ## CLI reference
162
+
163
+ ### `python -m pdfsys_bench`
164
+
165
+ ```
166
+ usage: pdfsys-bench [-h] --pdf-dir PDF_DIR --out OUT [--limit N]
167
+ [--no-quality] [--quality-model MODEL]
168
+ [--router-weights PATH] [--markdown-dir DIR]
169
+ [--ocr-threshold FLOAT]
170
+
171
+ Run the MVP pdfsys closed loop.
172
+
173
+ options:
174
+ --pdf-dir PATH Directory of PDFs to process (recursive).
175
+ --out PATH Output JSONL path (one line per PDF).
176
+ --limit N Cap the number of PDFs processed.
177
+ --no-quality Skip the ModernBERT quality scorer.
178
+ --quality-model ID HuggingFace model for quality scoring.
179
+ --router-weights P Path to xgb_classifier.ubj.
180
+ --markdown-dir DIR Dump per-PDF extracted markdown here.
181
+ --ocr-threshold F P(ocr) threshold (default: 0.5).
182
+ ```
183
+
184
+ ### `python -m pdfsys_router.download_weights`
185
+
186
+ Downloads the XGBoost router weights (~257 KB) from the FinePDFs Git LFS.
187
 
188
  ```bash
189
+ python -m pdfsys_router.download_weights # first time
190
+ python -m pdfsys_router.download_weights --force # re-download
191
+ ```
192
+
193
+ ## Output format
194
+
195
+ The JSONL output (`--out`) has one JSON object per PDF:
196
+
197
+ ```json
198
+ {
199
+ "pdf_path": "packages/pdfsys-bench/omnidocbench_100/pdfs/example.pdf",
200
+ "sha256": "a53b50cb0d3d...",
201
+ "backend": "mupdf",
202
+ "ocr_prob": 0.003,
203
+ "num_pages": 1,
204
+ "is_form": false,
205
+ "garbled_text_ratio": 0.0,
206
+ "router_error": null,
207
+ "extract_stats": {"page_count": 1, "pages_extracted": 1, "segment_count": 5, "char_count": 5734},
208
+ "extract_error": null,
209
+ "quality_score": 2.45,
210
+ "quality_num_chars": 5734,
211
+ "quality_num_tokens": 512,
212
+ "quality_model": "HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn",
213
+ "markdown_chars": 5734,
214
+ "wall_ms_router": 42.1,
215
+ "wall_ms_extract": 6.3,
216
+ "wall_ms_quality": 3421.0
217
+ }
218
  ```
219
 
220
+ A companion `.summary.json` file is also written with aggregate statistics.
 
 
221
 
222
  ## Docs
223
 
packages/pdfsys-core/README.md ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # pdfsys-core
2
+
3
+ Shared data contracts for the pdfsys pipeline. Every other package depends on this one.
4
+
5
+ ## What's in here
6
+
7
+ - **Enums**: `RegionType` (TEXT / IMAGE / TABLE / FORMULA), `Backend` (MUPDF / PIPELINE / VLM / DEFERRED).
8
+ - **PdfRecord**: Frozen dataclass for per-PDF metadata (sha256, source_uri, size, provenance).
9
+ - **Layout schema**: `BBox` (normalized [0,1]), `LayoutRegion`, `LayoutPage`, `LayoutDocument` β€” the contract between layout-analyser and every parser backend.
10
+ - **ExtractedDoc / Segment**: Backend-agnostic output schema. All three parser backends emit these.
11
+ - **LayoutCache**: Content-addressable on-disk cache for LayoutDocuments, keyed by `sha256 + model_tag`.
12
+ - **PdfsysConfig**: Hierarchical configuration (paths, router, layout, per-backend settings, runtime).
13
+ - **Serde**: Generic `to_dict()` / `from_dict()` for all the above dataclasses.
14
+
15
+ ## Key design rule
16
+
17
+ This package has **zero external dependencies** β€” stdlib only. Do not add pymupdf, torch, or anything else here. The types must be importable everywhere without pulling in heavy ML libraries.
packages/pdfsys-layout-analyser/README.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # pdfsys-layout-analyser
2
+
3
+ Page layout model runner. **Stub only β€” not yet implemented.**
4
+
5
+ Will run a layout detection model (PP-DocLayoutV3 / docling-layout-heron) on each page and write a `LayoutDocument` to the `LayoutCache`. This layout is consumed by:
6
+
7
+ 1. **pdfsys-router Stage B** β€” checks `has_complex_content` to decide pipeline vs VLM.
8
+ 2. **pdfsys-parser-pipeline** β€” uses region bboxes to crop and OCR individual regions.
9
+ 3. **pdfsys-parser-vlm** β€” sends complex regions to a vision-language model.
10
+
11
+ Layout inference runs at most once per PDF (keyed by sha256 + model_tag in the cache).
packages/pdfsys-parser-mupdf/README.md ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # pdfsys-parser-mupdf
2
+
3
+ Text-ok extraction backend. This is the fast path for PDFs that the router classifies as having a clean embedded text layer (i.e. `ocr_prob < threshold`).
4
+
5
+ ## What it does
6
+
7
+ 1. Opens the PDF with PyMuPDF.
8
+ 2. Iterates every page, calling `page.get_text("blocks", sort=True)`.
9
+ 3. Filters to text blocks (drops image blocks).
10
+ 4. Normalizes each block's bbox to [0, 1] coordinates.
11
+ 5. Produces one `Segment` per block, joined into an `ExtractedDoc` with merged Markdown.
12
+
13
+ ## Usage
14
+
15
+ ```python
16
+ from pdfsys_parser_mupdf import extract_doc
17
+
18
+ doc = extract_doc("path/to/clean.pdf")
19
+ print(doc.markdown[:500])
20
+ print(f"{doc.segment_count} segments, {doc.char_count} chars")
21
+ ```
22
+
23
+ ## Scope
24
+
25
+ This backend intentionally does NOT:
26
+ - Run OCR (that's what parser-pipeline and parser-vlm are for)
27
+ - Use a layout model (not needed for text-ok PDFs)
28
+ - Extract images or tables (image-heavy PDFs should be routed elsewhere)
29
+
30
+ It is the simplest possible extraction: unwrap PyMuPDF blocks into structured output.
packages/pdfsys-parser-pipeline/README.md ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ # pdfsys-parser-pipeline
2
+
3
+ Region-level OCR backend for scanned PDFs with simple layouts. **Stub only β€” not yet implemented.**
4
+
5
+ Will take a `LayoutDocument` from the cache, crop each region at the configured DPI, and run OCR (RapidOCR / PaddleOCR-classic) on each crop individually. Produces an `ExtractedDoc` following the same schema as parser-mupdf.
packages/pdfsys-parser-vlm/README.md ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ # pdfsys-parser-vlm
2
+
3
+ Vision-language model backend for scanned PDFs with complex content (tables, formulas). **Stub only β€” not yet implemented.**
4
+
5
+ Will handle regions flagged as TABLE or FORMULA by the layout analyser, sending them to a VLM (MinerU 2.5 / PaddleOCR-VL) that can produce structured output (HTML tables, LaTeX formulas). Simple text regions in the same document may still be handled by the pipeline backend.
packages/pdfsys-router/README.md ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # pdfsys-router
2
+
3
+ Two-stage routing for the pdfsys extraction pipeline.
4
+
5
+ ## Stage A (implemented)
6
+
7
+ XGBoost binary classifier ported from [FinePDFs](https://github.com/huggingface/finepdfs). Given a PDF, it extracts 124 features using PyMuPDF (4 doc-level + 15 page-level Γ— 8 sampled pages) and predicts `P(needs OCR)`.
8
+
9
+ - `ocr_prob < threshold` β†’ **MUPDF** (text-ok, fast path)
10
+ - `ocr_prob >= threshold` β†’ **PIPELINE** (needs OCR)
11
+
12
+ ### Usage
13
+
14
+ ```python
15
+ from pdfsys_router import Router
16
+
17
+ router = Router() # loads xgb_classifier.ubj lazily
18
+ decision = router.classify("path/to/document.pdf")
19
+ print(decision.backend, decision.ocr_prob)
20
+ ```
21
+
22
+ ### Weights
23
+
24
+ The XGBoost model (`models/xgb_classifier.ubj`, 257 KB) is gitignored. Fetch it once:
25
+
26
+ ```bash
27
+ python -m pdfsys_router.download_weights
28
+ ```
29
+
30
+ ## Stage B (not yet implemented)
31
+
32
+ For PDFs routed to OCR, Stage B reads the cached `LayoutDocument` and decides:
33
+ - No complex content β†’ `PIPELINE` (region-level OCR)
34
+ - Tables / formulas present β†’ `VLM` (vision-language model)
35
+
36
+ ## Module layout
37
+
38
+ | File | Purpose |
39
+ |------|---------|
40
+ | `feature_extractor.py` | Port of FinePDFs' `PDFFeatureExtractor` β€” DO NOT modify without retraining |
41
+ | `xgb_model.py` | Lazy XGBoost model loader |
42
+ | `classifier.py` | `Router.classify()` β†’ `RouterDecision` public API |
43
+ | `download_weights.py` | Fetches weights from FinePDFs Git LFS |
44
+ | `decider.py` | Stage-B stub |