Spaces:

roger1024
/

DocPipe

Running

yin Claude Opus 4.6 commited on 11 days ago

Commit

b8ca6f2

1 Parent(s): d423504

docs: add project README, CONTRIBUTING guide, and per-package READMEs

Rewrite the top-level README with:
- Quick start (uv sync + download weights + run bench CLI)
- Architecture diagram with implemented/stub status table
- MVP benchmark results on OmniDocBench-100
- Key data structures (RouterDecision, ExtractedDoc, QualityScore)
- Design principles, CLI reference, output format spec

Add CONTRIBUTING.md covering:
- Dev environment setup
- Project structure overview
- Code conventions (naming, immutability, error handling)
- Feature extractor parity rules (124-column contract)
- How to add a new parser backend
- Commit message conventions

Add per-package READMEs for all 7 workspace packages explaining
each one's role, usage, and scope.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (8) hide show

CONTRIBUTING.md +126 -0
README.md +188 -22
packages/pdfsys-core/README.md +17 -0
packages/pdfsys-layout-analyser/README.md +11 -0
packages/pdfsys-parser-mupdf/README.md +30 -0
packages/pdfsys-parser-pipeline/README.md +5 -0
packages/pdfsys-parser-vlm/README.md +5 -0
packages/pdfsys-router/README.md +44 -0

CONTRIBUTING.md ADDED Viewed

	@@ -0,0 +1,126 @@

+# Contributing to pdfsys-mnbvc
+## Dev environment setup
+```bash
+# Prerequisites: Python >= 3.11, uv >= 0.4
+uv sync                             # installs all workspace packages in editable mode
+python -m pdfsys_router.download_weights   # one-time: fetch XGBoost weights (257 KB)
+```
+If you'll be working on quality scoring, torch + transformers are pulled in by `pdfsys-bench`. The ModernBERT-large model (~800 MB) downloads on first scorer use. Set `HF_HOME` to control the cache location.
+## Project structure
+```
+pdfsystem_mnbvc/
+├── pyproject.toml              # uv workspace root (meta-package)
+├── packages/
+│   ├── pdfsys-core/            # shared types, enums, layout cache, serde
+│   ├── pdfsys-router/          # Stage-A XGBoost classifier
+│   │   ├── models/             # gitignored xgb_classifier.ubj lives here
+│   │   └── src/pdfsys_router/
+│   │       ├── feature_extractor.py   # 124-feature PyMuPDF extractor
+│   │       ├── xgb_model.py           # lazy model loader
+│   │       ├── classifier.py          # Router.classify() → RouterDecision
+│   │       └── download_weights.py    # fetch weights from HF LFS
+│   ├── pdfsys-parser-mupdf/    # text-ok fast path (PyMuPDF blocks → Markdown)
+│   ├── pdfsys-parser-pipeline/ # OCR backend (stub)
+│   ├── pdfsys-parser-vlm/      # VLM backend (stub)
+│   ├── pdfsys-layout-analyser/ # layout model runner (stub)
+│   └── pdfsys-bench/           # evaluation harness + quality scorer
+│       ├── omnidocbench_100/   # gitignored bench dataset
+│       └── src/pdfsys_bench/
+│           ├── quality.py      # ModernBERT-large OCR quality scorer
+│           ├── loop.py         # router → parser → scorer → JSONL runner
+│           └── __main__.py     # CLI entry point
+└── out/                        # gitignored run outputs
+```
+## Code conventions
+### Naming
+- Package dirs: `pdfsys-<name>` (kebab-case in pyproject.toml and directory names).
+- Import names: `pdfsys_<name>` (snake_case, matching `src/pdfsys_<name>/`).
+- All packages live under `packages/` and use the `[tool.uv.workspace]` editable pattern.
+### Types and immutability
+- Core data structures are `@dataclass(frozen=True, slots=True)`.
+- Enums live in `pdfsys_core.types`.
+- BBox coordinates are always normalized to `[0, 1]`; convert to pixels/points at the call site.
+- Parser backends all emit `ExtractedDoc` with a `tuple[Segment, ...]` — the schema is backend-agnostic.
+### Error handling
+- `Router.classify()` never raises. Errors are surfaced via `RouterDecision.error`.
+- Parser `extract_doc()` may raise; the bench loop catches and records errors in JSONL.
+- Prefer explicit `except Exception` with a recorded message over silent swallowing.
+### Feature extractor parity
+The `feature_extractor.py` in `pdfsys-router` is a direct port of FinePDFs'
+`blocks/predictor/ocr_predictor.py`. The 124-column feature vector MUST match
+the upstream layout exactly — the XGBoost weights depend on column order. If you
+change any feature extraction logic, verify against the FinePDFs reference output
+before merging.
+The feature ordering is:
+1. `num_pages_successfully_sampled` (doc-level)
+2. `garbled_text_ratio` (doc-level)
+3. `is_form` (doc-level)
+4. `creator_or_producer_is_known_scanner` (doc-level)
+5. `page_level_unique_font_counts_page1` through `_page8`
+6. ... (15 page-level features × 8 pages = 120 columns)
+Total: 4 + 120 = 124 features.
+### Dependencies
+- `pdfsys-core` has **zero** external dependencies. Keep it that way.
+- Heavy deps (torch, transformers) are lazy-imported so that `import pdfsys_bench` doesn't pull them in at module scope.
+- XGBoost model weights are NOT committed to the repo. They're downloaded on demand via `download_weights.py`.
+## Running the MVP
+```bash
+# Full run on OmniDocBench-100 (takes ~4 min on CPU)
+python -m pdfsys_bench \
+  --pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \
+  --out out/bench_omnidoc100.jsonl \
+  --markdown-dir out/bench_omnidoc100_md
+# Fast smoke test (no quality scoring)
+python -m pdfsys_bench \
+  --pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \
+  --out out/smoke.jsonl \
+  --limit 5 --no-quality
+```
+Output: one JSONL file (per-doc results) + one `.summary.json` (aggregate stats).
+## Adding a new parser backend
+1. Implement the backend in its package under `packages/pdfsys-parser-<name>/`.
+2. The entry point should accept a `Path` and return `ExtractedDoc` (from `pdfsys-core`).
+3. Each `Segment` must have `page_index`, `type` (RegionType), `content`, and ideally a normalized `BBox`.
+4. Call `merge_segments_to_markdown(segments)` from `pdfsys-core` to produce the `markdown` field.
+5. Wire it into `loop.py` by handling the corresponding `Backend` enum value.
+## Adding new features to the router
+**Do not** modify `feature_extractor.py` unless you're also retraining the XGBoost model. The weights and feature layout are coupled. If you need additional routing signals, add them as post-classification heuristics in `classifier.py` rather than changing the feature vector.
+## Commit conventions
+Commit messages follow conventional commits:
+```
+feat(router): add scanner metadata detection
+fix(parser-mupdf): handle zero-width bbox on empty pages
+docs: update quickstart for new deps
+chore: bump pymupdf to 1.25
+```
+Scope is the package name without the `pdfsys-` prefix (e.g. `router`, `core`, `bench`, `parser-mupdf`).

README.md CHANGED Viewed

@@ -1,15 +1,46 @@
 # pdfsys-mnbvc
-PB-scale PDF → pretraining-data pipeline for the MNBVC corpus project.
 FinePDFs-inspired architecture adapted for Chinese-heavy, mixed-quality input.
-## Architecture
-Two-stage routing, cascaded:
 ```
           ┌──────────────┐
-PDF  ─►   │ pdfsys-router│  stage A (cheap classifier)
           └──────┬───────┘
                  │
       text-ok ◄──┴──► needs-ocr
@@ -26,32 +57,167 @@ PDF  ─►   │ pdfsys-router│  stage A (cheap classifier)
          parser-pipeline       parser-vlm
 ```
-The `LayoutDocument` produced by `pdfsys-layout-analyser` is cached to disk
-and consumed by **both** the stage-B decision in `pdfsys-router` **and** the
-downstream parser backend — layout inference runs at most once per PDF.
 ## Workspace packages
-| Package | Role |
-|---|---|
-| `pdfsys-core` | Shared dataclasses (`PdfRecord`, `LayoutDocument`), manifest IO, layout cache. No PDF/ML deps. |
-| `pdfsys-router` | Two-stage router. Stage A text-ok/needs-ocr; Stage B pipeline/vlm from cached layout. |
-| `pdfsys-layout-analyser` | Page layout model runner (PP-DocLayoutV3 / docling-layout-heron). Runs once, writes cache. |
-| `pdfsys-parser-mupdf` | Text-ok backend. PyMuPDF + reading order → Markdown. |
-| `pdfsys-parser-pipeline` | Needs-ocr + simple layout backend. Region-level OCR (RapidOCR / PaddleOCR-classic). |
-| `pdfsys-parser-vlm` | Needs-ocr + complex layout backend. MinerU 2.5 / PaddleOCR-VL on complex regions. |
-| `pdfsys-bench` | Cross-backend throughput / latency / F1 evaluation. |
-## Setup (macOS)
 ```bash
-# Requires uv >= 0.4
-uv sync
 ```
-Running a single PDF through the pipeline, and orchestration above the
-extraction core (ingest / dedup / quality / tokenize) are not implemented
-yet — see `docs/PRD.md` for the full design.
 ## Docs

 # pdfsys-mnbvc
+PB-scale PDF → pretraining-data pipeline for the [MNBVC](https://github.com/esbatmop/MNBVC) corpus project.
 FinePDFs-inspired architecture adapted for Chinese-heavy, mixed-quality input.
+## Current status: MVP closed loop ✅
+The first end-to-end path — **Router → MuPDF parser → OCR quality scorer** — is working on the OmniDocBench-100 evaluation set. PDFs that need OCR are routed to `PIPELINE` but not yet extracted (that backend is not implemented yet).
+## Quick start
+```bash
+# 1. Install uv (>= 0.4)
+curl -LsSf https://astral.sh/uv/install.sh | sh
+# 2. Clone the repo and sync all workspace packages
+git clone <this-repo-url>
+cd pdfsystem_mnbvc
+uv sync
+# 3. Fetch the XGBoost router weights (257 KB, one-time)
+python -m pdfsys_router.download_weights
+# 4. Run the MVP closed loop on the bench dataset
+python -m pdfsys_bench \
+  --pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \
+  --out out/bench_omnidoc100.jsonl \
+  --markdown-dir out/bench_omnidoc100_md
+```
+> **Note:** The first run downloads the ModernBERT-large quality scorer
+> (~800 MB) from HuggingFace Hub. Set `HF_HOME` to control where it's
+> cached. If you don't need quality scoring, add `--no-quality` to skip it.
+> **Note:** The bench dataset (omnidocbench_100) is NOT committed to the repo.
+> You need to obtain it separately and place it under
+> `packages/pdfsys-bench/omnidocbench_100/`.
+## Architecture
 ```
           ┌──────────────┐
+PDF  ──►  │ pdfsys-router│  stage A: XGBoost (124 PyMuPDF features)
           └──────┬───────┘
                  │
       text-ok ◄──┴──► needs-ocr
          parser-pipeline       parser-vlm
 ```
+### What's implemented
+| Stage | Status | Description |
+|-------|--------|-------------|
+| **Stage-A router** | ✅ | XGBoost binary classifier, ported from FinePDFs. 124 features (4 doc-level + 15 page-level × 8 sampled pages). Routes to `MUPDF` (text-ok) or `PIPELINE` (needs-ocr). |
+| **MuPDF parser** | ✅ | `page.get_text("blocks", sort=True)` → `ExtractedDoc` with normalized bbox and merged Markdown. Fast path for clean-text PDFs. |
+| **OCR quality scorer** | ✅ | ModernBERT-large regression head (`HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn`). Scores extracted text on a [0, 3] scale. |
+| **Bench CLI** | ✅ | `python -m pdfsys_bench` — drives the full loop, emits per-doc JSONL + summary JSON. |
+| Stage-B router | ❌ | Pending layout-analyser and LayoutCache integration. |
+| Layout analyser | ❌ | PP-DocLayoutV3 / docling-layout-heron runner — not started. |
+| Pipeline parser | ❌ | Region-level OCR (RapidOCR / PaddleOCR) — not started. |
+| VLM parser | ❌ | MinerU 2.5 / PaddleOCR-VL on complex regions — not started. |
+### MVP benchmark results (OmniDocBench-100)
+```
+Backend split:  mupdf=70  pipeline=30
+Avg ocr_prob:   mupdf=0.034  pipeline=0.634
+Extracted:      70   Errors: 0
+Quality:        avg=1.71  min=0.39  max=2.73
+Per-doc time:   router=49ms  extract=7ms  quality=3.6s
+```
 ## Workspace packages
+| Package | Role | Dependencies |
+|---------|------|-------------|
+| `pdfsys-core` | Shared dataclasses, enums, layout cache, serde. No PDF/ML deps. | stdlib only |
+| `pdfsys-router` | Stage-A XGBoost classifier + Stage-B layout decision (stub). | pymupdf, xgboost, pandas, numpy, scikit-learn |
+| `pdfsys-layout-analyser` | Page layout model runner. Stub only. | — |
+| `pdfsys-parser-mupdf` | Text-ok backend: PyMuPDF block extraction → Markdown. | pymupdf |
+| `pdfsys-parser-pipeline` | OCR backend for simple layouts. Stub only. | — |
+| `pdfsys-parser-vlm` | VLM backend for complex layouts. Stub only. | — |
+| `pdfsys-bench` | Closed-loop evaluation harness + quality scorer. | torch, transformers, pdfsys-router, pdfsys-parser-mupdf |
+### Package dependency graph
+```
+pdfsys-core  ◄── pdfsys-router
+             ◄── pdfsys-parser-mupdf
+             ◄── pdfsys-parser-pipeline  (stub)
+             ◄── pdfsys-parser-vlm       (stub)
+             ◄── pdfsys-layout-analyser  (stub)
+pdfsys-router        ◄── pdfsys-bench
+pdfsys-parser-mupdf  ◄── pdfsys-bench
+```
+`pdfsys-core` is the root dependency: every other package imports it, and it has zero external deps beyond the Python stdlib.
+## Key data structures
+### Router output (`RouterDecision`)
+```python
+@dataclass
+class RouterDecision:
+    backend: Backend          # MUPDF | PIPELINE | VLM | DEFERRED
+    ocr_prob: float           # P(needs OCR) from XGBoost, [0, 1]
+    num_pages: int
+    is_form: bool
+    garbled_text_ratio: float
+    is_encrypted: bool
+    needs_password: bool
+    features: dict            # full 124-feature vector for debugging
+    error: str | None
+```
+### Parser output (`ExtractedDoc`)
+```python
+@dataclass(frozen=True)
+class ExtractedDoc:
+    sha256: str
+    backend: Backend
+    segments: tuple[Segment, ...]   # ordered block-level units
+    markdown: str                    # segments merged with \n\n
+    stats: dict
+```
+Each `Segment` carries `page_index`, `RegionType` (TEXT/IMAGE/TABLE/FORMULA), `content` (Markdown / HTML / LaTeX), and a normalized `BBox` in [0, 1].
+### Quality score
+```python
+@dataclass
+class QualityScore:
+    score: float        # [0, 3]: 0=garbage, 1=format issues, 2=minor, 3=clean
+    num_chars: int
+    num_tokens: int
+    model: str
+```
+## Design principles
+1. **Stateless processing.** No manifest, no central DB. Every PDF produces self-contained output. Following FinePDFs' datatrove-style design.
+2. **Content-addressable caching.** LayoutCache shards by `sha256 + model_tag`. Bumping the model tag lazily invalidates old entries.
+3. **Atomic writes.** All file outputs use `tmp + os.replace()` for crash safety.
+4. **Normalized coordinates.** BBox is always `[0, 1]` with origin top-left; backends convert to pixels/points on demand.
+5. **Backend-agnostic output.** All three parser backends emit the same `ExtractedDoc` / `Segment` schema, so downstream stages don't need to know which backend produced a document.
+## CLI reference
+### `python -m pdfsys_bench`
+```
+usage: pdfsys-bench [-h] --pdf-dir PDF_DIR --out OUT [--limit N]
+                    [--no-quality] [--quality-model MODEL]
+                    [--router-weights PATH] [--markdown-dir DIR]
+                    [--ocr-threshold FLOAT]
+Run the MVP pdfsys closed loop.
+options:
+  --pdf-dir PATH       Directory of PDFs to process (recursive).
+  --out PATH           Output JSONL path (one line per PDF).
+  --limit N            Cap the number of PDFs processed.
+  --no-quality         Skip the ModernBERT quality scorer.
+  --quality-model ID   HuggingFace model for quality scoring.
+  --router-weights P   Path to xgb_classifier.ubj.
+  --markdown-dir DIR   Dump per-PDF extracted markdown here.
+  --ocr-threshold F    P(ocr) threshold (default: 0.5).
+```
+### `python -m pdfsys_router.download_weights`
+Downloads the XGBoost router weights (~257 KB) from the FinePDFs Git LFS.
 ```bash
+python -m pdfsys_router.download_weights          # first time
+python -m pdfsys_router.download_weights --force   # re-download
+```
+## Output format
+The JSONL output (`--out`) has one JSON object per PDF:
+```json
+{
+  "pdf_path": "packages/pdfsys-bench/omnidocbench_100/pdfs/example.pdf",
+  "sha256": "a53b50cb0d3d...",
+  "backend": "mupdf",
+  "ocr_prob": 0.003,
+  "num_pages": 1,
+  "is_form": false,
+  "garbled_text_ratio": 0.0,
+  "router_error": null,
+  "extract_stats": {"page_count": 1, "pages_extracted": 1, "segment_count": 5, "char_count": 5734},
+  "extract_error": null,
+  "quality_score": 2.45,
+  "quality_num_chars": 5734,
+  "quality_num_tokens": 512,
+  "quality_model": "HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn",
+  "markdown_chars": 5734,
+  "wall_ms_router": 42.1,
+  "wall_ms_extract": 6.3,
+  "wall_ms_quality": 3421.0
+}
 ```
+A companion `.summary.json` file is also written with aggregate statistics.
 ## Docs

packages/pdfsys-core/README.md ADDED Viewed

	@@ -0,0 +1,17 @@

+# pdfsys-core
+Shared data contracts for the pdfsys pipeline. Every other package depends on this one.
+## What's in here
+- **Enums**: `RegionType` (TEXT / IMAGE / TABLE / FORMULA), `Backend` (MUPDF / PIPELINE / VLM / DEFERRED).
+- **PdfRecord**: Frozen dataclass for per-PDF metadata (sha256, source_uri, size, provenance).
+- **Layout schema**: `BBox` (normalized [0,1]), `LayoutRegion`, `LayoutPage`, `LayoutDocument` — the contract between layout-analyser and every parser backend.
+- **ExtractedDoc / Segment**: Backend-agnostic output schema. All three parser backends emit these.
+- **LayoutCache**: Content-addressable on-disk cache for LayoutDocuments, keyed by `sha256 + model_tag`.
+- **PdfsysConfig**: Hierarchical configuration (paths, router, layout, per-backend settings, runtime).
+- **Serde**: Generic `to_dict()` / `from_dict()` for all the above dataclasses.
+## Key design rule
+This package has **zero external dependencies** — stdlib only. Do not add pymupdf, torch, or anything else here. The types must be importable everywhere without pulling in heavy ML libraries.

packages/pdfsys-layout-analyser/README.md ADDED Viewed

	@@ -0,0 +1,11 @@

+# pdfsys-layout-analyser
+Page layout model runner. **Stub only — not yet implemented.**
+Will run a layout detection model (PP-DocLayoutV3 / docling-layout-heron) on each page and write a `LayoutDocument` to the `LayoutCache`. This layout is consumed by:
+1. **pdfsys-router Stage B** — checks `has_complex_content` to decide pipeline vs VLM.
+2. **pdfsys-parser-pipeline** — uses region bboxes to crop and OCR individual regions.
+3. **pdfsys-parser-vlm** — sends complex regions to a vision-language model.
+Layout inference runs at most once per PDF (keyed by sha256 + model_tag in the cache).

packages/pdfsys-parser-mupdf/README.md ADDED Viewed

	@@ -0,0 +1,30 @@

+# pdfsys-parser-mupdf
+Text-ok extraction backend. This is the fast path for PDFs that the router classifies as having a clean embedded text layer (i.e. `ocr_prob < threshold`).
+## What it does
+1. Opens the PDF with PyMuPDF.
+2. Iterates every page, calling `page.get_text("blocks", sort=True)`.
+3. Filters to text blocks (drops image blocks).
+4. Normalizes each block's bbox to [0, 1] coordinates.
+5. Produces one `Segment` per block, joined into an `ExtractedDoc` with merged Markdown.
+## Usage
+```python
+from pdfsys_parser_mupdf import extract_doc
+doc = extract_doc("path/to/clean.pdf")
+print(doc.markdown[:500])
+print(f"{doc.segment_count} segments, {doc.char_count} chars")
+```
+## Scope
+This backend intentionally does NOT:
+- Run OCR (that's what parser-pipeline and parser-vlm are for)
+- Use a layout model (not needed for text-ok PDFs)
+- Extract images or tables (image-heavy PDFs should be routed elsewhere)
+It is the simplest possible extraction: unwrap PyMuPDF blocks into structured output.

packages/pdfsys-parser-pipeline/README.md ADDED Viewed

	@@ -0,0 +1,5 @@

+# pdfsys-parser-pipeline
+Region-level OCR backend for scanned PDFs with simple layouts. **Stub only — not yet implemented.**
+Will take a `LayoutDocument` from the cache, crop each region at the configured DPI, and run OCR (RapidOCR / PaddleOCR-classic) on each crop individually. Produces an `ExtractedDoc` following the same schema as parser-mupdf.

packages/pdfsys-parser-vlm/README.md ADDED Viewed

	@@ -0,0 +1,5 @@

+# pdfsys-parser-vlm
+Vision-language model backend for scanned PDFs with complex content (tables, formulas). **Stub only — not yet implemented.**
+Will handle regions flagged as TABLE or FORMULA by the layout analyser, sending them to a VLM (MinerU 2.5 / PaddleOCR-VL) that can produce structured output (HTML tables, LaTeX formulas). Simple text regions in the same document may still be handled by the pipeline backend.

packages/pdfsys-router/README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+# pdfsys-router
+Two-stage routing for the pdfsys extraction pipeline.
+## Stage A (implemented)
+XGBoost binary classifier ported from [FinePDFs](https://github.com/huggingface/finepdfs). Given a PDF, it extracts 124 features using PyMuPDF (4 doc-level + 15 page-level × 8 sampled pages) and predicts `P(needs OCR)`.
+- `ocr_prob < threshold` → **MUPDF** (text-ok, fast path)
+- `ocr_prob >= threshold` → **PIPELINE** (needs OCR)
+### Usage
+```python
+from pdfsys_router import Router
+router = Router()  # loads xgb_classifier.ubj lazily
+decision = router.classify("path/to/document.pdf")
+print(decision.backend, decision.ocr_prob)
+```
+### Weights
+The XGBoost model (`models/xgb_classifier.ubj`, 257 KB) is gitignored. Fetch it once:
+```bash
+python -m pdfsys_router.download_weights
+```
+## Stage B (not yet implemented)
+For PDFs routed to OCR, Stage B reads the cached `LayoutDocument` and decides:
+- No complex content → `PIPELINE` (region-level OCR)
+- Tables / formulas present → `VLM` (vision-language model)
+## Module layout
+| File | Purpose |
+|------|---------|
+| `feature_extractor.py` | Port of FinePDFs' `PDFFeatureExtractor` — DO NOT modify without retraining |
+| `xgb_model.py` | Lazy XGBoost model loader |
+| `classifier.py` | `Router.classify()` → `RouterDecision` public API |
+| `download_weights.py` | Fetches weights from FinePDFs Git LFS |
+| `decider.py` | Stage-B stub |