DocPipe / README.md
jieluo1024
fix: add back HF Spaces YAML frontmatter
d80f375

A newer version of the Gradio SDK is available: 6.13.0

Upgrade
metadata
title: 'PDFSystem: PB-Scale PDF Processing Pipeline'
emoji: πŸš€
colorFrom: green
colorTo: purple
sdk: gradio
sdk_version: 6.12.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: PDF to Markdown pipeline with ML-powered routing

PDFSystem for MNBVC

PB-scale PDF β†’ Pretraining Data Pipeline
FinePDFs-inspired architecture for Chinese-heavy, mixed-quality PDFs

Hugging Face Spaces GitHub Python 3.11 Gradio License


πŸš€ Quick Links

Platform Link Description
Live Demo πŸ€— HF Spaces Upload PDF and try the pipeline instantly
Source Code GitHub Full source code and documentation

✨ Features

  • 🧠 ML-Powered Routing: XGBoost classifier (124 features) routes PDFs to optimal backend
  • ⚑ Fast Path: PyMuPDF extraction for text-ok documents (~10ms/page)
  • πŸ“Š Quality Scoring: ModernBERT-large OCR quality assessment [0-3 scale]
  • πŸ” Visual Debug: Page preview with extracted bbox overlays
  • πŸ“¦ Modular Design: Stateless, backend-agnostic pipeline components

🎯 Current Status

Component Status Description
Stage-A Router βœ… Ready XGBoost binary classifier with 124 PyMuPDF features
MuPDF Parser βœ… Ready Fast extraction for clean-text PDFs
OCR Quality Scorer βœ… Ready ModernBERT-large regression model
Stage-B Router 🚧 Planned Layout-based complexity routing
Pipeline Parser 🚧 Planned Region-level OCR for simple layouts
VLM Parser 🚧 Planned Vision-Language model for complex layouts

πŸƒ Quick Start

Option 1: Online Demo (Fastest)

Visit Hugging Face Spaces and upload a PDF β€” no installation required.

Option 2: Local Development

# 1. Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone and setup
git clone https://github.com/MIracleyin/pdfsystem_mnbvc.git
cd pdfsystem_mnbvc
uv sync

# 3. Download router weights (257 KB, one-time)
python -m pdfsys_router.download_weights

# 4. Run interactive demo
python app.py
# Open http://localhost:7860

Option 3: Batch Processing

python -m pdfsys_bench \
  --pdf-dir /path/to/pdfs \
  --out results.jsonl \
  --markdown-dir ./extracted

πŸ—οΈ Architecture

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   PDF Input  ───►  β”‚  Stage-A Router β”‚  XGBoost (124 features)
                    β”‚  (Implemented)  β”‚  ~10ms per PDF
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚ ocr_prob
           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β–Ό                 β–Ό                 β–Ό
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚  MUPDF  β”‚      β”‚ PIPELINE β”‚      β”‚   VLM   β”‚
      β”‚  (Fast) β”‚      β”‚  (OCR)   β”‚      β”‚(Complex)β”‚
      β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  ExtractedDoc: Markdown + Segments  β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  Quality Scorer (ModernBERT-large)  β”‚
   β”‚  Score: [0, 3]                      β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“¦ Workspace Packages

Package Purpose Dependencies
pdfsys-core Shared types, schemas, layout cache stdlib only
pdfsys-router Stage-A/Stage-B routing decisions pymupdf, xgboost, pandas, sklearn
pdfsys-parser-mupdf Fast PyMuPDF extraction pymupdf
pdfsys-bench Evaluation harness + quality scorer torch, transformers
pdfsys-layout-analyser Layout model runner 🚧 Planned
pdfsys-parser-pipeline OCR backend 🚧 Planned
pdfsys-parser-vlm VLM backend 🚧 Planned

πŸ“Š Benchmark Results

OmniDocBench-100 Dataset:

Backend Split:    mupdf=70    pipeline=30
Avg OCR Prob:     mupdf=0.034  pipeline=0.634
Extraction:       70 success   0 errors
Quality Score:    avg=1.71     min=0.39   max=2.73
Timing:           router=49ms  extract=7ms  quality=3.6s

🎨 Demo Interface

The Gradio demo provides:

  • πŸ“€ PDF Upload: Drag-and-drop or click to upload
  • πŸ“ˆ Routing Info: OCR probability, selected backend, page count
  • πŸ–ΌοΈ Page Preview: First page with colored bbox overlays
  • πŸ“ Markdown Output: Extracted text content
  • πŸ“‹ Segment Table: Block-level extraction details
  • πŸ”§ Feature View: Selected router features
  • πŸ“„ Raw JSON: Complete pipeline output
  • ⭐ Quality Score: Optional ModernBERT scoring

πŸ“š Documentation

Document Description
docs/PRD.md Product Requirements & Architecture Rationale
docs/ROADMAP.md Implementation Timeline & Milestones
CONTRIBUTING.md Development Guidelines & Commit Conventions
demo/README.md Demo-specific Documentation

πŸ’» Development

Data Structures

Router Output:

@dataclass
class RouterDecision:
    backend: Backend          # MUPDF | PIPELINE | VLM | DEFERRED
    ocr_prob: float           # P(needs OCR) [0, 1]
    num_pages: int
    is_form: bool
    features: dict            # 124-dim feature vector

Parser Output:

@dataclass(frozen=True)
class ExtractedDoc:
    sha256: str
    backend: Backend
    segments: tuple[Segment, ...]
    markdown: str
    stats: dict

CLI Reference

# Download router weights
python -m pdfsys_router.download_weights

# Run benchmark
python -m pdfsys_bench \
  --pdf-dir PATH \
  --out results.jsonl \
  --no-quality          # Skip quality scoring

🀝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.


πŸ“„ License

This project is licensed under the Apache License 2.0.


Built with ❀️ for the MNBVC corpus project