Spaces:

roger1024
/

DocPipe

Running

App Files Files Community

DocPipe / README.md

jieluo1024

fix: add back HF Spaces YAML frontmatter

d80f375 4 days ago

preview code

raw

history blame contribute delete

7.6 kB

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

metadata

title: 'PDFSystem: PB-Scale PDF Processing Pipeline'
emoji: 🚀
colorFrom: green
colorTo: purple
sdk: gradio
sdk_version: 6.12.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: PDF to Markdown pipeline with ML-powered routing

PDFSystem for MNBVC

PB-scale PDF → Pretraining Data Pipeline
FinePDFs-inspired architecture for Chinese-heavy, mixed-quality PDFs

🚀 Quick Links

Platform	Link	Description
Live Demo	🤗 HF Spaces	Upload PDF and try the pipeline instantly
Source Code	GitHub	Full source code and documentation

✨ Features

🧠 ML-Powered Routing: XGBoost classifier (124 features) routes PDFs to optimal backend
⚡ Fast Path: PyMuPDF extraction for text-ok documents (~10ms/page)
📊 Quality Scoring: ModernBERT-large OCR quality assessment [0-3 scale]
🔍 Visual Debug: Page preview with extracted bbox overlays
📦 Modular Design: Stateless, backend-agnostic pipeline components

🎯 Current Status

Component	Status	Description
Stage-A Router	✅ Ready	XGBoost binary classifier with 124 PyMuPDF features
MuPDF Parser	✅ Ready	Fast extraction for clean-text PDFs
OCR Quality Scorer	✅ Ready	ModernBERT-large regression model
Stage-B Router	🚧 Planned	Layout-based complexity routing
Pipeline Parser	🚧 Planned	Region-level OCR for simple layouts
VLM Parser	🚧 Planned	Vision-Language model for complex layouts

🏃 Quick Start

Option 1: Online Demo (Fastest)

Visit Hugging Face Spaces and upload a PDF — no installation required.

Option 2: Local Development

# 1. Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone and setup
git clone https://github.com/MIracleyin/pdfsystem_mnbvc.git
cd pdfsystem_mnbvc
uv sync

# 3. Download router weights (257 KB, one-time)
python -m pdfsys_router.download_weights

# 4. Run interactive demo
python app.py
# Open http://localhost:7860

Option 3: Batch Processing

python -m pdfsys_bench \
  --pdf-dir /path/to/pdfs \
  --out results.jsonl \
  --markdown-dir ./extracted

🏗️ Architecture

                    ┌─────────────────┐
   PDF Input  ───►  │  Stage-A Router │  XGBoost (124 features)
                    │  (Implemented)  │  ~10ms per PDF
                    └────────┬────────┘
                             │ ocr_prob
           ┌─────────────────┼─────────────────┐
           ▼                 ▼                 ▼
      ┌─────────┐      ┌──────────┐      ┌─────────┐
      │  MUPDF  │      │ PIPELINE │      │   VLM   │
      │  (Fast) │      │  (OCR)   │      │(Complex)│
      └────┬────┘      └──────────┘      └─────────┘
           │
           ▼
   ┌─────────────────────────────────────┐
   │  ExtractedDoc: Markdown + Segments  │
   └─────────────────────────────────────┘
           │
           ▼
   ┌─────────────────────────────────────┐
   │  Quality Scorer (ModernBERT-large)  │
   │  Score: [0, 3]                      │
   └─────────────────────────────────────┘

📦 Workspace Packages

Package	Purpose	Dependencies
`pdfsys-core`	Shared types, schemas, layout cache	stdlib only
`pdfsys-router`	Stage-A/Stage-B routing decisions	pymupdf, xgboost, pandas, sklearn
`pdfsys-parser-mupdf`	Fast PyMuPDF extraction	pymupdf
`pdfsys-bench`	Evaluation harness + quality scorer	torch, transformers
`pdfsys-layout-analyser`	Layout model runner	🚧 Planned
`pdfsys-parser-pipeline`	OCR backend	🚧 Planned
`pdfsys-parser-vlm`	VLM backend	🚧 Planned

📊 Benchmark Results

OmniDocBench-100 Dataset:

Backend Split:    mupdf=70    pipeline=30
Avg OCR Prob:     mupdf=0.034  pipeline=0.634
Extraction:       70 success   0 errors
Quality Score:    avg=1.71     min=0.39   max=2.73
Timing:           router=49ms  extract=7ms  quality=3.6s

🎨 Demo Interface

The Gradio demo provides:

📤 PDF Upload: Drag-and-drop or click to upload
📈 Routing Info: OCR probability, selected backend, page count
🖼️ Page Preview: First page with colored bbox overlays
📝 Markdown Output: Extracted text content
📋 Segment Table: Block-level extraction details
🔧 Feature View: Selected router features
📄 Raw JSON: Complete pipeline output
⭐ Quality Score: Optional ModernBERT scoring

📚 Documentation

Document	Description
`docs/PRD.md`	Product Requirements & Architecture Rationale
`docs/ROADMAP.md`	Implementation Timeline & Milestones
`CONTRIBUTING.md`	Development Guidelines & Commit Conventions
`demo/README.md`	Demo-specific Documentation

💻 Development

Data Structures

Router Output:

@dataclass
class RouterDecision:
    backend: Backend          # MUPDF | PIPELINE | VLM | DEFERRED
    ocr_prob: float           # P(needs OCR) [0, 1]
    num_pages: int
    is_form: bool
    features: dict            # 124-dim feature vector

Parser Output:

@dataclass(frozen=True)
class ExtractedDoc:
    sha256: str
    backend: Backend
    segments: tuple[Segment, ...]
    markdown: str
    stats: dict

CLI Reference

# Download router weights
python -m pdfsys_router.download_weights

# Run benchmark
python -m pdfsys_bench \
  --pdf-dir PATH \
  --out results.jsonl \
  --no-quality          # Skip quality scoring

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

📄 License

This project is licensed under the Apache License 2.0.

Built with ❤️ for the MNBVC corpus project