---
base_model:
- unsloth/qwen2-vl-2b-instruct-unsloth-bnb-4bit
- NAMAA-Space/Qari-OCR-0.2.2.1-VL-2B-Instruct
library_name: peft
pipeline_tag: text-generation
tags:
- base_model:adapter:unsloth/qwen2-vl-2b-instruct-unsloth-bnb-4bit
- lora
- transformers
- unsloth
license: apache-2.0
datasets:
- ahmedheakl/arocrbench_synthesizear
- ahmedheakl/arocrbench_patsocr
- ahmedheakl/arocrbench_historyar
- ahmedheakl/arocrbench_historicalbooks
- ahmedheakl/arocrbench_khattparagraph
- ahmedheakl/arocrbench_adab
- ahmedheakl/arocrbench_muharaf
- ahmedheakl/arocrbench_onlinekhatt
- ahmedheakl/arocrbench_khatt
- ahmedheakl/arocrbench_isippt
- ahmedheakl/arocrbench_arabicocr
- ahmedheakl/arocrbench_hindawi
- ahmedheakl/arocrbench_evarest
metrics:
- wer
---
# Qari-OCR-Fine-Tuned-Kitab-Benchmark

## Model Description

This model is a LoRA fine-tuned version of [NAMAA-Space/Qari-OCR-0.2.2.1-VL-2B-Instruct](https://huggingface.co/NAMAA-Space/Qari-OCR-0.2.2.1-VL-2B-Instruct) specifically optimized for Arabic OCR tasks using the comprehensive KITAB-Bench dataset.


### Model Details

- **Base Model:** NAMAA-Space/Qari-OCR-0.2.2.1-VL-2B-Instruct
- **Model Type:** Vision-Language Model with LoRA fine-tuning
- **Language:** Arabic (primary), with multilingual capabilities
- **License:** [Specify license]
- **Fine-tuned for:** Arabic Optical Character Recognition (OCR)

### Training Configuration

- **Training Method:** LoRA (Low-Rank Adaptation)
- **LoRA Parameters:**
  - Rank (r): 16
  - Alpha: 32
  - Dropout: 0.05
  - Target Modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
- **Training Epochs:** 5
- **Batch Size:** 4 per device
- **Learning Rate:** 2e-4
- **Optimizer:** AdamW 8-bit
- **Max Sequence Length:** 2048

## Dataset

The model was trained on a curated subset of Arabic OCR datasets comprising **3,760 total samples** from **13 domain-specific datasets**:

### Training Data Composition
**Total Combined Dataset:** 3,760 samples
- **Training Set:** 3,572 samples (95% of total)
- **Held-out Test Set:** 188 samples (5% of total)

### Source Datasets Used:
- **ahmedheakl/arocrbench_synthesizear:** 500 samples
- **ahmedheakl/arocrbench_patsocr:** 500 samples  
- **ahmedheakl/arocrbench_historyar:** 200 samples
- **ahmedheakl/arocrbench_historicalbooks:** 10 samples
- **ahmedheakl/arocrbench_khattparagraph:** 200 samples
- **ahmedheakl/arocrbench_adab:** 200 samples
- **ahmedheakl/arocrbench_muharaf:** 200 samples
- **ahmedheakl/arocrbench_onlinekhatt:** 200 samples
- **ahmedheakl/arocrbench_khatt:** 200 samples
- **ahmedheakl/arocrbench_isippt:** 500 samples
- **ahmedheakl/arocrbench_arabicocr:** 50 samples
- **ahmedheakl/arocrbench_hindawi:** 200 samples
- **ahmedheakl/arocrbench_evarest:** 800 samples

### Data Split
- 95% training (3,572 samples) 
- 5% held-out test (188 samples) for final evaluation

### Domain Coverage
- **Handwritten Text:** Historical manuscripts, personal notes, traditional calligraphy
- **Printed Text:** Books, newspapers, academic papers, legal documents
- **Scene Text:** Street signs, advertisements, natural environments
- **Structured Documents:** Tables, forms, layouts
- **Historical Documents:** Ancient texts, heritage manuscripts
- **Synthetic Data:** Generated text for augmentation

## Performance

### Evaluation Results on Held-Out Test Set

| Metric | Score |
|--------|-------|
| **Word Error Rate (WER)** | 0.4388 |
| **Character Error Rate (CER)** | 0.2231 |
| **BLEU Score** | 48.12 |

 
## Intended Use

### Primary Use Cases
- Arabic document digitization
- Historical manuscript transcription
- Multi-domain Arabic text recognition
- RAG (Retrieval-Augmented Generation) document processing pipelines
- Academic research in Arabic NLP and OCR

### Direct Use
```python
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image

# Load model and processor
processor = AutoProcessor.from_pretrained("FatimahEmadEldin/Qari-OCR-Fine-Tuned-Kitab-Benchmark")
model = AutoModelForVision2Seq.from_pretrained("FatimahEmadEldin/Qari-OCR-Fine-Tuned-Kitab-Benchmark")

# Process image
image = Image.open("arabic_document.jpg")
prompt = "Below is the image of one page of a document. Please provide the plain text representation of this document as if you were reading it naturally, ensuring high accuracy."

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": prompt},
        ],
    }
]

text_prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text_prompt, images=image, return_tensors="pt")

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```

## Limitations and Considerations

### Known Limitations
- **Complex Fonts:** Performance may vary with highly stylized or decorative Arabic fonts
- **Numeral Recognition:** Some challenges with mixed Arabic-Indic numeral systems
- **Word Elongation:** Handling of kashida (Arabic text elongation) requires improvement
- **PDF-to-Markdown:** Limited accuracy (best models achieve ~65% on complex layouts)

### Bias and Fairness
- Trained primarily on Modern Standard Arabic; dialectical variations may have reduced accuracy
- Historical document performance depends on manuscript quality and preservation state
- Geographic bias toward Gulf and Levantine Arabic text styles

## Technical Specifications

### Hardware Requirements
- **Minimum:** 8GB GPU memory for inference
- **Recommended:** 16GB+ GPU memory for optimal performance
- **Training:** Conducted on NVIDIA A100 GPUs

### Software Dependencies
- transformers >= 4.51.3
- torch >= 2.4.0
- unsloth (for efficient training)
- Pillow for image processing

## Training Details

### Training Infrastructure
- **Framework:** Unsloth for efficient LoRA training
- **Quantization:** 4-bit quantization for memory efficiency
- **Mixed Precision:** BF16/FP16 based on hardware support

### Data Processing
- Images processed at various resolutions maintaining aspect ratios
- Text preprocessing includes normalization of Arabic diacritics
- Synthetic data generation pipeline for augmentation

## Citation

If you use this model, cite the original dataset:

```bibtex
@article{heakl2025kitab,
  title={KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding},
  author={Heakl, Ahmed and Sohail, Abdullah and Ranjan, Mukul and Hossam, Rania and Ahmad, Ghazi and El-Geish, Mohamed and Maher, Omar and Shen, Zhiqiang and Khan, Fahad and Khan, Salman},
  journal={arXiv preprint arXiv:2502.14949},
  year={2025}
}
```


## Related Models

- **Base Model:** [NAMAA-Space/Qari-OCR-0.2.2.1-VL-2B-Instruct](https://huggingface.co/NAMAA-Space/Qari-OCR-0.2.2.1-VL-2B-Instruct)
- **KITAB-Bench Collection:** [ahmedheakl/kitab-bench](https://huggingface.co/collections/ahmedheakl/kitab-bench-677dd5d88d5db344d5595b78)