--- base_model: - unsloth/qwen2-vl-2b-instruct-unsloth-bnb-4bit - NAMAA-Space/Qari-OCR-0.2.2.1-VL-2B-Instruct library_name: peft pipeline_tag: text-generation tags: - base_model:adapter:unsloth/qwen2-vl-2b-instruct-unsloth-bnb-4bit - lora - transformers - unsloth license: apache-2.0 datasets: - ahmedheakl/arocrbench_synthesizear - ahmedheakl/arocrbench_patsocr - ahmedheakl/arocrbench_historyar - ahmedheakl/arocrbench_historicalbooks - ahmedheakl/arocrbench_khattparagraph - ahmedheakl/arocrbench_adab - ahmedheakl/arocrbench_muharaf - ahmedheakl/arocrbench_onlinekhatt - ahmedheakl/arocrbench_khatt - ahmedheakl/arocrbench_isippt - ahmedheakl/arocrbench_arabicocr - ahmedheakl/arocrbench_hindawi - ahmedheakl/arocrbench_evarest metrics: - wer --- # Qari-OCR-Fine-Tuned-Kitab-Benchmark ## Model Description This model is a LoRA fine-tuned version of [NAMAA-Space/Qari-OCR-0.2.2.1-VL-2B-Instruct](https://huggingface.co/NAMAA-Space/Qari-OCR-0.2.2.1-VL-2B-Instruct) specifically optimized for Arabic OCR tasks using the comprehensive KITAB-Bench dataset. ### Model Details - **Base Model:** NAMAA-Space/Qari-OCR-0.2.2.1-VL-2B-Instruct - **Model Type:** Vision-Language Model with LoRA fine-tuning - **Language:** Arabic (primary), with multilingual capabilities - **License:** [Specify license] - **Fine-tuned for:** Arabic Optical Character Recognition (OCR) ### Training Configuration - **Training Method:** LoRA (Low-Rank Adaptation) - **LoRA Parameters:** - Rank (r): 16 - Alpha: 32 - Dropout: 0.05 - Target Modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] - **Training Epochs:** 5 - **Batch Size:** 4 per device - **Learning Rate:** 2e-4 - **Optimizer:** AdamW 8-bit - **Max Sequence Length:** 2048 ## Dataset The model was trained on a curated subset of Arabic OCR datasets comprising **3,760 total samples** from **13 domain-specific datasets**: ### Training Data Composition **Total Combined Dataset:** 3,760 samples - **Training Set:** 3,572 samples (95% of total) - **Held-out Test Set:** 188 samples (5% of total) ### Source Datasets Used: - **ahmedheakl/arocrbench_synthesizear:** 500 samples - **ahmedheakl/arocrbench_patsocr:** 500 samples - **ahmedheakl/arocrbench_historyar:** 200 samples - **ahmedheakl/arocrbench_historicalbooks:** 10 samples - **ahmedheakl/arocrbench_khattparagraph:** 200 samples - **ahmedheakl/arocrbench_adab:** 200 samples - **ahmedheakl/arocrbench_muharaf:** 200 samples - **ahmedheakl/arocrbench_onlinekhatt:** 200 samples - **ahmedheakl/arocrbench_khatt:** 200 samples - **ahmedheakl/arocrbench_isippt:** 500 samples - **ahmedheakl/arocrbench_arabicocr:** 50 samples - **ahmedheakl/arocrbench_hindawi:** 200 samples - **ahmedheakl/arocrbench_evarest:** 800 samples ### Data Split - 95% training (3,572 samples) - 5% held-out test (188 samples) for final evaluation ### Domain Coverage - **Handwritten Text:** Historical manuscripts, personal notes, traditional calligraphy - **Printed Text:** Books, newspapers, academic papers, legal documents - **Scene Text:** Street signs, advertisements, natural environments - **Structured Documents:** Tables, forms, layouts - **Historical Documents:** Ancient texts, heritage manuscripts - **Synthetic Data:** Generated text for augmentation ## Performance ### Evaluation Results on Held-Out Test Set | Metric | Score | |--------|-------| | **Word Error Rate (WER)** | 0.4388 | | **Character Error Rate (CER)** | 0.2231 | | **BLEU Score** | 48.12 | ## Intended Use ### Primary Use Cases - Arabic document digitization - Historical manuscript transcription - Multi-domain Arabic text recognition - RAG (Retrieval-Augmented Generation) document processing pipelines - Academic research in Arabic NLP and OCR ### Direct Use ```python from transformers import AutoProcessor, AutoModelForVision2Seq from PIL import Image # Load model and processor processor = AutoProcessor.from_pretrained("FatimahEmadEldin/Qari-OCR-Fine-Tuned-Kitab-Benchmark") model = AutoModelForVision2Seq.from_pretrained("FatimahEmadEldin/Qari-OCR-Fine-Tuned-Kitab-Benchmark") # Process image image = Image.open("arabic_document.jpg") prompt = "Below is the image of one page of a document. Please provide the plain text representation of this document as if you were reading it naturally, ensuring high accuracy." messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": prompt}, ], } ] text_prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=text_prompt, images=image, return_tensors="pt") # Generate generated_ids = model.generate(**inputs, max_new_tokens=2048) generated_ids = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)] response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(response) ``` ## Limitations and Considerations ### Known Limitations - **Complex Fonts:** Performance may vary with highly stylized or decorative Arabic fonts - **Numeral Recognition:** Some challenges with mixed Arabic-Indic numeral systems - **Word Elongation:** Handling of kashida (Arabic text elongation) requires improvement - **PDF-to-Markdown:** Limited accuracy (best models achieve ~65% on complex layouts) ### Bias and Fairness - Trained primarily on Modern Standard Arabic; dialectical variations may have reduced accuracy - Historical document performance depends on manuscript quality and preservation state - Geographic bias toward Gulf and Levantine Arabic text styles ## Technical Specifications ### Hardware Requirements - **Minimum:** 8GB GPU memory for inference - **Recommended:** 16GB+ GPU memory for optimal performance - **Training:** Conducted on NVIDIA A100 GPUs ### Software Dependencies - transformers >= 4.51.3 - torch >= 2.4.0 - unsloth (for efficient training) - Pillow for image processing ## Training Details ### Training Infrastructure - **Framework:** Unsloth for efficient LoRA training - **Quantization:** 4-bit quantization for memory efficiency - **Mixed Precision:** BF16/FP16 based on hardware support ### Data Processing - Images processed at various resolutions maintaining aspect ratios - Text preprocessing includes normalization of Arabic diacritics - Synthetic data generation pipeline for augmentation ## Citation If you use this model, cite the original dataset: ```bibtex @article{heakl2025kitab, title={KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding}, author={Heakl, Ahmed and Sohail, Abdullah and Ranjan, Mukul and Hossam, Rania and Ahmad, Ghazi and El-Geish, Mohamed and Maher, Omar and Shen, Zhiqiang and Khan, Fahad and Khan, Salman}, journal={arXiv preprint arXiv:2502.14949}, year={2025} } ``` ## Related Models - **Base Model:** [NAMAA-Space/Qari-OCR-0.2.2.1-VL-2B-Instruct](https://huggingface.co/NAMAA-Space/Qari-OCR-0.2.2.1-VL-2B-Instruct) - **KITAB-Bench Collection:** [ahmedheakl/kitab-bench](https://huggingface.co/collections/ahmedheakl/kitab-bench-677dd5d88d5db344d5595b78)