---
license: llama3.1
base_model:
  - meta-llama/Meta-Llama-3.1-8B-Instruct
  - NousResearch/Hermes-3-Llama-3.1-8B
  - HPAI-BSC/Llama3.1-Aloe-Beta-8B
tags:
  - medical
  - clinical-reasoning
  - self-consistency 
  - merge
  - dare-ties
  - biology
  - healthcare
language:
  - en
pipeline_tag: text-generation
library_name: transformers
inference: false
model-index:
  - name: Avicenna-8B-Base
    results:
      - task:
          type: text-generation
          name: Clinical Reasoning
        dataset:
          name: MedQA (USMLE)
          type: GBaker/MedQA-USMLE-4-options-hf
          config: 4-options
          split: test
        metrics:
          - name: Accuracy (SC N=5)
            type: accuracy
            value: 61.0
          - name: Accuracy (Greedy)
            type: accuracy
            value: 60.0
      - task:
          type: text-generation
          name: Biomedical Knowledge
        dataset:
          name: MMLU (Medical)
          type: cais/mmlu
          config: medical
          split: test
        metrics:
          - name: Accuracy (Greedy)
            type: accuracy
            value: 69.05
---

![Avicenna Banner](Avicenna-Banner.png)

# 🧬 Avicenna-8B-Base

<div align="center">


![Avicenna Banner](https://img.shields.io/badge/Avicenna-Base_Series-0055a6?style=for-the-badge&logo=caduceus&logoColor=white) 

[![Medical](https://img.shields.io/badge/Task-Clinical_Reasoning-red?style=flat-square)](https://huggingface.co/salihfurkaan)

</div>

> **"Restoring the 'Think' in Medical AI."**

**Avicenna-8B-Base** is the foundational model of the Avicenna Project. It is a specialized medical language model engineered to achieve **SOTA reasoning** at the 8B parameter scale via architectural merging.

By surgically merging three distinct Llama 3.1 models and utilizing a **Self-Consistency Ensembling** inference strategy.

---

## 🏗️ Architecture: "The Surgical Merge"

Unlike standard merges that blend models uniformly, Avicenna-8B-Base uses a **layer-segmented DARE-TIES** configuration to assign specific cognitive roles to different parts of the network.

| Model Region | Source Model | Role | Weights |
| :--- | :--- | :--- | :--- |
| **Foundation (Layers 0-8)** | `Llama-3.1-Instruct` | Syntax, instruction following, and grammar stability. | 100% |
| **Logic Core (Layers 8-20)** | `Hermes-3` + `Llama-3.1` | **Clinical Reasoning:** Implicit logic and causal analysis. | 45% Hermes / 55% Base |
| **Medical Cortex (Layers 20-28)** | `Aloe-Beta` + `Llama-3.1` | **Knowledge Retrieval:** High-density injection of medical textbooks and guidelines. | 52% Aloe / 48% Base |
| **Frontal Cortex (Layers 28-32)** | `Llama-3.1-Instruct` | **Safety & Output:** Ensures polite, structured, and compliant responses. | 100% |

This structure prevents "catastrophic forgetting" of general logic while injecting massive medical knowledge into the deep layers.

---

### 🏆 Benchmark Performance (Comprehensive Comparison)

We compared Avicenna-8B-Base against other leading medical models across three major benchmarks: **MedQA (USMLE)**, **MMLU-Medical**, and **MedMCQA**.

| Model | Size | Inference Method | MedQA (USMLE) | MMLU-Medical | MedMCQA |
| :--- | :---: | :--- | :---: | :---: | :---: |
| **Avicenna-8B-Base** | **8B** | **Self-Consistency (SC) (N=5)** | **61.0%** | - | **50.0%** |
| **Avicenna-8B-Base** | **8B** | Greedy | 60.0% | **69.5%** | - |
| *GPT-3.5 Turbo* | *175B+* | Standard | 61.2% | 73.5% | 59.4% |
| ClinicalCamel-70B | 70B | Standard | 45.8% | 68.4% | 45.8% |
| PMC-LLaMA-13B | 13B | Standard | 39.6% | 56.3% | 37.7% |
| MedAlpaca-13B | 13B | Standard | 37.3% | 51.5% | 35.7% |
| BioMistral-7B | 7B | Standard | 35.4% | 52.6% | 34.8% |
| Meditron-7B | 7B | Standard | 33.5% | 45.2% | 31.1% |


> **Methodology Notes:**
> * **Hardware:** All results obtained using **4-bit NF4 Quantization** on NVIDIA T4 GPUs. Full precision scores are expected to be higher.
> * **Inference:** MedQA and MedMCQA utilized **Self-Consistency Ensembling (SC)** inference (N=5 voters). MMLU utilized standard Greedy decoding.
> * **Sampling:** MedQA and MedMCQA results represent randomized subsets of the validation/test sets due to compute constraints. MMLU represents the complete evaluation of all 6 medical subsets.

---

## 🚀 How to Run (Self-Consistency Ensembling)

You use the following Python script which implements the Self-Consistency Ensembling voting logic.

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# --- CONFIGURATION ---
MODEL_ID = "salihfurkaan/Avicenna-8B-Base"
TOKENIZER_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct"

def setup_model():
    print(f"Loading {MODEL_ID} in 4-bit mode...")
    
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4"
    )
    
    tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_ID)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        quantization_config=bnb_config,  # you can remove this line if you want the non-quantized version
        device_map="auto"
    )
    return model, tokenizer

def solve_with_moa_open_ended(model, tokenizer, user_input):
    """
    Runs Mixture-of-Agents for open-ended queries:
    1. Generates 3 distinct clinical opinions (Drafts).
    2. Synthesizes them into a final consensus answer.
    """
    
    # --- PHASE 1: DRAFTING (3 Internal Specialists) ---
    system_prompt_draft = "You are Avicenna, an expert medical consultant. Analyze the case step-by-step. Provide a Differential Diagnosis and Recommended Next Steps."
    
    messages = [{"role": "system", "content": system_prompt_draft}, {"role": "user", "content": user_input}]
    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
    
    print("Consulting 3 internal specialists (Drafting Phase)...")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=1536,
            temperature=0.7, # High creativity for diverse perspectives
            do_sample=True,
            num_return_sequences=3, # Generate 3 Drafts
            pad_token_id=tokenizer.eos_token_id
        )
    
    # Extract only the new tokens (answers) from the output
    # outputs shape: [3, seq_len]
    # inputs shape: [1, seq_len] -> We slice off the prompt length
    new_tokens = outputs[:, inputs.input_ids.shape[1]:]
    drafts = tokenizer.batch_decode(new_tokens, skip_special_tokens=True)

    # --- PHASE 2: SYNTHESIS (Chief Resident) ---
    print(" Synthesizing Final Consensus...")
    
    combined_drafts = ""
    for i, draft in enumerate(drafts):
        combined_drafts += f"\n[Opinion {i+1}]:\n{draft}\n"
        # Optional: Print drafts to see the internal debate
        # print(f"\n--- Opinion {i+1} ---\n{draft[:200]}...") 
    
    aggregator_prompt = (
        f"Clinical Case:\n{user_input}\n\n"
        f"Consider the following 3 medical opinions on this case:\n{combined_drafts}\n\n"
        "TASK: Synthesize these opinions into a single, highly accurate, and professional clinical assessment. "
        "Resolve any conflicts by prioritizing patient safety and standard of care. "
        "Structure the answer clearly: 1. Assessment, 2. Key Differentials, 3. Plan."
    )
    
    agg_messages = [
        {"role": "system", "content": "You are a Senior Chief Physician. Provide a final authoritative consultation."},
        {"role": "user", "content": aggregator_prompt}
    ]
    
    agg_inputs = tokenizer.apply_chat_template(agg_messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
    
    with torch.no_grad():
        final_output = model.generate(
            **agg_inputs,
            max_new_tokens=768,
            temperature=0.2, # Low temp for stable synthesis
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
        
    final_response = tokenizer.decode(final_output[0][agg_inputs.input_ids.shape[1]:], skip_special_tokens=True)
    return final_response

if __name__ == "__main__":
    # Initialize
    model, tokenizer = setup_model()
    
    print("\n Avicenna Interactive Consultant")
    print("Type 'exit' or 'quit' to stop.\n")

    while True:
        print("\n" + "-"*30)
        question = input("Enter Clinical Case/Question: ")
        if question.lower() in ["exit", "quit"]:
            break

        final_answer = solve_with_moa_open_ended(model, tokenizer, question)
        
        print("\n" + "="*40)
        print(f"FINAL CLINICAL CONSENSUS")
        print("="*40)
        print(final_answer)
```

⚠️ Disclaimer
- Research Use Only: Avicenna-8B-Base is a derivative of Llama 3.1. It is intended for academic research, benchmarking, and decision-support prototyping.
- Not a Doctor: The model can hallucinate. It should never be used for real-world patient diagnosis or treatment without human supervision.