--- license: llama3.1 base_model: - meta-llama/Meta-Llama-3.1-8B-Instruct - NousResearch/Hermes-3-Llama-3.1-8B - HPAI-BSC/Llama3.1-Aloe-Beta-8B tags: - medical - clinical-reasoning - self-consistency - merge - dare-ties - biology - healthcare language: - en pipeline_tag: text-generation library_name: transformers inference: false model-index: - name: Avicenna-8B-Base results: - task: type: text-generation name: Clinical Reasoning dataset: name: MedQA (USMLE) type: GBaker/MedQA-USMLE-4-options-hf config: 4-options split: test metrics: - name: Accuracy (SC N=5) type: accuracy value: 61.0 - name: Accuracy (Greedy) type: accuracy value: 60.0 - task: type: text-generation name: Biomedical Knowledge dataset: name: MMLU (Medical) type: cais/mmlu config: medical split: test metrics: - name: Accuracy (Greedy) type: accuracy value: 69.05 --- ![Avicenna Banner](Avicenna-Banner.png) # 🧬 Avicenna-8B-Base
![Avicenna Banner](https://img.shields.io/badge/Avicenna-Base_Series-0055a6?style=for-the-badge&logo=caduceus&logoColor=white) [![Medical](https://img.shields.io/badge/Task-Clinical_Reasoning-red?style=flat-square)](https://huggingface.co/salihfurkaan)
> **"Restoring the 'Think' in Medical AI."** **Avicenna-8B-Base** is the foundational model of the Avicenna Project. It is a specialized medical language model engineered to achieve **SOTA reasoning** at the 8B parameter scale via architectural merging. By surgically merging three distinct Llama 3.1 models and utilizing a **Self-Consistency Ensembling** inference strategy. --- ## 🏗️ Architecture: "The Surgical Merge" Unlike standard merges that blend models uniformly, Avicenna-8B-Base uses a **layer-segmented DARE-TIES** configuration to assign specific cognitive roles to different parts of the network. | Model Region | Source Model | Role | Weights | | :--- | :--- | :--- | :--- | | **Foundation (Layers 0-8)** | `Llama-3.1-Instruct` | Syntax, instruction following, and grammar stability. | 100% | | **Logic Core (Layers 8-20)** | `Hermes-3` + `Llama-3.1` | **Clinical Reasoning:** Implicit logic and causal analysis. | 45% Hermes / 55% Base | | **Medical Cortex (Layers 20-28)** | `Aloe-Beta` + `Llama-3.1` | **Knowledge Retrieval:** High-density injection of medical textbooks and guidelines. | 52% Aloe / 48% Base | | **Frontal Cortex (Layers 28-32)** | `Llama-3.1-Instruct` | **Safety & Output:** Ensures polite, structured, and compliant responses. | 100% | This structure prevents "catastrophic forgetting" of general logic while injecting massive medical knowledge into the deep layers. --- ### 🏆 Benchmark Performance (Comprehensive Comparison) We compared Avicenna-8B-Base against other leading medical models across three major benchmarks: **MedQA (USMLE)**, **MMLU-Medical**, and **MedMCQA**. | Model | Size | Inference Method | MedQA (USMLE) | MMLU-Medical | MedMCQA | | :--- | :---: | :--- | :---: | :---: | :---: | | **Avicenna-8B-Base** | **8B** | **Self-Consistency (SC) (N=5)** | **61.0%** | - | **50.0%** | | **Avicenna-8B-Base** | **8B** | Greedy | 60.0% | **69.5%** | - | | *GPT-3.5 Turbo* | *175B+* | Standard | 61.2% | 73.5% | 59.4% | | ClinicalCamel-70B | 70B | Standard | 45.8% | 68.4% | 45.8% | | PMC-LLaMA-13B | 13B | Standard | 39.6% | 56.3% | 37.7% | | MedAlpaca-13B | 13B | Standard | 37.3% | 51.5% | 35.7% | | BioMistral-7B | 7B | Standard | 35.4% | 52.6% | 34.8% | | Meditron-7B | 7B | Standard | 33.5% | 45.2% | 31.1% | > **Methodology Notes:** > * **Hardware:** All results obtained using **4-bit NF4 Quantization** on NVIDIA T4 GPUs. Full precision scores are expected to be higher. > * **Inference:** MedQA and MedMCQA utilized **Self-Consistency Ensembling (SC)** inference (N=5 voters). MMLU utilized standard Greedy decoding. > * **Sampling:** MedQA and MedMCQA results represent randomized subsets of the validation/test sets due to compute constraints. MMLU represents the complete evaluation of all 6 medical subsets. --- ## 🚀 How to Run (Self-Consistency Ensembling) You use the following Python script which implements the Self-Consistency Ensembling voting logic. ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig # --- CONFIGURATION --- MODEL_ID = "salihfurkaan/Avicenna-8B-Base" TOKENIZER_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct" def setup_model(): print(f"Loading {MODEL_ID} in 4-bit mode...") bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4" ) tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_ID) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token model = AutoModelForCausalLM.from_pretrained( MODEL_ID, quantization_config=bnb_config, # you can remove this line if you want the non-quantized version device_map="auto" ) return model, tokenizer def solve_with_moa_open_ended(model, tokenizer, user_input): """ Runs Mixture-of-Agents for open-ended queries: 1. Generates 3 distinct clinical opinions (Drafts). 2. Synthesizes them into a final consensus answer. """ # --- PHASE 1: DRAFTING (3 Internal Specialists) --- system_prompt_draft = "You are Avicenna, an expert medical consultant. Analyze the case step-by-step. Provide a Differential Diagnosis and Recommended Next Steps." messages = [{"role": "system", "content": system_prompt_draft}, {"role": "user", "content": user_input}] inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device) print("Consulting 3 internal specialists (Drafting Phase)...") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=1536, temperature=0.7, # High creativity for diverse perspectives do_sample=True, num_return_sequences=3, # Generate 3 Drafts pad_token_id=tokenizer.eos_token_id ) # Extract only the new tokens (answers) from the output # outputs shape: [3, seq_len] # inputs shape: [1, seq_len] -> We slice off the prompt length new_tokens = outputs[:, inputs.input_ids.shape[1]:] drafts = tokenizer.batch_decode(new_tokens, skip_special_tokens=True) # --- PHASE 2: SYNTHESIS (Chief Resident) --- print(" Synthesizing Final Consensus...") combined_drafts = "" for i, draft in enumerate(drafts): combined_drafts += f"\n[Opinion {i+1}]:\n{draft}\n" # Optional: Print drafts to see the internal debate # print(f"\n--- Opinion {i+1} ---\n{draft[:200]}...") aggregator_prompt = ( f"Clinical Case:\n{user_input}\n\n" f"Consider the following 3 medical opinions on this case:\n{combined_drafts}\n\n" "TASK: Synthesize these opinions into a single, highly accurate, and professional clinical assessment. " "Resolve any conflicts by prioritizing patient safety and standard of care. " "Structure the answer clearly: 1. Assessment, 2. Key Differentials, 3. Plan." ) agg_messages = [ {"role": "system", "content": "You are a Senior Chief Physician. Provide a final authoritative consultation."}, {"role": "user", "content": aggregator_prompt} ] agg_inputs = tokenizer.apply_chat_template(agg_messages, return_tensors="pt", add_generation_prompt=True).to(model.device) with torch.no_grad(): final_output = model.generate( **agg_inputs, max_new_tokens=768, temperature=0.2, # Low temp for stable synthesis do_sample=True, pad_token_id=tokenizer.eos_token_id ) final_response = tokenizer.decode(final_output[0][agg_inputs.input_ids.shape[1]:], skip_special_tokens=True) return final_response if __name__ == "__main__": # Initialize model, tokenizer = setup_model() print("\n Avicenna Interactive Consultant") print("Type 'exit' or 'quit' to stop.\n") while True: print("\n" + "-"*30) question = input("Enter Clinical Case/Question: ") if question.lower() in ["exit", "quit"]: break final_answer = solve_with_moa_open_ended(model, tokenizer, question) print("\n" + "="*40) print(f"FINAL CLINICAL CONSENSUS") print("="*40) print(final_answer) ``` ⚠️ Disclaimer - Research Use Only: Avicenna-8B-Base is a derivative of Llama 3.1. It is intended for academic research, benchmarking, and decision-support prototyping. - Not a Doctor: The model can hallucinate. It should never be used for real-world patient diagnosis or treatment without human supervision.