Medical Symptom to Diagnosis Classifier
A text classification model that predicts medical diagnoses from natural language symptom descriptions using TF-IDF vectorization and classical machine learning algorithms.
Model Description
This model takes natural language descriptions of symptoms as input and predicts one of 22 possible medical diagnoses. It uses TF-IDF (Term Frequency-Inverse Document Frequency) for text vectorization combined with various classical ML classifiers.
Intended Uses
- Educational purposes: Learning about NLP and medical text classification
- Research: Exploring symptom-diagnosis relationships
- Prototyping: Building medical assistance tools
Out-of-Scope Uses
- Clinical diagnosis: This model should NOT be used for actual medical diagnosis
- Medical decision-making: Never replace professional medical consultation
- Emergency situations: Always contact emergency services for urgent health issues
Training Data
Dataset: gretelai/symptom_to_diagnosis
| Split | Examples |
|---|---|
| Train | 853 |
| Test | 212 |
| Total | 1,065 |
Supported Diagnoses (22 classes)
| # | Diagnosis | # | Diagnosis |
|---|---|---|---|
| 1 | Allergy | 12 | Hypertension |
| 2 | Arthritis | 13 | Impetigo |
| 3 | Bronchial Asthma | 14 | Jaundice |
| 4 | Cervical Spondylosis | 15 | Malaria |
| 5 | Chicken Pox | 16 | Migraine |
| 6 | Common Cold | 17 | Peptic Ulcer Disease |
| 7 | Dengue | 18 | Pneumonia |
| 8 | Diabetes | 19 | Psoriasis |
| 9 | Drug Reaction | 20 | Typhoid |
| 10 | Fungal Infection | 21 | Urinary Tract Infection |
| 11 | Gastroesophageal Reflux Disease | 22 | Varicose Veins |
Model Architecture
Text Preprocessing (TF-IDF)
TfidfVectorizer(
max_features=5000,
ngram_range=(1, 3), # Unigrams, bigrams, trigrams
min_df=2,
max_df=0.8,
stop_words='english'
)
Classifiers Evaluated
| Model | Key Parameters |
|---|---|
| Logistic Regression | max_iter=1000, class_weight='balanced' |
| Random Forest | n_estimators=200, max_depth=20 |
| SVM | kernel='linear', C=1.0 |
| Naive Bayes | alpha=0.1 |
| Gradient Boosting | n_estimators=100, learning_rate=0.1 |
Pipeline
Symptom Text → TF-IDF Vectorization → Classifier → Probability Scores → Top N Diagnoses
Performance
| Metric | Score |
|---|---|
| Accuracy | ~90-95% |
| F1-Score (macro) | ~90-95% |
| Precision (macro) | ~90-95% |
| Recall (macro) | ~90-95% |
Exact performance varies depending on the selected classifier
Usage
Installation
pip install datasets scikit-learn pandas numpy joblib
Inference Example
import joblib
# Load model and vectorizer
model = joblib.load('best_model.pkl')
vectorizer = joblib.load('tfidf_vectorizer.pkl')
# Predict
symptoms = "I have been experiencing severe headaches, sensitivity to light, and nausea"
X = vectorizer.transform([symptoms])
prediction = model.predict(X)
probabilities = model.predict_proba(X)
print(f"Predicted diagnosis: {prediction[0]}")
With Gradio Interface
import gradio as gr
def predict_diagnosis(symptoms, top_n=3):
X = vectorizer.transform([symptoms])
probs = model.predict_proba(X)[0]
classes = model.classes_
# Get top N predictions
top_indices = probs.argsort()[-top_n:][::-1]
results = [(classes[i], probs[i]) for i in top_indices]
return results
demo = gr.Interface(
fn=predict_diagnosis,
inputs=[
gr.Textbox(label="Describe your symptoms"),
gr.Slider(1, 10, value=3, label="Number of diagnoses")
],
outputs=gr.Label(num_top_classes=5)
)
demo.launch()
Limitations
- Small dataset: Only 1,065 examples may not capture all symptom variations
- English only: Model trained exclusively on English text
- Limited diagnoses: Only covers 22 conditions
- No context: Does not consider patient history, age, gender, or other factors
- Text quality: Performance depends on symptom description quality
Ethical Considerations
Risks
- Users may rely on predictions for actual medical decisions
- Model may have biases based on training data representation
- False negatives could delay proper medical care
Recommendations
- Always display clear disclaimers about educational purpose
- Encourage users to consult healthcare professionals
- Do not deploy in clinical settings without proper validation
Technical Specifications
Dependencies
datasets>=2.14.0
scikit-learn>=1.3.0
pandas>=2.0.0
numpy>=1.24.0
matplotlib>=3.7.0
seaborn>=0.12.0
joblib>=1.3.0
gradio>=4.0.0
Hardware
- Training: CPU sufficient (< 5 minutes)
- Inference: CPU (< 100ms per prediction)
Citation
@misc{symptom_diagnosis_classifier,
title={Medical Symptom to Diagnosis Classifier},
author={Community},
year={2024},
publisher={Hugging Face},
dataset={gretelai/symptom_to_diagnosis}
}
License
- Model: Apache 2.0
- Dataset: Apache 2.0 (gretelai/symptom_to_diagnosis)
DISCLAIMER: This model is for educational purposes only. It is NOT a substitute for professional medical advice, diagnosis, or treatment. Always seek the advice of a qualified healthcare provider.
- Downloads last month
- -