Medical Symptom to Diagnosis Classifier

A text classification model that predicts medical diagnoses from natural language symptom descriptions using TF-IDF vectorization and classical machine learning algorithms.

Model Description

This model takes natural language descriptions of symptoms as input and predicts one of 22 possible medical diagnoses. It uses TF-IDF (Term Frequency-Inverse Document Frequency) for text vectorization combined with various classical ML classifiers.

Intended Uses

  • Educational purposes: Learning about NLP and medical text classification
  • Research: Exploring symptom-diagnosis relationships
  • Prototyping: Building medical assistance tools

Out-of-Scope Uses

  • Clinical diagnosis: This model should NOT be used for actual medical diagnosis
  • Medical decision-making: Never replace professional medical consultation
  • Emergency situations: Always contact emergency services for urgent health issues

Training Data

Dataset: gretelai/symptom_to_diagnosis

Split Examples
Train 853
Test 212
Total 1,065

Supported Diagnoses (22 classes)

# Diagnosis # Diagnosis
1 Allergy 12 Hypertension
2 Arthritis 13 Impetigo
3 Bronchial Asthma 14 Jaundice
4 Cervical Spondylosis 15 Malaria
5 Chicken Pox 16 Migraine
6 Common Cold 17 Peptic Ulcer Disease
7 Dengue 18 Pneumonia
8 Diabetes 19 Psoriasis
9 Drug Reaction 20 Typhoid
10 Fungal Infection 21 Urinary Tract Infection
11 Gastroesophageal Reflux Disease 22 Varicose Veins

Model Architecture

Text Preprocessing (TF-IDF)

TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 3),    # Unigrams, bigrams, trigrams
    min_df=2,
    max_df=0.8,
    stop_words='english'
)

Classifiers Evaluated

Model Key Parameters
Logistic Regression max_iter=1000, class_weight='balanced'
Random Forest n_estimators=200, max_depth=20
SVM kernel='linear', C=1.0
Naive Bayes alpha=0.1
Gradient Boosting n_estimators=100, learning_rate=0.1

Pipeline

Symptom Text → TF-IDF Vectorization → Classifier → Probability Scores → Top N Diagnoses

Performance

Metric Score
Accuracy ~90-95%
F1-Score (macro) ~90-95%
Precision (macro) ~90-95%
Recall (macro) ~90-95%

Exact performance varies depending on the selected classifier

Usage

Installation

pip install datasets scikit-learn pandas numpy joblib

Inference Example

import joblib

# Load model and vectorizer
model = joblib.load('best_model.pkl')
vectorizer = joblib.load('tfidf_vectorizer.pkl')

# Predict
symptoms = "I have been experiencing severe headaches, sensitivity to light, and nausea"
X = vectorizer.transform([symptoms])
prediction = model.predict(X)
probabilities = model.predict_proba(X)

print(f"Predicted diagnosis: {prediction[0]}")

With Gradio Interface

import gradio as gr

def predict_diagnosis(symptoms, top_n=3):
    X = vectorizer.transform([symptoms])
    probs = model.predict_proba(X)[0]
    classes = model.classes_

    # Get top N predictions
    top_indices = probs.argsort()[-top_n:][::-1]
    results = [(classes[i], probs[i]) for i in top_indices]
    return results

demo = gr.Interface(
    fn=predict_diagnosis,
    inputs=[
        gr.Textbox(label="Describe your symptoms"),
        gr.Slider(1, 10, value=3, label="Number of diagnoses")
    ],
    outputs=gr.Label(num_top_classes=5)
)
demo.launch()

Limitations

  • Small dataset: Only 1,065 examples may not capture all symptom variations
  • English only: Model trained exclusively on English text
  • Limited diagnoses: Only covers 22 conditions
  • No context: Does not consider patient history, age, gender, or other factors
  • Text quality: Performance depends on symptom description quality

Ethical Considerations

Risks

  • Users may rely on predictions for actual medical decisions
  • Model may have biases based on training data representation
  • False negatives could delay proper medical care

Recommendations

  • Always display clear disclaimers about educational purpose
  • Encourage users to consult healthcare professionals
  • Do not deploy in clinical settings without proper validation

Technical Specifications

Dependencies

datasets>=2.14.0
scikit-learn>=1.3.0
pandas>=2.0.0
numpy>=1.24.0
matplotlib>=3.7.0
seaborn>=0.12.0
joblib>=1.3.0
gradio>=4.0.0

Hardware

  • Training: CPU sufficient (< 5 minutes)
  • Inference: CPU (< 100ms per prediction)

Citation

@misc{symptom_diagnosis_classifier,
  title={Medical Symptom to Diagnosis Classifier},
  author={Community},
  year={2024},
  publisher={Hugging Face},
  dataset={gretelai/symptom_to_diagnosis}
}

License

  • Model: Apache 2.0
  • Dataset: Apache 2.0 (gretelai/symptom_to_diagnosis)

DISCLAIMER: This model is for educational purposes only. It is NOT a substitute for professional medical advice, diagnosis, or treatment. Always seek the advice of a qualified healthcare provider.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train A0lgk/Medical_Diagnosis