🇻🇳 BiLSTM-CNN-CRF for Vietnamese COVID-19 NER

📌 Overview

This repository provides a Named Entity Recognition (NER) model for Vietnamese text in the COVID-19 medical domain.

The model is built using a hybrid deep learning architecture: BiLSTM-CNN-CRF, which captures both contextual and morphological features.

It is trained on the PhoNER_COVID19 dataset for extracting structured medical information from clinical and COVID-related texts.


🧠 Model Architecture

🔹 Word-level Representation

  • Pre-trained FastText embeddings (300-dim)
  • Frozen during training

🔹 Character-level Representation (CNN)

  • Char embedding: 50-dim
  • CNN filters: 30
  • Kernel size: 3

🔹 Context Encoder (BiLSTM)

  • Hidden size: 200
  • Layers: 2
  • Bidirectional

🔹 Sequence Decoder (CRF)

  • Ensures valid BIO tagging sequences

🏷️ Named Entities

Entity Description
AGE Tuổi
DATE Ngày tháng
GENDER Giới tính
JOB Nghề nghiệp
LOCATION Địa điểm
NAME Tên người
ORGANIZATION Tổ chức
PATIENT_ID Mã bệnh nhân
SYMPTOM_AND_DISEASE Triệu chứng & Bệnh
TRANSPORTATION Phương tiện

⚙️ Training Configuration

Parameter Value
Word Embedding 300
Char Embedding 50
LSTM Hidden 200
Dropout 0.25
Optimizer Adam
Learning Rate 0.001
Batch Size 36
Epochs 30
Early Stopping 19

📊 Evaluation Results

🏆 Overall Performance

  • Micro Precision: 0.9302

  • Micro Recall: 0.9122

  • Micro F1-score: 0.9211

  • Macro F1-score: 0.8918

  • Weighted F1-score: 0.9194


📈 Detailed Results by Entity

Entity Precision Recall F1-score Support
AGE 0.9609 0.9663 0.9636 356
DATE 0.9670 0.9837 0.9753 1103
GENDER 0.9698 0.9380 0.9536 274
JOB 0.8395 0.5191 0.6415 131
LOCATION 0.9221 0.9245 0.9233 2727
NAME 0.9325 0.8172 0.8711 186
ORGANIZATION 0.8611 0.7985 0.8286 551
PATIENT_ID 0.9803 0.9819 0.9811 1269
SYMPTOM_AND_DISEASE 0.8368 0.7833 0.8092 766
TRANSPORTATION 0.9881 0.9540 0.9708 87

🔍 Observations

  • Strong performance: DATE, AGE, PATIENT_ID
  • Good performance: LOCATION, GENDER
  • Lower performance:
    • JOB (data imbalance, sparse samples)
    • SYMPTOM_AND_DISEASE (complex multi-span entities)

🚀 How to Use

import json

with open("word2id.json", "r", encoding="utf-8") as f:
    word2id = json.load(f)

with open("char2id.json", "r", encoding="utf-8") as f:
    char2id = json.load(f)

with open("label2id.json", "r", encoding="utf-8") as f:
    label2id = json.load(f)

with open("hf_model/id2label.json", "w", encoding="utf-8") as f:
    json.dump(id2label, f, ensure_ascii=False, indent=2)
    
id2label = {int(k): v for k, v in label2id.items()}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support