🇻🇳 BiLSTM-CNN-CRF for Vietnamese COVID-19 NER
📌 Overview
This repository provides a Named Entity Recognition (NER) model for Vietnamese text in the COVID-19 medical domain.
The model is built using a hybrid deep learning architecture: BiLSTM-CNN-CRF, which captures both contextual and morphological features.
It is trained on the PhoNER_COVID19 dataset for extracting structured medical information from clinical and COVID-related texts.
🧠 Model Architecture
🔹 Word-level Representation
- Pre-trained FastText embeddings (300-dim)
- Frozen during training
🔹 Character-level Representation (CNN)
- Char embedding: 50-dim
- CNN filters: 30
- Kernel size: 3
🔹 Context Encoder (BiLSTM)
- Hidden size: 200
- Layers: 2
- Bidirectional
🔹 Sequence Decoder (CRF)
- Ensures valid BIO tagging sequences
🏷️ Named Entities
| Entity | Description |
|---|---|
| AGE | Tuổi |
| DATE | Ngày tháng |
| GENDER | Giới tính |
| JOB | Nghề nghiệp |
| LOCATION | Địa điểm |
| NAME | Tên người |
| ORGANIZATION | Tổ chức |
| PATIENT_ID | Mã bệnh nhân |
| SYMPTOM_AND_DISEASE | Triệu chứng & Bệnh |
| TRANSPORTATION | Phương tiện |
⚙️ Training Configuration
| Parameter | Value |
|---|---|
| Word Embedding | 300 |
| Char Embedding | 50 |
| LSTM Hidden | 200 |
| Dropout | 0.25 |
| Optimizer | Adam |
| Learning Rate | 0.001 |
| Batch Size | 36 |
| Epochs | 30 |
| Early Stopping | 19 |
📊 Evaluation Results
🏆 Overall Performance
Micro Precision: 0.9302
Micro Recall: 0.9122
Micro F1-score: 0.9211
Macro F1-score: 0.8918
Weighted F1-score: 0.9194
📈 Detailed Results by Entity
| Entity | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| AGE | 0.9609 | 0.9663 | 0.9636 | 356 |
| DATE | 0.9670 | 0.9837 | 0.9753 | 1103 |
| GENDER | 0.9698 | 0.9380 | 0.9536 | 274 |
| JOB | 0.8395 | 0.5191 | 0.6415 | 131 |
| LOCATION | 0.9221 | 0.9245 | 0.9233 | 2727 |
| NAME | 0.9325 | 0.8172 | 0.8711 | 186 |
| ORGANIZATION | 0.8611 | 0.7985 | 0.8286 | 551 |
| PATIENT_ID | 0.9803 | 0.9819 | 0.9811 | 1269 |
| SYMPTOM_AND_DISEASE | 0.8368 | 0.7833 | 0.8092 | 766 |
| TRANSPORTATION | 0.9881 | 0.9540 | 0.9708 | 87 |
🔍 Observations
- Strong performance: DATE, AGE, PATIENT_ID
- Good performance: LOCATION, GENDER
- Lower performance:
- JOB (data imbalance, sparse samples)
- SYMPTOM_AND_DISEASE (complex multi-span entities)
🚀 How to Use
import json
with open("word2id.json", "r", encoding="utf-8") as f:
word2id = json.load(f)
with open("char2id.json", "r", encoding="utf-8") as f:
char2id = json.load(f)
with open("label2id.json", "r", encoding="utf-8") as f:
label2id = json.load(f)
with open("hf_model/id2label.json", "w", encoding="utf-8") as f:
json.dump(id2label, f, ensure_ascii=False, indent=2)
id2label = {int(k): v for k, v in label2id.items()}