🇻🇳 BiLSTM-CNN-CRF for Vietnamese COVID-19 NER

📌 Overview

This repository provides a Named Entity Recognition (NER) model for Vietnamese text in the COVID-19 medical domain.

The model is built using a hybrid deep learning architecture: BiLSTM-CNN-CRF, which captures both contextual and morphological features.

It is trained on the PhoNER_COVID19 dataset for extracting structured medical information from clinical and COVID-related texts.

🧠 Model Architecture

🔹 Word-level Representation

Pre-trained FastText embeddings (300-dim)
Frozen during training

🔹 Character-level Representation (CNN)

Char embedding: 50-dim
CNN filters: 30
Kernel size: 3

🔹 Context Encoder (BiLSTM)

Hidden size: 200
Layers: 2
Bidirectional

🔹 Sequence Decoder (CRF)

Ensures valid BIO tagging sequences

🏷️ Named Entities

Entity	Description
AGE	Tuổi
DATE	Ngày tháng
GENDER	Giới tính
JOB	Nghề nghiệp
LOCATION	Địa điểm
NAME	Tên người
ORGANIZATION	Tổ chức
PATIENT_ID	Mã bệnh nhân
SYMPTOM_AND_DISEASE	Triệu chứng & Bệnh
TRANSPORTATION	Phương tiện

⚙️ Training Configuration

Parameter	Value
Word Embedding	300
Char Embedding	50
LSTM Hidden	200
Dropout	0.25
Optimizer	Adam
Learning Rate	0.001
Batch Size	36
Epochs	30
Early Stopping	19

📊 Evaluation Results

🏆 Overall Performance

Micro Precision: 0.9302
Micro Recall: 0.9122
Micro F1-score: 0.9211
Macro F1-score: 0.8918
Weighted F1-score: 0.9194

📈 Detailed Results by Entity

Entity	Precision	Recall	F1-score	Support
AGE	0.9609	0.9663	0.9636	356
DATE	0.9670	0.9837	0.9753	1103
GENDER	0.9698	0.9380	0.9536	274
JOB	0.8395	0.5191	0.6415	131
LOCATION	0.9221	0.9245	0.9233	2727
NAME	0.9325	0.8172	0.8711	186
ORGANIZATION	0.8611	0.7985	0.8286	551
PATIENT_ID	0.9803	0.9819	0.9811	1269
SYMPTOM_AND_DISEASE	0.8368	0.7833	0.8092	766
TRANSPORTATION	0.9881	0.9540	0.9708	87

🔍 Observations

Strong performance: DATE, AGE, PATIENT_ID
Good performance: LOCATION, GENDER
Lower performance:
- JOB (data imbalance, sparse samples)
- SYMPTOM_AND_DISEASE (complex multi-span entities)

🚀 How to Use

import json

with open("word2id.json", "r", encoding="utf-8") as f:
    word2id = json.load(f)

with open("char2id.json", "r", encoding="utf-8") as f:
    char2id = json.load(f)

with open("label2id.json", "r", encoding="utf-8") as f:
    label2id = json.load(f)

with open("hf_model/id2label.json", "w", encoding="utf-8") as f:
    json.dump(id2label, f, ensure_ascii=False, indent=2)
    
id2label = {int(k): v for k, v in label2id.items()}

Downloads last month: -; Downloads are not tracked for this model. How to track