BiLSTM-CNN-CRF for Vietnamese COVID-19 NER
Model Description
This model performs Named Entity Recognition (NER) on Vietnamese text specifically related to the COVID-19 pandemic. It utilizes a BiLSTM-CNN-CRF architecture, which combines:
- BiLSTM (Bidirectional Long Short-Term Memory) to capture forward and backward word-level context.
- CNN (Convolutional Neural Network) to extract character-level morphological features (number of filters: 30, filter length: 3).
- CRF (Conditional Random Field) to decode the most probable sequence of entity tags.
The model was trained on the PhoNER_COVID19 dataset at the syllable level and integrates frozen, pre-trained 100-dimensional Word2Vec embeddings for Vietnamese.
Entities Recognized
The model is trained to extract the following 10 entity types:
AGE: TuổiDATE: Ngày thángGENDER: Giới tínhJOB: Nghề nghiệpLOCATION: Địa điểmNAME: Tên ngườiORGANIZATION: Tổ chứcPATIENT_ID: Mã bệnh nhânSYMPTOM_AND_DISEASE: Triệu chứng và Bệnh lýTRANSPORTATION: Phương tiện giao thông
Training Parameters
- Word Embedding Dimension: 100
- Char Embedding Dimension: 50
- LSTM Hidden Dimension: 200
- Dropout Rate: 0.25
- Optimizer: Adam (Learning Rate: 0.001)
- Epochs: 30 (with Early Stopping triggered at epoch 23)
- Batch Size: 36
Evaluation Results (Test Set)
Based on the seqeval classification report, the model achieves the following overall performance:
🏆 Key Metrics (Micro Average)
- Micro F1-Score:
0.9052- Micro Precision:
0.9174- Micro Recall:
0.8933
Detailed Performance by Entity
| Entity | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| AGE | 0.9283 | 0.9570 | 0.9425 | 582 |
| DATE | 0.9742 | 0.9831 | 0.9786 | 1654 |
| GENDER | 0.9258 | 0.9459 | 0.9358 | 462 |
| JOB | 0.5410 | 0.3815 | 0.4475 | 173 |
| LOCATION | 0.9099 | 0.9023 | 0.9060 | 4441 |
| NAME | 0.9209 | 0.8050 | 0.8591 | 318 |
| ORGANIZATION | 0.7899 | 0.7704 | 0.7800 | 771 |
| PATIENT_ID | 0.9722 | 0.9771 | 0.9746 | 2005 |
| SYMPTOM_AND_DISEASE | 0.8641 | 0.7165 | 0.7834 | 1136 |
| TRANSPORTATION | 0.9653 | 0.8653 | 0.9126 | 193 |
How to Use
To use this model for inference, you need to load the saved state dictionary (pytorch_model.bin), the vocabulary mappings (mappings.pkl), and reconstruct the BiLSTM_CRF PyTorch class exactly as defined in the training script.
import torch
import pickle
import numpy as np
from torch.autograd import Variable
# 1. Load Mappings
with open("mappings.pkl", "rb") as f:
mapping_data = pickle.load(f)
word_to_id = mapping_data['word_to_id']
char_to_id = mapping_data['char_to_id']
id_to_tag = mapping_data['id_to_tag']
# 2. Initialize Model
# Ensure your BiLSTM_CRF class is defined in your script
# model = BiLSTM_CRF(vocab_size=len(word_to_id), tag_to_ix=..., ...)
# model.load_state_dict(torch.load("pytorch_model.bin", map_location=torch.device('cpu')))
# model.eval()
# 3. Preprocessing & Inference Example
sentence = "Bệnh nhân nhập viện tối qua ở Bệnh Viện 115 là bệnh nhân thứ 82"
str_words = sentence.split()
words = [word_to_id.get(w, word_to_id.get('<UNK>')) for w in str_words]
chars = [[char_to_id.get(c, 0) for c in w] for w in str_words] # Adjust OOV handling as needed
chars2_length = [len(c) for c in chars]
char_maxl = max(chars2_length)
chars2_mask = np.zeros((len(chars2_length), char_maxl), dtype='int')
for i, c in enumerate(chars):
chars2_mask[i, :chars2_length[i]] = c
dwords = Variable(torch.LongTensor(words))
chars2_mask = Variable(torch.LongTensor(chars2_mask))
# Get predictions
# _, predicted_id = model(dwords, chars2_mask, chars2_length, {})
# prediction_label = [id_to_tag[val] for val in predicted_id]
# print(prediction_label)