BiLSTM-CNN-CRF for Vietnamese COVID-19 NER

Model Description

This model performs Named Entity Recognition (NER) on Vietnamese text specifically related to the COVID-19 pandemic. It utilizes a BiLSTM-CNN-CRF architecture, which combines:

BiLSTM (Bidirectional Long Short-Term Memory) to capture forward and backward word-level context.
CNN (Convolutional Neural Network) to extract character-level morphological features (number of filters: 30, filter length: 3).
CRF (Conditional Random Field) to decode the most probable sequence of entity tags.

The model was trained on the PhoNER_COVID19 dataset at the syllable level and integrates frozen, pre-trained 100-dimensional Word2Vec embeddings for Vietnamese.

Entities Recognized

The model is trained to extract the following 10 entity types:

AGE: Tuổi
DATE: Ngày tháng
GENDER: Giới tính
JOB: Nghề nghiệp
LOCATION: Địa điểm
NAME: Tên người
ORGANIZATION: Tổ chức
PATIENT_ID: Mã bệnh nhân
SYMPTOM_AND_DISEASE: Triệu chứng và Bệnh lý
TRANSPORTATION: Phương tiện giao thông

Training Parameters

Word Embedding Dimension: 100
Char Embedding Dimension: 50
LSTM Hidden Dimension: 200
Dropout Rate: 0.25
Optimizer: Adam (Learning Rate: 0.001)
Epochs: 30 (with Early Stopping triggered at epoch 23)
Batch Size: 36

Evaluation Results (Test Set)

Based on the seqeval classification report, the model achieves the following overall performance:

🏆 Key Metrics (Micro Average)

Micro F1-Score: 0.9052

Micro Precision: 0.9174

Micro Recall: 0.8933

Detailed Performance by Entity

Entity	Precision	Recall	F1-Score	Support
AGE	0.9283	0.9570	0.9425	582
DATE	0.9742	0.9831	0.9786	1654
GENDER	0.9258	0.9459	0.9358	462
JOB	0.5410	0.3815	0.4475	173
LOCATION	0.9099	0.9023	0.9060	4441
NAME	0.9209	0.8050	0.8591	318
ORGANIZATION	0.7899	0.7704	0.7800	771
PATIENT_ID	0.9722	0.9771	0.9746	2005
SYMPTOM_AND_DISEASE	0.8641	0.7165	0.7834	1136
TRANSPORTATION	0.9653	0.8653	0.9126	193

How to Use

To use this model for inference, you need to load the saved state dictionary (pytorch_model.bin), the vocabulary mappings (mappings.pkl), and reconstruct the BiLSTM_CRF PyTorch class exactly as defined in the training script.

import torch
import pickle
import numpy as np
from torch.autograd import Variable

# 1. Load Mappings
with open("mappings.pkl", "rb") as f:
    mapping_data = pickle.load(f)
    
word_to_id = mapping_data['word_to_id']
char_to_id = mapping_data['char_to_id']
id_to_tag = mapping_data['id_to_tag']

# 2. Initialize Model
# Ensure your BiLSTM_CRF class is defined in your script
# model = BiLSTM_CRF(vocab_size=len(word_to_id), tag_to_ix=..., ...)
# model.load_state_dict(torch.load("pytorch_model.bin", map_location=torch.device('cpu')))
# model.eval()

# 3. Preprocessing & Inference Example
sentence = "Bệnh nhân nhập viện tối qua ở Bệnh Viện 115 là bệnh nhân thứ 82"
str_words = sentence.split()
words = [word_to_id.get(w, word_to_id.get('<UNK>')) for w in str_words]
chars = [[char_to_id.get(c, 0) for c in w] for w in str_words] # Adjust OOV handling as needed

chars2_length = [len(c) for c in chars]
char_maxl = max(chars2_length)
chars2_mask = np.zeros((len(chars2_length), char_maxl), dtype='int')
for i, c in enumerate(chars):
    chars2_mask[i, :chars2_length[i]] = c

dwords = Variable(torch.LongTensor(words))
chars2_mask = Variable(torch.LongTensor(chars2_mask))

# Get predictions
# _, predicted_id = model(dwords, chars2_mask, chars2_length, {})
# prediction_label = [id_to_tag[val] for val in predicted_id]
# print(prediction_label)

Downloads last month: -; Downloads are not tracked for this model. How to track