BiLSTM-CNN-CRF for Vietnamese COVID-19 NER

Model Description

This model performs Named Entity Recognition (NER) on Vietnamese text specifically related to the COVID-19 pandemic. It utilizes a BiLSTM-CNN-CRF architecture, which combines:

  • BiLSTM (Bidirectional Long Short-Term Memory) to capture forward and backward word-level context.
  • CNN (Convolutional Neural Network) to extract character-level morphological features (number of filters: 30, filter length: 3).
  • CRF (Conditional Random Field) to decode the most probable sequence of entity tags.

The model was trained on the PhoNER_COVID19 dataset at the syllable level and integrates frozen, pre-trained 100-dimensional Word2Vec embeddings for Vietnamese.

Entities Recognized

The model is trained to extract the following 10 entity types:

  • AGE: Tuổi
  • DATE: Ngày tháng
  • GENDER: Giới tính
  • JOB: Nghề nghiệp
  • LOCATION: Địa điểm
  • NAME: Tên người
  • ORGANIZATION: Tổ chức
  • PATIENT_ID: Mã bệnh nhân
  • SYMPTOM_AND_DISEASE: Triệu chứng và Bệnh lý
  • TRANSPORTATION: Phương tiện giao thông

Training Parameters

  • Word Embedding Dimension: 100
  • Char Embedding Dimension: 50
  • LSTM Hidden Dimension: 200
  • Dropout Rate: 0.25
  • Optimizer: Adam (Learning Rate: 0.001)
  • Epochs: 30 (with Early Stopping triggered at epoch 23)
  • Batch Size: 36

Evaluation Results (Test Set)

Based on the seqeval classification report, the model achieves the following overall performance:

🏆 Key Metrics (Micro Average)

  • Micro F1-Score: 0.9052
  • Micro Precision: 0.9174
  • Micro Recall: 0.8933

Detailed Performance by Entity

Entity Precision Recall F1-Score Support
AGE 0.9283 0.9570 0.9425 582
DATE 0.9742 0.9831 0.9786 1654
GENDER 0.9258 0.9459 0.9358 462
JOB 0.5410 0.3815 0.4475 173
LOCATION 0.9099 0.9023 0.9060 4441
NAME 0.9209 0.8050 0.8591 318
ORGANIZATION 0.7899 0.7704 0.7800 771
PATIENT_ID 0.9722 0.9771 0.9746 2005
SYMPTOM_AND_DISEASE 0.8641 0.7165 0.7834 1136
TRANSPORTATION 0.9653 0.8653 0.9126 193

How to Use

To use this model for inference, you need to load the saved state dictionary (pytorch_model.bin), the vocabulary mappings (mappings.pkl), and reconstruct the BiLSTM_CRF PyTorch class exactly as defined in the training script.

import torch
import pickle
import numpy as np
from torch.autograd import Variable

# 1. Load Mappings
with open("mappings.pkl", "rb") as f:
    mapping_data = pickle.load(f)
    
word_to_id = mapping_data['word_to_id']
char_to_id = mapping_data['char_to_id']
id_to_tag = mapping_data['id_to_tag']

# 2. Initialize Model
# Ensure your BiLSTM_CRF class is defined in your script
# model = BiLSTM_CRF(vocab_size=len(word_to_id), tag_to_ix=..., ...)
# model.load_state_dict(torch.load("pytorch_model.bin", map_location=torch.device('cpu')))
# model.eval()

# 3. Preprocessing & Inference Example
sentence = "Bệnh nhân nhập viện tối qua ở Bệnh Viện 115 là bệnh nhân thứ 82"
str_words = sentence.split()
words = [word_to_id.get(w, word_to_id.get('<UNK>')) for w in str_words]
chars = [[char_to_id.get(c, 0) for c in w] for w in str_words] # Adjust OOV handling as needed

chars2_length = [len(c) for c in chars]
char_maxl = max(chars2_length)
chars2_mask = np.zeros((len(chars2_length), char_maxl), dtype='int')
for i, c in enumerate(chars):
    chars2_mask[i, :chars2_length[i]] = c

dwords = Variable(torch.LongTensor(words))
chars2_mask = Variable(torch.LongTensor(chars2_mask))

# Get predictions
# _, predicted_id = model(dwords, chars2_mask, chars2_length, {})
# prediction_label = [id_to_tag[val] for val in predicted_id]
# print(prediction_label)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support