Recognition of Bangla and English Words in Bangla Texts Using a Modified BERT-base-NER Model
Abstract
A combination of Bangla and English words is commonly used, particularly on social media. This tendency greatly hampers the next generation’s ability to learn Bangla.
This study suggests an approach for identifying words in Bangla texts that are both
English and Bangla. This study also translates the identified English terms into standard Bangla words. The Transformer architecture, which uses an attention mechanism
to identify the connections between words and their contexts inside a text, is the foundation of bidirectional encoder representations from transformers (BERT). In this study, we
use the training input dataset to modify the BERT-base-NER model. For the name entity recognition (NER) task, the proposed BERT-base-NER model in this study achieves
state-of-the-art performance. For both the training and testing scenarios, we employ a
holdout cross-validation procedure. We used 80% of the entire data for training and 20%
for testing. We use the Google Translate API (application programming interface) to
translate the identified English words into standard Bangla words. In order to assess the
modified BERT-base-NER model, we applied the input dataset to the current machine
learning (ML) and deep learning (DL) techniques. Support vector machines (SVM) and
Naive Bayes (NB) are two components of the machine learning approach. Conversely, the
DL method uses bidirectional LSTM (BiLSTM), long short-term memory (LSTM), and
convolutional neural network (CNN). The improved BERT-base-NER model is highly accurate and efficient at identifying Bangla and English words, according to simulation data.
With an accuracy of 95%, the proposed BERT-base-NER model achieves the best result
among the current methods. For Bangla–English code-mixed text, this study presents a
reliable BERT-based word-level language identification system that successfully resolves
Banglish ambiguity and allows downstream Bangla language processing applications such
as standard Bangla conversion, machine translation, and information extraction.
Collections
- M.Sc Thesis/Project [163]
