Recognition of Bangla and English Words in Bangla Texts Using a Modified BERT-base-NER Model

Hossain, Md. Parvez

View/Open

UIU_MSCSE_Thesis_Parvez_Final_Report.pdf (1.280Mb)

Date

2026-01-12

Author

Hossain, Md. Parvez

Metadata

Show full item record

Abstract

A combination of Bangla and English words is commonly used, particularly on social media. This tendency greatly hampers the next generation’s ability to learn Bangla. This study suggests an approach for identifying words in Bangla texts that are both English and Bangla. This study also translates the identified English terms into standard Bangla words. The Transformer architecture, which uses an attention mechanism to identify the connections between words and their contexts inside a text, is the foundation of bidirectional encoder representations from transformers (BERT). In this study, we use the training input dataset to modify the BERT-base-NER model. For the name entity recognition (NER) task, the proposed BERT-base-NER model in this study achieves state-of-the-art performance. For both the training and testing scenarios, we employ a holdout cross-validation procedure. We used 80% of the entire data for training and 20% for testing. We use the Google Translate API (application programming interface) to translate the identified English words into standard Bangla words. In order to assess the modified BERT-base-NER model, we applied the input dataset to the current machine learning (ML) and deep learning (DL) techniques. Support vector machines (SVM) and Naive Bayes (NB) are two components of the machine learning approach. Conversely, the DL method uses bidirectional LSTM (BiLSTM), long short-term memory (LSTM), and convolutional neural network (CNN). The improved BERT-base-NER model is highly accurate and efficient at identifying Bangla and English words, according to simulation data. With an accuracy of 95%, the proposed BERT-base-NER model achieves the best result among the current methods. For Bangla–English code-mixed text, this study presents a reliable BERT-based word-level language identification system that successfully resolves Banglish ambiguity and allows downstream Bangla language processing applications such as standard Bangla conversion, machine translation, and information extraction.

URI

http://dspace.uiu.ac.bd/handle/52243/3391

Collections

M.Sc Thesis/Project [163]