A Hybrid Approach to Bangla Regional Text Classification Using BERT Ensemble and Region-Specific Lexical Oversampling

UIU Institutional Repository

    • Login
    View Item 
    •   UIU DSpace Home
    • School of Science and Engineering (SoSE)
    • Department of Computer Science and Engineering (CSE)
    • M.Sc Thesis/Project
    • View Item
    •   UIU DSpace Home
    • School of Science and Engineering (SoSE)
    • Department of Computer Science and Engineering (CSE)
    • M.Sc Thesis/Project
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    A Hybrid Approach to Bangla Regional Text Classification Using BERT Ensemble and Region-Specific Lexical Oversampling

    Thumbnail
    View/Open
    pdf (3.077Mb)
    Date
    2026-01-12
    Author
    Sultana, Babe
    Metadata
    Show full item record
    Abstract
    Regional text analysis reflects the lived realities of diverse communities by capturing the linguistic richness and diversity present in various dialects. It bridges the gap between everyday regional usage and standardized language forms, thereby enhancing the inclusivity of language technologies. In this paper, we focus on five regional dialects in Bangladesh, namely Chittagong, Sylhet, Noakhali, Barishal, and Rangpur, using a dataset of 4,218 text samples. The dataset is validated by five regional experts and categorized into three tiers based on an assigned agreement criterion. Tier 1 represents a strictly filtered, high-confidence subset and is used primarily for evaluation. A set of region-specific special words, which belong exclusively to their respective regions and are validated by domain experts, is introduced. These words are used in a linguistically informed oversampling technique to balance the dataset in both experiments. In the first experiment, we demonstrate the effectiveness of the tiered dataset structure, where Tier 2 and Tier 3 (mediumand low-confidence subsets) are used for training, and Tier 1 (high-quality subset) is used for testing. In this setting, BanglaBERT achieves the best individual performance with 67.45% accuracy and a weighted F1-score of 67.62%. In the second experiment, we focus exclusively on the Tier 1 dataset, applying a wide range of machine learning and deep learning models to assess their effectiveness. The key contribution is a heterogeneous deep ensemble technique that combines three BERT models, BanglaBERT, BUETBERT, and DistilBERT, achieving an accuracy of 85.17% and a weighted F1-score of 84.84% on the Tier 1 dataset.
    URI
    http://dspace.uiu.ac.bd/handle/52243/3393
    Collections
    • M.Sc Thesis/Project [163]

    Copyright 2003-2017 United International University
    Contact Us | Send Feedback
    Developed by UIU CITS
     

     

    Browse

    All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    LoginRegister

    Copyright 2003-2017 United International University
    Contact Us | Send Feedback
    Developed by UIU CITS