Show simple item record

dc.contributor.authorSultana, Babe
dc.date.accessioned2026-01-12T04:44:07Z
dc.date.available2026-01-12T04:44:07Z
dc.date.issued2026-01-12
dc.identifier.citationCSEen_US
dc.identifier.urihttp://dspace.uiu.ac.bd/handle/52243/3390
dc.descriptionpdfen_US
dc.description.abstractRegional text analysis reflects the lived realities of diverse communities by capturing the linguistic richness and diversity present in various dialects. It bridges the gap between everyday regional usage and standardized language forms, thereby enhancing the inclusivity of language technologies. In this paper, we focus on five regional dialects in Bangladesh, namely Chittagong, Sylhet, Noakhali, Barishal, and Rangpur, using a dataset of 4,218 text samples. The dataset is validated by five regional experts and categorized into three tiers based on an assigned agreement criterion. Tier 1 represents a strictly filtered, high-confidence subset and is used primarily for evaluation. A set of region-specific special words, which belong exclusively to their respective regions and are validated by domain experts, is introduced. These words are used in a linguistically informed oversampling technique to balance the dataset in both experiments. In the first experiment, we demonstrate the effectiveness of the tiered dataset structure, where Tier 2 and Tier 3 (mediumand low-confidence subsets) are used for training, and Tier 1 (high-quality subset) is used for testing. In this setting, BanglaBERT achieves the best individual performance with 67.45% accuracy and a weighted F1-score of 67.62%. In the second experiment, we focus exclusively on the Tier 1 dataset, applying a wide range of machine learning and deep learning models to assess their effectiveness. The key contribution is a heterogeneous deep ensemble technique that combines three BERT models, BanglaBERT, BUETBERT, and DistilBERT, achieving an accuracy of 85.17% and a weighted F1-score of 84.84% on the Tier 1 dataset. iiien_US
dc.description.sponsorshipCSE UIUen_US
dc.language.isoen_USen_US
dc.publisherUIUen_US
dc.subjectRegional text analysis, dialects, standardized language forms, BanglaBERTen_US
dc.titleA Hybrid Approach to Bangla Regional Text Classification Using BERT Ensemble and Region-Specific Lexical Oversamplingen_US
dc.typeThesisen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record