| dc.description.abstract | Regional text analysis reflects the lived realities of diverse communities by capturing
the linguistic richness and diversity present in various dialects. It bridges the gap between
everyday regional usage and standardized language forms, thereby enhancing the
inclusivity of language technologies. In this paper, we focus on five regional dialects in
Bangladesh, namely Chittagong, Sylhet, Noakhali, Barishal, and Rangpur, using a dataset
of 4,218 text samples. The dataset is validated by five regional experts and categorized into
three tiers based on an assigned agreement criterion. Tier 1 represents a strictly filtered,
high-confidence subset and is used primarily for evaluation. A set of region-specific special
words, which belong exclusively to their respective regions and are validated by domain
experts, is introduced. These words are used in a linguistically informed oversampling
technique to balance the dataset in both experiments. In the first experiment, we demonstrate
the effectiveness of the tiered dataset structure, where Tier 2 and Tier 3 (mediumand
low-confidence subsets) are used for training, and Tier 1 (high-quality subset) is used
for testing. In this setting, BanglaBERT achieves the best individual performance with
67.45% accuracy and a weighted F1-score of 67.62%. In the second experiment, we focus
exclusively on the Tier 1 dataset, applying a wide range of machine learning and deep
learning models to assess their effectiveness. The key contribution is a heterogeneous deep
ensemble technique that combines three BERT models, BanglaBERT, BUETBERT, and
DistilBERT, achieving an accuracy of 85.17% and a weighted F1-score of 84.84% on the
Tier 1 dataset.
iii | en_US |