Improving Machine Learning Methods for Handling Data Imbalance Problem

Abdullah-All-Tanvir

View/Open

Thesis_book_Final_Version.pdf (980.8Kb)

Date

2022-07

Author

Abdullah-All-Tanvir

Metadata

Show full item record

Abstract

Class unbalanced datasets are widespread in various fields, including health, security, and banking. When dealing with imbalanced datasets, a standard supervised learning algorithm is biased toward the dominant class. In real-life applications, however, the minority class instances are more interested in reflecting the notion than the majority class instances. For categorizing unbalanced datasets, numerous strategies based on sampling methods (under-sampling of the majority class and oversampling of the minority class), cost-sensitive learning methods, and ensemble learning have recently been employed in the literature. However, deleting the majority of samples at random using a uniform distribution may result in needless data loss. In this paper, we proposed 3 different cluster-based undersampling models to prevent unnecessary data loss. First, we inject test data into training data for clustering. Then we select 25% close to the centroid and 25% from the boundary line. For the last method, we clean 50% majority data around minority data. We experiment with our methods over 49 datasets and calculate auROC, auPR, F1-Score, and MCC for evaluation. According to the experimental results, our methods are promising and successful strategies for dealing with severely unbalanced datasets.

URI

http://dspace.uiu.ac.bd/handle/52243/2493

Collections

M.Sc Thesis/Project [166]