Active Learning for Mining Big Data
Abstract
Active learning also known as an optimal experimental design, is a process
for building a classifier or learning model with less number of training instances
in the semi-supervised setting. It's a well-known approach that is
used in many real-life machine learning and data mining applications. Active
learning uses a query function and an oracle or expert (e.g., a human
or information source) for labeling unlabeled data instances to boost up the
performance of a classifier. Labeling the unlabeled data instances is difficult,
time-consuming, and expensive. In this paper, we have proposed an approach
based on cluster analysis for selecting informative training instances
from large number of unlabeled data instances or big data that helps us to
select less number of training instances to build a classifier suitable for active
learning. The proposed method clusters the unlabeled big data into several
clusters and find the informative instances from each cluster based on the
center of the cluster, nearest neighbors of the center of the cluster, and also
selecting random instances from each cluster. The objective is to nd the
informative unlabeled instances and label them by the oracle for scaling up
the classification results of the machine learning algorithms to be applied on
big data. We have tested the performance of the proposed method on seven
benchmark datasets from UC Irvine Machine Learning Repository employing
following five well-known machine learning algorithms: C4.5 (decision
tree induction), SVM (support vector machines), Random Forest, Bagging,
and Boosting (AdaBoost). The experimental analysis proved that proposed
method improves the performance of classifiers in active learning with less
number of training instances.
Collections
- M.Sc Thesis/Project [145]