A Feature Group Weighting Method for Classifying High-Dimensional Big Data
Abstract
Features hold the distinctive characteristics and intrinsic values of data. But it's of
no use if the important information and pattern can not be extracted from the data
coming from disparate sources and applications. In the area of big data, feature
selection is one of the most important pre-processing step in reducing numerous
numbers of unessential, irrelevant and noisy features that can seriously affect the
outcomes of the classifier models. The main motivation for applying feature selection
is to reduce high-dimensionality of large-scale data. As high-dimensional
big data has more features for training, it becomes challenging and costly to measure
the performances. The aim of the research is to build models with several
hybrid feature selection techniques so that the classification algorithms can have
only those features that are really relevant and help to achieve better performances.
Also, finding the informative features and grouping them so that we can extract
the knowledge from Big Data. In this research, we have collected 10 benchmark
datasets from UC Irvine Machine Learning Repository. We have applied several
feature selection methods and tested their performance (CFS, Chi-Square, Consistency
Subset Evaluator, Gain Ratio, Information Gain, OneR, PCA, ReliefF,
Symmetrical Uncertainty and Wrapper). The feature grouping methods are named
Random Grouping, Correlation based Grouping and Attribute weighting grouping;
these groups were experimented with ensemble classifiers: Random Forest, Bagging
and Boosting (AdaBoost). With the observed result it has been found that
these groups have similar or even better result than the entire feature sets for the
datasets. Attribute Weighting grouping method has shown promising performances
for the Big Data.
Collections
- M.Sc Thesis/Project [149]