A Feature Group Weighting Method for Classifying High-Dimensional Big Data
MetadataShow full item record
Features hold the distinctive characteristics and intrinsic values of data. But it's of no use if the important information and pattern can not be extracted from the data coming from disparate sources and applications. In the area of big data, feature selection is one of the most important pre-processing step in reducing numerous numbers of unessential, irrelevant and noisy features that can seriously affect the outcomes of the classifier models. The main motivation for applying feature selection is to reduce high-dimensionality of large-scale data. As high-dimensional big data has more features for training, it becomes challenging and costly to measure the performances. The aim of the research is to build models with several hybrid feature selection techniques so that the classification algorithms can have only those features that are really relevant and help to achieve better performances. Also, finding the informative features and grouping them so that we can extract the knowledge from Big Data. In this research, we have collected 10 benchmark datasets from UC Irvine Machine Learning Repository. We have applied several feature selection methods and tested their performance (CFS, Chi-Square, Consistency Subset Evaluator, Gain Ratio, Information Gain, OneR, PCA, ReliefF, Symmetrical Uncertainty and Wrapper). The feature grouping methods are named Random Grouping, Correlation based Grouping and Attribute weighting grouping; these groups were experimented with ensemble classifiers: Random Forest, Bagging and Boosting (AdaBoost). With the observed result it has been found that these groups have similar or even better result than the entire feature sets for the datasets. Attribute Weighting grouping method has shown promising performances for the Big Data.
- M.Sc Thesis/Project