CEC Theses and Dissertations

Date of Award

2015

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Information Systems (DISS)

Department

College of Engineering and Computing

Advisor

Junping Sun

Committee Member

Easwar Nyshadham

Committee Member

Steven Zhou

Abstract

The use of data mining methods in corporate decision making has been increasing in the past decades. Its popularity can be attributed to better utilizing data mining algorithms, increased performance in computers, and results which can be measured and applied for decision making. The effective use of data mining methods to analyze various types of data has shown great advantages in various application domains. While some data sets need little preparation to be mined, whereas others, in particular high-dimensional data sets, need to be preprocessed in order to be mined due to the complexity and inefficiency in mining high dimensional data processing. Feature selection or attribute selection is one of the techniques used for dimensionality reduction. Previous research has shown that data mining results can be improved in terms of accuracy and efficacy by selecting the attributes with most significance. This study analyzes vehicle service and sales data from multiple car dealerships. The purpose of this study is to find a model that better classifies existing customers as new car buyers based on their vehicle service histories. Six different feature selection methods such as; Information Gain, Correlation Based Feature Selection, Relief-F, Wrapper, and Hybrid methods, were used to reduce the number of attributes in the data sets are compared. The data sets with the attributes selected were run through three popular classification algorithms, Decision Trees, k-Nearest Neighbor, and Support Vector Machines, and the results compared and analyzed. This study concludes with a comparative analysis of feature selection methods and their effects on different classification algorithms within the domain. As a base of comparison, the same procedures were run on a standard data set from the financial institution domain.