Date of Award
Doctor of Philosophy (PhD)
College of Engineering and Computing
Supervised machine learning models are increasingly being used for medical diagnosis. The diagnostic problem is formulated as a binary classification task in which trained classifiers make predictions based on a set of input features. In diagnosis, these features are typically procedures or tests with associated costs. The cost of applying a trained classifier for diagnosis may be estimated as the total cost of obtaining values for the features that serve as inputs for the classifier. Obtaining classifiers based on a low cost set of input features with acceptable classification accuracy is of interest to practitioners and researchers. What makes this problem even more challenging is that costs associated with features vary with patients and service providers and change over time.
This dissertation aims to address this problem by proposing a method for obtaining low cost classifiers that meet specified accuracy requirements under dynamically changing costs. Given a set of relevant input features and accuracy requirements, the goal is to identify all qualifying classifiers based on subsets of the feature set. Then, for any arbitrary costs associated with the features, the cost of the classifiers may be computed and candidate classifiers selected based on cost-accuracy tradeoff. Since the number of relevant input features k tends to be large for typical diagnosis problems, training and testing classifiers based on all 2^k-1 possible non-empty subsets of features is computationally prohibitive. Under the reasonable assumption that the accuracy of a classifier is no lower than that of any classifier based on a subset of its input features, this dissertation aims to develop an efficient method to identify all qualifying classifiers.
This study used two types of classifiers – artificial neural networks and classification trees – that have proved promising for numerous problems as documented in the literature. The approach was to measure the accuracy obtained with the classifiers when all features were used. Then, reduced thresholds of accuracy were arbitrarily established which were satisfied with subsets of the complete feature set. Threshold values for three measures –true positive rates, true negative rates, and overall classification accuracy were considered for the classifiers. Two cost functions were used for the features; one used unit costs and the other random costs. Additional manipulation of costs was also performed.
The order in which features were removed was found to have a material impact on the effort required (removing the most important features first was most efficient, removing the least important features first was least efficient). The accuracy and cost measures were combined to produce a Pareto-Optimal Frontier. There were consistently few elements on this Frontier. At most 15 subsets were on the Frontier even when there were hundreds of thousands of acceptable feature sets. Most of the computational time is taken for training and testing the models. Given costs, models in the Pareto-Optimal Frontier can be efficiently identified and the models may be presented to decision makers. Both the Neural Networks and the Decision Trees performed in a comparable fashion suggesting that any classifier could be employed.
Richard Clyde McCrae. 2018. The Impact of Cost on Feature Selection for Classifiers. Doctoral dissertation. Nova Southeastern University. Retrieved from NSUWorks, College of Engineering and Computing. (1057)