CCE Theses and Dissertations

Comparison of Data Mining Techniques used to Predict Student Retention

Date of Award


Document Type


Degree Name

Doctor of Philosophy in Information Systems (DISS)


Graduate School of Computer and Information Sciences


Sumitra Mukherjee

Committee Member

Maxine S. Cohen

Committee Member

Easwar Nyshadham


Retaining undergraduate students at four-year public institutions has been a long-standing problem for many years. Although the retention issue has been the focus of literally thousands of studies over the past 75 years, it is widely acknowledged that this problem remains complex. Many retention studies have focused on a single variable or a single set of variables, and even a well -established factor such as low grade point average (GPA) explains only a small percentage of the variance in retention. Researchers in this area have noted the need for more sophisticated models that can take into account multiple variables that may contribute to student attrition as well as the need for retention research to be useful to practitioners in higher education settings. In addition, there are major gaps in the persistence literature when considering retention for part-time students, transfer students, and upperclassmen. Most retention research focuses on one-year retention for first-time, full-time freshmen, and new models that extend beyond this traditional focus are needed. The purpose of this study was to expand understanding of how educational institutions might benefit from including data mining processes and multivariate analysis to inform student retention strategies. This study applies data mining techniques to student demographic and behavioral data in an institution of higher education, providing a detailed description of the data mining process. Both full and part-time students as well as students at every class level were included in the analysis. Findings indicate that neural networks, Naive Bayesian classification, and decision tree induction are comparable to logistic regression when used to predict individual student retention. In addition, the data was segmented into several more homogeneous student groups and predictive performance improved for selected data segments, most notably for the part-time student segment. Finally, attribute evaluators were applied to each data segment an results indicate that a data mining approach can be used to isolate variables that predict persistence differently for different groups of students.

This document is currently not available here.

  Link to NovaCat