CCE Theses and Dissertations

Date of Award


Document Type


Degree Name

Doctor of Philosophy in Information Systems (DISS)


Graduate School of Computer and Information Sciences


Sumitra Mukherjee

Committee Member

Michael J. Lazlo

Committee Member

Junping Sun


The current decline in the U.S. economy was accompanied by an increase in foreclosure rates starting in 2007. Though the earliest figures for 2009 - 2010 indicate a significant decrease, foreclosure of homes in the U.S. is still at an alarming level (Gutierrez, 2009a). Recent research at the University of Michigan suggested that many foreclosures could have been averted had there been a predictive system that did not only rely on credit scores and loan-to-value ratios (DeGroat, 2009). Furthermore, Grover, Smith & Todd (2008) contend that foreclosure prediction can enhance the efficiency of foreclosure mitigation by facilitating the allocation of resources to areas where predicted foreclosure rates will be high.

The primary goal of this dissertation was to develop a foreclosure prediction model that builds upon established bankruptcy and credit scoring models. The study utilized and compared the predictive accuracy of three supervised machine learning (ML) techniques when applied to mortgage data. The selected ML techniques were:

ML1. Classification Trees

ML2. Support Vector Machines (SVM)

ML3. Genetic Programming

The data used for the study is comprised of mortgage data, demographic metrics and certain macro-economic indicators that are available at the time of the inception of the loan.

The hypothesis of the study was based on the assumption that foreclosure rates, and associated actions, are dependent on critical demographic (age, gender), economic (per capita income, inflation) and regional variables (predatory lending, unemployment index). The task of the machine learning techniques was to identify a function that well approximates the relationship between these explanatory variables and the binary outcome of interest (mortgage status in +3 years from inception).

The predictive accuracy of ML1 through ML3 was significantly better than expected given the size of the recordset (1000) and the number of input variables (~110). Each ML technique achieved classification accuracy better than 75%, with ML3 scoring in the upper 90s. Given such high scores, it was concluded that the hypothesis was satisfied and that ML techniques are suitable for prediction tasks in this problem domain.

  Link to NovaCat