CEC Theses and Dissertations


An Adaptive Machine Learning Approach to Knowledge Discovery in Large Datasets

Date of Award


Document Type


Degree Name

Doctor of Philosophy in Information Systems (DISS)


Graduate School of Computer and Information Sciences


James D. Cannady

Committee Member

William L. Hafner

Committee Member

Junping Sun


Large text databases, such as medical records, on-line journals, or the Internet, potentially contain a great wealth of data and knowledge. However, text representation of factual information and knowledge is difficult to process. Analyzing these large text databases often rely upon time consuming human resources for data mining. Since a textual format is a very flexible way to describe and store various types of information, large amounts of information are often retained and distributed as text. 'The amount of accessible textual data has been increasing rapidly. Such data may potentially contain a great wealth of knowledge. However, analyzing huge amounts of textual data requires a tremendous amount of work in reading al l of the text and organizing the content. Thus, the increase in accessible textual data has caused an information flood in spite of hope of becoming knowledgeable about various topics" (Nasukawa and Nagano, 2001).

Preliminary research focused on key concepts and techniques derived from clustering methodology, machine learning, and other communities within the arena of data mining. The research was based on a two-stage machine-intelligence system that clustered and filtered large datasets. The overall objective was to optimize response time through parallel processing while attempting to reduce potential errors due to knowledge manipulation. The results generated by the two-stage system were reviewed by domain experts and tested using traditional methods that included multi variable regression analysis and logic testing for accuracy. The two-stage prototype developed a model that was 85 to 90% accurate in determining childhood asthma and disproved existing stereotypes related to sleep breathing disorders. Detail results will be discussed in the proposed dissertation.

While the initial research demonstrated positive results in processing large text datasets limitations were identified. These limitations included processing de lays resulting from equal distribution of processing in a heterogeneous client environment and utilizing the results derived from the second-stage as inputs for the first-stage. To address these limitations the proposed doctoral research will investigate the dynamic distribution of processing in heterogeneous environment and cyclical learning involving the second stage neural network clients modifying the first-stage expert systems.

This document is currently not available here.

  Link to NovaCat