CCE Theses and Dissertations


Use of Abstract Categories in Data Clustering Algorithms

Date of Award


Document Type


Degree Name

Doctor of Philosophy in Information Systems (DISS)


Graduate School of Computer and Information Sciences


William Hafner

Committee Member

Yair Levy

Committee Member

Easwar Nyshadham


This research was motivated to address the problem of solving the difficulty of distinguishing data anomalies and patterns of information within the enormous volume of data generated in audit logs, and to approach or achieve real-time analysis of audit log data in applications and operating systems.

The research question was if the employment of data clustering techniques that reduce the volume of audit data, would enable data analysts to approach or achieve real time intrusion detection due to the reduction in audit data.

The approach used in this study was to use data mining clustering techniques to develop abstract categories from the audit log data that can be used with the pattern matching techniques already in use or under experimental development. The results of this study clearly demonstrate that the suffix tree algorithm is the fastest in terms of real time processing speed, the most accurate, in terms of producing the fewest false positives, and the best at reducing audit data file size.

The objectives of this research were to identify and test a group of algorithms for speed and accuracy in searching audit log data, as well as to determine their ability to distinguish anomalies and patterns. While researchers have developed a number of different algorithms for intrusion detection and pattern matching in databases, the approaches used in these algorithms have not been previously tested to determine their comparability in terms of speed of processing and the accuracy of the results.

The methodology that was used in this study involves the assessment of data mining clustering techniques to develop abstract categories from the audit log data that can be used with pattern matching techniques already in existence. The methodology established criteria for the selection of cluster algorithms, resulting in the selection of K-Means, Single Pass, Fuzzy c-means, and Suffix Tree clustering algorithms for analysis. The test procedure was to measure the execution time of the algorithms in clustering datasets. The number of the false positives in the answer for the query is compared against the execution time. The findings determined which of the algorithms produced the greatest accuracy in the shortest amount of time, which indicates which approach is most effective and advanced. A superiority of the suffix tree algorithm was found, followed by the k-means, single pass, and fuzzy c-means.

This document is currently not available here.

  Link to NovaCat