CCE Theses and Dissertations


Preliminary Examination for Empirical Knowledge(PEEK) A Fast Heuristic to Estimate the Inherent Degree of Clustering in Data Sets

Date of Award


Document Type


Degree Name

Doctor of Philosophy in Computing Technology in Education (DCTE)


Graduate School of Computer and Information Sciences


Michael J. Laszlo

Committee Member

James D. Canandy

Committee Member

Sumitra Mukherjee


The general area of this research is data clustering, in which an unsupervised classification process is used to discover and extract the clusters that naturally exist in some data set. These inherent patterns are then used to understand the data in a manner consistent with what the data represent. Such clustering methods may be used to discover natural grouping of raw data and to abstract structures which might reside there, without having any prior knowledge of whether such structures exist.

Many different clustering algorithms are in use, each having relative strengths or other points of merit. For example, some have lower asymptotic running time than others, some require a priori knowledge of the underlying data, and some produce results which are highly dependent on the input parameters.

The goal of this dissertation was to develop an approach which measures the degree to which the data under study contains natural clusters. It develops measures of the degree of clustering inherent in a data set. That is, using the measures developed, a researcher can know whether the underlying data possesses nature clusters, or not, so further processing of the data may proceed by choosing a method known to provide best results given the degree of clustering exhibited by the native data. Moreover, understanding the native clustering tendency of the data will facilitate measure of clustering validity, or how well the produced clusters actually partition the data in a manner which is meaningful in the real-world domain of the data set.

Such measures permit a researcher to choose intelligently, perhaps employing computationally intensive technique with the confidence that the underlying data warrant such effort.

This document is currently not available here.

  Link to NovaCat