CEC Theses and Dissertations

Date of Award

2015

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Information Systems (DCIS)

Department

Graduate School of Computer and Information Sciences

Advisor

Sumitra Mukherjee

Committee Member

Francisco J. Mitropoulos

Committee Member

Michael J. Lazlo

Abstract

Clustering gene expression data such that the diameters of the clusters formed are no greater than a specified threshold prompted the development of the Quality Threshold Clustering (QTC) algorithm. It iteratively forms clusters of non-increasing size until all points are clustered; the largest cluster is always selected first. The QTC algorithm applies in many other domains that require a similar quality guarantee based on cluster diameter. The worst-case complexity of the original QTC algorithm is (n5). Since practical applications often involve large datasets, researchers called for more efficient versions of the QTC algorithm.

This dissertation aimed to develop and evaluate efficient variations of the QTC algorithm that guarantee a maximum cluster diameter while producing partitions that are similar to those produced by the original QTC algorithm. The QTC algorithm is expensive because it considers forming clusters around every item in the dataset. This dissertation addressed this issue by developing methods for selecting a small subset of promising items around which to form clusters. A second factor that adversely affects the efficiency of the QTC algorithm is the computational cost of updating cluster diameters as new items are added to clusters. This dissertation proposed and evaluated alternate methods to meet the cluster diameter constraint while not having to repeatedly update the cluster diameters.

The variations of the QTC algorithm developed in this dissertation were evaluated on benchmark datasets using two measures: execution time and quality of solutions produced. Execution times were compared to the time taken to execute the most efficient published implementation of the QTC algorithm. Since the partitions produced by the proposed variations are not guaranteed to be identical to those produced by the original algorithm, the Jaccard measure of partition similarity was used to measure the quality of the solutions.

The findings of this research were threefold. First, the Stochastic QTC alone wasn’t computationally helpful since in order to produce partitions that were acceptably similar to those found by the deterministic QTCs, the algorithm had to be seeded with a large number of centers (ntryn). Second, the preprocessed data methods are desirable since they reduce the complexity of the search for candidate cluster points. Third, radius based methods are promising since they produce partitions that are acceptably similar to those found by the deterministic QTCs in significantly less time.

Share

COinS