CCE Theses and Dissertations

Parallel Hybrid Clustering using Genetic Programming and Multi-Objective Fitness with Density(PYRAMID)

Date of Award

2006

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Graduate School of Computer and Information Sciences

Advisor

Junping Sun

Committee Member

Michael J. Laszlo

Committee Member

William Sverdlik

Abstract

Clustering is the art of locating patterns in large data sets. It is an active research area that provides value to scientific as well as business applications. There are some challenges that face practical clustering including: identifying clusters of arbitrary shapes, sensitivity to the order of input, dynamic determination of the number of clusters, outlier handling, high dependency on user-defined parameters, processing speed of massive data sets, and the potential to fall into sub-optimal solutions. Many studies that were conducted in the realm of clustering have addressed some of these challenges. This study proposes a new approach, called parallel hybrid clustering using genetic programming and multi-objective fitness with density (PYRAMID), that tackles several of these challenges from a different perspective. PYRAMID employs genetic programming to represent arbitrary cluster shapes and circumvent falling in local optima. It accommodates large data sets and avoids dependency on the order of input by quantizing the data space, i.e., the space on which the data set resides, thus abstracting it into hyper-rectangular cells and creating genetic programming individuals as concatenations of these cells. Thus the cells become the subject of clustering, rather than the data points themselves. PYRAMID also utilizes a density-based multi-objective fitness function to handle outliers. It gathers statistics in a pre-processing step and uses them so not to rely on user-defined parameters. Finally, PYRAMID employs data parallelism in a master-slave model in an attempt to cure the inherent slow performance of evolutionary algorithms and provide speedup. A master processor distributes the clustering data evenly onto multiple slave processors. The slave processors conduct the clustering on their local data sets and report their clustering results back to the master, which consolidates them by merging the partial results into a final clustering solution. This last step also involves determining the number of clusters dynamically and labeling them accordingly. Experiments have demonstrated that, using these features, PYRAMID offers an advantage over some of the existing approaches by tackling the clustering challenges from a different angle.

This document is currently not available here.

  Link to NovaCat

Share

COinS