A Method for Finding the Number of Clusters

Description

This presentation proposes a maximum clustering similarity (MCS) method for determining the number of clusters in a data set by studying the behavior of similarity indices comparing two (of several) clustering methods. The similarity between the two clusterings is calculated at the same number of clusters, using the indices of Rand, Fowlkes and Mallows, and Kulczynski—each corrected for chance agreement.

The number of clusters at which the index attains its maximum is a candidate for the optimal number of clusters. The proposed method is applied to simulated bivariate normal data, and further extended for use in circular data. Its performance is compared to the criteria discussed in Tibshirani, Walther, and Hastie’s 2001 work. The proposed method is not based on any distributional or data assumption, which makes it widely applicable to any type of data that can be clustered using at least two clustering algorithms.

Presenter Bio

Ahmed Albatineh has a Ph.D. and teaches at Florida Atlantic University

Date of Event

October 5, 2011 12 - 1:00 PM

Location

Alvin Sherman Library, Room 2053, 3301 College Ave., Fort Lauderdale (main campus)

NSU News Release Link

http://nsunews.nova.edu/gather-mathematics-colloquium-series-talk-clusters/

Share

COinS
 
Oct 5th, 12:00 PM Oct 5th, 1:00 PM

A Method for Finding the Number of Clusters

Alvin Sherman Library, Room 2053, 3301 College Ave., Fort Lauderdale (main campus)

This presentation proposes a maximum clustering similarity (MCS) method for determining the number of clusters in a data set by studying the behavior of similarity indices comparing two (of several) clustering methods. The similarity between the two clusterings is calculated at the same number of clusters, using the indices of Rand, Fowlkes and Mallows, and Kulczynski—each corrected for chance agreement.

The number of clusters at which the index attains its maximum is a candidate for the optimal number of clusters. The proposed method is applied to simulated bivariate normal data, and further extended for use in circular data. Its performance is compared to the criteria discussed in Tibshirani, Walther, and Hastie’s 2001 work. The proposed method is not based on any distributional or data assumption, which makes it widely applicable to any type of data that can be clustered using at least two clustering algorithms.

https://nsuworks.nova.edu/mathematics_colloquium/ay_2011-2012/events/12