Determining the number of clusters in a data set

Determining the number of clusters in a data set, a quantity often labeled k as in the kmeans algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem.
For a certain class of clustering algorithms (in particular kmeans, kmedoids and Expectationmaximization algorithm), there is a parameter commonly referred to as k that specifies the number of clusters to detect. Other algorithms such as DBSCAN and OPTICS algorithm do not require the specification of this parameter; hierarchical clustering avoids the problem altogether.
The correct choice of k is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. In addition, increasing k without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i.e., when k equals the number of data points, n). Intuitively then, the optimal choice of k will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster. If an appropriate value of k is not apparent from prior knowledge of the properties of the data set, it must be chosen somehow. There are several categories of methods for making this decision.
Contents
Rule of thumb
One simple rule of thumb sets the number to^{[1]}^{:365}
with n as the number of objects (data points).
Finding Number of Clusters in Text Databases
In text databases, a document collection defined by a document by term D matrix (of size m by n, m: no. of documents, n: no. of terms) number of clusters can be estimated by the following formula (m x n) / t where t is the no. of nonzero entries in D. Note that in D each row and each column must contain at least one nonzero element. ^{[2]}
The Elbow Method
Another method looks at the percentage of variance explained as a function of the number of clusters: You should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data. More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters are chosen at this point, hence the "elbow criterion". This "elbow" cannot always be unambiguously identified.^{[3]} Percentage of variance explained is the ratio of the betweengroup variance to the total variance, also known as an Ftest. A slight variation of this method plots the curvature of the within group variance.^{[4]} The method can be traced to speculation by Robert L. Thorndike in 1953.^{[5]}
Information Criterion Approach
Another set of methods for determining the number of clusters are information criteria, such as the Akaike information criterion (AIC), Bayesian information criterion (BIC), or the Deviance information criterion (DIC) — if it is possible to make a likelihood function for the clustering model. For example: The kmeans model is "almost" a Gaussian mixture model and one can construct a likelihood for the Gaussian mixture model and thus also determine information criterion values.^{[6]}
An Information Theoretic Approach^{[7]}
Rate distortion theory has been applied to choosing k called the "jump" method, which determines the number of clusters that maximizes efficiency while minimizing error by information theoretic standards. The strategy of the algorithm is to generate a distortion curve for the input data by running a standard clustering algorithm such as kmeans for all values of k between 1 and n, and computing the distortion (described below) of the resulting clustering. The distortion curve is then transformed by a negative power chosen based on the dimensionality of the data. Jumps in the resulting values then signify reasonable choices for k, with the largest jump representing the best choice.
The distortion of a clustering of some input data is formally defined as follows: Let the data set be modeled as a pdimensional random variable, X, consisting of a mixture distribution of G components with common covariance, Γ. If we let c_{1}...c_{K} be a set of K cluster centers, with c_{X} the closest center to a given sample of X, then the minimum average distortion per dimension when fitting the K centers to the data is:
This is also the average Mahalanobis distance per dimension between X and the set of cluster centers C. Because the minimization over all possible sets of cluster centers is prohibitively complex, the distortion is computed in practice by generating a set of cluster centers using a standard clustering algorithm and computing the distortion using the result. The pseudocode for the jump method with an input set of pdimensional data points X is:
JumpMethod(X): Let Y = (p/2) Init a list D, of size n+1 Let D[0] = 0 For k = 1 ... n: Cluster X with k clusters (e.g., with kmeans) Let d = Distortion of the resulting clustering D[k] = d^(Y) Define J(i) = D[i]  D[i1] Return the k between 1 and n that maximizes J(k)
The choice of the transform power Y = (p / 2) is motivated by asymptotic reasoning using results from rate distortion theory. Let the data X have a single, arbitrarily pdimensional Gaussian distribution, and let fixed K = floor(α^{p}), for some α greater than zero. Then the distortion of a clustering of K clusters in the limit as p goes to infinity is α ^{− 2}. It can be seen that asymptotically, the distortion of a clustering to the power ( − p / 2) is proportional to α^{p}, which by definition is approximately the number of clusters K. In other words, for a single Gaussian distribution, increasing K beyond the true number of clusters, which should be one, causes a linear growth in distortion. This behavior is important in the general case of a mixture of multiple distribution components.
Let X be a mixture of G pdimensional Gaussian distributions with common covariance. Then for any fixed K less than G, the distortion of a clustering as p goes to infinity is infinite. Intuitively, this means that a clustering of less than the correct number of clusters is unable to describe asymptotically highdimensional data, causing the distortion to increase without limit. If, as described above, K is made an increasing function of p, namely, K = floor(α^{p}), the same result as above is achieved, with the value of the distortion in the limit as p goes to infinity being equal to α ^{− 2}. Correspondingly, there is the same proportional relationship between the transformed distortion and the number of clusters, K.
Putting the results above together, it can be seen that for sufficiently high values of p, the transformed distortion is approximately zero for K < G, then jumps suddenly and begins increasing linearly for K >= G. The jump algorithm for choosing K makes use of these behaviors to identify the most likely value for the true number of clusters.
Although the mathematical support for the method is given in terms of asymptotic results, the algorithm has been empirically verified to work well in a variety of data sets with reasonable dimensionality. In addition to the localized jump method described above, there exists a second algorithm for choosing K using the same transformed distortion values known as the broken line method. The broken line method identifies the jump point in the graph of the transformed distortion by doing a simple least squares error line fit of two line segments, which in theory will fall along the xaxis for K < G, and along the linearly increasing phase of the transformed distortion plot for K >= G. The broken line method is more robust than the jump method in that its decision is global rather than local, but it also relies on the assumption of Gaussian mixture components, whereas the jump method is fully nonparametric and has been shown to be viable for general mixture distributions.
Choosing k Using the Silhouette
The average silhouette of the data is another useful criterion for assessing the natural number of clusters. The silhouette of a datum is a measure of how closely it is matched to data within its cluster and how loosely it is matched to data of the neighbouring cluster, i.e. the cluster whose average distance from the datum is lowest.^{[8]} A silhouette close to 1 implies the datum is in an appropriate cluster, whilst a silhouette close to − 1 implies the datum is in the wrong cluster. Optimization techniques such as genetic algorithms are useful in determining the number of clusters that gives rise to the largest silhouette.^{[9]}
Crossvalidation
One can also use the process of crossvalidation to analyze the number of clusters. In this process, the data is partitioned into v parts. Each of the parts is then set aside at turn as a test set, a clustering model computed on the other v1 training sets, and the value of the goal function (for example, the sum of the squared distances to the centroids for kmeans) calculated for the test set. These v values are calculated and averaged for each alternative number of clusters, and the cluster number selected that minimizes the test set errors.^{[10]}
Analyzing the Kernel Matrix
Kernel matrix defines the proximity of the input information. For example, in Gaussian Radial basis function, determines the dot product of the inputs in a higherdimensional space, called feature space. It is believed that the data become more linearly separable in the feature space, and hence, linear algorithms can be applied on the data with a higher success.
The kernel matrix can thus be analyzed in order to find the optimal number of clusters ^{[11]}. The method proceeds by the eigenvalue decomposition of the kernel matrix. It will then analyze the eigenvalues and eigenvectors to obtain a measure of the compactness of the input distribution. Finally, a plot will be drawn, where the elbow of that plot indicates the optimal number of clusters in the data set. Unlike previous methods, this technique does not need to perform any clustering apriori. It directly find the number of clusters from the data.
External links
 Clustergram  cluster diagnostic plot  for visual diagnostics of choosing the number of (k) clusters (R code)
Bibliography
 ^ Kanti Mardia et al. (1979). Multivariate Analysis. Academic Press.
 ^ Fazli Can, Esen A. Ozkarahan (1990). "Concepts and effectiveness of the cover coefficientbased clustering methodology for text databases". ACM Transactions on Database Systems 15 (4): 483–517. doi:10.1145/99935.99938. http://portal.acm.org/citation.cfm?id=99935.99938&coll=DL&dl=ACM&CFID=36101954&CFTOKEN=18170009. especially see Section 2.7.
 ^ See, e.g., David J. Ketchen, Jr & Christopher L. Shook (1996). "The application of cluster analysis in Strategic Management Research: An analysis and critique". Strategic Management Journal 17 (6): 441–458. doi:10.1002/(SICI)10970266(199606)17:6<441::AIDSMJ819>3.0.CO;2G. http://www3.interscience.wiley.com/cgibin/fulltext/17435/PDFSTART.
 ^ See, e.g., Figure 6 in
 Cyril Goutte, Peter Toft, Egill Rostrup, Finn Årup Nielsen, Lars Kai Hansen (March 1999). "On Clustering fMRI Time Series". NeuroImage 9 (3): 298–310. doi:10.1006/nimg.1998.0391. PMID 10075900.
 ^ Robert L. Thorndike (December 1953). "Who Belong in the Family?". Psychometrika 18 (4).
 ^ Cyril Goutte, Lars Kai Hansen, Matthew G. Liptrot & Egill Rostrup (2001). "FeatureSpace Clustering for fMRI MetaAnalysis". Human Brain Mapping 13 (3): 165–183. doi:10.1002/hbm.1031. PMID 11376501. http://www3.interscience.wiley.com/cgibin/fulltext/82002382/. see especially Figure 14 and appendix.
 ^ Catherine A. Sugar and Gareth M. James (2003). "Finding the number of clusters in a data set: An information theoretic approach". Journal of the American Statistical Association 98 (January): 750–763.
 ^ Peter J. Rousseuw (1987). "Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis". Computational and Applied Mathematics 20: 53–65. doi:10.1016/03770427(87)901257.
 ^ R. Lleti, M.C. Ortiz, L.A. Sarabia, M.S. Sánchez (2004). "Selecting Variables for kMeans Cluster Analysis by Using a Genetic Algorithm that Optimises the Silhouettes". Analytica Chimica Acta 515: 87–100. doi:10.1016/j.aca.2003.12.020.
 ^ See e.g. "Finding the Right Number of Clusters in kMeans and EM Clustering: vFold CrossValidation". Electronic Statistics Textbook. StatSoft. 2010. http://www.statsoft.com/textbook/clusteranalysis/#vfold. Retrieved 20100503.
 ^ Honarkhah, M and Caers, J, 2010, Stochastic Simulation of Patterns Using DistanceBased Pattern Modeling, Mathematical Geosciences, 42: 487  517
 Ralf Wagner, Sören W. Scholz, Reinhold Decker (2005): The Number of Clusters in Market Segmentation, in: Daniel Baier, Reinhold Decker; Lars SchmidtThieme (Eds.): Data Analysis and Decision Support, Berlin, Springer, 157  176.
Categories:
Wikimedia Foundation. 2010.
Look at other dictionaries:
Data Intensive Computing — is a class of parallel computing applications which use a data parallel approach to processing large volumes of data typically terabytes or petabytes in size and typically referred to as Big Data. Computing applications which devote most of their … Wikipedia
Clustering highdimensional data — is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high dimensional data spaces are often encountered in areas such as medicine, where DNA microarray technology can produce a large number of… … Wikipedia
Anaxagoras and the atomists — C.C.W.Taylor ANAXAGORAS In the course of the fifth century BC the political and cultural pre eminence of Athens attracted to the city a considerable number of intellectuals of various kinds from all over the Greek world. This phenomenon, the so… … History of philosophy
kmeans clustering — In statistics and data mining, k means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. This results into a partitioning of… … Wikipedia
File Allocation Table — For other uses, see Fat (disambiguation). FAT Developer Microsoft Full Name File Allocation Table FAT12 (12‑bit version) FAT16/FAT16B (16‑bit versions) FAT32 (32‑bit version with 28 bits used) Introduced … Wikipedia
chemical bonding — ▪ chemistry Introduction any of the interactions that account for the association of atoms into molecules, ions, crystals, and other stable species that make up the familiar substances of the everyday world. When atoms approach one another … Universalium
Message Passing Interface — MPI, the Message Passing Interface, is standardized and portable message passing system designed by a group of researchers from academia and industry to function on a wide variety of parallel computers. The standard defines the syntax and… … Wikipedia
Cluster analysis (in marketing) — Cluster analysis is a class of statistical techniques that can be applied to data that exhibit “natural” groupings. Cluster analysis sorts through the raw data and groups them into clusters. A cluster is a group of relatively homogeneous cases or … Wikipedia
Raymond Cattell — Infobox Person name = Raymond Cattell image size = 150px caption = Raymond Cattell birth name = birth date = 20 March, 1905 birth place = death date = 2 February, 1998 death place = death cause = resting place = resting place coordinates =… … Wikipedia
Middle Chinese — 中古漢語 Spoken in China Region Medieval China Extinct Evolved into Proto Mandarin and other Chinese dialects apart from Min … Wikipedia