Correlation clustering

In machine learning, correlation clustering or cluster editing operates in a scenario where the relationship between the objects are known instead of the actual representation of the objects. For example, given a signed graph G = (V,E) where the edge label indicates whether two nodes are similar (+) or different (−), the task is to cluster the vertices so that similar objects are grouped together. Unlike other clustering algorithms this does not require choosing the number of clusters k in advance because the objective, to minimize the disagreements, is independent of the number of clusters.

It may not be possible to find a perfect clustering, where all similar items are in a cluster while all dissimilar ones are in different clusters. If the graph indeed admits a perfect clustering, then simply deleting all the negative edges and finding the connected components in the remaining graph will return the required clusters.

But, in general a graph may not have a perfect clustering. For example, given nodes a,b,c such that a,b and a,c are similar while b,c are dissimilar, a perfect clustering is not possible. In such cases, the task is to find a clustering that maximizes the number of agreements (number of + edges inside clusters plus the number of - edges between clusters) or minimizes the number of disagreements (the number of - edges inside clusters plus the number of + edges between clusters). This problem of maximizing the agreements is NP-complete (multiway cut problem reduces to maximizing weighted agreements and the problem of partitioning into triangles[1] can be reduced to unweighted version)

Bansal et al.[2] discuss the NP-completeness proof and also present both a constant factor approximation algorithm and polynomial-time approximation scheme to find the clusters in this setting. Ailon et al.[3] propose a randomized 3-approximation algorithm for the same problem.

CC-Pivot(G=(V,E+,E-))

   Pick random pivot i ∈ V
   Set C = {i}, V'=Ø
   For all j ∈ V, j ≠ i;
       If (i,j) ∈ E+ then
            Add j to C
       Else (If (i,j) ∈ E-)
            Add j to V'
   Let G' be the subgraph induced by V'
   Return clustering C,CC-Pivot(G')

The authors show that the above algorithm is a 3-approximation algorithm for correlation clustering.

Correlation clustering (data mining)

Correlation clustering also relates to a different task, where correlations among attributes of feature vectors in a high-dimensional space are assumed to exist guiding the clustering process. These correlations may be different in different clusters, thus a global decorrelation cannot reduce this to traditional (uncorrelated) clustering.

Correlations among subsets of attributes result in different spatial shapes of clusters. Hence, the similarity between cluster objects is defined by taking into account the local correlation patterns. With this notion, the term has been introduced in [4] simultaneously with the notion discussed above. Different methods for correlation clustering of this type are discussed in [5], the relationship to different types of clustering is discussed in [6], see also Clustering high-dimensional data.

Correlation clustering (according to this definition) can be shown to be closely related to biclustering. As in biclustering, the goal is to identify groups of objects that share a correlation in some of their attributes; where the correlation is usually typical for the individual clusters.

References

  1. ^ Garey, M. and Johnson, D. (2000). "Computers and Intractability: A Guide to the Theory of NP-Completeness". 
  2. ^ Bansal, N., Blum, A. and Chawla, S. (2004). "Correlation Clustering". Machine Learning Journal (Special Issue on Theoritical Advances in Data Clustering,. pp. 86–113,. doi:10.1023/B:MACH.0000033116.57574.95. 
  3. ^ Ailon, Nir and Charikar, Moses and Newman, Alantha (2005). "Aggregating inconsistent information: ranking and clustering". STOC '05: Proceedings of the thirty-seventh annual ACM symposium on Theory of computing. pp. 684–693,. doi:10.1145/1060590.1060692. 
  4. ^ Böhm, C., Kailing, K., Kröger, P., Zimek, A. (2004). "Computing Clusters of Correlation Connected Objects". Proc. ACM SIGMOD International Conference on Management of Data (SIGMOD'04), Paris, France. pp. 455–467. doi:10.1145/1007568.1007620. http://doi.acm.org/10.1145/1007568.1007620. 
  5. ^ Zimek, A. (2008). Correlation Clustering. http://edoc.ub.uni-muenchen.de/8736/. 
  6. ^ Kriegel, H.-P.; Kröger, P., Zimek, A. (March 2009). "Clustering High Dimensional Data: A Survey on Subspace Clustering, Pattern-based Clustering, and Correlation Clustering". ACM Transactions on Knowledge Discovery from Data (TKDD) 3 (1): 1–58. doi:10.1145/1497577.1497578. http://doi.acm.org/10.1145/1497577.1497578. 

Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Clustering high-dimensional data — is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high dimensional data spaces are often encountered in areas such as medicine, where DNA microarray technology can produce a large number of… …   Wikipedia

  • Consensus clustering — Clustering is the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. Often similarity is assessed according to a distance measure.… …   Wikipedia

  • Total correlation — In probability theory and in particular in information theory, total correlation (Watanabe 1960) is one of several generalizations of the mutual information. It is also known as the multivariate constraint (Garner 1962) or multiinformation… …   Wikipedia

  • Cluster analysis — The result of a cluster analysis shown as the coloring of the squares into three clusters. Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more… …   Wikipedia

  • Principal component analysis — PCA of a multivariate Gaussian distribution centered at (1,3) with a standard deviation of 3 in roughly the (0.878, 0.478) direction and of 1 in the orthogonal direction. The vectors shown are the eigenvectors of the covariance matrix scaled by… …   Wikipedia

  • Biclustering — Biclustering, co clustering, or two mode clustering[1] is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. The term was first introduced by Mirkin[2] (recently by Cheng and Church[3] in gene… …   Wikipedia

  • Environment for DeveLoping KDD-Applications Supported by Index-Structures — ELKI 0.4 visualisiert OPTICS Ergebnis Basisdaten Maintainer …   Deutsch Wikipedia

  • OPTICS algorithm — OPTICS ( Ordering Points To Identify the Clustering Structure ) is an algorithm for finding density based clusters in spatial data. It was presented by Mihael Ankerst, Markus M. Breunig, Hans Peter Kriegel and Jörg Sander[1]. Its basic idea is… …   Wikipedia

  • Classification double — La Classification double ou « Biclustering » est une technique d exploration de données non supervisée permettant de segmenter simultanément les lignes et les colonnes d une matrice. Plus formellement[1], la définition de la… …   Wikipédia en Français

  • Natural language user interface — Natural Language User Interfaces (LUI) are a type of computer human interface where linguistic phenomena such as verbs, phrases and clauses act as UI controls for creating, selecting and modifying data in software applications. In interface… …   Wikipedia


Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”

We are using cookies for the best presentation of our site. Continuing to use this site, you agree with this.