Graph-based data clustering via multiscale community detection

We present a graph-theoretical approach to data clustering, which combines the creation of a graph from the data with Markov Stability, a multiscale community detection framework. We show how the multiscale capabilities of the method allow the estimation of the number of clusters, as well as alleviating the sensitivity to the parameters in graph construction. We use both synthetic and benchmark real datasets to compare and evaluate several graph construction methods and clustering algorithms, and show that multiscale graph-based clustering achieves improved performance compared to popular clustering methods without the need to set externally the number of clusters.


Introduction
Clustering is a classic task in data mining, whereby input data are organised into groups (or clusters) such that data points within a group are more similar to each other than to those outside the group [1]. Such a task is distinct from supervised (or semi-supervised) classification, where examples of the different classes are known a priori and are used to train a computational model to assign other objects to the known groups. Instead, clustering aims to find natural, intrinsic sub-classes in the data, without assuming a priori the number or type of clusters. Indeed, a key open issue in this field is the principled determination of the number of clusters in an unsupervised manner, without the assumption of a generative model [2,3]. The obtained groups can then constitute the basis for a simpler, yet informative, representation of large, complex datasets.
Data clustering has a long history and there exist a myriad of clustering algorithms based on different principles and heuristics [4]. In their most basic form, many popular clustering techniques (e.g., k-means [5] and mixture models [6]) are based on the assumption that the data follows an explicit (typically multivariate Gaussian) distribution. Clusters are then defined as the samples most likely generated from the same distribution, and learned by likelihood maximisation. However, in real applications, the model that generates the data is unknown and the resulting data distribution may be complex. In this case of data-driven analysis, model-based clustering often yields poor results [7,8].
An alternative approach is provided by spectral clustering, which uses the eigenvectors of a (normalised) similarity matrix derived from the data to find relevant subgroups in the dataset [8,9]. Spectral clustering is underpinned by results in matrix analysis (e.g., singular value decomposition), and has strong connections to model reduction, geometric projections and dimensionality reduction [9,10]. The choice of similarity measure is a crucial ingredient to the clustering performance but, as long as a similarity matrix can be computed, spectral methods provide an attractive choice for non-vector data or for data sampled from irregular and non-convex data manifolds [11,12].
From a different perspective, the similarity matrix of a dataset can also be viewed as the adjacency matrix of a fully connected, weighted graph, where the nodes correspond to data points and the edge between two nodes is weighted by their similarity. One can then apply graph-based algorithms for community detection or graph partitioning to the problem of data clustering. Graph-based methods typically operate by searching for balanced graph cuts, sometimes invoking notions from spectral graph theory, i.e., using the spectral decomposition of the adjacency or Laplacian matrices of the graph [13,14]. Spectral clustering can thus be understood as a special case of the broader class of graph-based clustering methods [10]. Importantly, graph-based clustering is also able to reveal modular structure in graphs across levels of resolution through multiscale community detection [15,16,17]. This approach allows for the discovery of natural data clusterings of different coarseness [18], thus recasting the problem of finding the appropriate number of clusters to the detection of relevant scales in the graph.
Methods for graph construction usually involve a sparsification of the similarity (or distance) matrix under different heuristics (from simple thresholding to sophisticated regularisations) in order to extract a similarity graph that preserves key properties of the dataset [19]. The representation of data through graphs has attractive characteristics, including the capability of capturing efficiently the local and global properties of the data through graph-theoretical concepts that embody naturally the notions of local neighbourhoods, paths, and global connectivity [20,21,16]. The usage of graphs provides a natural links of spectral clustering with other clustering methods and allows for easy generalisation to a semi-supervised setting [22,23]. Graphs also provide a means to capture the geometry of complex manifolds, a feature of interest in realistic datasets [24]. Graph representations not only reduce the computational cost for spectral graph methods, but also allow us to use the techniques developed for complex networks as an alternative to address problems in data clustering. However, it has been shown that both the method of graph construction and the choice of method parameters (i.e., sparsity) have a strong impact on the performance of graph-based clustering methods [25].
Here, we study the use of multiscale community detection applied to similarity graphs extracted from data for the purpose of unsupervised data clustering. The basic idea of graph-based clustering is shown schematically in Figure 1. Specifically, we focus on the problem of assessing how to construct graphs that appropriately capture the structure of the dataset with the aim of being used within a multiscale graph-based clustering framework. In particular, we carry out an empirical study of different graph construction methods used in conjunction with Markov Stability (MS), a dynamics-based framework for multiscale community detection for graphs [17,26]. MS allows for the unsupervised identification of communities at different levels of resolution, and has been applied successfully to a variety of problems, including protein structures [27,28], airport networks [16], social networks [29] and neuronal network analyses [30]. We evaluate several geometric graph constructions, from methods that use only local distances to others that balance local and global measures, and find that the recently proposed Continuous k-nearest Neighbours (CkNN) graph [31] performs well for graph-based data clustering via community detection. We then show how the multiscale capabilities of the Markov Stability to scan across scales can be exploited to deliver robust clusterings, reducing the sensitivity to the parameters of the graph construction. In other words, a range of parameters in the graph construction lead to good clustering performance. We validate our graph-based clustering approach on real datasets and compare its performance to several other popular clustering methods, including k-means, mixture models, spectral clustering and hierarchical clustering [32].
The rest of the paper is structured as follows. We first introduce several methods for graph construction, apply them to nine public datasets with ground truths, and evaluate the performance of graph-based data clustering on the ensuing similarity graphs. We then describe briefly the Markov Stability framework for multiscale community detection, and use a synthetic example dataset to illustrate how the multi-resolution clustering reduces the sensitivity to graph construction parameters. Finally, we validate the Markov Stabiity graph-based clustering through comparisons with other clustering methods on real datasets.

Clustered points Data points
Similarity Graph graph construction graph partitioning

Graph construction methods for data clustering
Let us consider a dataset consisting of n samples {y i } n i=1 , where each sample y i is a d-dimensional vector. We will assume that we can define a measure of pairwise dissimilarity between the samples: d(i, j) ≥ 0. In some cases, the vectors will be defined in a metric space, and the dissimilarity will be a true distance d(i, j) = ||y i −y j || ≥ 0. Other dissimilarity measures, such as the cosine distance, can also be used depending on the application. In the examples below, we will restrict ourselves to Euclidean distances as the measure of dissimilarity.
The aim of transforming the data into a graph is to capture the geometry of the data through inherent features of the graph topology or through processes taking place on the graph. There exist a variety of ways to construct a graph from a high-dimensional dataset, invoking different principles. Here we focus on geometric graphs and examine two of the most widely used methods ( -ball graph, k-nearest neighbour (kNN) graph) and three recent methods (continuous k-nearest neighbours (CkNN) graph [31], perturbed minimum spanning tree (PMST) [33] and relaxed minimum spanning tree (RMST) [21]). These methods can be broadly ascribed to two categories depending on how they use the geometry of the data, as follows.

Neighbourhood based methods: -ball, kNN and CkNN graphs
The idea of neighbourhood based methods is to connect two nodes if they are local neighbours, as given by their pairwise distance d(i, j). The two simplest and most popular ways to construct a graph from pairwise distances are the -ball graph and the k-nearest neighbour graph (kNN): in the -ball graph, any two points at a distance smaller than are connected; in the kNN graph, every point is connected to its k-th nearest neighbours. These two methods capture the local information of the data but are highly sensitive to the parameters or k [25]. The parameter is usually set according to the density of the data points but in many datasets, the data points are not uniformly distributed.
A recently proposed method that can resolve this problem is the continuous k-nearest neighbours (CkNN) graph [31]. If d(i, j) is the distance between sample i and sample j, and d k (i) is the distance between sample i and its k-th nearest neighbour, the CkNN graph is constructed by connecting sample i and sample j if where δ is a positive parameter that controls the sparsity of the graph. Through this construction, the topology of the CkNN graph captures the geometric features of the data with the additional consistency that the CkNN graph Laplacian converges to the Laplace-Beltrami operator in the limit of large data [31]. It should be kept in mind that in CkNN the distance d(i, j) must be a metric to ensure geometrical consistency, whereas kNN graphs can be generated from any dissimilarity measure (i.e., one does not need a true distance) since only the ranking of node closeness matters.

Minimum spanning tree based methods: PMST and RMST graphs
A different class of approaches for graph construction attempt to capture the global geometry of the overall dataset by constructing graphs based on measures of global connectivity of the ensuing graph. A popular way to ensure such global connectivity is through the minimum spanning tree (MST) [34], as follows. If we consider the matrix of all pairwise distances d(i, j) as the adjacency matrix of a weighted, fully connected graph, the MST is the subgraph such that all the nodes are path connected and the sum of edge weights is minimised. In other words, the MST provides a graph that connects all the points in the dataset with minimal global distance. MST-based approaches can thus capture the geometry of in-homogeneously sampled data points in a high-dimensional space since the MST contains not only local but also global features of the dataset. In its simplest form, the MST is sometimes added to sparse neighbourhood graphs as a means to guarantee global connectivity of the dataset, i.e., the final graph is the union of the MST and a kNN graph with a small k. These schemes, which are sometimes referred as MST+kNN graphs, are the ones we adopt by default in our neighbourhood constructions. However, the global properties of the MST can be exploited to generate MST-based graphs from data with distinct properties. We have explored here the use of two such MST-based algorithms: the perturbed minimum spanning tree (PMST) [33] and the relaxed minimum spanning tree (RMST) [21,35].
In PMST [33], each data point y i is perturbed by a small amount of noise of standard deviation s i = rd k (i) (r ∈ [0, 1]) where d k (i) is used as an estimation of the local noise and r is a parameter controlling the level of noise. The MST is then computed for each realisation of the perturbed data and the process is performed repeatedly to generate an ensemble of MSTs. The PSMT graph is given by the union of all the perturbed MSTs, plus the original MST. The intuition behind this algorithm is that random perturbations of the points in high-dimensional space will induce changes inhomogeneously in different parts of the MST, depending on how globally important certain edges of the graph are, i.e., globally important edges will be consistently captured across all MSTs in the perturbed ensemble. One limitation of this algorithm is the heavy computational demand, since both the distance matrix and MST need to be computed for each random realisation. The computational burden makes it impractical to sweep over the parameters of the PMST, so we fix r = 0.5 and k = 1 in this paper.
RMST [35] proposes a different heuristic for estimating an MST-based graph with lower computational cost. Note that any two nodes are connected by a single path in the MST and we denote the longest edge on this path as d max path(i,j) . In RMST, the samples y i and y j are connected if where γ > 0 is a parameter that weights the local density (measured by the average distance to the k-th neighbours of y i and y j ) against a global property (the maximum distance found on the MST path linking the samples i and j). Similarly to the parameter δ in CkNN, the value of γ controls the sparsity of the resulting graph. The RMST construction was proposed as a means to reconstruct data that have been inhomogeneously sampled from continuous manifolds, and has been shown to provide good description of datasets when preserving a measure of continuity (due to temporal or parametric changes) is important [21].
In Figure 2, we use a synthetic dataset of four groups of points sampled from a geometric structure (a noisy circle and its centre) to illustrate the different graph construction schemes [36]. The graph representations allow us to gain intuition about the suitability of the different graph constructions for clustering, and the effect of their parameters. For instance, the sparsity of the RMST graph is controlled by the parameter γ. Note that RMST gives a similar graph to PMST with much lower computational cost. Note also that, for the same number of neighbours k, the CkNN gives a sparser graph than the kNN. To make CkNN and kNN comparable, we thus fix the parameter δ = 1 in CkNN and vary k.
As discussed above, the MST-based methods are not optimised for clustering but aimed at manifold learning. It follows from (2) that, in the RMST (and PMST) graphs it is more likely for an edge to appear between node i and node j if the longest edge on the MST path between node i and j is large. This feature makes the geodesic distance on the graph a good approximation to the true distance in the underlying space. Hence both RMST and PMST are closer to manifold learning approaches such as Isomap [20]). In contrast, the kNN and CkNN graphs tend to have better modular structures and hence appear as potentially more suitable for graph partitioning. We compare these issues in detail in Section Tests on benchmark real datasets below.

Markov Stability for graph-based clustering
Let us consider a graph representing the dataset. The unweighted and undirected graph with n nodes representing the dataset is encoded by the adjacency matrix A, where A ij = 1 if there is an edge connecting node i and j and A ij = 0 otherwise. The degree of the nodes is summarised in the degree vector d where d i = n j A ij , and we also define the diagonal degree matrix D where D ii = d i . The total number of edges of the network is m = i,j A ij /2. We then apply multiscale community detection to extract relevant subgraphs in an unsupervised manner using the framework of Markov Stability.

Multiscale community detection with Markov Stability
Markov Stability is a quality measure for community detection which adopts a dynamical perspective to unfold relevant structures in the graph at all scales as revealed by a diffusion process [15,16,17,37]. Consider a continuous-time Markov process on the graph governed by the dynamicsṗ = −p(I − M ) where p is an n-dimensional row vector defined on the nodes and M = D −1 A is the one step random walk transition matrix. For this Markov process, there is a unique stationary distribution π = d T /2m. Let us denote the autocovariance matrix of this process as B(t) = ΠP (t) − π T π where Π = D/2m encodes the stationary distribution and P (t) = exp(−t(I − M )) is the transition matrix. Given a partition g of the nodes into c non-overlapping groups denoted by g = {g 1 , g 2 , ..., g c }, the Markov Stability of g is defined as: A partition has a high value of (3) if the probability of finding a random walker at time t within the group where it started at t = 0 is higher than that expected by mere chance. In this sense, Markov Stability is a quality function for partitions of a graph and the objective is therefore to find the partitions that achieve high values of the Markov Stability as a function of t: r * (t) = max g r(t, g) achieved by the partition g * (t).
The (time) parameter t is the so-called Markov time, and can be viewed as the resolution parameter that leads to multiscale community detection-as t grows, the partitions of the graph become coarser. Computationally, the Markov Stability is optimised at different Markov times through a version of the Louvain algorithm [38]. Through the computational maximisation (4), Markov Stability detects optimised partitions g * (t) at all scales, parameterised by the value of t. However, we are interested in finding robust partitions and robust scales, in the sense that a partition is found to optimise MS over a long interval of Markov time. We thus compute the dissimilarity between the obtained partitions at different times t and t : where we use the variation of information (V I) [39] as the metric of dissimilarity between partitions. If g * (t) is a robust partition, the partition g * (t ) found at a Markov time t close to t should be very similar, and hence V I(t, t ) will be small. We therefore look for large diagonal blocks of small values in the V I(t, t ) matrix. Such blocks correspond to a robust scale with an associated robust partition.
As an additional feature, we look for optimised partitions that are also robust to the Louvain optimisation. Since the Louvain method is a greedy algorithm dependent on the random initialisation, the consistency of the output of the algorithm can be used as an indicator of the robustness of the solution. At each t, we run the Louvain optmisation multiple times and if the Markov time corresponds to a robust scale, the output partition should be always the same. Therefore we expect a low value of the average variation of information of the optimised partitions at time t where the Louvain algorithm is run n L times. Depending on the structure of the graph, several such robust scales and associated graph partitions might be found, which can then be used as the basis of unsupervised data clustering.

Using Markov Stability for data clustering
We illustrate the application of MS to data clustering through the synthetic dataset in Figure 3. The example dataset has geometric struture and is designed to have two scales, so that it can be divided into 3 big clusters or 9 small clusters. First, we construct an unweighted CkNN graph (k = 7, δ = 1.8) and apply MS as described above. We optmise the Markov Stability (3) for n T Markov times t ∈ [1, 1000], and at each t, we run the Louvain algorithm n L = 500 times. For each Markov time, we record the partition with the maximal Markov Stability, g * (t), and the average dissimilarity of the partitions found in the n L optimisations, V I(t) (6). Once the scan across Markov time is completed, we also compute V I(t, t ), the matrix recording the dissimilarity of the optimal partitions found across the scan. Note that with Markov Stability, there is a range of δ for the CkNN graph (k = 7) that can reveal the multiscale structure of the data (see Fig. 5).
The results are presented in Figure 3, where, as a function of Markov time t, we plot the number of communities in the optimal partition g * (t); the the optimised Markov Stability r * (t) (3); the average dissimilarity due to algorithmic variability V I(t); and the dissimilarity of partitions across time given by the V I(t, t ) matrix. The diagonal blocks of low values of V I(t, t ) (which also correspond to plateaux in the number of communities) and the low values (or dips) of V I(t) suggest that there are two relevant scales in this graph, which correspond to a finer clustering into 9 groups (at small t) and a partition into 3 groups (at larger t > 200). The inset shows that the clustering recovers the planted groups of this synthetic example.
The robustness of the partitions found across scales is further examined in Figure 4. To understand how the graph partitions evolve with t, we compute the V I metric between all the partitions found across the Markov time scan (n L × n T ) and project them on a low dimensional space using multidimensional scaling (MDS). In Figure 4, we use the first MDS coordinate of each of the partitions as a function of t coloured by its frequency. Several partitions can coexist at a given Markov time, but our numerics show that the two robust partitions (c = 9 and c = 3) have a long-lived high frequency of appearance when the t matches the corresponding resolution. Between c = 9 and c = 3, other partitions of lesser robustness appear through mergers of clusters one by one, as shown by the Sankey diagramme in Figure 4, and only exist for short Markov times until the robust partition of c = 3 appears.

Scanning across Markov time reduces the sensitivity to graph construction
Graph-based clustering performance is sensitive to the parameters of the graph construction method, which modulate sparsity, but there is no easy way to select the best parameter if the ground truth is unknown. In practice, the parameters are usually set empirically with little guidance that the chosen parameter will lead to good clustering results for a particular dataset. Within the Markov Stability framework, we can use the robustness provided by scanning across Markov time to reduce the sensitivity to the details of graph construction, thus improving the reliability of the detected clusters.
To illustrate this idea, We use the same dataset in Figure 3 and construct CkNN (k = 7) graphs with different values of δ = 1.5, 1.8, 2.4. Our numerics show that Markov Stability detects the relevant underlying scales of the data (c = 9 and c = 3) for the different values of δ, as shown by the long diagonal blocks of low V I(t, t ) and the low values of V I(t) in Figure 5. Although the degree of the CkNN graph varies markedly with the parameter δ, the two significant scales are identified by scanning across the Markov time. Hence the scanning across scales inherent to multiscale community detection provides additional robustness to the parameters of the graph construction algorithm.

Tests on benchmark real datasets
We have tested several graph-based clustering approaches (both graph constructions and clustering methods) using nine benchmark datasets from the UCI repository (see Table 1 for a summary of attributes) [40]. All the datasets have ground truth labels, which we use to validate the results of the different methods.

Comparison between graph constructions
Starting from the Euclidean distance d(i, j) = ||y i − y j || 2 , we generated geometric graphs from each of the nine UCI datasets using the five graph construction methods described in Section Graph construction methods for data clustering. If the constructed graph is disconnected, the MST is added to the graph to ensure a connected graph. Each graph was analysed using Markov Stability to obtain optimised partitions at any scale, and we selected the closest of those partitions to the ground truth, as measured by the normalised mutual information (NMI) [41]. The computed NMI is a quality index for the graph construction under clustering. We also compute the adjusted Rand index (ARI) [42] as an additional quality index. The results of the comparison are shown in Table 2. The kNN and -ball graphs, both widely-used in many machine learning and both based on local neighbourhoods, give good results for a range of k. (Note that is set to be the average of the distances to the 7-th neighbour, d 7 (i).) For the MST-based methods, RMST achieves better performance for sparser graphs (with smaller γ) when the cluster structure is not obscured by the objective of manifold reconstruction (Fig. 2). The same applies to the PMST graph. The empirical tests show that the CkNN graph gives the best average results over the nine datasets for k = 7 and above. Hence we adopt the CkNN graph with k in the range of 7 to 12 as a good choice for graph-based clustering.
The CkNN graph is constructed by using a variable bandwidth diffusion kernel [43] where the bandwidth is inversely proportional to the sampling density and allows for uniform estimation errors over the underlying manifold, while the graph construction which uses a fixed bandwidth kernel will have large errors in areas of small sampling density. By such a construction, the graph Laplacian of the CkNN graph converges to the the Laplacian operator in the manifold. This explains why the CkNN graph shows a better clustering performance than the other graph construction when the graph is partitioned with Markov Stability, which considers a diffusion process on the graph.

Comparison between clustering methods
In this section, we evaluate the performance of graph-based clustering through Markov Stability against several other clustering methods applied to the datasets from the UCI repository. We include a variety of clustering approaches. Model-based methods include k-means and Gaussian mixture (clustering repeated 10 times and partition with the best NMI is reported). We also apply hierarchical clustering with complete linkage and Euclidean distance is used as the distance measure. Since graph-based clustering is closely related to spectral clustering, we compare to two spectral clustering methods: the multiclass n-cut algorithm [44] and the classic NJW algorithm [8]. The affinity matrix for these two spectral clustering algorithms is calculated with a local density kernel as described in [45]. Note that all these algorithms need the number of clusters to be given as an input. Hence we use the number of classes in the ground truth c * as an input to set the number of clusters. For comparability, in the case of Markov Stability, we construct the CkNN graph with Euclidean distance (k = 7 and δ = 1) and find the optimised partition with the number of communities equal to c * . The results are presented in Table 3. Given the distinct features of the datasets, no method is expected to achieve consistently better performance across all datasets. The hierarchical clustering performs worst, as it is easily affected by the noise in real datasets. Model-based methods, such as the Gaussian mixture model, can achieve good performance if the properties of the data fit the assumptions (e.g., the Iris dataset), but can also perform poorly. On datasets that have non-convex geometries, graph-based methods and spectral clustering tend to perform better. On average, the Markov Stability approach achieves the best NMI and ARI scores.
To further validate the quality of the robust partitions found by Markov Stability, we also carried out MS clustering in a fully unsupervised manner, i.e., without providing the number of classes in the ground truth as an input. Using the principles described in Section Markov Stability for graph-based clustering, we identify robust scales and robust partitions in order to establish the number of clusters inherent to the data in an unsupervised manner. Table 4 presents the NMI and ARI values and the number of communities detected by MS and compares it to the results obtained in Table 3, where the number of communities in the ground truth was provided. Although the number of detected clusters differs slightly from the ground truth, the clusters found in the unsupervised MS have a higher average NMI value, i.e., they provide more information about the ground truth thus highlighting the advantage of multiscale clustering with Markov Stability as an unsupervised means to find a partition closer to the ground truth.

Conclusion
We have investigated the use of multiscale community detection for graph-based data clustering. The first step in graph-based clustering is to construct a graph from the data, and our empirical study shows that the recently proposed CkNN graph is a good choice for this purpose. In contrast to other neighbourhood-based graph constructions like kNN or -ball graphs, the CkNN graph is designed to provide a consistent discrete approximation of the diffusion operator on the underlying data manifold. Since many community detection methods are closely related to diffusion or random walks (e.g., Markov Stability and spectral methods), this explains the good performance of CkNN for clustering purposes. Other graph construction methods specifically designed for manifold learning (e.g, RMST) performed well but are not optimised for cluster separation. Our work has also examined the suitability of multiscale community detection as a means for unsupervised data clustering. Specifically, we have used the Markov Stability framework, which employs a diffusion process on the graph to detect the presence of relevant subgraphs at all scales. The time of the diffusion process  acts as a resolution parameter and a cost function for graph partitioning is optimised at different scales by scanning time. Robust partitions and robust scales can be identified by analysing the consistency of the ensemble of optimised partitions found by the Louvain algorithm. Our numerics show that the Markov Stability framework is able to determine the number of clusters and reveal the multiscale structure in data. Further, by scanning Markov time, the MS analysis can reduce the sensitivity to the parameters in the graph construction step, thus improving the robustness of graph-based clustering.
We have validated our graph-based clustering approach on several real datasets by comparing with other popular clustering methods, including k-means, Gaussian mixture model, hierarchical clustering, and two spectral clustering algorithms. The graph-based clustering method achieves the best NMI and ARI values on average across the datasets. Importantly, we show that the clustering can be done in a completely unsupervised way (without assuming a knowledge of the number of clusters), whereas for the other standard methods the number of clusters needs to be given as an input.
Our study also suggests several directions of future work. Here we showed that the CkNN graph is a good choice for graph-based clustering, but it will be interesting to establish the performance of CkNN in other data mining problems, such as manifold learning where graphs also play a important role [46]. We showed that the variation of information of partitions and the ensemble of partitions found by greedy optimisation can be used to guide the identification of robust partitions. However, a quantitative, statistically sound process to choose the significant partitions automatically would be desirable and useful in practice.