Motif-Based Spectral Clustering of Weighted Directed Networks

Clustering is an essential technique for network analysis, with applications in a diverse range of fields. Although spectral clustering is a popular and effective method, it fails to consider higher-order structure and can perform poorly on directed networks. One approach is to capture and cluster higher-order structures using motif adjacency matrices. However, current formulations fail to take edge weights into account, and thus are somewhat limited when weight is a key component of the network under study. We address these shortcomings by exploring motif-based weighted spectral clustering methods. We present new and computationally useful matrix formulae for motif adjacency matrices on weighted networks, which can be used to construct efficient algorithms for any anchored or non-anchored motif on three nodes. In a very sparse regime, our proposed method can handle graphs with five million nodes and tens of millions of edges in under ten minutes. We further use our framework to construct a motif-based approach for clustering bipartite networks. We provide comprehensive experimental results, demonstrating (i) the scalability of our approach, (ii) advantages of higher-order clustering on synthetic examples, and (iii) the effectiveness of our techniques on a variety of real world data sets. We conclude that motif-based spectral clustering is a valuable tool for analysis of directed and bipartite weighted networks, which is also scalable and easy to implement.


Introduction
Networks are ubiquitous in modern society; from the internet and online blogs to protein interactions and human migration, we are surrounded by inherently connected structures (Kolaczyk and Csárdi, 2014). The mathematical and statistical analysis of networks is therefore an important area of modern research, with applications in a diverse range of fields, including biology (Albert, 2005), chemistry (Jacob and Lapkin, 2018), physics (Newman, 2008) and sociology (Adamic and Glance, 2005).
A common task in network analysis is that of clustering (Schaeffer, 2007). Network clustering refers to the partitioning of a network into "clusters", so that nodes in each cluster are similar (in some sense), while nodes in different clusters are dissimilar. For a review of approaches, see for example (Fortunato, 2010) or (Fortunato and Hric, 2016). Spectral methods for network clustering have a long and successful history (Cheeger, 1969;Donath and Hoffman, 1972;Guattery and Miller, 1995), and have become increasingly popular in recent years. These techniques exhibit many attractive properties including generality, ease of implementation and scalability (Von Luxburg, 2007); in addition to often being amenable to theoretical analysis by using tools from matrix perturbation theory (Stewart, G.W. and Sun, Ji-guang, 1990). However, traditional spectral methods have shortcomings, particularly involving their inability to consider higher-order network structures (Benson et al, 2016), and their insensitivity to edge directions 4 . Such weaknesses can lead to unsatisfactory results, especially when considering directed networks. Motif-based spectral methods have proven more effective for clustering directed networks on the basis of higher-order structures (Tsourakakis et al, 2017), with the introduction of the motif adjacency matrix (MAM) (Benson et al, 2016).
Contribution In this paper, we explore motif-based spectral clustering methods with a focus on addressing these shortcomings by generalizing motif-based spectral methods to weighted directed networks. Our main contributions include a collection of new matrix-based formulae for MAMs on weighted directed networks, and a motif-based approach for clustering bipartite networks. We also provide computational analysis of our approaches, demonstrations of scalability and comprehensive experimental results, both from synthetic data (variants of stochastic block models) and from real world network data.
Paper layout Our paper is organized as follows. In Section 2, we describe our graph-theoretic framework which provides a natural model for real world weighted directed networks and weighted bipartite networks. In Section 3 we develop our methodology, and state and prove new matrix-based formulae for MAMs. We explore the computational complexity of our approaches and demonstrate their scalability on sparse graphs. In Section 4 we explore the performance of our approaches on several synthetic examples. We demonstrate the utility of considering weight and higher-order structure, and compare against non-weighted and non-higher-order methods. In Section 5 we apply our methods to real world data sets, demonstrating that they can uncover interesting divisions in directed and bipartite networks, and noting their ability (in contrast to other methods) to avoid misclassification. Finally, in Section 6, we present our conclusions and discuss future work.

Framework
In this section, we give notation and definitions for our graph-theoretic framework, with the aim of being able to define our weighted generalizations of motif adjacency matrices in Section 3. The motif adjacency matrix (Benson et al, 2016) is the central object in motif-based spectral clustering, and serves as a similarity matrix for spectral clustering. In an unweighted MAM M , the entry M ij is proportional to the number of motifs of a given type that include both of the vertices i and j. where I is an indicator function. Furthermore, we consider the following three weighted adjacency matrices, corresponding to directed edges, single edges, and double edges, respectively: G ij := W ((i, j)) I{(i, j) ∈ E} , (G s ) ij := W ((i, j)) I{(i, j) ∈ E and (j, i) / ∈ E} , (G d ) ij := W ((i, j)) + W ((j, i)) I{(i, j) ∈ E and (j, i) ∈ E} .
We can extend to the setting of undirected graphs by constructing a new graph, where each undirected edge is replaced by a bi-directional edge.
Definition 2.1 (Motifs and anchor sets). A motif is a pair (M, A) where M = (V M , E M ) is a connected graph with V M = {1, . . . , m} for some small m ≥ 2, and an anchor set is A ⊆ V M with |A| ≥ 2. If A = V M , we say the motif is anchored, and if A = V M we say it is simple. We say that H is a functional instance of M in G if M ∼ = H ≤ G, and we say that H is a structural instance of M in G if M ∼ = H < G. When an anchor set is not given, it is assumed that the motif is simple. Figure 1 shows all the simple motifs (up to isomorphism) on at most three vertices.
Structural instances occur when an exact copy of the motif is present in the graph: i.e. edges present (resp. not present) in the motif are present (resp. not present) in the graph. Functional instances are less restrictive, and occur when the motif is present in a graph, but potentially with extra edges: Anchor sets (Benson et al, 2016) can be thought of as the set of locations in a motif that we consider important for the motif structure. For example we could consider the 2-path motif M 9 , and try to cluster together nodes which appear at the start or end of a 2-path, but maybe not in the middle. Then we can define the anchor set as the subset of motif vertices which should not be separated. In M 9 , the start and end vertices would correspond to A = {1, 3} (Figure 1). Anchor sets are crucial for defining the collider and expander motifs given in Section 3.2.
Finally, we require one additional definition before we can state our generalization of MAMs to weighted networks.
Definition 2.2 (Anchored pairs). Let G be a graph and (M, A) a motif. Suppose H is an instance of M in G. Define the anchored pairs of the instance H as This is the set of pairs of vertices for which both vertices lie in the image of the motif's anchor set, under some isomorphism from the motif to the instance.

Methodology
In this section we detail our methods for motif-based spectral clustering of weighted directed networks. Firstly we define our weighted generalizations of motif adjacency matrices (MAMs) (Definition 3.1). We further provide computationally useful formulae for weighted MAMs (Proposition 3.1), and discuss their applications to clustering (Section 3.3). Finally we present a complexity analysis of our method for computing weighted MAMs (Proposition 3.3), and empirically demonstrate the scalability of our approach (Section 3.4).

Weighted motif adjacency matrices
When generalizing an unweighted concept to weighted networks it is often helpful to first consider its generalization to multi-edges (Newman, 2004), and to view general weighted edges as multi-edges.
Thus, we first consider the weight to place on a motif which contains multi-edges. One option is to count the number of unique instances of the motif among the multi-edges, which gives the total weight as the product of the number of multi-edges between each pair of nodes. Alternatively, we could consider the total number of edges involved in the motif. Both have merits: the former "product" approach is useful if we consider each possible set of edges to be a separate entity, and the latter "sum" approach is useful if we consider the motif to be a single object and wish to count the total number of edges it contains, perhaps as some notion of "capacity". When considering weighted edges however, these two approaches have some mathematical differences, particularly in how they handle the presence of large weights. For example, suppose that a graph contains just a few heavy edges. Were two of these heavy edges to appear together in a motif instance, the product formulation would assign a very large total weight to that instance, possibly completely dominating any other instances of the motif in the graph. On the other hand, using the sum formulation ensures that the total weight of any instance remains on the same scale as the individual edge weights. For this reason, and following (Simmons et al, 2019;Mora et al, 2018), we use the sum formulation in this paper.
To be more precise, we define the weight of a motif instance by its average (mean) edge weight.
Definition 3.1 (Weighted motif adjacency matrices). Let G = (V, E, W ) be a weighted graph on n vertices and let (M, A) be a motif. The functional and structural motif adjacency matrices (MAMs) of size n × n of (M, A) in G are given by When W ≡ 1 and M is simple, the (functional or structural) MAM entry M ij (i = j) simply counts the (functional or structural) instances of M in G containing i and j. When M is not simple, M ij counts only those instances with anchor sets containing both i and j. MAMs are always symmetric, since the only dependency on (i, j) is via the unordered set {i, j}. In order to state Proposition 3.1, we need one more definition. The motivation for this definition of anchored automorphism classes is as follows: suppose we are looking for instances of (M, A) in G which contain nodes i and j. We set k 1 = i and k m = j (where m is the number of nodes in the motif -note we could use any two fixed indices instead of 1 and m here), and choose some other

Motif adjacency matrices for bipartite graphs
We also extend our formulation to weighted bipartite networks, by considering certain 3-node anchored motifs. These are used to create separate similarity matrices for each part of a bipartite graph. Definition 3.3. A bipartite graph is a directed graph where the vertices can be partitioned as V = S D, such that every edge starts in S and ends in D. We refer to S as the source vertices and to D as the destination vertices.
Our method for clustering bipartite graphs uses two anchored motifs; the collider and the expander (Figure 2). For both motifs the anchor set is A = {1, 3}. These motifs are useful for bipartite clustering because when restricted to the source or destination vertices, their MAMs are the adjacency matrices of the weighted projections (Chessa et al, 2014) of the graph G (Proposition 3.2). In particular they can be used as similarity matrices for the source and destination vertices respectively. Note that if a bipartite graph is connected, then so are the projections onto its source or destination vertices.
We note that there are other options available for constructing projections of weighted bipartite graphs, such as the approach in Stram et al (2017), which, similarly to our framework, uses sums of edge weights over shared neighbors. Another relevant line of work is that of (Zha et al, 2001), who proposed a certain minimization problem on the bipartite graph, showing that an approximation solution could be obtained via a partial singular value decomposition of a suitably scaled edge weight matrix.

Clustering the motif adjacency matrix
A motif adjacency matrix can be construed as a general pairwise similarity measure, and as such, can be analyzed directly as a new weighted undirected graph. Following (Benson et al, 2016), one of the most interesting applications is identifying higher-order clusters with spectral methods, leveraging the fact that the similarity is based on motifs.
To extract clusters, we use a standard approach from the literature on spectral clustering, based on the spectrum of the random-walk Laplacian, followed by k-means++ (see Appendix C). To this end, we need to define two parameters: l is the number of random-walk Laplacian eigenvectors to use, and k is the number of clusters for k-means++ (see Algorithm 1). We detail our approach for general directed weighted networks in Algorithm 2, and for weighted bipartite networks in Algorithm 3.

Cluster evaluation
When ground-truth clustering is available, we compare it to our recovered clustering using the Adjusted Rand Index (ARI) (Hubert and Arabie, 1985). The ARI between two clusterings has expected value 0 under random cluster assignment, and maximum value 1 denoting perfect agreement between the clusterings. A larger ARI indicates a more similar clustering, and hence closer to the ground truth.

Connected components
In order for spectral clustering to produce nontrivial clusters from a graph, it is necessary to restrict the graph to its largest connected component. When forming MAMs, even if the original graph is connected, there is no guarantee that the MAM is connected too. Hence we restrict the MAM to its largest connected component before spectral clustering is applied. While this may initially seem to be a flaw with motif-based spectral clustering (since some vertices may not be assigned to any cluster), in fact it can be useful; we only attempt to cluster vertices which are in some sense "well connected" to the rest of the graph. This can result in fewer misclassifications than with traditional spectral clustering, as seen in Section 5.2.

Functional vs. structural MAMs
There is a choice of whether to use functional or structural MAMs for motif-based clustering, and their different properties make them suitable for different circumstances. Firstly, note that 0 ≤ M struc ij ≤ M func ij for all i, j ∈ V. This implies that the largest connected component of M func is always at least as large as that of M struc , meaning that sometimes more vertices can be assigned to a cluster by using functional MAMs. However, structural MAMs are more discerning about finding motifs, since they require both "existence" and "non-existence" of edges.

Computational analysis
For motifs on at most three vertices, we provide two fast and potentially parallelizable matrix-based procedures for computing weighted MAMs: one for dense regimes and one for sparse regimes. In this section, we explore and demonstrate the scalability of these two approaches.

Dense approach
First, Proposition 3.3 bounds the number of matrix operations required to compute a weighted MAM, for a motif on at most three vertices, using our dense formulation. In practice, additional symmetries of the motif often allow computation with even fewer matrix operations (Appendix A).
Proposition 3.3 (Complexity of MAM formula). Let G be a (weighted, directed) graph on n vertices, and suppose that the n × n directed adjacency matrix G of G is known. Then, computing the adjacency and indicator matrices and calculating a MAM, for a motif on at most three vertices, using Equations (1) and (2) in Proposition 3.1, involves at most 18 matrix multiplications, 22 entry-wise multiplications and 21 additions of n × n matrices.
Thus as we have a fixed bound on the number of each operation for any motif, and as each operation is performed on a matrix of the same size, our approach scales with the largest complexity of these operations. In dense matrices, element-wise products and additions are O(n 2 ) and naive matrix multiplication is O(n 3 ). Therefore, in a naive implementation for a graph on n vertices, the overall complexity of our dense approach is O(n 3 ), and the memory requirement is O(n 2 ). We note that somewhat faster algorithms are available for multiplication of large dense matrices, such as the O(n 2.81 ) algorithm given by Strassen (1969).
A list of functional MAM formulae for our dense approach, for all simple motifs on at most three vertices, as well as for the collider and expander motifs (used in Section 3.2), is given in Table 3 in Appendix A. These formulae are generalizations of those stated in Table S6 in the supplementary materials of Benson et al (2016) (a list of structural MAMs for triangular motifs in unweighted graphs). Note that the functional MAM formula for the two-vertex motif M s yields the symmetrized adjacency matrix M = G + G T , which can be used for traditional spectral clustering (Appendix C.1), as in Meilȃ and Pentney (2007).

Sparse approach
Real world networks are often sparse (Leskovec and Krevl, 2014), and operations with sparse matrices are significantly faster: in sparse matrices with b non-zero entries, element-wise products and additions are O(b), and matrix multiplications are O(bn). Therefore, we give a slightly different approach with a better computational running time (Section 3.4.3) on sparse graphs, which can extend to graph sizes that are infeasible for our dense approach.
For motifs on at most three vertices, the formulae in Proposition 3.1 are simply sums of terms of the form A • (BC), where A, B and C are adjacency or indicator matrices, and • represents the entry-wise (Hadamard) product. If the directed graph has n vertices and b edges, then most of the adjacency and indicator matrices have at most b non-zero entries. For these matrices, we can compute the (possibly dense) matrix BC in O(bn) time, and then the matrix A • (BC) in O(n 2 ) time. Summing them together is also done in O(n 2 ) time. 5 However, when considering motifs with missing edges, we use the dense indicator matrices J 0 and J n . This is problematic as now the term BC can contain a dense matrix, which can be an issue both for computational reasons 6 and more importantly, for memory requirements. To address this we rewrite these two matrices as where 1 is a matrix of 1s, and expand out the resulting formulae. Element-wise and matrix products with 1 and I are at most O(n 2 ) so are computationally simple, and do not require generating dense matrices. This allows scaling simply using standard linear algebra libraries.
For sparse graphs the key limitation of this approach is in fact not CPU, but memory (RAM) as while B and C might be sparse, BC may not be. The worst case density of BC for B and C indicator matrices of a connected graph is n 2 (e.g. a star graph). 7 However, in practice the density is often substantially lower than this, allowing our approaches to scale to very large graphs (Section 3.4.3).

Empirical computational speed
In this section, we demonstrate the scalability of our two approaches to computing MAMs. We showcase the running times for two different random graph ensembles, corresponding to a very sparse graph regime and a less sparse regime. For both, we use directed Erdős Rényi (ER) (Erdős et al, 1959) random graphs with n nodes. For the sparser regime, we use an edge existence probability of p = 10 n , and for the less sparse case we set p = 100 n . In both cases, the expected number of edges in the graph scales as O(n). We measure the performance of computing a functional MAM for a representative sample of motifs: one triangle motif M 1 , the 2-path M 9 , the 2-star M 10 , and finally a motif with a reciprocal edge, M 11 . We do not include the time taken to generate the graph in either sparse or dense matrix form. To make this a fair test, we use an optimized Python environment from the Data Science Virtual Machine image on a D64v3 Azure virtual machine with 64 vCPUs and 256 GiB of RAM.
The results are summarized in Figure 3, where we note that in both cases the multi-core dense approach is fastest for graphs with densities greater than around 0.01 (i.e. n < 1000 for p = 10 n and n < 10 000 for p = 100 n ), highlighting the advantages of using this approach in dense graphs. In the sparser regime (p = 10 n ), our sparse approach can handle graphs with five million nodes, and tens of millions of edges. In the less sparse regime (p = 100 n ), we consider graphs of size up to n = 100 000 nodes and of order 10 7 edges, for which our sparse approach takes around 200 seconds.
We note that the time required for each method is roughly constant across most of the tested motifs. The exception is the sparse approach for M 11 , which takes substantially less time than the other motifs in both regimes. We believe this to be related to the rare reciprocated edge which is part of M 11 (and not present in the other motifs), which heavily reduces the level of computation required. The various scenarios we experimented with highlight the fact that our approach is highly scalable to very large sparse graphs.

Applications to Synthetic Data
To validate our method, we consider three synthetic data sets as examples. We selected each example to highlight a different aspect of our approach and to demonstrate the advantages of considering a weighted higher-order clustering measure. Example 1 demonstrates the importance of taking the edge weights into account when performing higher-order clustering; Example 2 shows the value of higher-order clustering, demonstrating that clustering using different motifs can yield different insights into the same data; and finally, Example 3 demonstrates the value of our bipartite clustering scheme for detecting structures in weighted bipartite networks. Timing results for the less sparse regime, ER(n, 100 n ) Dense Sparse Figure 3: Run time for our two approaches on sparse p = 10 n (Top) and less sparse p = 100 n (Bottom) graph ensembles. We average over 5 repeats and error bars are one sample standard deviation.

Directed stochastic block models
For these tests we use weighted and unweighted directed stochastic block models (DSBMs), a broad class of generative models for directed graphs (Nowicki and Snijders, 2001). An unweighted DSBM is characterized by a block count k, a list of block sizes (n i ) k i=1 , and a connection matrix F ∈ [0, 1] k×k . We define the cumulative block sizes n c i = i j=1 n j , and the total graph size n = n c k . These are used to construct the group allocations g i = min{r : n c r ≥ i}, and finally a graph G is generated with adjacency matrix entries with all Bernoulli random variables sampled independently. A weighted DSBM is constructed in a similar manner (see for example Mariadassou et al (2010) and Aicher et al (2014)), but also requires a weight matrix Λ ∈ [0, ∞) k×k . In this case, the weighted adjacency matrix entries are generated by the following mixture model with all Bernoulli and Poisson random variables sampled independently. Note that the Bernoulli variable now no longer directly corresponds to edge existence, since there is always a non-zero chance of the Poisson variable being zero which sets G ij = 0 and thus removes the edge. We assume a DSBM is unweighted unless stated otherwise. We evaluate performance using the Adjusted Rand Index (ARI) (Rand, 1971), as in Section 3.3.1.

Bipartite stochastic block models
We define the unweighted bipartite stochastic block model (BSBM) with source block count k S , destination block count k D , source block sizes (n similarly be constructed by using a weighted DSBM with weight matrix Λ = 0 Λ b 0 0 . This is a generalization of the model in Florescu and Perkins (2016).

Example 1
In this example we demonstrate the advantages of taking edge weights into account when detecting higherorder structures. We compare Algorithm 2 with two clusters (k = 2) and two eigenvectors (l = 2), against two standard approaches. The first comparison is random-walk spectral clustering using a symmetrized weighted matrix, which captures the ability of non-motif-based methods to uncover the underlying structure. This is equivalent to Algorithm 2 with motif M s . Secondly, we compare against our own motif-based approaches, but with all edge weights set to 1, similar to the formulation of (Benson et al, 2016).
To demonstrate the advantages of our approach, we consider an example for which (i) a higher-order structure is present, and (ii) weight is important in the structure. Constructing higher-order structures in a stochastic block model is challenging, as by definition all the edges are independent. Thus the block structure in a DSBM with strongly connected blocks is captured by density rather than by the existence of motifs (although these are correlated).
To this end, we construct clusters that consist of several blocks, and introduce a higher-order structure between blocks of the same cluster. For this example, we use two clusters (Figure 4 (upper panel)), each one consisting of two blocks each of size 100, for a total of n = 400 nodes. Each cluster ({V 2 }) consists of two blocks with strong unidirectional (probability p = 0.25) connections with potentially large weights (Poisson with mean w 1 ). The two clusters are then linked by weaker inter-block connections with potentially smaller weights (Poisson with mean 50). By design, this model has strong heavy-weighted unidirectional structures, and is thus well captured by M 8 and M 10 , both of which capture unidirectional structure. Thus, following (Benson et al, 2016) rather than focusing on an individual motif, we use the sum of both MAMs.
The lower panel of Figure 4 displays the results for functional motifs (structural motifs in Appendix D). As our procedure only clusters the largest connected MAM component, we compute ARI over this component. For w 1 = 50, all edges have the same expected weight, and thus the performances of weighted and non-weighted motifs are equal. In the mid-range of weights (50 < w 1 ≤ 90), our approach outperforms both of the others, indicating the advantages of accounting for weights.
Finally, for large weights (w 1 > 90), we are (on average) slightly outperformed by M s , the symmetrized weighted adjacency matrix, although the error bars are overlapping. One possible reason for this is that while using M 8 and M 10 gives a strong signal, it also introduces noise when motifs with heavy and non-heavy weighted edges span clusters.
The pattern on structural motifs is similar (Figure 12 in Appendix D), with two key differences; first, the non-weighted motifs have a larger ARI (≈ 0.7 vs. ≈ 0.3), and second, the weighted motif equals/outperforms M s . We hypothesize that this is due to the structural motifs' ability to filter out some of the noise described in the previous paragraph, thus leading to better performance.
Finally, we note that this example has been designed to highlight the advantages of our motifs; and several other structures were considered which do not have this property. Following Benson et al (2016), when applying this to real world data, it is important to consider the correct motif to use. We note that the DSBM has been constructed to favor weighted motifs. The left and middle panels display the connection matrix and weight matrix (Equation (6)). In the right panel, a schematic diagram of the structure, blocks of the same color belong to the same cluster and larger arrows represent connections with higher probabilities and larger edge weights. Bottom: Exploring the performance of MAMs based on M 8 and M 10 on the model in the upper panel. Each block contains 100 nodes. We compare both to the unweighted case, and to the symmetrized case. We perform 100 repeats, and error bars are one sample standard deviation.

Example 2
In the second example, we explore the value of higher-order clustering, by demonstrating that clustering using different motifs can give different albeit potentially equally valuable insights. We showcase this with two experiments. First, we consider the case where, due to the construction of the network, certain motifs are better at detecting the underlying structure in different parameterization regimes. Second, expanding on our first example, we consider a DSBM where, by construction, the network does not have strong densely connected blocks, and thus different motifs highlight distinct and equally relevant groupings in the network.

Experiment 1
For the first experiment, we use the DSBM with k = 2, n 1 = n 2 = 100 (so n = 200) and F = 0.2 q1 0.05 0.2 , as depicted in the upper panel of Figure 5.
We test the performance of Algorithm 2 across a few structural motifs with parameters k = l = 2 on this model (functional motifs in Appendix D). It can be seen that the motifs perform well in different regions, with M 1 outperforming other methods in the regime 0.3 ≤ q 1 ≤ 0.7, and also performing well outside this range.
For large values of q 1 , motif M 9 performs best. For lower values of q 1 , M 2 has a slightly higher average ARI, although the difference is within the standard errors. This altogether demonstrates the importance of selecting the right motif. Furthermore, we compare against M s , the symmetrized weighted adjacency matrix, and each presented motif outperforms this baseline.  (5)). Right panel, a schematic diagram of the structure: larger arrows represent connections with higher probabilities. Bottom: Performance on this benchmark using MAMs. We compare with standard symmetrization M s . We perform 100 repeats, and error bars are one sample standard deviation.
We also observe an artifact in this plot: for certain motifs, the performance drops away at q 1 = 0 (triangles) and at q 1 = 1 (all motifs). For these parameter values, the MAM becomes entirely disconnected, with each of the two groups in its own connected component. As our clustering scheme only considers the largest connected component (Section 3.3.2), we then obtain an ARI of 0. In real world graphs, we would recommend investigating all reasonably large connected components.
We note that there is a bi-modality with certain motifs, notable with M 9 performing well for small and large values of q 1 , a feature not present within the functional motifs ( Figure 13). We believe this is due to the fact that structural motifs act as a filter: with high values of q 1 , it is difficult to form 2-paths (M 9 ) between groups without also forming triangular motifs, which would not contribute towards a structural MAM.

Experiment 2
In this experiment, we demonstrate the ability of MAMs to uncover different structures. We construct a DSBM with 8 groups of size 100 (n = 800) with block structure given by the diagram in the left panel of Figure 6: we place edges which match the block structure with probability 0.5 (i.e. F ab = 0.5 if there is an arrow from a to b in the diagram), and 0.0001 otherwise, in order to maintain connectivity.
Following Example 1, in this DSBM we do not have the usual strongly connected groups: in fact, each group has a close to zero probability (0.0001) of within-group connections. Thus, while there are clearly 8 groups with unique connection patterns, there does not exist a "correct" division of the nodes into densely connected groups, but rather, different motifs highlight different structures.
We display the results of using functional motifs with Algorithm 2 with k = l = 2 in Figure 6 (structural results are similar in Figure 14). Each column represents the structure uncovered by one of the motifs. For robustness, we perform the experiment on 100 replicates, and place the resulting partitions side by side in Figure 6: Left: Diagram of the block structure for Example 2 Experiment 2: edges in the diagram are present with probability 0.5, all other edges are present with probability 0.0001 (including edges within blocks). Right: The detected groups found by each method, with k = 2 and l = 2, with the exception of the last column which has k = 8, l = 3. We test 100 replicates and present the results as columns in the plot. The nodes are ordered by block, and colors represent the group allocations in a given block for each motif-based method. We order replicates (i.e. columns) within each motif to highlight similarities, the columns are sorted within block in order to promote contiguity of the colors across each block. While the clustering assignments may contain some errors, the results are relatively robust across replicates. each column. For each motif, the columns are sorted within block in order to promote contiguity of the colors across each block.
We observe that (by design) this graph has 3 different clusterings, highlighted by different motifs. First, M 8 (a 2-star with the edges pointing outwards (Figure 1)). Each pair of connected blocks is connected by this motif. However, there is a much stronger connection when this motif is part of the higher-order structure. Thus, when we cluster using M 8 , we obtain two consistent clusters consisting of {V 1 , V 2 , V 4 } and {V 5 , V 7 , V 8 }, each of which is united by this motif at a block level. The remaining blocks without this block level structure (V 3 and V 6 ) are then essentially randomly placed into one of the two clusters, giving the behavior observed in Figure 6. We observe a similar structure with the other motifs: M 9 (a 2-path) splits the graph into the two groups characterized by their 2-paths, {V 1 , V 2 , V 3 , V 5 } and {V 4 , V 6 , V 7 , V 8 }; and finally M 10 (a 2-star with the edges pointing inwards) splits the graph into the two clusters based on this structure, namely, {V 1 , V 4 , V 6 } and {V 3 , V 5 , V 8 } (with a similar behavior to M 8 for the remaining blocks). Thus, we conclude that our MAMs can obtain very different and equally valid structures in the same graph.
Finally, we compare with M s which is equivalent to clustering with G + G T . Under symmetrization, this graph is a ring of indistinguishable blocks. Therefore, the best possible division is to arbitrarily divide the ring into two roughly equal-sized pieces. Considering Figure 6, we observe this behavior with many different divisions, and no pair of blocks is consistently placed within the same cluster. For completeness we also compare against M s with k = 8 and l = 3, which divides the network into the 8 blocks of the underlying DSBM. Although this is a specially constructed example, it highlights the different equally valid structures that can be uncovered, emphasizing the importance of considering the correct motif (Benson et al, 2016), and demonstrating the value of this procedure above standard spectral clustering.

Example 3
In our final example, we demonstrate clustering the source vertices of bipartite networks, using the collider motif as in Algorithm 3. Clustering the destination vertices with the expander motif is exactly analogous (by simply reversing edge directions), so we do not demonstrate it explicitly. Note also that in bipartite graphs, all instances of the collider or expander motifs are structural, so we need not compare functional and structural MAMs.  (6) and Section 4.2 for details), and right panel is a schematic diagram of the structure. Bottom: performance of our bipartite collider method with and without edge weights. We perform 100 repeats, and error bars are one sample standard deviation.
We illustrate this with two experiments. In the first one, we show that in certain networks, edge weights are important for our collider method to perform well. In the second experiment, we compare against an alternative method for clustering weighted bipartite networks, showing that under certain conditions, our collider method performs better.

Experiment 1
This example illustrates the advantages of using weights for bipartite clustering. We use a weighted BSBM with block counts k S = k D = 2, block sizes n 1 S = n 2 S = n 1 D = n 2 D = 100, bipartite connection matrix F b = ( 0.15 0.1 0.1 0.15 ), and bipartite weight matrix Λ b = w1 50 50 w1 , where w 1 is a varying parameter. This network has been constructed so that vertices in S 1 are slightly more likely to connect to vertices in D 1 than to vertices in D 2 , and vice versa for S 2 . This makes the source vertex clusters S 1 and S 2 weakly distinguishable when not considering edge weights. However, the edge weights exhibit the same preferences, and this effect becomes stronger as w 1 increases.
We test the performance of Algorithm 3 for clustering the source vertices (using the collider motif), with parameters k S = l S = 2 on this model. We compare the results against those obtained by ignoring the edge weights. Figure 7 shows that the weights in this model allow the structure to be recovered well when the weighting effect is large enough.

Experiment 2
In this final experiment, we show the advantages of using the collider motif over simply clustering using the matrix AA T , where A ∈ [0, ∞) |S|×|D| is the weighted bipartite adjacency matrix of the graph (Chessa et al, 2014). We use a weighted BSBM with block counts k S = 2, k D = 3, block sizes n 1 S = n 2 S = n 1 D = n 2 D = n 3 D = 200, bipartite connection matrix F b = ( 0.9 0.3 0 0 0.3 0.9 ), and bipartite weight matrix Λ b = w1 1 0 0 1 w1 , where w 1 is a varying parameter. This model has been constructed such that vertices in S 1 are more likely to be connected to D 1 than to D 2 , and similarly vertices in S 2 are more likely to be connected to D 3 than to D 2 . However, the weights show a reverse preference (for w 1 < 1, as in our model), with larger weights assigned to the lower-probability edges. Note that in this regime where weights are small, there is a significant probability of the Poisson weight variables being zero, removing edges which might otherwise have had a high probability of existing.
The results are shown in Figure 8, which illustrates that our collider method (Algorithm 3 with k S = l S = 2) is better at distinguishing the source vertex clusters S 1 and S 2 compared to the method based on AA T . Although it must be made clear that the model has been specifically constructed to do this, and that this effect occurs only for a limited parameter set, it gives an example of the different behavior observed when using a product-based weight formulation (Section 3.1). For our collider method, the expected similarity between two source vertices in the same cluster directly depends on w 1 , while the expected similarity between two source vertices in different clusters is fixed. Hence for large enough values of w 1 , our collider method performs well. However, for the AA T method, the expected similarity between two source vertices in the same cluster depends on w 2 1 (since AA T contains squared edge weights), which in this model is small (as w 1 < 1). Hence the AA T method does not perform so well on this model.

Applications to Real World Data
This section details the results of our proposed methodology on a number of directed networks arising from real world data sets. We show that weighted edges and motif-based clustering allow different structures to be uncovered in migration data with the US-Migration network, while the US-Political-Blogs network demonstrates how motif-based methods can control misclassification error for weakly-connected vertices. Finally, we consider the bipartite territory-to-language network, Unicode-Languages, and show that our bipartite clustering method produces meaningful clusters.

US-Migration network
The first data set is the US-Migration network (U.S. Census Bureau, 2002), consisting of data collected during the US Census in 2000. Vertices represent the 3075 counties in 49 contiguous states (excluding Alaska and Hawaii, and including the District of Columbia). The 721 432 weighted directed edges represent the number of people migrating from county to county, capped at 10 000 (the 99.9th percentile) to control large entries, as in Cucuringu et al (2019b). We test the performance of Algorithm 2 with three selected motifs: M s , M 6 and M 9 (see Figure 1). M s gives a spectral clustering method with naive symmetrization. M 6 represents a pair of counties exchanging migrants, with both also receiving migrants from a third. M 9 is a path of length two, allowing counties to be deemed similar if they both lie on a heavy-weighted 2-path in the migration network. Figure 9 plots maps of the US, with counties colored by the clustering obtained using Algorithm 2 with k = l = 7. The top row shows clusterings from the unweighted graph, while those in the bottom row are from the weighted graph. The advantage of using weighted graphs is clear, with the clustering showing better spatial coherence (compared with the fragmented pale yellow clusters produced by the unweighted graph). Furthermore, the different motifs uncover distinct structures. For example, M 6 and M 9 both allocate a cluster to the southern states of Mississippi, Alabama, Georgia and Tennessee, while M s identifies the northern states of Michigan and Wisconsin. Also, M 6 favors a larger "central" region, which includes significant parts of Colorado, Oklahoma, Arkansas and Illinois.

US-Political-Blogs network
The US-Political-Blogs network (Adamic and Glance, 2005) consists of data collected two months before the 2004 US election. Vertices represent blogs, and are labeled by their political leaning ("liberal" or "conservative"). Weighted directed edges represent the number of citations from one blog to another. After restricting to the largest connected component, there are 536 liberal blogs, 636 conservative blogs (total 1222) and 19 024 edges. The network is plotted in Figure 10.
Traditional two-cluster spectral clustering performs very poorly on this network, with one of the clusters containing just four vertices (circled in Figure 10), which are very weakly connected to the rest of the graph. However, motif-based clustering performs significantly better. The choice of motif determines the trade-off between the number of vertices assigned to clusters, and the accuracy of those clusters (see Figure 10 for a plot of ARI score against number of vertices clustered). Motif M 9 clusters 1197 vertices with an ARI of 0.82, while the more strongly connected M 4 only clusters 378 vertices, with an improved ARI of 0.92. This is because the largest connected component of the MAM will be smaller for the more strongly-connected motifs such as M 4 . This phenomenon of not assigning every vertex to a cluster could therefore be seen as a method for reducing misclassification error. In effect, the algorithm refrains from allocating weakly-connected vertices to any cluster.

Unicode-Languages bipartite network
The Unicode-Languages network (KONECT: The Koblenz Network Collection, 2019) is a bipartite network consisting of data collected in 2014 on languages spoken around the world. Source vertices are territories, and destination vertices are languages. Weighted edges from territory to language indicate the number of inhabitants (GeoNames, 2019) in that territory who speak the specified language. After removing territories with fewer than one million inhabitants, and languages with fewer than one million speakers, there are 155 territories, 270 languages and 705 edges remaining.
We test Algorithm 3 with parameters k S = l S = k D = l D = 6 on this network. For the source vertices, Figure 11 plots maps of the world with territories colored by the recovered clustering. The top 20 territories (by population) in each cluster are given in Table 1. Cluster 1 is by far the largest cluster, and includes a wide variety of territories. Cluster 2 contains several Persian-speaking and African French-speaking territories, while Cluster 3 mostly captures Spanish-speaking territories in the Americas. Cluster 4 includes the Slavic territories of Russia and some of its neighbors. Cluster 5 covers China, Hong Kong, Mongolia and some of South-East Asia. Cluster 6 is the smallest cluster and contains only Japan and the Koreas. There are a few territories and languages which are not contained in the large connected component of the network due to their linguistic isolation. These territories are Laos, Norway and Timor-Leste, and the languages are Lao, Norwegian Bokmål and Norwegian Nynorsk.
For the destination vertices, we present the six clusters obtained by Algorithm 3. Table 2 contains the top 20 languages (by number of speakers) in each cluster. Cluster 1 is the largest cluster and contains some European languages and dialects of Arabic. Cluster 2 is also large and includes English, as well as several South Asian languages. Cluster 3 consists of many indigenous African languages. Cluster 4 captures languages from South-East Asia, mostly spoken in Indonesia and Malaysia. Cluster 5 identifies several varieties of Chinese and a few other Central and East Asian languages. Cluster 6 captures more South-East Asian languages, this time from Thailand, Myanmar and Cambodia. Libya · · · · · · |Cluster 1| = 87 |Cluster 2| = 29 |Cluster 3| = 15 |Cluster 4| = 11 |Cluster 5| = 10 |Cluster 6| = 3 Contribution We have introduced generalizations of MAMs to weighted directed graphs and presented new matrix-based formulae for calculating them, which are easy to implement and scalable. By leveraging the popular random-walk spectral clustering algorithm, we have proposed motif-based techniques for clustering both weighted directed graphs and weighted bipartite graphs. We demonstrated using synthetic data examples that accounting for edge weights and higher-order structure can be essential for good cluster recovery in directed and bipartite graphs, and that different motifs can uncover different, but equally insightful, clusters. Applications to real world networks have shown that weighted motif-based clustering can reveal a variety of different structures, and that it can reduce the misclassification rate.
Future work There are limitations to our proposed methodology, which raise a number of research avenues to explore. While our matrix-based formulae for MAMs are simple to implement and scalable, there is scope for further computational improvement. As mentioned in (Benson et al, 2016), fast triangle enumeration algorithms (Demeyer et al, 2013;Wernicke, 2006;Wernicke and Rasche, 2006) may offer superior performance at scale for triangular motifs, but MAMs for motifs with missing edges are inherently denser. Oromo Nyamwezi · · · · · · · · · |Cluster 1| = 120 |Cluster 2| = 90 |Cluster 3| = 25 |Cluster 4| = 15 |Cluster 5| = 13 |Cluster 6| = 7 The implementation could also be optimized to further exploit any sparsity structures in the network, rather than relying on general-purpose linear algebra libraries.
We have seen that choosing the correct motif for clustering a network can play a crucial role (Section 4.4), but we lack a principled method for motif selection. It may be that concepts such as motif conductance (Benson et al, 2016) or other suitable objective functions can guide the motif selection process. We would also like to see a more principled approach to dealing with disconnected MAMs, perhaps involving some criterion for when a connected component is "large enough" to be worth keeping. Simple extensions to our methodology could involve further analysis of the differences between clustering methods based on functional and structural MAMs, or perhaps investigating the effects of replacing the random-walk Laplacian with the un-normalized Laplacian or symmetric normalized Laplacian (Von Luxburg, 2007). Since we have considered only unweighted or Poisson-weighted DSBMs, it would also be interesting to investigate other types of weighted DSBMs (perhaps following the exponential family method detailed by Aicher et al (2013)) Further experimental work is also desirable. We would like to explore the utility of our framework on additional real data, and suggest that collaboration networks such as Leskovec and Krevl (2007), transportation networks such as the airports network in Frey and Dueck (2007), and bipartite preference networks such as Clauset et al (2007) could be interesting. Direct comparison of results with other state-of-the-art clustering methods from the literature could also be insightful; suitable benchmarks for performance include the Hermitian matrix-based clustering method in , the Motif-based Approximate Personalized PageRank (MAPPR) in (Yin et al, 2017), and Tectonic from (Tsourakakis et al, 2017). Other established and fast methods from the literature on directed networks include SaPa (Satuluri and Parthasarathy, 2011) and DI-SIM (Rohe et al, 2016), which respectively perform a degree-discounted symmetrization; and a combined row and column clustering into four sets using the concatenation of the left and right singular vectors.
Further potential extensions of our work pertain to the detection of overlapping clusters, allowing nodes to belong to multiple groups (Li et al, 2016), to the detection of core-periphery structures in directed graphs (Elliott et al, 2019a), as well as implications for existing methods on network comparison (Wegner et al, 2018), anomaly detection in directed networks (Elliott et al, 2019b), and ranking from a sparse set of noisy pairwise comparisons (Cucuringu, 2016). Furthermore, incorporating our pipeline into deep learning architectures is another avenue worth exploring, building on the recent MotifNet architecture (Monti et al, 2018), a graph convolutional neural network for directed graphs, or SiGAT (Huang et al, 2019), another recent graph neural network which incorporates graph motifs in the context of signed networks which contain both positive and negative links .

A Motif Adjacency Matrix Formulae
We give explicit matrix-based formulae for motif adjacency matrices M for all simple motifs M on at most three vertices, along with the anchored motifs M coll and M expa . Entry-wise (Hadamard) products are denoted by •. Table 3 gives our dense formulation of functional MAMs. For the dense formulation of structural MAMs, simply replace J n , J, and G with J 0 , J s , and G s respectively. For the sparse formulation of functional MAMs, replace J n with 1 − I, and expand and simplify the expressions as in Section 3.4. For the sparse formulation of structural MAMs, replace J and G with J s and G s respectively. Also replace J 0 with 1 − (I +J), whereJ = J s + J T s + J d , and expand and simplify the expressions.

C Spectral Clustering
We provide a summary of traditional random-walk spectral clustering and show how it applies to motif-based clustering.

C.1 Overview of spectral clustering
For an undirected graph, the objects to be clustered are the vertices of the graph, and a similarity matrix is provided by the graph's adjacency matrix G. To cluster directed graphs, the adjacency matrix must first be symmetrized, perhaps by considering G + G T (Meilȃ and Pentney, 2007). This symmetrization ignores information about edge direction and higher-order structures, and can lead to poor performance (Benson et al, 2016). Spectral clustering consists of two steps. Firstly, eigendecomposition of a Laplacian matrix embeds the vertices into R l . The k clusters are then extracted from this space.

C.2 Graph Laplacians
The Laplacians of an undirected graph are a family of matrices which play a central role in spectral clustering. While many different graph Laplacians are available, we focus on just the random-walk Laplacian, for reasons concerning objective functions, consistency and computation (Von Luxburg et al, 2004;Von Luxburg, 2007). The random-walk Laplacian of an undirected graph G with (symmetric) adjacency matrix G is given by where I is the identity and D ii := j G ij is the diagonal matrix of weighted degrees.

C.3 Graph cuts
Graph cuts provide objective functions which we seek to minimize while clustering the vertices of a graph.
Definition C.1. Let G be a graph. Let P 1 , . . . , P k be a partition of V. Then the normalized cut (Shi and Malik, 2000) of G with respect to P 1 , . . . , P k is where cut(P i ,P i ) := u∈Pi, v∈V\Pi G uv and vol(P i ) := u∈Pi D uu . Note that more desirable partitions have a lower Ncut value; the numerators penalize partitions which cut a large number of heavily weighted edges, and the denominators penalize partitions which have highly imbalanced cluster sizes. It can be shown (Von Luxburg, 2007) that minimizing Ncut over partitions P 1 , . . . , P k is equivalent to finding the cluster indicator matrix H ∈ R n×k minimizing Tr H T (D − G)H subject to H ij = vol(P j ) − 1 2 I{v i ∈ P j } ( †), and H T DH = I . Solving this problem is in general NP-hard (Wagner and Wagner, 1993). However, by dropping the constraint ( †) and applying the Rayleigh Principle (Lütkepohl, 1996), we find that the solution to this relaxed problem is that H contains the first k eigenvectors of L rw as columns (Von Luxburg, 2007). In practice, to find k clusters it is often sufficient to use only the first l < k eigenvectors of L rw .

C.4 Cluster extraction
Once Laplacian eigendecomposition has been used to embed the data into R l , the clusters may be extracted using a variety of methods. We propose k-means++ (Arthur and Vassilvitskii, 2007), a popular clustering algorithm for data in R l , as an appropriate technique. It aims to minimize the within-cluster sum of squares, based on the standard Euclidean metric on R l . This makes it a reasonable candidate for clustering spectral data, since the Euclidean metric corresponds to notions of "diffusion distance" in the original graph (Nadler et al, 2006).

C.5 Random-walk spectral clustering
Algorithm 1 gives random-walk spectral clustering (Von Luxburg, 2007), which takes a symmetric connected adjacency matrix as input. We drop the first column of H (the first eigenvector of L rw ) since although it should be constant and uninformative, numerical imprecision may give unwanted artifacts. It is worth noting that although the relaxation used in Appendix C.3 is reasonable and often leads to good approximate solutions of the Ncut problem, there are cases where it performs poorly (Guattery and Miller, 1998). The Cheeger inequality (Chung, 2005) gives a bound on the error introduced by this relaxation.

C.6 Motif-based random-walk spectral clustering
Algorithm 2 gives motif-based random-walk spectral clustering. The algorithm forms a motif adjacency matrix, restricts it to its largest connected component, and applies random-walk spectral clustering (Algorithm 1) to produce clusters. The most computationally expensive part of Algorithm 2 is the calculation of the MAM using a formula from Table 3, as noted by Benson et al (2016) for their unweighted MAMs. The complexity of this is analyzed in Section 3.4.

C.7 Bipartite spectral clustering
Algorithm 3 gives our procedure for clustering a bipartite graph. The algorithm uses the collider and expander motifs to create similarity matrices for the source and destination vertices respectively (as in Section 3.2), and then applies random-walk spectral clustering (Algorithm 1) to produce the partitions.

D Additional Synthetic Results
In this section we present additional results for some of our synthetic experiments.
Example 1 In Figure 12 we present the results for structural motifs, to complement the functional motifs presented in the main paper. We note that when we consider structural motifs, the unweighted motifs perform better but are still outperformed by our weighted method.
Example 2 -Experiment 1 In Figure 13 we present the results for the functional motifs for Example 2 Experiment 1. As noted in the main paper, our approaches outperform the baseline, but we do not observe the same bi-modal structure, with M 1 outperforming all approaches for q 1 < 0.6, and without the recovery in performance for high values of q 1 .
Example 2 -Experiment 2 In Figure 14 we present the structural version of of Example 2 Experiment 2. We observe the same structures as we observe for the functional version, with motifs each highlighting a different but equally relevant structure. Finally the symmetrized adjacency matrix (M s ), for which each of the blocks are indistinguishable, does not robustly identify the higher-order structures for k = 2, but can uncover the blocks when clustering with k = 8 and l = 3.  Figure 4). Each block contains 100 nodes. We compare both to the unweighted case, and to the symmetrized case. We perform 100 repeats, and error bars are one sample standard deviation.  Figure 14: The detected groups uncovered by each method using structural motifs, with k = 2 and l = 2, with the exception of the last column which has k = 8, l = 3. We test 100 replicates and present the results as columns in the plot. We order and color the columns as in Figure 6. While the clustering assignments may contain some errors, the results are relatively robust across replicates.