Graph neural network inspired algorithm for unsupervised network community detection

Network community detection often relies on optimizing partition quality functions, like modularity. This optimization appears to be a complex problem traditionally relying on discrete heuristics. And although the problem could be reformulated as continuous optimization, direct application of the standard optimization methods has limited efficiency in overcoming the numerous local extrema. However, the rise of deep learning and its applications to graphs offers new opportunities. And while graph neural networks have been used for supervised and unsupervised learning on networks, their application to modularity optimization has not been explored yet. This paper proposes a new variant of the recurrent graph neural network algorithm for unsupervised network community detection through modularity optimization. The new algorithm’s performance is compared against the state-of-the-art methods. The approach also serves as a proof-of-concept for the broader application of recurrent graph neural networks to unsupervised network optimization.

problems. Such problems often arise when the aim is to reconstruct nodes' attributes based on their features and the network structure, including various types of unsupervised graph clustering (Kampffmeyer et al. 2019;Bianchi 2022). Following the work applying machine learning to combinatorial optimization problems by Bengio et al. (2021), several attempts to apply GNNs to hard combinatorial optimization problems were recently made. Some first promising results were obtained for the problems of minimum vertex cover, maximal clique, maximal independent set, and the satisfiability problem (Li et al. 2018). Furthermore, various GNN architectures were adapted to address the graph correlation clustering problem formulated as the minimum cost multicuts problem (Jung and Keuper 2022). Initial steps were even taken towards graph clustering via maximizing some variant of modularity function (Lobov and Ivanov 2019;Tsitsulin et al. 2020). However, their results still fall behind current state-of-the-art approaches in terms of modularity score. In this work, we show that with GNN-inspired techniques it is possible to achieve more practical results close to state-of-the-art.
In the following sections, first, we recall the formulation of community detection through modularity maximization and show how it could be framed as continuous quadratic optimization. Then we propose our GNN-inspired method and describe how to select its parameters. Lastly, we evaluate our approach using three benchmarks: classical real-world networks used previously in the literature to test modularity maximization algorithms, synthetic-networks benchmark, and a temporal network of taxi trips. The paper ends with conclusions and discussions.

The modularity optimization problem
The network modularity was among the first quality/objective functions proposed to assess and optimize the community structure . However, it is now known to have certain shortcomings, including a resolution limit (Fortunato and Barthélémy 2007;Good et al. 2010) and the fact that it does not compare with a proper baseline and finds communities in random networks. Therefore alternative objective functions should be mentioned, e.g., Infomap description code length Bergstrom 2007, 2008), Stochastic Block Model likelihood Ball et al. 2011;Bickel and Chen 2009;Yan et al. 2014), and Surprise (Aldecoa and Marìn 2011). Nevertheless, despite its limitations, modularity remains perhaps the most commonly used objective function so far.
In 2014, the authors proposed a novel optimization technique for community detection, "Combo" , capable of maximizing various objective functions, including modularity, description code length, and pretty much any other metric based on the link scoring and assessing the cumulative score of the intra-community links. At the time of publication, for modularity optimization, Combo outperformed other state-of-the-art algorithms, including a popular Louvain method (Blondel et al. 2008) in terms of the quality (modularity score) of the resulting partitioning, which could be achieved within a reasonable time for most real-world and synthetic networks of up to tens of thousands of nodes. The size limitation for the algorithm evaluation is due to the current implementation handling a full modularity matrix in the memory entirely. However, this is not a fundamental limitation and could be overcome by using sparse matrix operations. Recently, more precise algorithms were proposed, but they are slower and often designed to run on a distributed cluster (Džamić et al. 2019;Biedermann et al. 2018;Hamann et al. 2018;Lu et al. 2015). Moreover, their code is not available.
The proposed algorithms, including Combo, are often quite efficient and, in some cases, are able to reach the theoretical maximum of the modularity score as revealed by a suitable upper bound estimate . However, in general, finding the theoretically optimal solution may not be feasible, and one has to rely on heuristic algorithmic solutions without being certain of their optimality. Instead, an empiric assessment of their performance in comparison with other available algorithms could be performed.

The modularity function
In short, the modularity (Newman and Girvan 2004; function of the proposed network partition quantifies how relatively strong all the edges between the nodes attached to the same community are. Specifically, if the network edge weights between each pair of nodes i, j are denoted as e i,j , then the modularity of the partition com(i) (expressed as a mapping assigning community number com to each node i) can be defined as where the quantity q i,j for each edge i, j (call q a modularity score for an edge) is defined as its normalized relative edge weight in comparison with the random network model with the same node strengths. Namely, where w out (i) = k e i,k , w in (j) = k e k,j , T = i w out (i) = j w in (j) = i,j e i,j .
Rewrite the modularity optimization problem in a vector form: let Q = q i,j be the matrix of all the modularity scores for all the edges (call it a modularity matrix). Let C be an n × k matrix, where n is the number of network nodes and k is the number of communities we are looking to build. Each element c i,p of the matrix can be zero or one depending on whether the node i belongs to the community p or not, i.e., whether com(i) = p . If the communities are not overlapping, then each row of the matrix has one single unit element, and the rest of its elements are zeros.
More generally, if we admit uncertainty in community attachment, then the elements c i,p of the matrix C could represent the probabilities of the node i to be attached to the community p. This way, c i,p ∈ [0, 1] and the sum of each row of the matrix C equals 1.
Page 5 of 19 Sobolevsky and Belyi Applied Network Science (2022) 7:63 Then the modularity score M in the case of a discrete community attachment could be represented as a trace of matrix product where tr denotes the trace of the matrix -a sum of all of its diagonal elements.
This way, finding the community structure of up to k communities optimizing the network modularity could be expressed as a constrained quadratic optimization problem of finding the n × k matrix C maximizing the trace of matrix product M = tr(C T QC) , such that all c i,p ∈ {0, 1} and the sum of each row of the matrix C equals 1 (having a single unit element).
Replacing the binary attachment constraint c i,p ∈ {0, 1} with a continuous attachment c i,p ∈ [0, 1] relaxes the optimization problem to finding probabilistic community attachments. It could be easily shown that the optimal solution of the binary attachment problem could be derived from the optimal solution of the probabilistic attachment problem after assigning q i,i = 0 for all the diagonal elements of the matrix Q. As diagonal elements q i,i are always included in the sum M since com(i) = com(j) for i = j , the values of the diagonal elements serve as constant adjustment of the objective function M and do not affect the choice of the optimal partition, so we are free to null them without loss of generality. At the same time, for each given i, once q i,i = 0 , if we fix the community attachments of all the other nodes j = i , the objective function M becomes a linear function of the variables c i,p subject to constraints p c i,p = 1 and c i,p ∈ [0, 1] . Obviously, the maximum of the linear function with linear constraints is reached at one of the vertices of the domain of the allowed values for c i,p , which will involve a single c i,p being one and the rest being zeros. This way, we have proven the following: Proposition The optimal probabilistic attachment c i,p ∈ [0, 1] maximizing (2) in the case of q i,i = 0 represents a binary attachment c i,p ∈ {0, 1} maximizing (2) for an arbitrary original Q.
So the discrete community detection problem through modularity optimization could be solved within the continuous constrained quadratic optimization framework. This allows the application of methods and techniques developed for continuous optimization, such as gradient descent, for example. However, despite its analytic simplicity, the quadratic programming problem with indefinite matrix Q is still NP-hard. In particular, the dimensionality of the problem leads to multiple local maxima challenging direct application of the standard continuous optimization techniques, like gradient descent. Indeed, any discrete partition, such that no single node could be moved to a different community with a modularity gain, will become such a local maximum. Unfortunately, finding such a local maximum rarely provides a plausible partition -such solutions could have been obtained with a simple greedy discrete heuristic iteratively adjusting the single node attachments, while we know that the modularity optimization, being NPhard, generally requires more sophisticated non-greedy heuristics, like Sobolevsky et al.
(2) M = tr(C T QC), Page 6 of 19 Sobolevsky and Belyi Applied Network Science (2022) 7:63 (2014). To address this challenge, we introduce a new heuristic method that efficiently finds high-quality solutions to the described quadratic optimization problem, although without guaranteed achievement of the global optimum.

The GNNS method
In this section, we present a GNN-style method (GNNS) for unsupervised network partition through modularity optimization inspired by recurrent graph neural network models (in the definition of Wu et al. (2020)) as well as an older Weisfeiler-Lehmann graph node labeling algorithm Weisfeiler and Leman (1968). Ideas similar to the Weisfeiler-Lehmann algorithm have already found their application to community detection in a well-known label propagation algorithm Raghavan et al. (2007). Their development and application to modularity maximization were proposed and studied in detail in consequent works (Barber and Clark 2009;Liu and Murata 2010). However, they still were discrete optimization heuristics in their spirit and could not take into account recent advances in graph neural networks. Unlike these methods, we propose a continuous optimization technique that considers current nodes' attachments combined with attachments of their neighbors. Namely, we propose a simple iterative process starting with a random initial matrix C = C 0 and at each step t = 1, 2, 3, ..., N performing an iterative update of the rows c i of the matrix C representing the node i community attachments as follows: where Q i = q i,j : j = 1, 2, ..., n is the i-th row of the modularity matrix Q representing the outgoing edges from the node i. This way, the term Q i C t−1 collects information about the neighbor nodes' community attachments (this could be viewed as a development of ideas discussed in Barber and Clark (2009)), and the equation (3) updates the node community attachments with respect to their previous attachments as well as the neighbor node attachments. In order to ensure the conditions p c i,p = 1 , a further normalization c t i,p =c t i,p / p * c t i,p * needs to be applied at each iteration. A simple form for an activation function F could be a superposition of a linear function subject to appropriate scale normalization and a rectified linear unit where f 0 , f 1 , f 2 are the model parameters, C p = c j,p : j = 1, 2, ..., n is the p-th column of the matrix C representing all the node attachments to the community p, and the are the normalization coefficients ensuring the same scale for terms of the formulae.
(3) Intuitive considerations allow defining possible ranges for the model coefficients f 0 , f 1 , f 2 . Defining the coefficient f 1 within the range f 1 ∈ [0, 1] would ensure decay scaling of the community attachment at each iteration unless confirmed by the strength of the node's attachment to the rest of the community expressed by Q i C t−1 p . A free term f 0 ∈ [−1, 0] provides some additional constant decay of the community attachment at each iteration, while the term Q i C t−1 p /τ t i strengthens the attachment of node i to those communities having positive modularity scores of the edges between node i and the rest of the community. Normalization term τ t i ensures that the strongest community attachment gets a maximum improvement of a fixed scale f 2 .
A consistent node attachment that cannot be improved by assigning the given node i to a different community p (i.e., having should see the fastest increase in the attachment score c t i,p , eventually converging to the case of c i,p = 1 . Any weaker attachment should see a decreasing community membership score c t i,p , eventually dropping to zero. This could be ensured by the balancing equation f 1 + f 2 + f 0 = 1 , allowing to define appropriate f 2 given f 0 and f 1 .

Training the GNNS
The sequence of the GNNS iterations (4) depends on the choice of the model parameters f 0 , f 1 , as well as the initial community attachments. The final convergence also often depends on those choices. Given that, a good strategy is to simulate multiple iteration sequences with different initial attachments and choose the best final result. Also, it turns out that the method demonstrates reasonable performance for a broad range of parameter values f 0 ∈ [−1, 0] , f 1 ∈ [0, 1] , so rather than trying to fit the best choice for all the networks or a given network, one may simply include the random choice for f 0 , f 1 along with a random choice of the initial community attachments.
So the proposed GNNS algorithm starts with a certain number of S random partitions and parameter choices. Then the GNNS performs 10 iterations of the partition updates according to (4). Among those, a batch of the best ⌊S/3⌋ partitions (with the highest achieved modularity scores) is selected and further supplemented with another S − ⌊S/3⌋ configurations derived from the selected batch by randomly shuffling partitions and assigning new random parameters. Another 10 iterations are performed. Another batch of ⌊S/9⌋ best partitions is selected and shuffled, creating a total of ⌊S/3⌋ samples. Then another 30 iterations are performed with those, and for small S ( S ≤ 1000 ), the best resulting partition is selected as the final outcome of the algorithm. For larger S ( S > 1000 ), another batch of ⌊S/30⌋ best partitions is selected and shuffled, creating a total of ⌊S/10⌋ samples. Then the final 100 iterations are performed with those, and the best resulting partition is selected as the final outcome of the algorithm. Finally, the partition is discretized by assigning each node to the cluster with maximum probability. The following algorithm describes the whole process.
Page 8 of 19 Sobolevsky and Belyi Applied Network Science (2022) 7:63 If matrix Q is stored as a sparse graph matrix and two separate vectors for in-and out-degrees, this algorithm has a time complexity of O(Simk), where S is the number of attempts with different initial attachments and parameter values, i is the number of iterations, m is the number of edges, and k is the maximum number of communities. Value k is usually selected as an (educated) guess of the expected maximum number of communities or as the maximum feasible value. One nice property the GNNS inherits from neural networks is that it is easily parallelizable by modern frameworks.
Below we evaluate the three versions of the GNNS: a fast version with S = 100 denoted GNNS100, a slower but more precise version with S = 2500 denoted GNNS2500, and a slow but very precise version with S = 25000 denoted GNNS25000.

Comparative evaluation
We implemented the proposed GNNS modularity optimization algorithm in Python and ran our experiments in Google Colab 1 (results of Combo runs on the three largest networks were obtained on a laptop because of the time limits in Colab). The source code is available on GitHub 2 . We evaluate our algorithm against other state-of-the-art  (2014), often capable of reaching the best known modularity score over a wide range of networks, and the ADVNDS method claimed to be state-of-the-art method in 2017 Džamić et al. (2019). Both papers introducing Combo and ADVNDS make a thorough comparison with other methods available at their time and conclude that those methods outperform their competitors. We also tried all other modularity maximization methods implemented in the CDLib library developed specifically for evaluating community detection methods Rossetti et al. (2019). However, their performance was much worse both in terms of running time and achieved modularity scores. Some more recent methods claim to be new state-of-the-art, but they are designed to be run on distributed systems, and their results cannot be fairly compared Biedermann et al. (2018). Thus, first, we shall compare our method with four selected approaches over a sample of classic network examples. And then, all those methods but ADVNDS (whose implementation is not available to us) shall also be evaluated over the series of the two types of random graphs often used for benchmarking the community detection algorithms: Lancichinetti et al. (2008) and Block-model graphs .
As the GNNS chooses the best partition among multiple runs, we consider its three configurations involving a) 100, b) 2500, and c) 25000 initial random samples of partitions plus model configurations. We shall refer to those as GNNS100, GNNS2500, and GNNS25000. All other algorithms in our comparison also involve random steps, and the best partition they converge to is not perfectly stable. Thus their performance could also benefit from choosing the best partition among multiple runs, especially for the Louvain method. It often takes up to 10-20 attempts to find the best partition they are capable of producing. E.g. applying Combo and Louvain to the classic case of the Email network leads to the following performance reported in Table 1 below. As we see, both reach their best performance after 20 iterations (although results of the Combo for this network are significantly better compared to Louvain) but do not further improve over the subsequent 30 iterations. Based on that, in the further experiments, we shall report the best performance of the Combo, Louvain, and Leiden algorithms achieved after 20 attempts for each. Since the implementation of the ADVNDS algorithm is not available to us, we provide results reported in its original paper after ten runs with running time calculated as the reported average multiplied by ten Džamić et al. (2019).

Classic examples
Most of the classic instances were taken from the clustering chapter of the 10th DIMACS Implementation Challenge 3 and were often reused in community detection literature Sanders et al. (2014). Table 2 reports the sources and details of those networks. Since the complexity of our method is proportional to the maximum number of communities, we limited this dataset to the networks with less than three hundred communities, according to the ADVNDS paper Džamić et al. (2019). All originally directed networks were symmetrized, and self-loops were removed. The largest network, Krong-500slogn16, is synthetic, while all other networks represent real-world data. Tables 3 and 4   . Missing values in the ADVNDS column mean that the corresponding networks were not present in the original paper. For the Leiden and Louvain methods, we used their implementation in the Python igraph package, setting the number of iterations to −1 allowing the Leiden algorithm to run until the best modularity is achieved. The missing value in the Combo column corresponds to the network too large for the current implementation.
According to the results reported in Tables 3 and 4, Louvain is the fastest algorithm, closely followed by the GNNS100 method, especially for larger networks, with GNNS100 often providing higher modularity scores, especially for smaller networks,    Page 12 of 19 Sobolevsky and Belyi Applied Network Science (2022) 7:63 while ADVNDS, Combo, and GNNS25000 are by far the slowest, demonstrating however superior performance. GNNS2500 finds pretty good partitions for some networks, and it works much faster than ADVNDS and Combo. In general, its performance is comparable to Leiden. Both algorithms work quickly and find partitions with modularity just a bit below the best-known. Moreover, Combo could not handle the largest network, while GNNS2500 found a partition better than Leiden and did so almost six times faster. For that network, GNNS25000 finds the highest modularity score, outperforming all other methods, including ADVNDS. While ADVNDS reported the highest known modularity scores for all the networks where it was applied except the largest one, it is much slower than GNNS, and its implementation is not available, so we could not test it on other networks. Besides, the current GNNS implementation uses pure Python, while implementing it in C++ (as done for all other algorithms) could provide further speed improvements.
Overall, while no single heuristic is the best solution for all the cases, a GNNS algorithm often finds a plausible solution, sometimes the best-known one, and provides a flexible parameter-controlled trade-off between speed and performance ranging from the fastest to the close-to-optimal performance, which makes it a valuable addition to an existing collection of algorithms. More importantly, since this simple GNN-style heuristic can perform comparably to the state-of-the-art, that serves as a proof-of-concept for considering more sophisticated GNN architectures, configurations, and learning techniques that could provide further improvement in solving community detection problem as well as other complex network optimization problems like minimum vertex cover, maximal independent set Li et al. (2018), clique partitioning, correlation clustering Jung and Keuper (2022), spectral clustering Bianchi et al. (2020), and others Bengio et al. (2021), Yow and Luo (2022).

Synthetic networks
In the next test, the methods were applied to the sets of synthetic networks -Lancichinetti-Fortunato-Radicchi (LFR) Lancichinetti et al. (2008) and Stochastic Block-Model (SBM) Holland et al. (1983). We built two sets of ten LFR networks of size 250 each, with overall average node degrees of d = 29.6 and d = 13.3 , and three sets of ten SBM networks of size 300 for each value of the parameter ν , defining the ratio of the probability for the model graph to have inner-community edges (three communities of size 100 each) divided by the probability of the inter-community edges. For all SBM, networks the average probability of an edge was 0.1, leading to the average node degree d = 30 . The Python package networkx was used to generate these synthetic networks. Table 5 shows the average values of modularity, normalized mutual information (NMI), and execution time obtained after partitioning ten instances of each type using the Leiden, Louvain, Combo (best of the 20 runs), and GNNS100 algorithms. As we can see, Combo demonstrates superior performance in terms of modularity in all sets of networks except for the last SBM set with the highest ν = 3 , where all algorithms performed achieve the same score. While GNNS100 demonstrates suboptimal performance Page 13 of 19 Sobolevsky and Belyi Applied Network Science (2022) 7:63 for those networks compared to Combo, it works at unparalleled speed, several times faster than Louvain and Leiden and two orders of magnitude faster than Combo. Based on that, GNNS100 proves itself to be the fastest solution for partitioning the synthetic networks demonstrating reasonable performance in terms of the modularity and NMI scores achieved.

Temporal networks
Earlier works established the applicability of the GNN architecture for capturing dynamic properties of the evolving graphs Ma et al. (2020). As GNNS is well-suited for the iterative partition improvement, it could be suggested for active learning of the temporal network partition. For example, initial warm-up training could be performed over the first temporal layers with subsequent tuning iterations while moving from a current temporal layer to the next one. Below we apply the approach to the temporal network of the daily taxi mobility between the taxi zones in New York City (NYC). We use the 2016-2017 data provided by the NYC Taxi and Limousine Commission 4 to build the origin-destination network of yellow and green taxi ridership between the NYC taxi zones (edges of the network are weighted by the number of trips). The results of the temporal GNNS (GNNStemp) for each daily ridership network are compared against single runs (for the sake of speed) of the Louvain and Combo algorithms as the fastest and the most precise from the available algorithms. GNNStemp uses an initial warm-up over the year 2016 aggregated network and then performs a single run of 20 fine-tune iterations for each daily temporal layer in 2017, starting with the previously achieved partition. The achieved best modularity scores fluctuate slightly between the daily layers. The 2017 yearly average of the ratios of the daily scores achieved by each algorithm to the best score of all three algorithms for that day look as follows: 97.97% for Louvain, 99.99% for Combo, and 99.78% for GNNStemp. At the same time, the total elapsed time is as follows: 1.71 sec for Louvain, 16.04 sec for Combo, and 8.40 sec for GNNStemp. Furthermore, GNNStemp managed to find the best modularity score not reached by two other algorithms on 11.2% of the temporal layers.
So the performance of GNNStemp in terms of the achieved modularity score falls right in the middle between Louvain and Combo, while GNNStemp is nearly twice as fast as the single run of Combo.
Details of the day-by-day GNNStemp performance compared to the best of the three algorithms are shown in fig. 1. On some days, one may notice somewhat visible differences between the top algorithm performance and that of the GNNStemp, but GNNStemp captures the temporal pattern of the modularity score dynamics pretty well, with a correlation of 99.44% between the GNNStemp score and the top algorithm score timelines.
By performing time-series periodicity and trend-seasonality analysis, one could notice some interesting temporal patterns in the strength of the network community structure, as also presented in fig. 1. The community structure quantified by means of the best achieved modularity score demonstrates a strong weekly periodicity with a stronger community structure over the weekends, including Fridays. The strength of the community structure also shows noticeable seasonality, with stronger communities over the winter and weaker over the summer. One may relate this observation Fig. 1 Comparative performance of GNNS vs. the best of GNNS, Louvain, and Combo for community detection on the temporal network of daily taxi ridership in NYC. Subplots depict achieved network modularity and its time-series properties: autocorrelation, seasonality, and periodicity obtained by seasonal decomposition using moving averages Page 15 of 19 Sobolevsky and Belyi Applied Network Science (2022) 7:63 with the people exploring more destinations during their weekend and holiday time.
As seen in fig. 1, the GNNStemp accurately reproduces the patterns discovered for the best partition timeline.
In conclusion, while Combo runs could demonstrate higher partition accuracy at certain temporal layers, the GNNStemp is capable of reaching similar performance much faster while adequately reproducing the qualitative temporal patterns. So GNNStemp could be trusted as a fast and efficient solution for extracting insights into the dynamics of the temporal network structure.

Conclusions
We proposed a novel recurrent GNN-inspired framework for unsupervised learning of the network community structure through modularity optimization. A simple iterative algorithm depends on only two variable parameters, and we propose an integrated technique for tuning those so that parameter selection and modularity optimization are performed within the same iterative learning process.
The algorithm's performance has been evaluated on classic network examples, synthetic network models, and a real-world temporal network case. Despite its simplicity, the new algorithm reaches similar and, in some cases, higher modularity scores compared to the more sophisticated discrete optimization state-of-the-art algorithms. One of the possible limitations of the proposed method is its dependence on the number of runs. However, this could be viewed as an advantage since it allows flexible adjustment of the algorithm's complexity, tuning the model parameters to find the right balance between speed and performance. At the low-complexity settings, it can significantly outperform alternative methods in terms of running time while maintaining reasonable performance in terms of partition quality. Thus, the algorithm is efficiently applicable in both scenarios-when the execution time is of the essence as well as when the quality of the resulting partition is a paramount priority. Furthermore, the algorithm enables a special configuration for the active learning of the community structure on temporal networks, reconstructing all the important longitudinal patterns.
But more importantly, we believe the algorithm serves as a successful proof of concept for applying more advanced GNN-type techniques for unsupervised network learning, and opens possibilities for solving a broader range of network optimization problems. Similar approaches could work for such related problems as clique partitioning and correlation clustering. Applying more sophisticated model architectures along with constantly developing new methods for neural network training can possibly further improve the quality of found solutions. Developing these methods will constitute our future work in this direction.

Appendix. Results for directed networks
The analysis presented in the paper includes 17 classic sample networks. Their sources and characteristics (network size and whether the network is weighted and/or directed) are presented in the Table 2. Six of the 17 sample networks are directed in their original form. However Python iGraph implementations of the Leiden and Louvain methods handle only undirected networks, but Combo and GNNS