A novel algorithm for finding top-k weighted overlapping densest connected subgraphs in dual networks

The use of networks for modelling and analysing relations among data is currently growing. Recently, the use of a single networks for capturing all the aspects of some complex scenarios has shown some limitations. Consequently, it has been proposed to use Dual Networks (DN), a pair of related networks, to analyse complex systems. The two graphs in a DN have the same set of vertices and different edge sets. Common subgraphs among these networks may convey some insights about the modelled scenarios. For instance, the detection of the Top-k Densest Connected subgraphs, i.e. a set k subgraphs having the largest density in the conceptual network which are also connected in the physical network, may reveal set of highly related nodes. After proposing a formalisation of the approach, we propose a heuristic to find a solution, since the problem is computationally hard. A set of experiments on synthetic and real networks is also presented to support our approach.

interested into. "The proposed algorithm" section presents our heuristic; "Experiments" section discusses the case studies; finally "Conclusion" section concludes the paper.

Related work
Many complex systems cannot be efficiently modelled using a single network without losses of information. Therefore the use of dual networks is growing (Wu et al. 2016;Sun and Kardia 2010). These applications span a large number of fields as introduced before: from bioinformatics to social networks. In genetics, dual networks are used to describe and analyse interactions among genetic variants. They can discover the common effects among multiple genetic variants (Sun and Kardia 2010), using a protein-protein interaction network that represents physical interactions and a weighted network that represents the relations between two genetic variants, usually measured by statistical tests.
A relevant problem in network analysis is that of discovering dense communities, as they represent strongly related nodes. The problem of finding communities in a network or a dual network is based on the specific model of dense or cohesive graph considered. Several models of cohesive subgraph have been considered in the literature and applied in different contexts. One of the first definition of a cohesive subgraph is a fully connected subgraph, i.e. a clique. However, the determination of a clique of the maximum size, also referred to as the Maximum Clique Problem, is NP-hard (Hastad 1996), and it is difficult to approximate (Zuckerman 2006). Moreover, in real networks communities may have missing edges; therefore, the clique model is often too strict and may fail to find some important subgraphs. Consequently, many alternative definitions of cohesive Fig. 1 Workflow of the proposed approach. In the first step the input conceptual and physical networks are merged together using a network alignment approach; then Weighted-Top-k-Overlapping DCS is applied on the alignment graph. Each extracted subgraph induces a connected subgraph in the physical network and one of the top-k overlapping weighted densest subgraph in the conceptual one subgraphs that are not fully interconnected have been introduced, including s-club, s-plex and densest subgraph (Komusiewicz 2016;.
A densest subgraph is a subgraph with maximum density (where the density is the ratio between the number of edges and number of nodes of the subgraph) and the Densest-Subgraph problem asks for a subgraph of maximum density in a given graph. The problem can be solved in polynomial time (Goldberg 1984;Kawase and Miyauchi 2018) and approximated within factor 1 2 (Asahiro et al. 2000;. Notice that the Densest-Subgraph problem can be extended also to edge-weighted networks. Recently, Wu et al. (2016), proposed an algorithm for finding a densest connected subgraph in a dual network. The approach is based on a two-step strategy. In the first step, the algorithm prunes the dual network without eliminating the optimal solution. In the second step, two greedy approaches are developed to build a search strategy for finding a densest connected subgraph. Briefly, the first step finds the densest subgraph in the conceptual network. The second step refines this subgraph to guarantee that it is connected in the physical network.
In this contribution we use an approach based on local network alignment (LNA) that aims to find (relatively) small regions of similarity among two or more input networks. Such regions may be overlapping or not, and they represent conserved topological among networks. For instance, in protein interaction networks these regions are related to conserved motifs or pattern of interactions (Guzzi and Milenković 2017). LNA algorithms are usually based on building an intermediate structure, defined as alignment graph, and on the subsequent mining of it (Milano et al. 2020). For instance, Ciriello et al. (2012) and its successor AlignMCL (Mina and Guzzi 2014) are based on the construction of alignment graphs (see related papers for complete details about the construction of the alignment graph). GLAlign (Global Local Aligner), is a new local network alignment methodology (Milano et al. 2018) that mixes topology information from global alignment and biological information according to a linear combination schema, while the more recent L-HetNetAligner (Milano et al. 2020) extends the local alignment to heterogeneous networks.
While the literature of network mining has mainly focused on the problem of finding a single subgraph, recently the interest in finding more than a subgraph has emerged (Balalau 2015; Galbrun et al. 2016;Hosseinzadeh 2020;Cho et al. 2013). The proposed approaches usually allows overlapping between the computed dense subgraphs. Indeed, there can be nodes that are shared between interesting dense subgraphs, for example hubs. The proposed approaches differ in the way they deal with overlapping. The problem defined in Balalau (2015) controls the overlap by limiting the Jaccard coefficient between each pair of subgraphs of the solution. The Top-k-Overlapping problem, introduced in Galbrun et al. (2016), includes a distance function in the the objective function. In this paper, we follow this last approach and we extend it to weighted networks.

Definitions
This section introduces the main concepts related to our problem.

Definition 1 Dual Network.
A Dual Network (DN) G(V , E c , E p ) is a pair of networks: a conceptual weighted network G c (V , E c ) and a physical unweighted one G p (V , E p ). Now, we introduce the definition of weighted density of a graph.

Definition 2 Density.
Given a weighted graph G(V, E, weight), let v ∈ V be a node of G, and let be the sum of the weights of the edges incident in v. The density of the weighted graph G is defined as Given a graph (weighted or unweighted) G with a set V of nodes and a subset Z ⊆ V , we denote by G[Z] the subgraph of G induced by Z. Given E ′ ⊆ E , we denote by weight(E ′ ) the sum of weights of edges in E ′ . Given a dual network we denote by G p [I] , G c [I] , respectively, the subgraphs induced in the physical and conceptual network, respectively, by the set I ⊆ V .
A densest common subgraph DCS, formally defined in the following, is a subset of nodes I that induces a connected subgraph in the conceptual network and a connected subgraph in the physical network.

Definition 3 Densest Common Subgraph.
Given a dual network G(V , E c , E p ) , a densest common subgraph in G(V , E c , E p ) is a subset of nodes I ⊆ V such that G p [I] is connected and the density of G c [I] is maximum.
In this paper, we are interested in finding k ≥ 1 densest connected subgraphs. However, to avoid taking the same copy of a subgraph or subgraphs that are very similar, we consider the following distance functions introduced in Galbrun et al. (2016).
Notice that 2 − |A∩B| 2 |A||B| decreases as the overlapping between A and B increases. Now, we are able to introduce the problem we are interested into.

Problem 1 Weighted-Top-k-Overlapping DCS
Output: a set X = {G[X 1 ], . . . , G[X k ]} of k connected subgraphs of G, with k ≥ 1 , such that the following objective function is maximised: Weighted-Top-k-Overlapping DCS, for k ≥ 3 , is NP-hard, as it is NP-hard already on an unweighted graphs . Notice that for k = 1 , then Weighted-Top-k-Overlapping DCS is exactly the problem of finding a single weighted densest connected subgraph, hence it can be solved in polynomial time (Goldberg 1984).

Greedy algorithms for DCS
One of the ingredient of our method is a variant of a greedy algorithm for DCS, denoted by Greedy, which is an approximation algorithm for the problem of computing a connected densest subgraph of a given graph. Given a weighted graph G, Greedy (Asahiro et al. 2000; iteratively removes from G a vertex v having lowest vol (v) and stops when all the vertices of the graph have been removed. It follows that at each iteration i, with 1 ≤ i ≤ |V | , Greedy computes a subgraph G i of G. The output of this algorithm is a densest of subgraphs G 1 , . . . , G |V | . The algorithm has a time complexity O(|E| + |V | log |V |) on weighted graphs and achieves an approximation factor of 1 2 (Asahiro et al. 2000 ;. We introduce here a variant of the Greedy algorithm, called V-Greedy. Given an input weighted graph G, V-Greedy, similarly to Greedy, at each iteration i, with 1 ≤ i ≤ |V | , removes a vertex v having lowest vol(v) and computes a subgraph G i , with 1 ≤ i ≤ |V | . Then, among subgraphs G 1 , . . . , G |V | , V-Greedy returns a subgraph G i that maximises the value: Essentially, when selecting the subgraph to return among G 1 , . . . , G |V | , we add to the density the correction factor 2( ρ(G i ) |V i | ) . This factor is added to avoid returning a subgraph that is not well-connected in terms of edge connectivity, that is it contains a small cut. For example, consider a graph with two equal size cliques K 1 and K 2 having the same (large) weighted density and a single edge of large weight connecting them. Then the union of K 1 and K 2 is denser than both K 1 and K 2 , hence Greedy returns the union of K 1 and K 2 . This may prevent us to find K 1 , K 2 as a solution of Weighted-Top-k-Overlapping DCS. In this example, when the density of K 1 and K 2 is close enough to the density of their union, V-Greedy will return one of K 1 , K 2 .

The proposed algorithm
In this section we present our heuristic for Weighted-Top-k-Overlapping DCS in dual networks. The approach is based on two main steps: 1. First, the input networks are integrated into a single weighted alignment graph preserving the connectivity properties of the physical network 2. Second, the obtained alignment graph is mined by using an ad-hoc heuristic for Weighted-Top-k-Overlapping DCS based on the V-Greedy algorithm

Building of the alignment graph
In the first step the algorithm receives in input: a weighted graph G c (V , E c ) (the conceptual graph); an unweighted graph G p (V , E p ) (the physical graph); an initial set (seed nodes) of node pairs P, where each pair defines a correspondence between a node of G c and a node of G p ; a distance threshold δ that represents the maximum threshold distance that two nodes may have in the physical network. For example, when δ is set to one, only adjacent nodes in both networks are considered. Given the input data, the algorithm starts by building the nodes of the alignment graph. The alignment graph contains a node for each pair in P. The edges and weights of the alignment graph are defined as follows: • An edge {u, v} is defined in the alignment graph when the nodes corresponding to u and v are adjacent in G p and in G c ; the weight of {u, v} is equal to the weight of the edge connecting the nodes corresponding to u and v in G c • An edge {u, v} is defined in the alignment graph when u and v are adjacent in G p and have distance lower than δ in G c ; the weight of {u, v} is equal to the average of the weights on a shortest path connecting the nodes corresponding to u and v in G c .

A heuristic for Weighted-top-k-overlapping DCS
In the second phase of our algorithm, we solve Weighted-Top-k-Overlapping DCS on the alignment graph G computed in phase 1 via a heuristic. We present here our heuristic for Weighted-Top-k-Overlapping DCS, called Iterative Weighted Dense Subgraphs (IWDS). The heuristic starts with a set X = ∅ and consists of k iterations. At each iteration i, with 1 ≤ i ≤ k , given a set X = {G[X 1 ], . . . , G[X i−1 ]} of subgraphs of G, IWDS computes a subgraph G[X i ] and adds it to X.
The first iteration of IWDS applies the V-Greedy algorithm (see "Greedy algorithms for DCS" section) on G and computes G[X 1 ] . In iteration i, with 2 ≤ i ≤ k , IWDS applies one of the two following cases, depending on a parameter f, 0 < f ≤ 1 , and on the size of the set C i−1 = i−1 j=1 X j (the set of nodes already covered by the subgraphs in X). Case 1. If |C i−1 | ≤ f |V | (that is at most f|V| nodes of G are covered by the subgraphs in X ), IWDS applies the V-Greedy algorithm on a subgraph G ′ pf G obtained by retaining α nodes ( α is a parameter) of C i−1 having highest weighted degree in G and removing the other nodes of Case 2. If |C i−1 | > f |V | (more than f|V| nodes of G are covered by the subgraphs in X ), IWDS applies the V-Greedy algorithm on a subgraph G ′′ of G obtained by removing (1 − α) nodes (recall that α is a parameter of IWDS) of C i−1 having lowest weighted degree in G. IWDS computes G ′′ [X i ] as a weighted connected dense subgraph in G ′ , distinct from those in X.

Complexity evaluation.
We denote by n (by m, respectively) the number of nodes (of edges, respectively) of the dual network. The first step requires the analysis of both the physical and the conceptual graph, and the construction of the novel alignment graph. This requires O(n 2 )(calculation-edge-weights) time. The calculation of edge weights requires the calculation of a shortest path among all the node pairs in the physical graph using the Chan implementation (Chan 2012), therefore it requires O(nm p ) time ( m p is the number of edges of the physical graph).
As for Step 2, IWDS makes k iterations. Each iteration applies V-Greedy on G and requires O(mn log n) time, as the Greedy algorithm . Iteration i, with 2 ≤ i ≤ k , first computes the set of covered nodes in order to find those nodes that have to be removed (or retained). For this purpose, we sort the nodes in C j−1 based on their weighted degree in O(n log n) time. Thus the overall time complexity of IWDS is O(kmn log n).

Experiments
In this section, we provide an experimental evaluation of IWDS on synthetic and real networks. 1 The design of a strong evaluation scheme for our algorithm is not simple, since we have to face two main issues: 1. Existing methods for computing the top k overlapping subgraphs (Galbrun et al. 2016) are defined for unweighted graphs and cannot be used on dual networks. 2. Existing network alignment algorithms do not aim to extract top k densest subgraphs.
Consequently, we cannot easily compare our approach with the existing state of the art methods, and we design an ad-hoc procedure for the evaluation of our method based on the following steps. First, we consider the performance of our approach on synthetic networks. In this way, we show that, in many of the cases we considered, IWDS can correctly recover top k weighted densest subgraphs. Then we apply our method to four realworld dual networks. The alignment algorithm described of "A heuristic for Weighted-top-k-overlapping DCS" section is implemented in Python 3.7 using the NetworkX package for managing networks (Hagberg et al. 2008). IWDS is implemented in MATLAB R2020a. We perform the experiments on MacBook-Pro (OS version 10.15.3) with processor 2.9 GHz Intel Core i5 and 8 GB 2133 MHz LPDDR3 of RAM, Intel Iris Graphics 550 1536 MB.

Synthetic networks
In the first part of our experimental evaluation, we analyse the performance of IWDS to find planted ground-truth subgraphs on synthetic datasets.
In Synthetic1, each planted dense subgraph contains 30 nodes and has edge weights randomly generated in the interval [0.8, 1]. In Synthetic3, each planted dense subgraph contains 20 nodes not shared with other planted subgraphs. The subgraphs are arranged in a cycle, 5 nodes of each subgraph are shared with the subgraph on one side and 5 nodes are shared with the subgraph on the other side. Edge weights are randomly generated in the interval [0.8, 1].
These cliques are then connected to a background subgraph of 100 nodes. We consider three different ways to generate the background subgraph: Erdös-Renyi with parameter p = 0.1 , Erdös-Renyi with parameter p = 0.2 and Barabasi-Albert with parameter equal to 10. Weights of the background graphs are randomly generated in interval [0, 0.5]. Then 50 edges connecting cliques and the background graph are randomly added (with weights randomly generated in interval [0, 0.5]).
Based on this approach, we generate four different sets of synthetic networks, called Synthetic1, Synthetic2, Synthetic3 and Synthetic4. Synthetic1 (for the non-overlapping case) and Synthetic3 (for the overlapping case) are generated as described above. Syn-thetic2 and Synthetic4, respectively, are obtained by applying noise to the synthetic networks in Synthetic1, Synthetic3, respectively. The noise is added by varying 5%, 10% and 15% of node relations of each network. A set of pairs of nodes are chosen randomly: if they belong to the same clique, the weight of the edge connecting the two nodes is changed to a random value in the interval [0, 0.5]; else an edge connecting the two nodes is (possibly) added (if not already in the network) and its weight is randomly assigned a value in the interval [0.8, 1].
Outcome. We present the results of our experimental evaluation, in particular, the average running time, density, distance and F1-score, 2 varying the parameter α . We recall that F1-score is the average mean of precision and recall, and, as in Galbrun et al. (2016) we consider this measure to evaluate the accuracy of our method to detect the ground-truth subgraphs. Following Yang and Leskovec (2012), we consider the number of shared nodes between each ground-truth subgraph and each detected subgraph, so that we are able to define the best-matching of ground-truth subgraphs and detected subgraphs. Then, we compute the F1[t/d] measure as the average F1-score of the bestmatching ground-truth subgraph to each detected subgraph (truth to detected) and F1[d/t] measure as the average F1-score of the best-matching detected subgraph to each ground-truth subgraph (detected to truth). Notice that in most of the cases considered, the running time of IWDS increases with the increasing of α . Also, generally, the solutions returned by IWDS for larger values of α are denser than for small values, while the solutions with small values of α have a higher value of distance (hence the subgraphs returned have a smaller overlapping).
Tables 1 and 3 report average results of running time (in minutes), density, distance and F1 scores for the two noiseless datasets. Table 1 shows the experimental results for the noiseless Synthetic1 dataset, where ground-truth subgraphs are disjoint. In this case IWDS is able to detect the ground-truth subgraphs for all values of α , averaged over 300 examples. For Synthetic4, the added noise has a significant impact on the quality of computed solutions, even for noise value equal to 0.05. While the noise increasing has a limited effect on IWDS for small value of α ( α ≤ 0.25 ), for higher values of α leads to a degrade in performance, in particular for F1[t/d].

Dual networks
We evaluate IWDS on four real-world dual network datasets: Datasets. G-graphA. The G-graphA dataset is derived from the GoWalla social network where users share their locations (expressed as GPS coordinates) by checking-in into the web site (Cho et al. 2011). Each node represents a user and each edge links two friends in the network. We obtained the physical network by considering friendship relation on the social network. We calculated the conceptual network by considering the distance among users. Then we run the first step of our algorithm and we obtained the alignment graph G-graphA, containing 2,241,339 interactions and 9878 nodes (we set δ =4). In this case a DCS represents set of friends that share check-ins in near locations.
DBLP-graphA. The DBLP-graphA dataset is extracted from a computer science bibliography and represents interactions between authors. Nodes represent authors and edges represent connections between two authors if they have published at least one paper together. Each edge in the physical network connects two authors that coauthored at least one paper. Edges in the conceptual network represent the similarity of research interests of the authors calculated on the basis of all their publications. After running the first step of the algorithm (using δ=4), we obtained an alignment graph DBLP-graphA dataset containing 553,699 interactions and 18,954 nodes. In this case a DCS represents a set of co-authors that share some strong common research interests and the use of DNs is mandatory, since physical network shows only co-authors that may not have many common interests and the conceptual network represents authors with common interest that may not be co-authors.
HS-graphA. HS-graphA is a biological dataset and is taken from the STRING database (Szklarczyk et al. 2016). Each node represents a protein, and each edge takes into account the reliability of the interactions. We use two networks for modelling the database: a conceptual network represents such reliability value; and a physical network stores the binary interactions. The HS-graphA dataset contains 5,879,727 interactions and 19,354 nodes (we set δ=4).
Protein-interaction We extracted from the STRING database a subnetwork of proteins involved into the SARS-COV-2 infection (Szklarczyk et al. 2016). The physical network contains interacting proteins, while the conceptual network contains the strength of the association among them. Protein-Interaction contains 192 nodes and 418 edges (Table 5).
Outcome For these large size datasets, we set the value of k to 20, following the approach in Galbrun et al. (2016). Table 6 reports the running time of IWDS, and the density and distance of the solutions returned by IWDS. As for the synthetic datasets, we consider six different values of α . As shown in Table 6, by increasing the value of α from 0.05 to 0.5, IWDS (except of one case, HS-graphA with α = 0.1 ) returns solutions that are denser, but with lower distance. Table 6 shows also how the running time of IWDS is influenced by the size of the network and by the value of α . We have put a bound of 20 h on the running time of IWDS and the method was not able to return a solution for HS-graphA for α ≥ 0.5 within this time. The running time is influenced in particular by the number of edges of the input network. DBLP-graphA and HS-graph-A have almost the same number of nodes, but HS-graph-A is much more denser than DBLP-graphA. IWDS for the former network is remarkably slower than for DBLP-graphA (1.986 slower for α = 0.05 , 6.218 slower for α = 0.25 ). The running time of IWDS is considerably influenced by the value of parameter α , since it increases as α increases. Indeed by increasing the value of α , less nodes are removed by Case 1 and Case 2 of IWDS, hence in iterations of IWDS V-Greedy is applied to larger subgraphs. This fact can be seen in particular for HS-graphA, for which IWDS failed to terminate within 20 h when α ≥ 0.5.

Biological evaluation of results
For biological data there is the possibility to evaluate the relevance of the results considering the relevance of the biological knowledge that results may convey.
Biological data are usually annotated with terms extracted from ontologies, e.g. Gene Ontology . Consequently, experiments of analysis of biological data may evaluated in terms of the biological knowledge inferred from the analysis of data and in terms of the statistical relevance of the results themselves. For instance, given a DCS extracted from two biological networks, it is interesting to determine the biological meaning of the DCS and how this is relevant, i.e. how this DCS may convey biological relevance with respect to another random one. Usually, Table 6 Performance of IWDS on real-world network for k = 20 , varying α from 0.05 to 0.9. For each network, we report the running time in minutes, the density and the distance subgraphs of biological networks may represent groups of interacting proteins sharing some common functions or playing similar biological roles. Consequently, it is possible to evaluate the biological relevance of obtained results by considering the role of proteins. Such information are stored and organised into biological ontologies such as Gene Ontology (GO) (Harris et al. 2004). GO functional enrichment has been proposed to evaluate the significant presence of common roles or function in a solution represented as a list of genes/proteins. It has been shown that the use of semantic similarities (SS) ) is a feasible and efficient way to quantify biological similarity among proteins. SS measures are able to quantify the functional similarity of pairs of proteins/genes, comparing the GO terms that annotate them, therefore proteins that share the biological role have high values of semantic similarity. As a consequence, genes/proteins that are found in the same solution should have a semantic similarity significantly higher than random expectation. These considerations have been used during the design of the evaluation of our results that we adapted from the evaluation scheme proposed in Mina and Guzzi (2014). Given a DCS DCS k we calculate its internal semantic similarity SS DCS k as the average semantic similarity of all the nodes pairs of the DCS as follows: We compare the DCS extracted from the biological network against random ones obtained by randomly sampling the input networks to prove their statistical significance. Given a DCS DCS i , we can test the null hypothesis: H 0 1 : the average semantic similarity of the protein internals to the DCS SS(DCS i ) is higher than by chance, where the background distribution can be estimated from the semantic similarity of random subgraphs RS i taken from the alignment graph SS(RS i ) , using for instance 0.05 as significance level.
Consequently we design this test as described in the following algorithm: • Let DCS i be a given DCS; • Let SS(DCS i ) be its internal semantic similarity • Let V s be the set of 100 random subgraph with same size V s ={RS j } j=0,..,99 • For Each RS j ∈ V s calculate SS j (RS j ) the internal semantic similarity of each random solution • Compare SS(DCS i ) and all the SS j (RS j ) using a non parametric test • Accept or Refuse the Hypothesis SS(DCS i ) is significantly higher than SS j (RS j ) Consequently, for each graph in the solution we generate 100 random graphs of the same size, by sampling the obtained alignment graph. For each graph we calculated its internal semantic similarity using the Resnick measure (Resnik 1999). Results demonstrate that (1) SS DCS k = n i ∈DCS k n j ∈DCS k ,j� =i SS DCS k (n i , n j ) �SS DCS k ��SS DCS k −1 � Table 7 Comparison of the average semantic similarity for the two biological networks considered

Random solutions
0.3 ± 0.1 DCS 0.6 ± 0.1 our solution is biologically relevant and the relevance is higher than by chance as summarised in Table 7.

Conclusion
DNs are used to model two kinds of relationships among elements in the same scenario. A DN is a pair of networks that have the same set of nodes. One network has unweighted edges (physical network), while the second one has weighted edges (conceptual network). In this contribution, we introduced an approach that first integrates a physical and a conceptual network into an alignment graph. Then, we applied the Weighted-Top-k-Overlapping DCS problem to the alignment graph to find k dense connected subgraphs. These subgraphs represent subsets of nodes that are strongly related in the conceptual network and that are connected in the physical one. We presented a heuristic, called IWDS, for Weighted-Top-k-Overlapping DCS and an experimental evaluation of IWDS. We first considered as a proof-of-concept the ability of our algorithm to retrieve known densest subgraphs in synthetic networks. Then we tested the approach on four real networks to demonstrate the effectiveness of our approach. Future work will consider a possible high performance implementation of our approach and the application of the IWDS algorithm to other scenarios (e.g. financial or marketing datasets).