Skip to main content

Detecting and generating overlapping nested communities

Abstract

Nestedness has been observed in a variety of networks but has been primarily viewed in the context of bipartite networks. Numerous metrics quantify nestedness and some clustering methods identify fully nested parts of graphs, but all with similar limitations. Clustering approaches also fail to uncover the overlap between fully nested subgraphs, as they assign vertices to a single group only. In this paper, we look at the nestedness of a network through an auxiliary graph, in which a directed edge represents a nested relationship between the two corresponding vertices of the network. We present an algorithm that recovers this so-called community graph, and finds the overlapping fully nested subgraphs of a network. We also introduce an algorithm for generating graphs with such nested structure, given by a community graph. This algorithm can be used to test a nested community detection algorithm of this kind, and potentially to evaluate different metrics of nestedness as well. Finally, we evaluate our nested community detection algorithm on a large variety of networks, including bipartite and non-bipartite ones, too. We derive a new metric from the community graph to quantify the nestedness of both bipartite and non-bipartite networks.

Introduction

Identifying clusters or communities of nodes in graphs is an important problem in graph-based data mining and network science. The standard methods try to achieve lots of edges within clusters (or communities) and only a few between distinct clusters (Schaeffer 2007), imitating concepts of data clustering in statistics and machine learning (Xu and Tian 2015). In general, this approach works well and provides meaningful clusters for social networks due to some of their widely observed common properties. For instance, the number of triangles in a social network is much larger than in a random graph with similar edge density, they often show heterogeneous degree distribution and have small diameters (McGlohon et al. 2011). Taking these properties into account helps to find the dense parts of the network. Many algorithms rely on maximizing the modularity function, which measures the quality of a given clustering (Newman and Girvan 2004), but there are lots of different approaches. While for clustering, where each node is assigned to exactly one cluster, both bottom-up and top-down type algorithms have been proposed, for community detection, where a node can be a member of several communities, mostly bottom-up algorithms are used, i.e., smaller initial communities are expanded with new nodes during the community detection process (Bóta et al. 2010).

On the other hand, some social networks and especially technological or transaction networks generally contain fewer triangles and often have tree-like structures (Adcock et al. 2013). Therefore, trying to find disjoint dense parts is inadequate in principle. Moreover, certain bipartite networks, such as pollination networks of plant species and their pollinators or trade networks of countries and their exported/imported goods, show the presence of special structures such as nestedness (Bastolla et al. 2009; Mariani et al. 2019; Uzzi 1996; Wright et al. 1997). That is, the nodes of each side of the bipartite network can be ordered in such a way that the neighborhood of any lower-ranked node contains the neighborhood of any higher-ranked node. Ecological networks often display a nested structure in which specialists species (refer to low degree nodes considering the species’ interaction network) interact with generalists (i.e., high degree nodes) species, while generalists interact with each other and with specialists, too (Bascompte 2010). The ecological concept of nestedness was published first by Darlington (1943), and it was formally defined by Atmar & Patterson utilizing graph theoretical concepts (Patterson and Atmar 1986). In the field of economics, the bipartite networks of industrial firms and locations also show a high level of nestedness (Bustos et al. 2012; Saavedra et al. 2009). At the macroeconomic level, world trade can be described by a bipartite graph, where nodes represent either countries or products, and weighted edges between a country and a product represent the ratio related to the total amount of the product imported or exported. Like other economic networks, the world trade network is also highly nested (Ermann and Shepelyansky 2013) with the coexistence of global and regional dynamics in terms of network communities within it Zhu et al. (2014).

Extending the concept of nestedness to unipartite (non-bipartite) networks can be done in various ways, see e.g., Chapter 2 of Mariani et al. (2019) and London et al. (2022). Since perfect nestedness is rarely observed in real-world networks, several metrics have been proposed to quantify the level of nestedness (Payrató-Borràs et al. 2020; Ulrich et al. 2009). In this paper we do not aim to review all the relevant literature in detail, we only refer to the surveys (Csermely et al. 2013; Mariani et al. 2019; Ulrich et al. 2009).

The problem of identifying perfectly nested parts (i.e., nested subgraphs) of a network has received much less attention in the literature, mostly in the context of image processing only. Junttila and Kaski call a binary matrix (that is, a matrix whose entries are either zero or one) fully nested if its rows and columns can be reordered such that the ones are in an echelon form (Junttila and Kaski 2011). They define a binary matrix A fully k-nested if its columns can be partitioned into k pairwise disjoint submatrices, called blocks, each of which is fully nested. Given a matrix A, a natural optimization problem is to find the smallest k such that A is fully k-nested and also to provide a partitioning into k fully-nested parts. The problem can be solved in polynomial time. Note that any \(m\times n\) binary matrix can be considered as the incidence matrix of a bipartite network with m and n nodes on its respective sides.

Extending the above definition to non-bipartite graphs can be done in several ways, but much less known about the problem’s complexity. For instance, for a graph G and a fixed bipartite graph H, London, Pluhár and Martin defined the concept of induced H-avoiding coloring (London et al. 2022), meaning that the union of any two color classes spans an induced H-free graph, and defined \(\chi _H(G)\) as the minimum number of colors in an induced H-avoiding coloring of G. In the case of \(H = 2K_2\) (a graph of four vertices with two non-adjacent edges) this coloring realizes a partitioning of G to bipartite, fully-nested clusters. Although determining \(\chi _H(G)\) is NP-hard, their approach and those we present in this paper are both applicable to general graphs.

Here we present a novel method that identifies overlapping nested subgraphs and represents them as paths of a directed graph we call a community graph. We also introduce a method that generates bipartite graphs with (any) ground-truth overlapping nested structure, making it possible to generate example nested graphs and test nested community detection algorithms. Since our algorithm detects nestedness in non-bipartite graphs, too, in order to be able to quantify nestedness in any graph, we derive a new metric from the output of our algorithm called vertex presence. To measure the “generalist-ness” of a vertex, we derive another metric called vertex position.

The rest of the paper is organized as follows. First, we introduce the core definitions we are going to use throughout the paper. In the section “Nestedness and community detection” we introduce an algorithm for detecting overlapping fully nested subgraphs of an arbitrary input graph. We represent the resulting nested community structure with a community graph that encodes additional information about the hierarchy and relationship of the nested subgraphs in the network. Then, in “Generation of overlapping communities” section we introduce an algorithm that can generate a class of bipartite graphs that exhibits the nested community structure of the input community graph. This algorithm may also be suitable for testing nestedness metrics. In the “Experiments” section we introduce two new metrics to measure nestedness on both node and graph level, and test our nested community detection algorithm on typical nested and non-nested artificial and real-world graphs. The artificial graphs are either generated using the algorithm introduced in the “Generation of overlapping communities” section, or bipartite nestedness benchmark networks and non-bipartite community detection benchmark networks are utilized. Finally, in “Conclusions” we summarize.

Throughout this paper, \(G = (V, E)\) will be a finite and unweighted graph with \(|V|=n\) and \(|E|=m\) (with no self-edges, i.e., \((i, i) \notin E~ \forall i \in V\)), N(i) denotes the neighborhood of node i and |N(i)| its size, i.e., the degree of i. In the case of directed graphs, we will denote the incoming neighbors of a node i with \(\textrm{in}(i) = \{ j: (j, i) \in E \}\) and its outgoing neighbors with \(\textrm{out}(i) = \{ j: (i, j) \in E \}\).

Nestedness

A graph G is fully nested if, for any pair of vertices \(i, j \in V(G)\) such that j has a higher or equal degree than i, \(N(i) \subseteq N(j)\) holds (Mariani et al. 2019). In other words, the vertices of G can be ordered such that the respective neighborhoods (as sets) form a chain. In case of bipartite graphs, the two compared vertices must be in the same (color) class. If perfect or full nestedness holds for one class, then it holds for the other. Figure 1a shows a fully nested graph, while Fig. 1b depicts one that is not fully nested. Observe that the presence of an induced \(2K_2\), that is two independent edges (formed by the edges (2, 5), (3, 6) in Fig. 1b, colored in red), is responsible for breaking full nestedness. We want to emphasize the use of not fully nested instead of not nested as graphs may not be fully nested themselves, but may have fully nested subgraphs. In other words, the definition of nestedness may not hold for the entire graph, but it might be true for one or more subsets of vertices.

Fig. 1
figure 1

Examples of (a) fully nested and (b) partially nested bipartite graphs. The \(2K_2\) breaking nestedness is highlighted in red

We distinguish between the definitions of nested vertices and nested graphs. As a building block to define nestedness of graphs, we first define nestedness of a pair of vertices. The amount (or strength) of nestedness between two vertices can be defined as

$$\begin{aligned} \textrm{nest}(i,j) = \frac{\left| N(i) \cap N(j)\right| }{\min \left\{ \left| N(i)\right| , \left| N(j)\right| \right\} }. \end{aligned}$$
(1)

If \(\textrm{nest}(i,j) = 1\), then \(N(i) \subseteq N(j)\) or \(N(j) \subseteq N(i)\), i.e., the nestedness criterion holds for the given vertices i and j. In the special case of either of the vertices being isolated (where \(\min \left\{ \left| N(i)\right| , \left| N(j)\right| \right\} = 0\)), we consider the vertices non-nested and define \(\textrm{nest}(i,j) = 0\).

While Eq. 1 is not ideal for measuring the nestedness of the whole graph (it would require \(\approx n^2\) calculations to get an average nestedness value, for example), we can use it to find groups of vertices that form perfectly nested subgraphs—which is our ultimate goal.

Existing methods

Clustering to nested parts

One way to find fully nested subgraphs is to use the incidence matrix to identify submatrices of echelon form (Junttila and Kaski 2011)—this, however, works for bipartite graphs only. Another possibility is to use Eq. 1 and assign vertices to a group where the pairwise nestedness values are equal to 1. This can be done by, for example, performing a \(2K_2\)-free coloring on the graph (London et al. 2022).

All of these methods exhibit the same problem, though. The graph in Fig. 1b contains two nested subgraphs: one with the vertices \(\{ 1, 2 \}\) and another with \(\{ 1, 3 \}\). Notice that vertex 1 is present in both fully nested subgraphs, but we can only assign that vertex to a single group in the clustering task. This would create two ambiguous cluster structures: both \(\{ \{ 1, 2 \}, \{ 3 \} \}\) and \(\{ \{ 1, 3 \}, \{ 2 \} \}\) are valid clusterings of the same graph. Although this is not necessarily a problem as one will often search for a single clustering, the resulting structure does not encode such overlaps, potentially losing valuable information.

Edge-based nested community detection

We differentiate clustering from community detection based on the number of groups a vertex can belong to. We refer to clustering when a vertex can be assigned to a single group only, as in the previous case, and community detection when vertices can belong to multiple, and thus potentially overlapping, groups. Note that in our case, we are looking for a special (or constrained) overlapping community structure, where each community is a fully nested subgraph.

One method that avoids the problem of nodes being constrained to a single group is a greedy edge-based community detection algorithm (Gera et al. 2022). This method first assigns community indices to the edges of the graph. It calculates \(\textrm{nest}(i, j)\) for vertices i and j and if they are nested, the edges of both vertices are assigned to a common community. Otherwise, the edges of i get a different community index than the edges of j. In the end, the communities that a vertex i belongs to will be the union of the communities of its incident edges.

Nestedness in non-bipartite graphs

Since we are focusing on general—not just bipartite—graphs, we need to take into account the connection between a vertex pair when measuring their nestedness. This was not an issue in bipartite graphs, since, by definition, there are no edges between vertices of the same class.

If we use Eq. 1 to measure nestedness between vertices i and j, and there exists an edge \((i, j) \in E(G)\), the two vertices will never be considered fully nested, because \(i \in N(j)\) and \(j \in N(i)\), but \(i \notin N(i)\) and \(j \notin N(j)\). Thus, \(\textrm{nest}(i, j) < 1\) when \((i, j) \in E(G)\). This would also mean that the graph in Fig. 2 or even \(K_n\) (\(n \ge 2\)), the complete graph of n vertices, are not fully nested graphs.

Fig. 2
figure 2

Example of a non-bipartite, fully nested graph. Nodes 1 and 2 are considered nested, despite them having an edge (dashed line) between them

The approach we are going to follow when comparing two vertices is to ignore the edge between them if there exists one. Thus, we may use the following equation instead of Eq. 1:

$$\begin{aligned} \textrm{nest}(i, j) = \frac{\left| (N(i) \setminus j) \cap (N(j) \setminus i)\right| }{\min \left\{ \left| N(i) \setminus j\right| , \left| N(j) \setminus i\right| \right\} }. \end{aligned}$$
(2)

Note that considering the existence of edges between nodes when searching for fully nested parts depends on the application. If we are looking for fully nested bipartite subgraphs in graphs that are not necessarily bipartite themselves (e.g., as in Junttila and Kaski 2011; London et al. 2022), the fact that two nodes are connected or not should not be ignored.

Nestedness and community detection

In this section, we will present an algorithm that retrieves the nested community structure of the input (bipartite or general) graph. First, we will talk about how the order of nodes inside nested community structures allows us to store more information compared to traditional communities. We use this additional information to construct a so-called community graph. Then, we introduce an algorithm that reconstructs not only the nested communities, but the entire community graph from an arbitrary input graph.

While in this work we frequently mention community detection, we refer to it as a framework for detecting overlapping groups (communities) of vertices that are, in some sense, similar to vertices within the group, while, in the same sense, different to vertices in other groups. Traditionally, this similarity meant vertices being densely connected within a group, while vertices across different groups were less connected. Here, we are looking for overlapping groups (communities) of vertices that form fully nested subgraphs instead of being densely connected. We will call these nested communities.

Nested hierarchy from directed graphs

Using Eq. 2 we are able to decide whether two vertices are nested, but the direction of nestedness, i.e., which vertex’s neighborhood is a subset of the other, is not considered. This is important because we will use this information to determine the hierarchy of vertices. Knowing the direction of pairwise nestedness, we can use it to create a graph representation that encodes the nested relationships of the entire graph.

To do this, we construct a directed graph, where a directed edge \(i \rightarrow j\) means \(N(i) \subseteq N(j)\). We will refer to this graph as the community graph. Before we proceed, we first verify some basic scenarios from Fig. 3.

  1. 1

    If we have edges \(i \rightarrow j\) and \(j \rightarrow k\) (as in Fig. 3a), we get \(N(i) \subseteq N(j)\) and \(N(j) \subseteq N(k)\). Nestedness is transitive, so this also means \(N(i) \subseteq N(k)\). For simplicity, we omit these edges from our community graphs, or equivalently, we work with the transitive reduction of the community graph (see the “Community detection algorithm” section for more details).

  2. 2

    A node can have multiple out-neighbors (Fig. 3b). If we have edges \(i \rightarrow j\) and \(i \rightarrow k\), then we get \(N(i) \subseteq N(j)\) and \(N(i) \subseteq N(k)\). This can be solved by letting both j and k have all the neighbors of i, but also making sure j and k each have at least one other neighbor the other doesn’t

  3. 3

    A node can also have multiple in-neighbors (Fig. 3c). Here we have edges \(i \rightarrow k\) and \(j \rightarrow k\) and get \(N(i) \subseteq N(k)\) and \(N(j) \subseteq N(k)\). This case can be solved by taking the union of the neighbors of i and j to create the neighborhood of k (\(N(k) \supseteq N(i) \cup N(j)\)).

  4. 4

    Finally, nodes may have the exact same neighbors as other nodes (Fig. 3d). This results in edges \(i \leftrightarrow j\), and thus in both \(N(i) \subseteq N(j)\) and \(N(j) \subseteq N(i)\) (\(N(i) \equiv N(j)\)). To reduce complexity, we will draw a path with bidirectional edges instead of a clique.

Fig. 3
figure 3

Cases of nested relations in a community graph. From left to right: a fully nested graph (a), a node nested with different nodes (b), multiple nodes nested with the same node (c), and nodes having equal neighborhoods (d)

It is important to note that a maximal (non-expandable) path of the community graph will represent a nested community or, in other words, a nested subgraph. For example, if the community graph of G is a single \(P_n\) (a path of n vertices, as in Fig. 3a), the original graph G is fully nested, whereas if G is fully not nested (i.e., G does not have a single pair of vertices (ij) where \(N(i) \subseteq N(j)\)), its community graph will be a graph with no edges.

Community detection algorithm

Now we introduce an algorithm to find overlapping nested communities, that is, fully nested subgraphs of G. The core parts of the detection algorithm reconstruct the community graph from the input graph and then find the community graph’s maximal (non-extendable) paths to enumerate the communities. The main steps are the following.

Reconstructing the community graph

To reconstruct the community graph, we first enumerate all nested vertex pairs. Here, instead of greedily performing \(\approx n^2\) comparisons, we can use the same trick used in Gera et al. (2022). That is, we do not compare vertices that have no common neighbors, since they are certainly not nested. Instead, we calculate \(\textrm{nest}(i, k)\) by first going through \(j \in N(i)\), and pick \(k \in N(j)\), where \(k < i\).Footnote 1 This way, i and k have at least one common neighbor (j), so they are potentially nested. In practice, this can save us a lot of computational time (especially in sparse graphs), and the number of discarded comparisons can be large, according to our experience. Since we do not exclude potentially nested vertex pairs, we do not lose any information in this step.

When comparing vertices, we also need to know the direction of nestedness between the vertices, for example, by calculating \(\textrm{sgn}(\left| N(i)\right| - \left| N(j)\right| )\) when \(\textrm{nest}(i, j) = 1\). Once we have done all the comparisons, we build the directed edge list of the nested pairs, where \(i \rightarrow j\) is an edge if \(\textrm{nest}(i, j) = 1\) and \(\textrm{sgn}(\left| N(i)\right| - \left| N(j)\right| ) \le 0\) (or equally, \(N(i) \subseteq N(j)\)).

However, the list will contain a lot more edges than we need. Let’s revisit the fully nested graph from Fig. 1a for an example. Here \(N(3) \subseteq N(2)\) and \(N(2) \subseteq N(1)\), but as such, \(N(3) \subseteq N(1)\) will also hold. Since we need to find maximal paths in the resulting graph, the transitive \(N(3) \subseteq N(1)\) relationship and its corresponding edge are redundant and certainly not part of the maximal path \(3 \rightarrow 2 \rightarrow 1\). To remove them, we perform a transitive reduction on the graph built from the nested edge list. As a result, for all triples \(i \rightarrow j \rightarrow k\) the edge \(i \rightarrow k\) will be deleted. This significantly reduces the number of edges to consider in the next step, which greatly improves the performance of the algorithm. This completes the community graph discovery.

Finding nested communities

Now that we have a community graph, we need to retrieve the list of nested communities. To do this, we enumerate the maximal (non-extendable) directed paths in the graph. Listing these paths can be done using any traversal method, such as a depth-first search. Due to the transitive reduction performed in the previous step, this can be accomplished quite quickly.

Here, each directed path represents a fully nested community, with the order of the vertices also encoding hierarchy. For example, if \(K_n\) (a clique of n vertices) is the input graph, the community graph will be \(P_n\) (a path of n vertices), which will have a single community with all n vertices in it.

Vertex compacting based on neighborhood

There is one edge case that complicates the search for maximal paths, where cycles are created due to bidirectional edges between vertices. To solve this problem, and also improve search performance, we first find vertices with equal neighborhoods and merge them into a single vertex before building the community graph. Isolated vertices are not compacted, and edges between vertices are ignored when checking for neighborhood equality. This means that an isolated \(K_2\) (two nodes with an edge between them), for example, is not compacted. This change has multiple positive effects. First, the community graph is now guaranteed to be a directed acyclic graph (DAG) as there are no other factors that can introduce cycles, making maximal path finding much easier. It also makes the resulting community graph smaller by having it be built from fewer vertices, improving performance. For example, a star graph with any number of nodes will have a community graph of just two nodes and a single edge.

We then find the maximal paths as normal, treating the merged vertices as a single vertex. Finally, we recover the original vertices by expanding the merged vertices and inserting edges in both directions between them.

Figure 4 shows all the steps of the algorithm on an example graph. When enumerating the communities, we traverse the compacted graph (Fig. 4c) and then insert the removed vertices into the paths.

Fig. 4
figure 4

Steps of the community detection algorithm. Starting from the input graph (a), we first compact vertices with equal neighborhoods (b), then build the community graph (c), and finally reverse the vertex compaction, adding the bidirectional edges

Remarks

Here we pinpoint some key areas in the behavior of the algorithm. The algorithm’s pseudocode is available on Fig. 5 and its source code is included in Additional file 1.

  1. 1.

    Non-bipartite graphs A major advantage of the algorithm is that it does not exploit any property specific to bipartite graphs. In theory, this could make it directly applicable to any (unweighted) non-bipartite graph. In practice, we need to solve the problem of connected vertices described in “ Nestedness in non-bipartite graphs” section. As with Eq. 2 we ignore the connection between two vertices when comparing them. This, combined with vertex compacting, results in the algorithm correctly finding nestedness in non-bipartite graphs too (as later demonstrated in “ Results on typical examples” section).

  2. 2.

    On constrained community detection We also need to make some comments about the community structure detected by the algorithm. The algorithm is designed to detect certain types of overlapping communities (specifically communities that satisfy the constraint of being fully nested), essentially performing a “constrained” community detection. The algorithm is also not a heuristic to detect nestedness. This is due to the fact that we start by enumerating all possible \(\approx n^2\) comparisons and then exclude only those pairs that are guaranteed not to be nested. As a result, all remaining, potentially nested, vertex pairs are compared, and no stochastic elements are included in the process.

  3. 3.

    Permissive nestedness As a possible future direction, we would also like to mention the potential of relaxing the requirements of nestedness for communities. So far, we have talked about how the algorithm detects communities that satisfy a certain “constraint”. This constraint can be quite strict, as two nodes that share largely the same neighborhood (with a few deviations) are considered to be non-nested. There are many metrics that quantify the degree of nestedness of a graph. They allow us to see not only whether a graph is fully nested or not, but also how much nested it is. To increase the flexibility of our algorithm, we can similarly allow pairs of vertices that are not fully nested to belong to the same community, above a certain nestedness threshold, for example. The algorithm currently does not support this, but it is easy to implement.

Fig. 5
figure 5

Pseudocode of the nested community detection algorithm. Despite having three nested for loops, the algorithm only iterates over (ik) pairs of vertices that have at least one common neighbor

Generation of overlapping communities

Previously, we have shown an algorithm that can reconstruct the community graph from an arbitrary input graph. Now, we present an algorithm that is capable of generating bipartite graphs with multiple overlapping fully nested groups of vertices, based on an input community graph. The method also returns the ground truth nested structure, making it suitable for use when benchmarking algorithms that find overlapping nested communities.

The generated structure is more general than in-block nestedness (Solé-Ribalta et al. 2018), where the graph is partitioned into disjoint, fully nested “blocks”. Our method is capable of generating not only this structure, making it a more versatile approach.

We believe that the proposed algorithm is useful not only for testing our nested community detection algorithm, but also for creating benchmark data sets for future methods that detect overlapping nested structures.

Benchmark generator algorithm

Now that we have a method for describing nestedness using a directed graph, we will present an algorithm that generates a bipartite graph that satisfies the nested structure described by the community graph. That is, if there is an edge \(i \rightarrow j\) in the community graph, the resulting graph will have \(N(i) \subseteq N(j)\). We will show that the algorithm is capable of generating not only fully nested graphs, but also graphs with overlapping nested communities.

To generate a bipartite graph from a community graph, denoted by \(G_c\), we first perform a topological sorting on \(G_c\), since we need to generate neighbors for each vertex v such that its predecessors (denoted by \(\textrm{in}(v)\)) already have their neighbors. To do this, we must assume that the community graph is acyclic, as there must be at least one vertex with no predecessors, which will be the starting vertex. When visiting a vertex, we add all the neighbors of its predecessors to the neighbors of the current vertex and generate a new neighbor for it (if we visit the ith vertex, we can label the new vertex \(n + i\)). The first part guarantees nestedness, while the new neighbor makes sure that the two vertices do not have the same neighborhood (in which case there should be both a \(i \rightarrow j\) and a \(j \rightarrow i\) edge in \(G_c\)). Formally, we have the subroutine visible in Fig. 6.

Fig. 6
figure 6

Pseudocode of the nested graph generator algorithm

Remarks

  1. 1.

    Generating random nested community structures The algorithm we have described so far is used to generate a graph with a given nested structure from an input community graph. To generate random graphs with overlapping nested structures, we can use random input graphs. However, the input graph must be a DAG. This can be achieved, for example, by sampling a random spanning tree of the complete graph \(K_n\) and randomly orienting its edges.

  2. 2.

    Cycles in the community graph Because the algorithm performs topological sorting and requires the input graph to have a vertex with no predecessors (i.e., one that is not part of a cycle), it is much simpler to work with an acyclic community graph. This prevents us from generating vertices with equal neighborhoods, however, the vertex compacting approach described in section “Community detection algorithm" could be adapted to the generator algorithm to make this possible.

  3. 3.

    Generated graph sizes Since the algorithm takes a community graph of size n and generates a neighbor for each vertex, the resulting graph will have exactly 2n nodes. This also makes the algorithm incapable of generating bipartite graphs with classes of different sizes—a trivial example is the star graph.

  4. 4.

    Multiple blocks We have not touched on whether the input \(G_c\) DAG has to be weakly or strongly connected yet. The algorithm can handle community graphs with multiple components and will render a component of the community graph as a component in the generated bipartite graph. For example, a graph with multiple disjoint \(P_k\) components (that is, a directed path of k nodes) results in an in-block nested bipartite graph (Solé-Ribalta et al. 2018).

  5. 5.

    Nested structure of the other class As mentioned in the description of the algorithm and in remark 2, the input to the algorithm is a DAG that describes the community structure of one class of the graph. The other class of the generated bipartite graph is not part of the input community graph (and thus the ground-truth). However, generating the bipartite graph from the community graph of the other class, we get a final bipartite graph that is isomorphic to the one generated using the original input.

Experiments

In this section, we will examine the performance of the nested community detection algorithm from several perspectives. First, we verify that the algorithm is able to detect the nested structures in a few basic examples, then whether it can find all ground-truth communities of the benchmark graphs generated by the algorithm described in the “Generation of overlapping communities” section, fully reconstructing the input community graph. We then examine the community structure of graphs commonly used when benchmarking nestedness metrics. Then, in a separate section, we compare the results of our method with other community detection algorithms to identify key differences in the discovered community structures. Finally, we measure the execution time of our algorithm in the function of node and edge counts in various graphs.

To evaluate our results and quantify nestedness, we use the NODF (Almeida-Neto and Ulrich 2011), discrepancy (Brualdi and Sanderson 1999) and temperature (using the Binmatnest algorithm) (Ángel Rodríguez-Gironés and Santamaría 2010) nestedness metrics on bipartite graphs. To make it easier to compare these metrics with our community structure, we derive our own nestedness metric from the community graph: the average fraction of communities a vertex is part of, called vertex presence. For normalization purposes, this number is multiplied by 2 in bipartite graphs, since perfectly nested bipartite graphs have 2 perfectly nested communities, one for each class. Vertex presence ranges from 0 to 1, where 1 means every vertex is part of every community, i.e., there is only a single community with all vertices in it. When vertex presence is low, it means that vertices are part of few communities while the total number of communities is high. When vertex presence is calculated specifically on the nested community structure, a maximal presence means there is one nested community (path), so the graph is fully nested, and a minimal presence means that every vertex is in its own nested community, having no nestedness in the network at all. Intuitively, a larger value means a vertex is part of a larger portion of the nested communities, which increases the overall nestedness of the network. We also note that since our community detection algorithm works on non-bipartite graphs, unlike the previously mentioned metrics, vertex presence is not restricted to bipartite graphs. Formally, we can obtain vertex presence for a vertex v by calculating

$$\begin{aligned} \textrm{pres}(v) = \frac{\left| \left\{ C: C \in {\mathcal {C}}, v \in C \right\} \right| }{\left| {\mathcal {C}}\right| }, \end{aligned}$$
(3)

where \({\mathcal {C}}\) is the set of nested communities (maximal paths in the community graph). Since the domain of \(\textrm{pres}(v)\) will depend on n (its lowest value is \(\frac{1}{n}\)), we can normalize it into the [0, 1] range, so that 0 means entirely not nested in all cases and 1 still means fully nested. This can be achieved using

$$\begin{aligned} \overline{\textrm{pres}}(v) = \frac{n}{n - 1} \left( \textrm{pres}(v) - 1 \right) . \end{aligned}$$
(4)

For compactness, we will use average vertex presence as a metric of the entire graph by averaging the normalized vertex presence across all vertices.

As opposed to regular communities, an interesting property of nested communities is that the position of a vertex inside a community is informative, too. If a vertex is at the beginning of a community (path), then its neighborhood is a subset of the other vertices of the community. If, on the other hand, a vertex is at the end of a community, its neighborhood is a superset of the other vertices in the same community. We call the quantification of this vertex position, and it can be calculated using

$$\begin{aligned} \textrm{pos}(v, C) = \frac{i - 1}{\max \left\{ 1, \left| C\right| - 1 \right\} }, \end{aligned}$$
(5)

where \(\exists i: C_i = v\) (v is the ith vertex of C), thus \(\textrm{pos}(v, C)\) is only valid if \(v \in C\). We perform normalization in the denominator that makes the position of vertices at the beginning of a community 0, even if they are the sole vertex in a community, assuming indexing starts at 1. With this, we can calculate the position of a vertex on all nested communities it is present in, then take its average to get the mean vertex position of said vertex.

These two metrics allow us to measure nestedness both on a graph level (by calculating average vertex presence) and on a vertex level (through vertex position).

Results on typical examples

Before testing the algorithm on generated benchmark examples, we first demonstrate that the algorithm finds basic nested structures. Figure 7a, e show that the algorithm correctly identified the full bipartite graph that has two fully nested parts: nodes in one class belong to the same community. Figure 7b, f show the same concept, but in a special case: the star graph is also considered fully nested, and the algorithm correctly identifies the upper class as fully nested and the single node of the bottom class as another. Figure 7c, g show that the nodes of the fully non-nested bipartite graph are all correctly put into different communities.

Fig. 7
figure 7

Graphs showing typical nested configurations (ad) and their community graphs (eh)

Finally, to demonstrate that the algorithm works with non-bipartite graphs, through Fig. 7d, h we show that all five nodes of the complete graph are correctly classified as a single community.

Results on benchmarks

In this section, we will compare the ground-truth community structure of the generated benchmark graphs with the one found by our algorithm. In order to create a benchmark graph, we use one or more random spanning trees (creating a spanning forest) sampled from complete graphs and orient their edges randomly. These oriented spanning trees will be the community graphs of the benchmark graphs.

The benchmark is set up as follows. We generated 2000 random graphs with ground-truth communities with 1 to 4 blocks (components) and 1 to 60 nodes per block. Let \(n_b\) denote the sum of the number of nodes in the input to the generator across all blocks. Note that we know the ground truth for the first \(n_b\) nodes as these are the nodes of the input community graph, the rest \(n_b\) nodes are generated; this was discussed in more detail in the generator algorithm's “Remarks” section. The generated benchmark graphs are available as Additional file 2.

After generating the benchmark graphs, we run the algorithm on them and compare the first \(n_b\) nodes of the result with the known ground truth. Again, we cannot compare the rest due to the limitations of the generator algorithm, however, we do not need to, since nestedness is a symmetric property for bipartite graphs. Furthermore, correctly recovering the ground-truth community structure for the first \(n_b\) nodes means that the algorithm is capable of reconstructing the entire community graph on the input.

Our tests show that all community graphs in the benchmark set were correctly recovered and that the resulting community structures were exactly consistent with the ground truths. Figure 8 shows a generated benchmark graph and its detected nested community structure. Even looking at this figure, we can presume that there are many communities, even on smaller graphs, with many of them overlapping over a large part of the vertices.

Fig. 8
figure 8

A generated bipartite graph (a) and its detected community graph (b). The ground truth is known for the green (bottom) class

Results on real-world networks

Bipartite networks

After validating that the algorithm is capable of reconstructing the community graph, we take a look at real-world networks used to test nestedness metrics. First, we examine the nested structure of ecological networks from Web of Life (Bascompte Lab 2014). We examine the algorithm’s output on pollinator (mutualistic) and host-parasite networks. These are two sets of small bipartite networks of species interactions. For an overview of the results, see Table 1.

Fig. 9
figure 9

The original M_PL_069_03 graph (a) and its nested community graph (b)

Table 1 Computed properties on a subset of the full dataset

The first network we examine is M_PL_069_03 (Kohler 2011) (Fig. 9), a tiny pollination network of seven plant and four hummingbird species created from observations in eastern South America. Vertices 1–7 represent plants, while vertices 8–11 represent hummingbirds. For legibility reasons, we show the vertex IDs instead of species names on the plot and clarify where needed. The NODF value of the network is 75.926 (in the range [0, 100], where 100 means fully nested), and its discrepancy value is 2 (where 0 means fully nested), suggesting that it is indeed a highly nested network. The average vertex presence of the community graph is 0.606.

We can see that the graph is clearly not fully nested, as there are multiple communities (paths) on its community graph in both classes. However, it does have large nested communities that cover most of the vertices in each class with high overlap. In both classes, we have three communities that cover all vertices of that class. In the upper class (plant species, purple), the communities cover a larger part of the class with 5 (71%) and 4 (57%) vertices. Three plant species (vertices 3, 5, and 1) also play a key role in these communities, as they are part of all three communities. They represent generalist entities in the network, connected to most vertices of the other class, i.e., most hummingbirds visit them. Vertex 7 is also part of two communities, only vertices 2, 4, and 6 are part of a single community. We can see that they are all visited by only one species of hummingbird.

Looking at the class of hummingbirds, the community structure is simpler because the class has four entities only. Interestingly, we can see that vertex 8 (amazilia versicolor) is part of the three communities but visits only a single plant (vertex 1, aechmea cylindrata), a plant that all other hummingbird species visit, too. This connection alone makes the amazilia versicolor nested with all other species. This is an important aspect of nested networks, where specialist species tend to pick the generalists in the other class.

Fig. 10
figure 10

The original M_PL_069_01 graph (a) and its nested community graph (b)

M_PL_069_01 (Kohler 2011) (Fig. 10) is a slightly bigger pollination network of 18 plants (vertices 1–18) and 6 hummingbirds (vertices 19–24), with a lower connectance. Interestingly, the community structure shows less symmetry in terms of the classes, with eight overlapping plant communities and the six hummingbird species all in their own class. On closer inspection, we can see that some hummingbird species visit largely the same plants as others, but there are always plants that one visits, but the other one does not, and vice versa. For example, vertex 21 (Clytolaema rubricauda) is connected to most of the neighbors of vertex 24 (Thalurania glaucopis), but vertex 16 (Vriesea erythrodactylon) is only connected to 21, creating a \(2K_2\).

This graph also highlights the asymmetric nature of nested communities: while we can observe some degree of nestedness (with some paths covering half the vertices) in the class of plants, the class of birds is fully non-nested.

Fig. 11
figure 11

Community graph of the slightly larger M_PL_058 graph. Larger vertex sizes correspond to higher in-degrees (including transitive edges)

Moving on to larger networks, such as M_PL_058 (Bartomeus et al. 2008) (community graph visible in Fig. 11), untangling the community structure becomes increasingly more difficult, with many communities (277 over two classes) overlapping each other. However, we can make some important observations on the community graph. For example, the community graph has few zero-degree vertices, which means that most vertices contribute to the overall nestedness of the graph, but they are nested only with some other vertices. We can also see that in both large components, there are only a few vertices with high total degrees. In the largest component of the community graph (colored in green, containing bees), there is a bee (of the Andrena genus) with a high in-degree, connecting lots of nested communities, by interacting with 22 plants out of 32. In the second-largest component, which consists of plants, there is a vertex with a high out-degree (a Vicea lutea), being part of a lot of communities at once by having only a single connection to the aforementioned Andrena bee.

Such large networks also show that partially nested networks can have a huge amount of nested communities. We expect a fully nested bipartite network to have two communities (one for each class), a fully nested non-bipartite network to have a single community, and a fully non-nested network to have n communities, each vertex belonging to its own community. Partly nested networks, on the other hand, may have more than n communities: M_PL_057 has \(n = 997\) vertices and \(m = 1920\) edges but at the same time \(2000> m > n\) communities, with an NODF value of 7.23 and a vertex presence of 0.045. Thus, the number of communities is not a linear function of nestedness. This also means that, unfortunately, the number of communities does not perfectly reflect the network’s nestedness.

Host-parasite networks

The host-parasite networks scored an average vertex presence of 0.28 versus 0.188 and an average NODF score of 52.033 versus 30.857 in the pollination set, suggesting that the host-parasite networks are on average more nested.

Fig. 12
figure 12

The A_HP_015 host-parasite network (a) and its nested community graph (b)

This set contains a fully nested network: A_HP_015 (Fig. 12), a network of three rodents (vertices 1–3) and seven parasites (vertices 4–10). We can see that a specialist entity (Microtus oeconomus, vertex 3) interacts only with a generalist species (Amphipsylla marikovskii, vertex 5) and vice versa. The network has a vertex presence score of 1 and a discrepancy of 0, but its NODF value is 75 (where 100 would mean fully nested). Similarly, its temperature score isn’t showing perfect nestedness, either, at a value of 1.05 instead of 0.

Nestedness metrics

Examining the relationship between vertex presence and some nestedness metrics, we can see that while vertex presence and NODF behave similarly with few exceptions (Fig. 13a, c), Binmatnest gave small scores (meaning high nestedness) to some graphs with low vertex presence (Fig. 13b). These graphs had low NODF and high discrepancy scores, both suggesting low nestedness. Similarly, NODF gave a score of 75 to a fully nested network in the host-parasite network set, where the discrepancy was 0 and vertex presence was 1 (Fig. 13c, d).

Fig. 13
figure 13

Comparison of vertex presence with the NODF and Binmatnest nestedness metrics on the pollination (a, b) and host-parasite (c, d) network set

Non-bipartite graphs

Now we are going to examine the algorithm’s output in some common non-bipartite, mostly social networks. Nestedness in a social network can be interesting in the sense of information exchange and domination. If some i is connected to all acquaintances of j and more, and they get into a conflict, i can spread their position to everyone j knows, potentially dominating j.

A common example used when testing community detection algorithms is Zachary’s karate club network (Zachary 1977), visible in Fig. 14a. This is a social network of 34 members of a karate club who interacted outside the club. The club split into two, creating two communities, marked with two colors.

Fig. 14
figure 14

Zachary’s karate club network and its community graph. Vertex colors represent the ground-truth communities

Although this is not an ecological network, we can see that there are only three weakly connected components in the community graph (Fig. 14b), two of them being isolated vertices and the third formed by the rest of the graph. Most nested relationships are between vertices of the same color (karate community), except for three nodes. Node 1 (the instructor) is in a nested relationship with 9 other members out of 16 in its karate community, while node 34 (the administrator) is with 7 out of 16. The graph contains 33 nested communities with an average size of 3.67 vertices per community and the mean vertex presence is 0.107. These observations lead us to believe that nestedness does not play a key role in the formation of the network.

The Florentine families network (Breiger and Pattison 1986) (Fig. 15a) contains marriage links between families during the Italian Renaissance. While this network is typically used to demonstrate centrality, the presence of nestedness may be interesting in the sense that families can be in a dominating position if they have connections to all neighbors of another family.

Fig. 15
figure 15

The Florentine families network and its community graph

Looking at the community graph in Fig. 15b we can see that there are few cases for this, most of them due to having a single neighbor that is common with one of the central families. For example, the Acciaiuoli family is nested with five others because they have a connection only with the Medici family. This means that if someone could influence the Medici family, they might also be able to influence the Acciaiuoli. Another interesting observation is that the Castellani family is not nested with anyone, so while they have their connections, they are not dominated by any other family. With a low mean vertex presence of 0.128 and a maximum community size of 2, nestedness does not appear to play a key role in this network, either.

Comparison with general community detection algorithms

Now, we compare the nested community structure found by our algorithm with the output of traditional community detection algorithms Linkcomm (Ahn et al. 2010), MOSES (McDaid and Hurley 2010) and CFinder (Adamcsek et al. 2006). These algorithms use different approaches for community detection, Linkcomm being a link partitioning method, MOSES a fuzzy algorithm, and CFinder a clique search algorithm. Note that while Linkcomm, MOSES and CFinder propose to find communities where vertices within communities are more densely connected, our algorithm defines communities as fully nested subgraphs. While the community definitions are different, we perform these comparisons to highlight the differences between nested and traditional community structures.

Since our algorithm assigns each vertex to a community—vertices that don’t belong anywhere else are put into their own communities – we modify the outputs of the community detection algorithms mentioned above so that omitted vertices are also assigned a community. This helps us to level the comparisons and also to avoid making false conclusions based on the number of communities, since a single community is only possible if all vertices are included in it.

Looking at the number of communities and their average sizes (Fig. 16a, b), we can see that MOSES tends to identify fewer but occasionally larger communities, while CFinder created many—sometimes as many as 4n communities—, although its average community size was not the smallest in these cases either, meaning that there had to be greater overlap between them. This is confirmed by the vertex presence in Fig. 16c. Vertex presence was low across all graphs and algorithms, which means that overall there is little overlap between the communities. The different community structures detected on Zachary’s karate club network are shown in Fig. 17.

Fig. 16
figure 16

Comparison of community statistics across algorithms and graphs

Fig. 17
figure 17

Community structures detected by different algorithms on Zachary’s karate club network. Communities with a single vertex are not plotted

Finally, we also calculate the Generalized Conventional Normalized Mutual Information (Lutov et al. 2019) (GenConvNMI) index to measure Mutual Information between the community structure of our algorithm and the compared other algorithms. The NMI indices shown in Table 2 are high, especially in the case of CFinder. The full results are included in Additional file 3.

Table 2 Generalized Conventional Normalized Mutual Information (Lutov et al. 2019) between the communities of our algorithm and the compared traditional community detection algorithms (higher = more similar)

Implementation and performance benchmarks

Finally, we show that our algorithm is scalable enough to be used on large networks. It was implemented in R (R Core Team 2023) and C++ and its reference implementation, together with the example generator code, is open source.Footnote 2 Here, we measure the runtime of our algorithm across all analyzed non-synthetic graphs.

Fig. 18
figure 18

Average execution times of the nested community detection algorithm’s implementation over 100 runs

Figure 18 shows average runtimes of 100 executions on each of the analyzed bipartite and non-bipartite networks. The figure shows that while it is not perfectly linear, the algorithm’s runtime doesn’t scale steeply with the increase of the vertices or edges. We can see that it is capable of achieving runtimes of under a second on graphs with much more than 1000 vertices, or more than 10,000 edges. For the details of the test environment, please refer to “Appendix A”.

Conclusions

We introduced a novel constrained community detection algorithm that finds overlapping nested subgraphs of a given input graph and constructs a directed graph, called the community graph, representing this community structure in a compact form. In reverse, we also introduced an algorithm that generates bipartite graphs with any given nested community structure from the input community graph. Derived from the resulting community graph, we also introduced two metrics to measure nestedness in networks on both graph- and vertex levels. We have shown that our community detection method can uncover detailed nested relationships within the input graph. We demonstrated its capabilities through several benchmark networks and real-world networks as well. Finally, we compared our method with multiple commonly used community detection algorithms that are able to find overlapping communities. We showed that our method can reveal a different type of community structure in both bipartite and non-bipartite graphs.

Availability of data and materials

The pollination and host-parasite datasets analyzed during the current study are available in the Web of Life repository at www.web-of-life.es. The analyzed non-bipartite networks are available in their respective published articles: Zachary’s karate club (Zachary 1977), Florentine families network (Breiger and Pattison 1986). The other graphs not analyzed in detail, but on which we performed calculations: coappearance network of characters in the novel Les Misérables (les_miserables) (Knuth 1993), adjacency network of common adjectives and nouns in the novel David Copperfield by Charles Dickens (adjnoun) (Newman 2006), network representing the neural network of C. Elegans (celegansneural) (Watts and Strogatz 1998), social network of frequent associations between dolphins (dolphins) (Lusseau et al. 2003), coauthorships in network science (netscience) (Newman 2006), network representing the topology of the Western States Power Grid (power) (Watts and Strogatz 1998), Davis Southern women social network (dswomen) (Davis et al. 1941). The generated random bipartite networks (along with their base community graphs and ground truths) analyzed in “Results on benchmarks” section are included in this published article and its supplementary information files.

Notes

  1. This condition enables us to skip the calculation of \(\textrm{nest}(k, i)\), which would have the same result as the previously calculated \(\textrm{nest}(i, k)\).

  2. https://github.com/Hanziness/r-nested-comms/releases/tag/v0.2.

References

Download references

Acknowledgements

The authors would like to express their gratitude to András Pluhár for his useful insights and to the reviewers for their detailed suggestions and comments. All of them have contributed to a significant improvement of the quality of the article.

Funding

Open access funding provided by University of Szeged. This work was supported by the National Research, Development and Innovation Office-NKFIH Fund No. SNN-135643. This work was supported by the University of Szeged Open Access Fund No. 6267.

Author information

Authors and Affiliations

Authors

Contributions

IG designed and implemented both algorithms and carried out the experiments using them. AL helped in formalizing and structuring multiple aspects of the paper. The authors jointly wrote and reviewed the manuscript.

Corresponding author

Correspondence to Imre Gera.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

Source code of the community detection package nested.comms: This is an R package that can be built and installed using the devtools package. To use it, one can call the install_local function from devtools on the source zip file. The installed package (nested.comms) exposes functions to perform nested community detection (described in the “Nestedness and community detection” section) using the nested_node_comm function, and test graph generation (“Generation of overlapping communities” section) through the generate_benchmark_random function.

Additional file 2.

Generated networks: The 2000 generated networks used in the “Results on benchmarks” section are included in a compressed archive. The final networks are named gen_id.csv, their base community graphs are named base_id.csv, and the ground-truth community list is included as truth_id.txt, where id is the identifier of the generated graph and each line corresponds to one community. Both the final and the base graphs are given in the form of edge lists. The final graph is always a bipartite graph.

Additional file 3.

Computational results: A CSV file containing all the computed properties of all the non-synthetic graphs we have analyzed, including graphs and properties omitted for compactness from Table 1.

Appendix A: Test environment

Appendix A: Test environment

The performance tests were carried out on an ASUS ExpertCenter computer with an Intel® Core\(^{\textrm{TM}}\) 7-10700 CPU @ 2.90GHz (16 cores) CPU and 16 GB of RAM. The system was running Manjaro Linux (kernel version 6.3.5-2-MANJARO (64-bit)) with R 4.3.1 (latest packages as of 2023-07-01), compiled with the Intel Math Kernel Library (MKL). Execution time was measured using the microbenchmark package.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gera, I., London, A. Detecting and generating overlapping nested communities. Appl Netw Sci 8, 51 (2023). https://doi.org/10.1007/s41109-023-00575-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s41109-023-00575-2

Keywords