ANGEL: efficient, and effective, node-centric community discovery in static and dynamic networks

Rossetti, Giulio

doi:10.1007/s41109-020-00270-6

Research
Open access
Published: 10 June 2020

ANGEL: efficient, and effective, node-centric community discovery in static and dynamic networks

Giulio Rossetti ORCID: orcid.org/0000-0003-3373-1240¹

Applied Network Science volume 5, Article number: 26 (2020) Cite this article

2952 Accesses
9 Citations
4 Altmetric
Metrics details

Abstract

Community discovery is one of the most challenging tasks in social network analysis. During the last decades, several algorithms have been proposed with the aim of identifying communities in complex networks, each one searching for mesoscale topologies having different and peculiar characteristics. Among such vast literature, an interesting family of Community Discovery algorithms, designed for the analysis of social network data, is represented by overlapping, node-centric approaches. In this work, following such line of research, we propose Angel, an algorithm that aims to lower the computational complexity of previous solutions while ensuring the identification of high-quality overlapping partitions. We compare Angel, both on synthetic and real-world datasets, against state of the art community discovery algorithms designed for the same community definition. Our experiments underline the effectiveness and efficiency of the proposed methodology, confirmed by its ability to constantly outperform the identified competitors.

Introduction

Community discovery (henceforth CD), the task of decomposing a complex network topology into meaningful node clusters, is one of the oldest and most discussed problems in complex network analysis (Coscia et al. 2011; Fortunato 2010). One of the main reasons behind the attention it has received during the last decades lies in its intrinsic complexity, strongly tied to its overall ill-posedness. Indeed, complex networks researchers agree that it is not possible to provide a single and unique formalization that covers all the possible characteristics a community partition may satisfy. Usually, every CD approach is designed to provide a different point of view on how to partition a graph: in this scenario, the solutions proposed by different authors were often proven to perform well when specific assumptions can be made on the analyzed topology. Nonetheless, decomposing a complex structure in a set of meaningful components represents per se a step required by several analytical tasks. Such peculiarity has lead to the definition of several “meta” community definitions, often tied to specific analytical needs. For instance, classic works intuitively describe communities as sets of nodes closer among them than with the rest of the network, while others, looking at the same problem from another angle, only define such topologies as dense network subgraphs. A general, high-level, formulation of the Community Discovery problem definition is the following:

Definition 1

(Community Discovery (CD)) Given a network G, a community C is defined as a set of nodes in G: $C= \{ v_{1},v_{2}, \dots,v_{n}\} $. The community discovery problem aims to identify the set $\mathcal {C}$ of all the communities in G.

The absence of a unique, well-posed, definition of what a community in a complex network should represent is only one of the issues to face when approaching network clustering. Indeed, the evolution through time of a network topology plays a major role in the way communities can be defined and extracted. Even though the CD problem has been classically studied considering the underlying network topology as “frozen in time", recently a novel branch of research addressed the problem of studying the dependant evolution of networks and their communities. Complex networks are often used to model dynamic objects – e.g., social phenomena, economic transactions, human mobility – composed by nodes and edges that may appear and vanish as time goes by. When considering this temporally enriched scenario, we need to revise the formulation of the classical Community Discovery problem. We will then talk of Dynamic Community Discovery (henceforth referred as DCD), a problem that can be defined by abstracting the specific CD definition as done in Rossetti and Cazabet (2018):

Definition 2

(Dynamic Community Discovery (DCD)) Given a dynamic network DG, a Dynamic Community DC is defined as a set of (node, periods) pairs:$DC= \{ (v_{1},P_{1}),(v_{2},P_{2}), \dots,(v_{n},P_{n}) \} $, with $P_{n} =((t_{s0},t_{e0}), (t_{s1},t_{e1})\dots (t_{sN}, t_{eN}))$, with t_s∗≤t_e∗. Dynamic Community Discovery aims to identify the set $\mathcal {C}$ of all the dynamic communities in DG.

Both proposed problem meta-definitions allow multiple solutions to the network clustering under different constraints. As an example, such definitions do not explicitly require complete coverage of the nodes, nor specify if the identified clustering represents a neat nodes partition or, instead, a cover (thus allowing overlaps among communities). In this work, we introduce a CD algorithm, ANGEL, tailored to extract overlapping communities from a complex network. Our approach is primarily designed for social networks analysis and belongs to a well-known subfamily of Community Discovery approaches often identified by the keywords bottom-up and node-centric (Rossetti et al. 2017b). ANGEL aims to provide a fast way to compute reliable overlapping network partitions in the absence of topology dynamics. However, as we underlined, the unfolding of time plays a significant role in the structures describing social phenomena. To cope with such intrinsic evolution, we leverage ANGEL to design a simple Dynamic Community Discovery approach that can be used to track dynamic communities and their life-cycles. Both the proposed approaches focus on lowering the computational complexity of existing methods proposing scalable sequential – although, easily parallelizable – solutions to a very demanding task: overlapping network decomposition.

The paper is organized as follows. “Related works” section covers the relevant literature on community discovery needed to frame the proposed approach. In “ANGEL: static community discovery” section we introduce our static node-centric algorithm, ANGEL. There we discuss its rationale, the properties it holds as well as its computational complexity. In “ANGEL evaluation” section we evaluate the proposed method on both synthetic and real-world datasets for which ground truth communities are known in advance. To better discuss the resemblance of ANGEL partitions to ground truth ones as well as its execution times, we compare the proposed method with state-of-art competitors sharing the same rationale. In “ANGEL on dynamic networks” section we introduce the extension of ANGEL we designed to cope with dynamic network topologies. There we frame the proposed method in its general class and discuss its computational complexity. In “Dynamic ANGEL evaluation” section, as done for ANGEL, we evaluate its extension on both synthetic benchmarks and real-world dynamic networks. To do so, the concept of community life-cycle is introduced, and qualitative analysis of community event trends is performed. Finally, “Conclusion” section concludes the paper.

Related works

Community discovery is a widely discussed and studied problem. Researchers continuously propose novel approaches with the aim of solving specific declinations of this complex, and ill-posed, problem. Due to the massive literature available in this field, several attempts were made to organize and cluster methods identifying some common grounds. Among the others, the surveys of Fortunato (2010); Fortunato and Hric (2016) and Coscia (Coscia et al. 2011) propose complete, detailed and extensive taxonomies for classic algorithms. However, due to the intrinsic complexity of the problem, several thematic surveys emerged, each focusing on a different declination (for instance considerint overlapping (Xie et al. 2013), directed (Malliaros and Vazirgiannis 2013), node-centric (Rossetti et al. 2017b) as well as dynamic community discovery (Cazabet et al. 2017; Rossetti and Cazabet 2018)).

Static Community Discovery. The algorithmic solutions we propose share a very specific goal: identify overlapping network partitions following a bottom-up, node-centric, strategy. Such an approach is often adopted while analyzing social network contexts (Rossetti et al. 2015; 2016; Milli et al. 2015), scenarios in which it is important to take into account the individual perspective on their local communities. Most importantly, in social scenarios, neat partitions are rarely semantically coherent or easily identifiable. Following such a rationale, ANGEL leverages individual ego-networks to access the node-centric perspective of the analyzed social graph. The growing availability of social media data has indeed allowed for extensive studies of such ego-centered topologies: among them in (Arnaboldi et al. 2017) Facebook and Twitter datasets were studied to relate online and offline properties of ego-networks. Such procedure, originally proposed in Coscia et al. (2012); Coscia et al. (2014a) were also extended to parallel implementations, as in Amoretti et al. (2016), and generalized in a high-level framework (Soundarajan and Hopcroft 2015). Moreover, several approaches leverage the concept of ego-network to design heterogeneous community definitions (Epasto et al. 2017; Buzun et al. 2014). Other common strategies to design node-centric approaches, avoiding the use of ego-networks, are the seed set expansion (Moradi et al. 2014; Whang et al. 2016), and community diffusion ones (Kumpula et al. 2008; Raghavan et al. 2007). The former decompose the community detection into two steps: identification of the seed nodes and definition of an iterative rule that describe how they attract nodes to form communities around them. Conversely, the latter let each node in the graph to autonomously chose its community by observing the choices made by its neighborhood. A classic example of this family of approaches is offered by the Label Propagation algorithm used by ANGEL to identify local communities (Raghavan et al. 2007).

Dynamic Community Discovery. Indeed, a significant number of systems can be modeled as temporal networks: cellular processes, social communications, large infrastructures (i.e., call graphs and web graphs) posses both network and temporal aspects that make them a perfect fit for dynamic network modeling. One of the first works underlining the needs for a dedicated framework for analyzing evolving network structure is indeed (Holme and Saramäki 2012). Several formalisms for representing evolving networks have been proposed to support the definition of such revised analytical framework: Temporal Networks (Holme and Saramäki 2012), Time-Varying Graphs (Casteigts et al. 2012), Interaction Networks (Rossetti et al. 2016), and Link Streams (Viard et al. 2016), to name the most famous. Leveraging such temporally enriched models novel community discovery approaches started taking into account the temporal dimension, following different strategies. In Rossetti and Cazabet (2018), three families of DCD algorithms are identified and discussed:

Instant-optimal CD assumes that communities existing at t only depend on the current state of the network at t, as done by our dynamic Angel extension.
Temporal Trade-off CD assumes that communities defined at an instant t do not only depend on the topology of the network at that time, but also on the past evolutions of the topology, past partitions found, or both.
Finally, Cross-Time CD shifts the from searching communities relevant at a particular time to searching communities relevant when considering the whole network evolution.

Within the first family are grouped several two-steps, Identify&Match, algorithms. The common ground of such approaches, e.g., (Palla et al. 2007; Takaffoli et al. 2011; Morini et al. 2017), is that they are easily parallelizable while suffering from some instability due to the matching phase performed as post-processing. Conversely, Temporal Trade-off approaches focus on smoothly identifying community evolutions as they happen. Such algorithms, e.g., (Cazabet et al. 2010; Zakrzewska and Bader 2015; Rossetti et al. 2017a), are designed to deal with high-frequency node interactions, are not easily parallelizable and prone to “avalanche effects" (i.e., since they focus on local community perturbations the node groups tend, as time goes by, to increase their sizes). Finally, algorithms of the latter family search for stable communities across time, e.g., (Matias and Miele 2016; Mucha et al. 2010; Himmel et al. 2016). They often work upon temporal network aggregations built leveraging the complete knowledge of nodes and edges evolution. As a result, they are usually neither easily parallelizable nor applicable in online scenarios.

ANGEL: static community discovery

In this section, we present our bottom-up solution to the community discovery problem. In “Algorithm rationale” section we discuss the core of ANGEL^{Footnote 1}. Our approach follows a well-known pattern composed by two phases: i) construction of local communities moving from ego-network structures and, ii) definition of mesoscale topologies by aggregating the identified local-scale ones. Moreover, in “Properties” and “Complexity” sections we discuss the properties of the proposed algorithm and provide bounds to its complexity.

Algorithm rationale

The algorithmic schema of ANGEL is borrowed from the one firstly adopted in (Coscia et al. 2012) where the authors propose DEMON an approach whose main goal was to identify local communities by capturing individual nodes perspectives on their neighbourhoods and using them to build mesoscale ones. ANGEL follows the same rationale: however, conversely from its predecessor, it focuses on lowering the time complexity while at the same time increasing the partition quality (as will be discussed in “Complexity” section).

ANGEL starts taking as input a graph G, a merging threshold ϕ and an empty set of communities $\mathcal {C}$. The algorithm main loop cycles over each node, so to generate all the possible points of view of the network structure and guarantee complete coverage of its overall topology (Step #1 in Algorithm 1). To do so, for each node v, our algorithm applies the $EgoMinusEgo(v, \mathcal {G})$ (Step #2 in Algorithm 1) operation as defined in Coscia et al. (2014b). Such function extracts the ego-network centred in the node v – e.g., the graph induced on $\mathcal {G}$ and built upon v and its first order neighbours – then removes v from it, obtaining a novel, filtered, graph substructure. ANGEL removes v since, by definition, it is directly linked to all nodes in its ego-network, connections that would lead to noise in the identification of local communities. A single node connecting the entire sub-graph will make all nodes very close, even if they are not in the same local community. Once obtained the ego-minus-ego graph, the next step is to compute the local communities it contains (Step #3 in Algorithm 1). The algorithm performs this step by using a CD approach borrowed from the literature: Label Propagation (LP)(Raghavan et al. 2007). This choice, already adopted in (Coscia et al. 2012), has been made for the following reasons:

1.
LP is known as the least complex algorithm in the literature, reaching a quasi-linear time complexity in terms of nodes. However,
2.
LP will return results of a quality comparable to more complex algorithms(Coscia et al. 2011).

Reason #1 is particularly important since Step #3 of our pseudocode needs to be performed once for every node of the network, thus making it unacceptable to spend a super-linear time for each node. Notice that instead of LP any other community discovery algorithm (both overlapping or not) can be used (impacting both on the algorithmic complexity and partition quality). Given the linear complexity (in the number of nodes of the extracted ego-minus-ego graph) of Step #3, we refer to this as the inner loop for finding the local communities. Due to the importance of LP for our approach and to shed lights on how it works, we briefly describe its classical formulation (Raghavan et al. 2007). Suppose that a node v has neighbors v₁,v₂,...,v_k and that each one of them carries a label denoting the community it belongs. Then, during each iteration, the label of v is updated to the majority label of its neighbours. As the labels propagate, densely connected groups of nodes quickly reach a consensus on a unique label. At the end of the propagation process, nodes sharing the same labels identify the resulting communities.

In case of bow-tie situations – e.g., a node having an equal maximum number of neighbors in two or more communities (example in Fig. 1) – the classic definition of the LP algorithm randomly select a single label for the contended node. ANGEL, conversely, handle this situation – that otherwise can led to nondeterministic behaviours – by allowing soft community memberships: each node can thus belong to multiple communities in case of bow-tie configuration. The result of Steps #1-3 of Algorithm 1 is a set of local communities $\mathcal {C}(v)$, according to the perspective of a specific node, v, of the network. Differently, from what done in DEMON, ANGEL does not reintroduce the ego in each local community to reduce the noisy effects hubs play during the merging step. Since local communities can be seen as a biased and partial view of the real community structure of $\mathcal {G}$, the result of ANGEL needs further processing: namely, a merging step that simplifies the local partition present in $\mathcal {C}$.

Once the outer loop on the network nodes is completed, ANGEL leverages the PRECISIONMERGE function to compact the community set $\mathcal {C}$ so to avoid the presence of fully contained communities in it. Such function (Step #6, detailed in Algorithm 2) implements a deterministic merging strategy and is applied iteratively until reaching convergence (Step #4) – e.g., until the communities in $\mathcal {C}$ cannot be merged further. To assure that all the possible community merges are performed at each iteration $\mathcal {C}$ is ordered from the smallest community to the biggest (Algorithm 1, #Step 6). This merging step is a crucial one since it needs to be repeated for each one of the local communities. In DEMON such operation requires the computation for each pair of communities (x,y), $x\in \mathcal {C}(v)$ and $y\in \mathcal {C}$, of an overlap measure (i.e. Jaccard index) and to evaluate if it overcomes a user defined threshold. This approach, although valid, has a major drawback: given a community $x\in \mathcal {C}(v)$ it requires $O(|\mathcal {C}|)$ evaluations to identify its best match among its peers. Indeed, such kind of strategy represents a costly bottleneck requiring an overall $O(|\mathcal {C}|^{2})$ complexity while applied to all the identified local communities. ANGEL aims to drastically reduce such computational complexity by performing the matches leveraging a greedy strategy. To do so, it proceeds in the following way:

Angel assumes that each node carries, as additional information, the identifiers of all the communities in $\mathcal {C}$ it already belongs to;
in Step #A (Algorithm 2) for each local community x is computed the frequency of the community identifiers associated with its nodes;
in Step #B, for each pair (community_id, frequency) is computed its Precision w.r.t. x, namely the percentage of nodes in x that also belong to community_id;
iff the precision ratio is greater (or equal) than a given threshold ϕ the local community x is merged with community_id: their union is added to $\mathcal {C}$ and the original communities are removed from the same set.

Operating in this way it is avoided the time expensive computation of community intersections required by Jaccard-like measures since all the containment testing can be done in place. Figure 2 shows two examples of Angel clustering of the Zachary Karate club network obtained varying the ϕ threshold. As expected, increasing the ψ threshold, we obtain a higher number of communities since lower quality merges cannot take place.

Properties

The proposed approach posses two nice properties: it produces a deterministic output (given the as input a network G and a threshold ϕ), and it allows for a parallel implementation.

Property 1

(Determinism) There exists a unique $\mathcal {C}$=ANGEL (G,ϕ) for any given G and ϕ, disregarding the order of visit of the nodes in G.

To prove the determinism of ANGEL it is mandatory to break its execution in two well-defined steps: (i) local community extraction and (ii) merging of local communities.

Local communities: Label Propagation identifies communities by applying a greedy strategy. In its classical formulation (Raghavan et al. 2007) it does not assure convergence to a stable partition due to the so-called “label ping-pong problem" (i.e., instability scenario primarily due to bow-tie configurations). However, as already discussed in “Algorithm rationale” section, we solved such problem relaxing the node single label constraint thus allowing for the identification of a stable configuration of overlapping local communities.
Merging: this step operates on a well-determined set of local communities on which the PrecisionMerge procedure is applied iteratively. Since we explicitly impose the community visit ordering the determinism of the solution is given by construction.

Property 2

(Compositionality) ANGEL is easily parallelizable since the local community extraction can be applied locally on well defined subgraphs (i.e., ego-minus-ego networks).

Given a graph G=(V,E) it is possible to instantiate ANGEL local community extraction simultaneously on all the nodes u∈V and then apply the PRECISIONMERGE recursively in order to reduce and compact the final overlapping partition:

$$ Angel(G,\phi) =PrecisionMerge\left(\bigcup_{u\in V}LP(EgoMinusEgo(u))\right) $$

(1)

The underlying idea is to operate community merging only when all the local communities have already been identified (i.e., LABELPROPAGATION is applied to all the ego-minus-ego of the nodes u∈V – LP(EgoMinusEgo(u)) in Eq. 1 – as shown in Fig. 3). Moreover, this parallelization schema is assured to produce the same network partition obtained by the original sequential approach due to the determinism property.

Complexity

To evaluate the time complexity we proceed by decomposing ANGEL in its main components. Given the pseudocode description provided in Algorithm 1 we can divide our approach into the following sub-procedures:

Outer loop (lines 3-6): the algorithm cycles over all the nodes of the network to extract the ego-minus-ego networks and identify local communities. This main loop has thus complexity $\mathcal {O}(|V|)$.
Local Communities extraction: the Label Propagation algorithm has complexity $\mathcal {O}(n + m)$ (Raghavan et al. 2007), where n is the number of nodes and m is the number of edges of the ego-minus-ego network. Let us assume that we are working with a scale free network, whose degree distribution is p_k=k^−α: in this scenario the majority of the identified ego-minus-ego networks are composed by n<<|V| nodes and m<<|E| edges, thus the average complexity of each iteration will be $\mathcal {O}(n+m) << \mathcal {O}(|V|+|E|)$.
PrecisionMerge final cycle (lines 9-14): for each local community Angel evaluate if it can be merged with one or more previously identified substructures. To efficiently implement this task, we assume that once identified a community a new identifier is generated and assigned to all the nodes within it. All the nodes will then have attached multiple labels (one representing an identifier of a community the node belongs to). Given a community x the PrecisionMerge function (Algorithm 2) leverage such information to efficiently compute – for each community identifier y attached to the nodes in x – the ratio of nodes in it that already belongs to y w.r.t. the size of x. If the ratio is greater than (or equal to) a given threshold, the merge is applied and the node label updated. This step can be performed with constant complexity employing an hash-map, $\mathcal {O}(1)$. Considering the complete loop the overall cost is thus given by the initial sorting of the communities by decreasing size, $\mathcal {O}(|C|log|C|)$ (where C is the community set), and the evaluation of PrecisionMerge on each community in C, $\mathcal {O}(|C|)$. Moreover, we can assume the number of iteration k<<|C| since at each step the number of communities decreases: thus we can consider k as a constant factor giving as final complexity, $\mathcal {O}(|C|log|C|)$ + $\mathcal {O}(|C|)$ = $\mathcal {O}(|C|log|C|)$.

Considered together such sub procedures gives us a final complexity of $\mathcal {O}(|V|(n+m)) + \mathcal {O}(|C|log|C|)$: considering a scale free network, for which we can reasonably expect |V|>>(n+m) and |V|>|C|, the final complexity can be approximated as $\mathcal {O}(|V|)$.

ANGEL evaluation

Evaluating a community discovery approach is not an easy task. In this section, we propose a two-stage evaluation, focusing both on underlining ANGEL efficiency – in terms of scalability and running time – as well as on its ability to retrieve ground-truth communities. As a first step, in “Competitors and datasets” section we identify the competitors of our algorithm, approaches that share with it the same rationale. After that, in “Community resemblance” section, we briefly describe the quality function we adopt to compare the partition produced by the selected algorithms and to assess their resemblance w.r.t. ground-truth communities. Finally, we evaluate ANGEL and its competitors on two different community resemblance tasks: (i) identification of planted ground truth partition in synthetically generated networks, “Synthetic benchmarks” section, and (ii) identification of semantic communities in real-world network datasets, “Evaluation on real world data” section.

Competitors and datasets

We defined ANGEL as a two-phase bottom-up approach that leverage label propagation to extract overlapping communities. To evaluate its performances, we compare it with state-of-art competitors having a similar rationale^{Footnote 2}. In particular, our analysis includes:

DEMON (Coscia et al. 2012; 2014b) is an incremental and limited time complexity algorithm for community discovery. It extracts ego networks, i.e., the set of nodes connected to an ego node u, and identifies the real communities by adopting a democratic, bottom-up merging approach of such structures.

PANDEMON (Amoretti et al. 2016) is a parallel implementation of DEMON designed to increase its scalability and to reduce the computational complexity of its community merging phase.

NODEPERCEPTION. In Soundarajan and Hopcroft (2015) the authors propose a generalization of the DEMON approach: NODEPERCEPTION instantiate the local two-phase schema by employing alternative community discovery approaches to Label Propagation in the local community extraction phase. Thanks to such flexibility, NODEPERCEPTION allows the final user to identify search for network partitions that optimize specific quality functions.

SLPA. In Xie and Szymanski (2012) is introduced an overlapping hierarchical community discovery algorithm designed for large-scale networks. SLPA leverages a label propagation strategy built upon dynamic interaction rules. The time complexity of SLPA scales linearly with the number of edges in the network.

The former three (DEMON, PANDEMON and NODEPERCEPTION) move from the same algorithmic schema of our approach. They all are node-centric algorithms (Rossetti et al. 2017b) that, moving from the analysis of ego-networks, generate overlapping partitions following a non-deterministic approach and providing different computational complexity. Conversely, the latter competitor, SLPA, represents a fast implementation of the label propagation algorithm used by ANGEL, DEMON and PANDEMON to identify ego-network local communities. Even though SLPA does not fall in the node-centric algorithmic family, we decided to include it in our analysis since it can be seen as a baseline for all those algorithms employing label propagation as the internal function.

Synthetic benchmarks. To evaluate how ANGEL behave under specific, controlled, settings we tested it, along with its competitors, against synthetic networks having planted ground truth communities generated through the LFR benchmark^{Footnote 3} (Lancichinetti et al. 2008). The networks described by LFR have well-known characteristics: among the others, both their node degrees and community sizes follow a power law distributions. Moreover, similar to the planted l-partition model(Condon and Karp 2001), LFR network vertices share a predefined fraction of their links with other vertices of their cluster. Finally, LFR allows the analyst to decide the average cluster density and size of the generated graph. We generated multiple networks varying the following LFR parameters:

N, the network size (from 100 to 100k nodes);
C, the network density (from 0.1 to 0.4, steps of 0.1);
μ, the mixing coefficient describing the average per-node ratio between the number of edges to its communities and the number of edges with the rest of the network (from 0.1 to 0.5, steps of 0.1).

Real world data. TTo understand how ANGEL and its competitors behave on real-world data, we tested them against four network datasets having annotated ground-truth community structure^{Footnote 4}. We analyzed the following datasets (Yang and Leskovec 2015), whose synthetic statistics are briefly summarized in Table 1:

emailEU. Network built upon email exchange data among members of a large European research institution. The ground truth communities identify members’ departments.
Table 1 Datasets statistics
Full size table
Amazon. Network built using the Customers Who Bought This Item Also Bought feature of the Amazon website. Each product category provided by Amazon defines each ground-truth community.
dblp. Co-authorship network where two authors are connected if they publish at least one paper together. Publication venue, e.g., journal or conference, define ground-truth communities.
Youtube. Subgraph of the Youtube social network. User-defined groups identify ground-truth communities.

Differently from synthetic benchmarks, where the planted communities respect specific topological characteristics, real data annotation provides a semantic partition of network nodes. Since none of the considered algorithms is parameter free in our analysis, we instantiate each one of them multiple times performing a grid-search estimation of the optimal parameter for each target network. Such parameter fitting strategy ensures that, for each network, we compare the performances of the selected algorithms leveraging their partitions that better approximate the ground truth ones.

Community resemblance

One way to asses the effectiveness of a CD algorithm is to compare how much the communities it identifies can provide a good approximation of a given ground truth partition. To quantify the degree of resemblance of two graph partitions we apply an efficient methodology proposed in Rossetti et al. (2016). Given a community set X produced by an algorithm and a ground truth community set Y, for each community x∈X we label its nodes with the ground truth community y∈Y they belong to. Then we match community x with the ground truth community with the highest number of labels in the algorithm community. Such procedure produces (x,y) pairs having the highest homophily between the node labels in x and all the ground truth communities. The quality of the produced mappings is estimated in terms of precision and reacal:

Precision: identifies the percentage of nodes in x labeled as y. It is defined as:

$$ P=\frac{|x\cap y|}{|x|} \in [0, 1] $$

(2)

Recall: identifies the percentage of nodes in y covered by x. It is defined as:

$$ R=\frac{|x\cap y|}{|y|} \in [0, 1]. $$

(3)

Given a pair (x,y) the two scores describe the overlap of their members. A perfect match is obtained when both precision and recall are equal to 1. Indeed, many-to-one mappings can occur: multiple communities in X can be connected to a single ground truth community in Y. This peculiarity allows the adoption of such methodology to evaluate both algorithms producing crisp partitions as well as approaches producing overlapping ones. In Rossetti et al. (2016), precision and recall are combined into their harmonic mean obtaining the F1-measure, a concise quality score for the individual pairing:

$$ F1=2\frac{\textit{precision}*\textit{recall}}{\textit{precision}+\textit{recall}}. $$

(4)

Given a network, the F1 score can be averaged among all the identified pairs in order to summarize the overall correspondence between the algorithm community set and ground truth community set. In the following, we will adopt a normalized version of the F1 score, namely NF1^{Footnote 5}, that mitigate the issues related to coverage and redundancy of communities in assessing the final matching quality. In particular, defined as Y_id the set of community of Y matched by community in X, we can define Coverage as:

$$ Coverage=\frac{|Y_{id}|}{|Y|} \in [0,1] $$

(5)

and it identify the percentage of communities in Y that are matched by at least an object of C. Redundancy instead can be defined as:

$$ Redundancy=\frac{|X|}{|Y_{id}|} \in [1,+\infty) $$

(6)

Redundancy is minimized when no conflicting matches exist among the communities in X and the ones in Y_id. Finally NF1 can be defined as:

$$ NF1=\frac{F1*Coverage}{Redundancy} \in (0, 1] $$

(7)

NF1 is maximized when: (i) the average F1 is maximal (perfect match), (ii) the community in X provide a complete coverage for the ones in Y and (iii) the redundancy is minimized (i.e., each community in X is matched with a distinct community in Y). As shown in Rossetti et al. (2016) it is possible to compute F1 (and thus NF1) paying a linear complexity in the size of the community set X. The reduced complexity makes NF1 a suitable alternative to the widely used NMI (Lancichinetti et al. 2008). Moreover, as discussed in Lancichinetti et al. (2009); McDaid et al. (2011), NMI is not stable while comparing overlapping partitions with non-overlapping ones while NF1 does not suffer such limitation. Since all the compared algorithms produced overlapping partitions, NF1 represents a reasonable resemblance function to adopt. Moreover, considering our analytical setup, we avoid applying aggregate ranking solutions (as proposed in Jebabli et al. (2018) and (Dao et al. 2018)) that are designed to obtain a more comprehensive view of clustering properties. Evaluation approaches belonging to such a family generate an aggregate ranking able to summarize the behaviors of alternative CD solutions once that multiple clustering fitness/comparison scores are available (see, for instance, Orman et al. (2012)). However, considering that all the compared algorithms are intended to operate under the same rationale, we feel that focusing on NF1 provides us enough information to draw a few considerations on the obtained clusterings.

Experimental results

Synthetic benchmarks

In Fig. 4 we report the execution time and NF1 score for the compared CD approaches. Our experiments show that our approach is able sensibly to improve the running times of its competitors while increasing the network size. In particular, it is worth noticing that Fig. 4a reports execution times on a log scale: considering the average runtime of ANGEL on the generated 100k nodes graphs it registers a speedup of an order of magnitude w.r.t. its competitors. In Fig.4b-c the NF1 score is used to compare the adherence of the partitions identified by the selected CD algorithms to the ground truth ones: we omitted NODEPERCEPTION’s results since their overall NF1 were always lower than 0.4. In particular Fig. 4b compare the average NF1 scores obtained by each algorithm on different sized LFR graphs. To compute the NF1 mean value for the pair <algorithm,network_size> we considered the results provided by the optimal parameter configuration w.r.t. each network size instantiation (e.g., varying graph density and mixing coefficient). Among the compared methods ANGEL is always able to reach the highest scores, often producing the perfect match for the planted communities. Figure 4c underline the impact of network mixing coefficient on the quality of extracted communities once fixed the network size. We observe that ANGEL and SLPA can assure relatively stable performances while varying μ.

Evaluation on real world data

Table 2 shows the running times – expressed in seconds – of the compared CD approaches when applied to the selected networks. As already underlined in the synthetic scenario, ANGEL outperforms its competitors, often achieving execution times of one or more orders of magnitude less.

Table 2 Running times

Full size table

Differently from the synthetic scenario, when it comes to assessing community resemblance – quantitative values in Table 3 – we observe a relatively low quality for all the partitions produced by the compared algorithms. Indeed, such results are somehow expected. Conversely, from the synthetic benchmark where the planted communities were designed to follow specific topological characteristics, the semantic annotation provided for the analyzed real-world network do not necessarily reflect structural properties (Hric et al. 2014). Such decoupling makes difficult, if not impossible, for CD algorithms that do not leverage semantic information to capture the same partition identified by the ground truth. However, even in this more complex scenario, ANGEL communities are the ones able to better approximate the provided ground truth node partitions. In order to provide a statistical significance bound to our experiment on real data we also applyed a Friedman test (Friedman 1937) with Li post-hoc evaluation (Li 2008) on the evaluation proposed in Table 3. The test was rejected for the NF1 scores with a p-value of 0.05, thus implying that the compared methods do actually behave differently when tested on multiple datasets. Moreover, the post-hoc underlined that Angel significantly outperforms NODEPERCEPTION under the same confidence interval, and all the others when p-value is imposed equalt to 0.1.

Table 3 Community resemblance

Full size table

ANGEL on dynamic networks

In this section, we propose an extension of ANGEL tailored to extract communities from dynamic network topologies so to observe their evolution as time goes by. From a modelling point of view, in the following sections, we represent a dynamic network by using snapshot graphs:

Definition 3

(Snapshot Graph) Let G be an attributed graph G=(V,E,T), where V is a set of nodes, E a set of edges and $T = \{0,1\dots,n\}$ an ordered set of labels (associated to both nodes and edges) identifying different timestamps. Given a label i∈T, we call G_i=(V_i,E_i) the graph induced from G composed by the nodes, edges whose labels is i. A Snapshot Graph is defined as the set $\mathcal {G}=\{G_{0}, G_{1}\dots G_{n}\}$ composed by n consecutive, non temporal overlapping, partition of G such as $G=\cup _{i=0}^{n}G_{i}$.

Using such temporal discretization in Instant-Optimal dynamic community discovery we extend ANGEL to handle dynamic networks and briefly frame the resulting approach within a specific subclass of DCD methodologies: Instant Optimal, Identify&Match algorithms. Finally, in Complexity we discuss the computational complexity of the proposed method.

Instant-Optimal dynamic community discovery

As previously discussed, ANGEL efficiently address the classical formulation of the overlapping community discovery problem: however, per se it is not designed to take into account the challenges that evolving network topologies generate. The natural way to proceed, to enable ANGEL to dynamic community analysis, is to extend it by applying an algorithmic schema known as Identify&Match (or, equivalently, “Two-steps"(Alhajj and Jon 2014)). Such schema characterizes a vast majority of DCD approaches originating from the extension of static methods to dynamic topologies. As suggested by its name, such strategy describes a two steps process:

Identify: detect static communities on each step of evolution;
Match: align the communities identified at step t with the ones at step t−1.

The main advantage of two-steps solutions is that they allow reusing static CD techniques, avoiding the definition of novel, often context dependent, methodologies. Moreover, one of the reasons for the abundance of methods belonging to such family lies in the fact that the matching step can be derived from existing literature, since set matching is a widely studied problem. Moreover, Identify&Match allows to easily describe parallelizable analytical workflows. As discussed in (Rossetti and Cazabet 2018), “Two-step" approaches represent a specialization of a more general class of algorithms, called Instant-Optimal CD. Indeed, matching the communities found at different stages of network evolution might involve comparing several sets of temporally disjoint network partitions: however, Instant Optimal CD approaches assume that the partition identified at t is optimal, w.r.t. the topology of the network at t. DCD solutions falling in this class are, by definition, non-temporally smoothed and represent the best choice when the final goal is to provide communities which are as good as possible at each step of the evolution of the network.

Algorithm 3 details the pseudocode of the dynamic extension of ANGEL. The algorithm required inputs are (i) a set of snapshot graphs, and (ii) the ϕ threshold (as requested by ANGEL). To avoid having a third parameter (e.g., a similarity threshold for the matching phase), we don’t make use of the Jaccard similarity – a widely adopted strategy to address this kind of approaches – while aligning community sets extracted from consecutive network snapshots. Conversely, we adopted a matching criterion similar to the one used to merge communities in the second phase of ANGEL, thus providing a coherent merging/matching strategy. We assume that each node at time t carries three sets of labels: i) the identifiers of the communities it currently belongs to; ii) the identifiers of the communities it was part of at t−1, and; iii) the identifiers of the communities it will be associated to at t+1. Given two community sets – i.e., the ones at time t−1 and t – constructing the requested labelling has linear complexity in the number of the nodes. Once the nodes belonging to temporally adjacent partitions are labelled, the following matching procedure is performed:

firstly, each community identified in G_t−1 is matched with the ones in G_t that maximize the precision score;
secondly, the same criterion is used to match each community in G_t with the more similar ones in G_t−1.

Indeed, the precision score (as defined in Equation 2) is not symmetric, thus performing the matching in both directions makes possible the identification of different evolutive patterns involving the observed communities.