A novel framework for community modeling and characterization in directed temporal networks

Bongiorno, Christian; Zino, Lorenzo; Rizzo, Alessandro

doi:10.1007/s41109-019-0119-2

Research
Open access
Published: 29 March 2019

A novel framework for community modeling and characterization in directed temporal networks

Christian Bongiorno¹^na1,
Lorenzo Zino²^na1 &
Alessandro Rizzo^1,3

Applied Network Science volume 4, Article number: 10 (2019) Cite this article

2258 Accesses
5 Citations
12 Altmetric
Metrics details

Abstract

We deal with the problem of modeling and characterizing the community structure of complex systems. First, we propose a mathematical model for directed temporal networks based on the paradigm of activity driven networks. Many features of real-world systems are encapsulated in our model, such as hierarchical and overlapping community structures, heterogeneous attitude of nodes in behaving as sources or drains for connections, and the existence of a backbone of links that model dyadic relationships between nodes. Second, we develop a method for parameter identification of temporal networks based on the analysis of the integrated network of connections. Starting from any existing community detection algorithm, our method enriches the obtained solution by providing an in-depth characterization of the very nature of the role of nodes and communities in generating the temporal link structure. The proposed modeling and characterization framework is validated on three synthetic benchmarks and two real-world case studies.

Introduction

Many seminal studies have revealed that communities are ubiquitous in networked systems of diverse nature. In fact, community structures have been identified in social, financial, biological, and in many other networks (Girvan and Newman 2002; Newman 2006; Estrada 2011; Benson et al. 2016; Yang and Leskovec 2014). Such communities typically have a complex structure: they present a hierarchical and overlapping organization (Ravasz et al. 2002; Palla et al. 2005; Lancichinetti et al. 2009; Pons and Latapy 2011) and their components have heterogeneous attitudes in the link formation process (Palla et al. 2007), acting as sources, mainly generating connections, or as drains, on the contrary. Further challenges have to be tackled toward a comprehensive analysis of real-world systems. First, evidence suggests that the patterns of connections between the components of a real-world networked system evolve in time (Volz and Meyers 2008; Holme and Saramäki 2012; Pastor-Satorras et al. 2015; Latapy et al. 2018). Second, the nature of interactions in many biological and technological systems has an inherent direction and is thus nonsymmetrical (Leicht and Newman 2008). Finally, there exist connections that are generated by dyadic relationships between nodes, rather than by properties of the single node. These connections give rise to a link structure, which is often referred as the irreducible backbone of temporal interactions, or the structure of strong ties (Onnela et al. 2007; Gemmetto et al. 2017). These factors often challenge the applicability of existing temporal network models and community detection algorithms (Lancichinetti et al. 2011; Fortunato and Hric 2016; Khan and Niazi 2017; Schaub et al. 2017; Zhang et al. 2018).

In this work, we propose a novel and extremely flexible model of temporal networks that encompasses many complex features of real-world systems. We refer to this model as routed activity driven networks (rADN). It extends the paradigm of activity driven networks (ADNs), which have emerged as a valuable framework to represent and study time-varying networks of interactions (Perra et al. 2012). The main strength of ADNs lays in their simplicity: the time-varying nature of the network is indeed encapsulated in a single parameter vector, called activity, which quantify the propensity of each node to generate transitory connections with the others. Such an activity parameter vector can be easily inferred from empirical data (Perra et al. 2012; Karsai et al. 2014; Liu et al. 2014; Rizzo et al. 2016). Since their original formulation in (Perra et al. 2012), many features have been included into the ADN paradigm toward realistic modeling of complex networks of interactions. These features include a continuous-time formulation of the framework (Zino et al. 2016; 2017), the heterogeneous propensity of nodes to receive connections (Pozzana et al. 2017; Alessandretti et al. 2017), memory mechanisms in the link generation process (Karsai et al. 2014; Sun et al. 2015; Zino et al. 2018), the partition of nodes into a simple community structure (Nadini et al. 2018b), and the presence of an irreducible backbone of recurrent connections (Lei et al. 2016; Nadini et al. 2018a). The simple formulation of ADNs and their extensions are amenable to analytical treatment, allowing for the study of many phenomena on time-varying heterogeneous networks, including epidemic outbreaks (Rizzo et al. 2014, 2016; Petri and Barrat 2018), diffusion of innovation (Rizzo and Porfiri 2016), opinion dynamics (Li et al. 2017), and percolation problems (Starnini and Pastor-Satorras 2014).

Our model incorporates the following features: i) the heterogeneity in the propensity to form links with other network nodes; ii) the directionality of such links; iii) a hierarchical and overlapping time-invariant community structure; iv) the heterogeneous involvement of nodes within their communities, acting as sources or drains; and v) the presence of an irreducible backbone. The model relies on a relatively compact parameter set able to elicit the complexity of the system in an elegant and intelligible form, highlighting the community structure and its role in the network formation process.

The main goal of this work is to provide a comprehensive and effective means to describe a complex and heterogeneous system through a mesoscopic characterization at the community level. We believe that such a characterization is of great interest for several applications (e.g., epidemic containment, contrast of misinformation) for which, on the one hand, macroscopic mean-field approaches fail in accounting for the inherent diversity throughout the system, whereas the microscopic characterization at the node level comprises an extremely large parameter set, hampering analytical tractability and making time- and resource-consuming Monte Carlo simulations the only way to shed light on the system properties. A preliminary version of this model was presented in (Bongiorno et al. 2018). Here, we extend such a preliminary work by adding the irreducible backbone of strong ties, a more flexible and comprehensive connection mechanism, and a more detailed analysis and assessment of the proposed model over synthetic benchmarks and two real-world datasets.

The identification of the model parameters from temporal link formation data poses a series of challenges, since the occurrence of a link cannot be unequivocally attributed the mechanism that has generated it. Hence, differently from existing methods for community detection that tend to explain link formation as a sole consequence of the community structure, here we establish a probabilistic framework to quantify the belief in alternative link formation processes, accounting for three co-existing mechanisms of connection: i) communities, ii) community-free, and iii) backbone.

The proposed parameter identification and community detection strategies unfold over three main steps, starting from the observation of the links generated during a given time-window, where it is assumed that both the community structure and the backbone do not change in time. Using an integrated version of the observed temporal network of contacts, we apply an existing community detection method (Fortunato and Hric 2016; Khan and Niazi 2017) to infer a reference community structure for our rADN-based model. Then, we perform parameter identification of the rADN model solving a quadratic optimization program with linear constraints (Boyd and Vandenberghe 2004). Such an identification problem is naturally underdetermined, implying that a family of rADN models with the same community structure but different parameters may equivalently reproduce the link formation process. In fact, different parameter combinations lead to different probabilistic explanations of the link formation, relying on different probabilistic blends of the three mechanisms mentioned above. A free parameter vector, called community belief, is thus defined to quantify the belief in the role of communities in the network formation: when the components of the community belief approach one, we tend to assume that the connection mechanism is mostly governed by communities, otherwise, when they approach zero, the role of communities becomes negligible. Hence, a family of models may be obtained by running an identification procedure for each value of the community belief. Confidence intervals on the community belief can then be established, in order to obtain a family of models that is practically compatible with the available data. Thanks to the belief mechanism, the use of our method in conjunction with different preliminary community detection algorithms allows us to quantitatively compare different hypotheses on the community structure and the link formation process.

We successfully validate our approach on three synthetic benchmarks that exhibit different features, all generated through the proposed rADN model. We then apply our method to two different real-world case studies. The first is based on the Enron email corpus (Cohen), where no information is available of the community structure. Here, we propose our method as a tool to compare and assess the outputs of different community detection algorithms. The second case study uses data about face-to-face interactions in a primary school (SocioPatterns). Here, we use metadata on the partition of students in classes to provide a ground truth on the community structure. In this case, our method is used to improve the characterization of the community structure.

The rest of this paper is organized as follows. In “Model” section, we formalize the rADN model. In “Estimation of the model parameters from empirical data” section, we propose a method to estimate the parameters of a rADN from empirical data. In “Validation on synthetic networks” section, we validate the proposed method on three synthetic networks, exhibiting different features of real-word networks such as heterogeneity, the presence of a backbone, overlapping and hierarchical communities. “Case studies” section is devoted to the analysis of the two case studies. Finally, “Conclusion” section concludes the paper and outlines our future research.

Model

A rADN is a network composed by a set of n nodes $\mathcal V=\{1,\dots,n\}$ connected through a time-varying link structure $\mathcal G(t)=(\mathcal V,\mathcal E(t))$, where $\mathcal E(t)\subseteq \mathcal V\times \mathcal V$ denotes the time-varying link set. Links are generated according to a continuous-time mechanism, following the formalism proposed in (Zino et al. 2016; 2017). The continuous-time formulation allows for addressing some theoretical limitations posed by the original discrete-time formulation of ADNs (Perra et al. 2012) and is not subject to the issues related to the choice of the discrete time step (Ribeiro et al. 2013).

A positive (time-invariant) activity rate a_i>0 is assigned to each node $i\in \mathcal V$. The activity rate quantifies the node’s propensity to generate transitory connections with other nodes in the network, as detailed in the following. Activity rates are gathered in a n-dimensional vector a. We define the routing matrix P∈[0,1]^n×n, as a stochastic matrix (i.e., an entry-wise nonnegative matrix such that all rows sum to one) with zero diagonal entries. The entry P_ij of the routing matrix measures the propensity of node i to generate connections toward node j, as detailed in the following.

Hence, the triple $(\mathcal V,a,P)$ identifies a rADN. Connections are generated according to a similar mechanism to the one of standard continuous-time ADNs (Zino et al. 2016), except for a nonuniform choice of the connection wirings, which are governed by the routing matrix P, whose construction will be detailed later in this section. The following algorithm summarizes the evolution of a rADN:

1
at t=0, the link set is set as empty ($\mathcal E(0)=\emptyset $) and a Poisson clock (Bailey 1990) with rate a_i (each one independent of the others) is initialized for each node $i\in \mathcal V$;
2
if at time t the clock associated with node i clicks, then node i activates and randomly selects a node $j\in \mathcal V$ to connect to with probability P_ij;
3
the directed link (i,j) is instantaneously added to the link set $\mathcal E(t)$; and
4
link (i,j) is immediately removed from the link set, the Poisson clock associated with node i is re-initialized, and the algorithm is resumed to item 2.

In their original formulation, links of continuous-time ADNs are ephemeral. Even though this could seem an over simplification, many interactions in social and biological systems have a negligible duration with respect to the time scale of the network evolution and of the emerging phenomena of the system. Relevant examples are e-mails or messages exchanged in social networks or physical interactions between individuals. The model can be straightforwardly extended by including nonephemeral connections, by modifying item 4 of the previous algorithm. For instance, one could remove link (i,j) after a certain time-interval, which could be fixed or drawn at random from any distribution.

According to our mechanism, the occurrences of the directed link (i,j) are governed by a (split) Poisson process (Ross 2009), whose rate is equal to

$$ A_{ij}=a_{i}P_{ij}. $$

(1)

The link activation rates A_ij can be gathered into the activity rate matrix$A\in \mathbb R_{+}^{n\times n}$, which encapsulates the information both on the activity rate vector a, and on the routing matrix P. Therefore, a rADN is completely identified by the couple $(\mathcal V,A)$. Given A, the activity rate vector a and the routing matrix P can be retrieved as

$$ a_{i}=\sum_{j\in \mathcal V}A_{ij},\qquad P_{ij}=\frac{A_{ij}}{\sum_{h\in V}A_{ih}},\qquad i,j\in \mathcal V. $$

(2)

Fixing a time-window of duration T>0, we introduce the weighted integrated network over the time-window, represented by the pair $\mathcal G_{T}=(\mathcal V,W)$, where $\mathcal V$ is the node set, which coincides with the node set of the time-varying network, and $W\in \mathbb Z_{+}^{n\times n}$ is a weighted adjacency matrix. Specifically, W_ij counts the number of occurrences of directed links from node i to node j in the time-window of duration T. In rADNs, the entry W_ij of the weighted adjacency matrix is a Poisson distributed random variable with parameter A_ijT (Ross 2009). Hence, the probability of observing w occurrences of the link (i,j) in the integrated network $\mathcal G_{T}$ is equal to

$$ \mathbb P[W_{ij}=w]=\frac{\left(A_{ij}T\right)^{w}}{w!}\exp\{-A_{ij}T\}. $$

(3)

Figure 1 illustrates an example of the construction of an integrated network from link observations over a given time-window of duration T=1. In this example, the weighted adjacency matrix is equal to

$$ W=\left[\begin{array}{lll} 0 &2&2\\1&0&1\\0&1&0 \end{array}\right]. $$

(4)

Routing matrix P

The routing matrix P is constructed to encapsulate the information on i) the organization of nodes in nontrivial, hierarchical and overlapping communities; ii) the heterogeneous involvement of nodes in their communities, characterized through the presence of sources and drains; and iii) the existence of an irreducible backbone.

In order to distinguish the contribute of the community structure from that of the backbone to the link generation process, we define matrix P as a convex combination of two n×n stochastic matrices C and R. The convex combination is weighted by a n-dimensional nonnegative (entry-wise) vector λ∈[0,1]ⁿ, as

$$ P_{ij}=\lambda_{i}C_{ij}+(1-\lambda_{i})R_{ij},\qquad i,j\in \mathcal V, $$

(5)

where matrix C, named community matrix, encodes the information on the role of the community-based mechanism in the link generation process, while matrix R encodes the role of the backbone in the process and is called backbone matrix. Toward a compact formalization of the model, the community-free mechanism is considered as a special case of the community-based mechanism, through the inclusion into matrix C of the special all-to-all community, as detailed in “Community matrix C” section. In the following, unless specified differently, the term community-based will refer to both the community-based and the community-free mechanisms, thus leaving aside only the backbone mechanism. Specifically, entry C_ij is the probability that node i connects to j as a consequence of the community structure (including the all-to-all community), while entry R_ij is the probability that such a connection is generated as a consequence of a dyadic relationship (backbone) between i and j. The entry λ_i weights the two mechanisms by quantifying the strength of the community-based mechanism in the process of link formation from node i. In general, the entries of vector λ are nonuniform, to capture the heterogeneous influence of the two mechanisms for different nodes. The limit case λ_i=0 represents the scenario in which the community structure has no influence on the link generation process from node i and all its links are caused by the presence of the irreducible backbone, while the case λ_i=1 models the case in which node i wires its connections only driven by the community structure.

The mechanism governed by the convex combination in (5), illustrated in Fig. 2, has an immediate probabilistic interpretation. When node i activates, connections are driven by the community structure with probability λ_i, whereas they are driven by the backbone mechanism with probability 1−λ_i. In the following, we will detail the construction principles of matrices C and R.

Community matrix C

Here, matrix C is designed to model a time-invariant community structure. This simplifying assumption is reflected in many real-world systems, where the pace of evolution of the community structure is much slower than the link generation process, as in (Bao and Michailidis 2018). Different scenarios, where the community structure evolves in time, can be found in Rossetti and Cazabet (2018).

Given a time-invariant set of k≥0 nontrivial communities, we label them with positive integer numbers h∈{1,…,k}. Trivial communities are the empty set, singletons, and the whole node set $\mathcal V$. To model the community-free mechanism, we add an all-to-all trivial community that coincides with the whole system $\mathcal V$, labeled by index 0. Hence, the community set$\mathcal K=\{0,\dots,k\}$ comprises the trivial all-to-all community 0 and k nontrivial communities. Considering the hth community, we denote by $\mathcal V_{h}\subseteq \mathcal V$ the set of nodes that belong to it, while $n_{h}:=\left |\mathcal V_{h}\right |$ is its cardinality. On the other hand, considering the generic node $i\in \mathcal V$, we denote by $\mathcal C_{i}:=\{h:i\in \mathcal V_{h}\}$ the set of communities to which node i belongs, and with $c_{i}:=\left |\mathcal C_{i}\right |$ its cardinality.

We observe that the rADN paradigm allows each network node to belong to an arbitrary number of communities. This encompasses and generalizes the paradigm of modular ADNs (Nadini et al. 2018b), where each node belongs to exactly one nontrivial community. The heterogeneous attitude of nodes in their different communities (Palla et al. 2007) is modeled by defining a stochastic rectangular matrix Q∈[0,1]^n×(k+1), named community strength matrix, such that Q_ih>0 if and only if $i\in \mathcal V_{h}$. The entry Q_ih is the probability that a link from node i that is caused by a community-based mechanism is wired within the hth community. The entries of matrix Q can be thus interpreted as the importance that each node gives to the each of the communities it belongs to. If Q_ih is large, node i acts as a source in community h, generating many inter-community links. On the other hand, if Q_ih is small, then node i will act as a drain, mostly receiving connections from other members. In this perspective, matrix Q quantifies the strength of the active involvement of nodes in each of their communities.

Formally, we define the community matrix C, entry-wise, as

$$ C_{ij}=\left\{\begin{array}{ll}0&\text{if}\ i=j\\ \sum\limits_{h\in \mathcal C_{j}}Q_{ih}\frac{1}{n_{h}-1}\qquad &\text{otherwise.}\end{array}\right. $$

(6)

Figure 3 illustrates the mechanisms that govern the formation of the community matrix C. Specifically, when node i generates a connection following the community-based mechanism, first, it randomly selects a community $h\in \mathcal K$ to which it belongs, according to the probabilities in the ith row of matrix Q. Then, it connects to a node j chosen uniformly at random among the n_h−1 nodes of the hth community (excluding node i).

Backbone matrix R

Similar to the community structure, also the irreducible backbone is typically fixed in time or it evolves much slower than the link formation process (Onnela et al. 2007; Gemmetto et al. 2017). Hence, here we hypothesize that this is constant for time-windows of reasonable duration.

The backbone is thus modeled by a time-invariant graph $G_{R}=(\mathcal V,\mathcal E_{R})$ and by a stochastic matrix R∈[0,1]^n×n, such that R_ij>0 if and only if $(i,j)\in \mathcal E_{R}$. The entry R_ij measures the strength of the dyadic relationship between node i and node j in a probabilistic framework. In many real-world scenarios of social and biological systems, it is reasonable to assume matrix R to be sparse, as observed from many empirical data sources (Newman 2003; Ballerini et al. 2008).

We remark that, in the limit case where node i is not influenced by the backbone mechanism, i.e., λ_i=1, the entries of the ith row of the backbone matrix R have no influence on the link formation mechanism. Without any loss in generality, in this case we set the corresponding rows of matrix R equal to the corresponding rows of a n×n identity matrix, i.e., we set R_ii=1 and R_ij=0, for any j≠i.

The set of parameters that characterize a rADN is summarized in Table 1. To recapitulate, taking into account the construction mechanism of matrix P, we detail the algorithm that summarizes the evolution of a rADN as follows:

1
at t=0, the link set is set as empty ($\mathcal E(0)=\emptyset $) and a Poisson clock (Bailey 1990) with rate a_i (each one independent of the others) is initialized for each node $i\in \mathcal V$;
Table 1 Parameters that characterize a rADN model
Full size table
2
if at time t the clock associated with node i clicks, then node i activates and randomly chooses whether the connection is generated a) by the community-based mechanism (with probability λ_i), or b) by the backbone (with probability 1−λ_i). Then, depending on the previous choice, either a) or b) occurs, where
- a) node i randomly selects a community $h\in \mathcal K$. Specifically, community h is selected with probability Q_ih. Then, node j is chosen uniformly at random among the n_h−1 nodes of the hth community excluding node i; or
- b) node i randomly selects a node $j\in \mathcal V$. Specifically, node j is selected with probability R_ij;
3
the directed link (i,j) is instantaneously added to the link set $\mathcal E(t)$; and
4
link (i,j) is immediately removed from the link set, the Poisson clock associated with node i is re-initialized, and the algorithm is resumed to item 2.

We observe that our extended rADN modeling framework actually encompasses many variants of standard ADNs proposed in the recent literature. Some relevant examples are presented in the following.

Standard ADNs

Standard continuous-time ADNs (Zino et al. 2016) are obtained by setting $\lambda _{i}=1, i\in \mathcal V$, and $\mathcal K=\{0\}$. This choice yields P_ii=0, for any $i\in \mathcal V$, and

$$ P_{ij}=C_{ij}=\frac{Q_{i0}}{n-1}=\frac{1}{n-1},\qquad i\in\mathcal V,\,j\in\mathcal V\smallsetminus\{i\}. $$

(7)

Modular ADNs

Modular ADNs (Nadini et al. 2018b) can be derived as a particular case of rADN with $\lambda _{i}=1,\,\forall \,i\in \mathcal V$. Here, the set of communities $\mathcal K$ defines a partition of the node set where each node belongs to one and only one community. The notation used in the original incarnation defined in Nadini et al. (2018b) can be retrieved by setting $Q_{i0}=1-\mu, Q_{ih}=\mu, \forall \,i\in \mathcal V_{h}$. In this case, (6) reads P_ii=0, $\forall \,i\in \mathcal V$, and, for any j≠i,

$$ P_{ij}=C_{ij}=\left\{\begin{array}{ll}\frac{1-\mu}{n-1}+\frac{\mu}{n_{h}-1}&\text{if}\ i\in\mathcal V_{h},\,j\in\mathcal V_{h},\\\frac{1-\mu}{n-1}&\text{if}\ i\in\mathcal V_{h},\,j\notin\mathcal V_{h}.\end{array}\right. $$

(8)

ADNs with attractiveness

Attractiveness has been added to ADNs to model the heterogeneous propensity of nodes to receive connections (Pozzana et al. 2017; Alessandretti et al. 2017). Specifically, given an attractivity vector b>0 (entry-wise), the probability that a node generates a link to node $j\in \mathcal V$ is proportional to b_j. In the framework of rADNs, this feature can be modeled by setting $\lambda _{i}=0,\,\forall \,i\in \mathcal V$, and all the entries of the backbone matrix R as

$$ R_{ij}=\frac{b_{j}}{\sum\limits_{k\in\mathcal V\smallsetminus\{ i\}}b_{k}},\qquad\forall\,i,j\in\mathcal V. $$

(9)

Estimation of the model parameters from empirical data

Here, we develop a technique to identify the model parameters. Specifically, we estimate the activity rate vector a, the weight vector λ, the community strength matrix Q, and the backbone matrix R. The objective of the technique presented in this section is to devise a procedure to deepen the characterization of communities, shading light on the diverse role of their members and their role in the link formation process.

Parameter identification procedure

As stated in the introduction, we preliminary use an existing community detection algorithm, feeding it with the temporal sequence of the link formation, and obtaining as output i) the realization of an integrated version of the temporal network over a time-window of duration T, whose information is gathered in the weighted adjacency matrix W; and ii) the community set $\mathcal K$ and the sets $\mathcal V_{h}$, $h\in \mathcal K$, obtained as the output of the community detection algorithm. According to (Perra et al. 2012), the activity rate vector can be estimated as

$$ \hat a_{i} = \sum_{j\in\mathcal V} \frac{W_{ij}}{T},\qquad i\in\mathcal V. $$

(10)

The expected number of occurrences of the link (i,j), denoted by $\bar W_{ij}$, is computed following (3), and it is equal to

$$ \bar W_{ij}=\hat a_{i}T\sum_{h \in \mathcal{C}_{i}} \lambda_{i} Q_{ih} \frac{1}{n_{h}-1}+ \hat a_{i}T(1 - \lambda_{i}) R_{ij}. $$

(11)

In order to estimate the other model parameters, i.e., the community weight vector λ, the community strength matrix Q, and the backbone matrix R, we formulate the identification problem in terms of a constrained optimization program. We observe that the identification problem is naturally underdetermined. In fact, the observed data consists of a n×n matrix, while the set of parameters to be estimated comprises a n×n matrix, a n×(k+1) matrix, and a n-dimensional vector. Hence, except for unlikely particular cases, the number of parameters to be estimated exceeds the number of equations that can be written using the available data. To address this issue, we introduce a free parameter vector γ∈[0,1]ⁿ, named community belief, which measures our belief in the prominence of the role of the community-based mechanism in the link formation process. Tuning this parameter vector is the most delicate task in the application of our method. In “Confidence interval for the community belief parameter” section, we put forward a statistical procedure to assess a confidence interval for such a parameter vector. In particular, we identify the largest value for γ that is compatible with the observed data, which yields the characterization of the system with the highest belief in the community-based link formation mechanism. Such a model is often preferred, as it leads to a characterization at a mesoscopic level, whereby the system characteristics are captured with a good detail and an intermediate granularity, which ensures a good description of the system behavior without incurring in the issues related to a microscopic, node-based representation. However, in “Validation on synthetic networks” section we show that when the dataset has a small size, smaller values of the parameter γ within the prescribed confidence interval may be more suitable to describe the system without overfitting the community structure.

The identification problem is formalized by writing a set of n disjoint minimization problems, one for each node $i\in \mathcal V$. Specifically, for node $i\in \mathcal V$, we want to minimize the function

$$ f(\varepsilon_{i\bullet},Q_{i\bullet},R_{i\bullet},\lambda_{i})=(1 - \gamma_{i})\sum_{j=1}^{n} \varepsilon_{ij}^{2} + \gamma_{i} (1 - \lambda_{i}) R_{ij}, $$

(12)

with respect to variable λ_i and the entries of the ith row of matrices ε, Q and R, written in compact form as ε_i∙,Q_i∙, and R_i∙, respectively. The minimization problem in (12) is subject to several constraints: we require that the number of occurrences of the link (i,j), i.e., W_ij, is equal to its expected value $\bar W_{ij}$, computed according to (11), up to some natural statistical fluctuation, modeled by the residual ε_ij; we also require the matrices Q and R to be stochastic, and the variable λ_i to be nonnegative and not greater than 1. These constraints are gathered as follows:

$$ \left\{ \begin{array}{ll} \hat a_{i}T\sum\limits_{h \in \mathcal{C}_{i}} \lambda_{i} Q_{ih} \frac{1}{n_{h}-1}+ \hat a_{i}T(1 - \lambda_{i}) R_{ij} + \varepsilon_{ij} = W_{ij}& \forall\,j\in\{1,\dots,n\}, \\ \sum\limits_{h \in \mathcal{C}_{i}} Q_{ih} = 1,&\\ \sum\limits_{j=1}^{n} R_{ij} = 1,&\\ 0\leq Q_{ih}\leq 1, & \forall \, h \in \mathcal{C}_{i},\\ 0\leq R_{ij}\leq 1, & \forall\,j \in\{1, \dots, n\},\\ 0\leq \lambda_{i} \leq 1.&\end{array}\right. $$

(13)

We observe that the objective function in (12) consists of the sum of two terms: the first summand is the sum of the squared residuals, whose minimization allows for obtaining a model that is compatible with the observed data; the second summand is a cost related to the contributions of the backbone-based mechanism. In the absence of the second term, a trivial solution would be λ_i=0 and $R_{ij}=\bar W_{ij}/\hat a_{i}$, that is, the whole link formation process is explained in terms of dyadic relationships between nodes. However, this is often not consistent with the empirical observation of a sparse backbone in systems of different nature (Ballerini et al. 2008; Newman 2003), and it fails to provide a description of the system at a mesoscopic scale.

Although the constraints are nonlinear, the change of variable

$$ \tilde{Q}_{ih} := \lambda_{i}Q_{ih}\qquad\tilde{R}_{ij} := (1 - \lambda_{i}) R_{ij}, $$

(14)

allows us to write (12) as a quadratic programming problem (Boyd and Vandenberghe 2004) with linear constraints, which can be solved with a reasonable computational effort. Specifically, the objective function reads

$$ f\left(\varepsilon_{i\bullet},\tilde Q_{i\bullet},\tilde R_{i\bullet},\lambda_{i}\right)=(1-\gamma_{i})\sum_{j=1}^{n} \varepsilon_{ij}^{2} + \gamma_{i}\tilde R_{ij}, $$

(15)

subject to the following constraints:

$$ \left\{\begin{array}{ll} \hat a_{i} T \sum\limits_{h \in \mathcal{C}_{i}}\tilde Q_{ih} \frac{1}{n_{h}-1}+ \hat a_{i} T \tilde R_{ij} + \varepsilon_{ij} = W_{ij}\quad& \forall\,j\in\{1,\dots,n\}, \\ \sum\limits_{h \in \mathcal{C}_{i}} \tilde Q_{ih}+\sum\limits_{j=1}^{n} \tilde R_{ij} = 1,&\\ 0\leq \tilde Q_{ih}\leq 1, & \forall \, h \in \mathcal{C}_{i},\\ 0\leq \tilde R_{ij}\leq 1, & \forall\,j \in\{1, \dots, n\}. \end{array}\right. $$

(16)

The computational complexity of the parameter identification method proposed here can be estimated as a function of the number n of nodes in $\mathcal V$, the number m of nonzero entries of the weighted adjacency matrix W, and the number k of communities. Specifically, we obtain that the computational complexity is equal to O(n²+nm+nk). Since it often holds k<<n, we conclude that, for sparse integrated networks the computational complexity is O(n²), while for dense networks it is O(n³).

Confidence interval for the community belief parameter

In this section, we develop a statistical method to identify a range of values for the community belief vector γ, compatible with the natural statistical fluctuations. Since each of the components of the vector γ is derived through a distinct minimization problem, here we focus on the generic ith component, independently of the others. In the previous section, we observed that, for γ_i=0, problem (15) admits the trivial solution in which the whole link formation process from node i is explained in terms of the backbone. Then, when γ_i grows, the role of communities in the process of link generation gains more importance. When γ_i is too large, however, the role of communities in the link formation process might be overestimated and the models obtained for these values of the community belief parameter are not statistically compatible with the data observed. Here, we put forward a technique to identify the largest value of the community belief γ_i such that the parameters of the rADN model identified for that value of γ_i are compatible with the available data.

Fixing a value of the parameter γ_i, the minimization problem (15) can be solved by means of a quadratic programming solver. We denote the corresponding solution as $\phantom {\dot {i}\!}Q^{(\gamma _{i})}$, $\phantom {\dot {i}\!}R^{(\gamma _{i})}$, and $\phantom {\dot {i}\!}\lambda ^{(\gamma _{i})}$. Using the parameters estimated with this solution, and the expression for the probability of link formation in rADN in (3), we can determine the distribution of the ith row of the weighted adjacency matrix W. We denote such a row as $W_{i\bullet }^{(\gamma _{i})}$, to stress its dependence on the choice of the parameter γ_i. Specifically, each row entry $W_{ij}^{(\gamma _{i})}$ is a Poisson random variable, independent of the others, with expected value

$$\bar W^{(\gamma_{i})}_{ij}=\lambda_{i}^{(\gamma_{i})}\sum_{h\in\mathcal C_{i}}\frac{Q_{ih}^{(\gamma_{i})}}{n_{h}-1}+\left(1-\lambda_{i}^{(\gamma_{i})}\right)R_{ij}^{(\gamma_{i})}. $$

(17)

Hence, the likelihood that the ith row of the observed weighted adjacency matrix W is a realization of the random variable $\phantom {\dot {i}\!}{W}^{(\gamma _{i})}$ is given by

$$ \mathcal{L}\left(\bar{W}^{(\gamma_{i})} | W\right) = \prod_{j=1}^{n} \frac{ \left(\bar W^{(\gamma_{i})}_{ij}\right)^{W_{ij}} e^{-\bar{W}^{(\gamma_{i})}_{ij}}}{W_{ij}!}, $$

(18)

which is the probability that the realization of the n independent Poisson distributed random variables with expected values computed according to (17) coincide with the ith row of the observed matrix W. Since the product of small probabilities is numerically unstable, it is convenient to test the log-likelihood (Boyd and Vandenberghe 2004) instead of (18), which is

$$ \log \mathcal{L}\left(\bar{W}^{(\gamma_{i})} |W\right) = \sum_{j=1}^{n} \log \left(\frac{\left(\bar W^{(\gamma_{i})}_{ij}\right)^{W_{ij}} e^{-\bar{W}^{(\gamma_{i})}_{ij}}}{W_{ij}!} \right). $$

(19)

Unsurprisingly, the value γ_i that maximizes the log-likelihood function is γ_i=0, which yields the scenario where communities have no role in the link formation process and all the connections are explained at the microscopic scale of the irreducible backbone. This scenario is the result of data overfitting, which is a well known problem of the methods based on maximum likelihood (Bishop 2006). Specifically, in this case dyadic relationships are overfitted. We observe that this extreme scenario is not particularly interesting in our framework. In fact, the interest in community-based modeling and community detection algorithms is to describe the system at the mesoscopic level, that is, at a level higher than the individual, microscopic one. This implies that the information encapsulated by a usually high number of microscopic parameters is compressed in a much smaller number of mesoscopic parameters, that is, those related to communities. For this reason, we are interested in models that do include communities, that is, with γ greater than zero. In particular, we are interested in finding the model with the largest value of γ_i that produces an rADN model that is compatible with the available data observed in the weighted adjacency matrix W. To this aim, we perform a Likelihood Ratio (LR) test (Casella and Berger 2002). The LR test determines whether the null hypothesis that the observed ith row of the weighted adjacency matrix W_i∙ is obtained from a vector of independent Poisson variables with expected values $\bar W^{(\gamma _{i})}_{ij}$ from (17), for j=1,…,n, should be rejected. Specifically, fixing a significance coverage α∈[0,1], the LR test rejects the null hypothesis if the statistic

$$ D:= 2 \left(\log \mathcal{L}\left(\bar{W}^{(0)}|W\right) - \log \mathcal{L}\left(\bar{W}^{(\gamma_{i})} |W\right) \right) $$

(20)

is greater than a threshold q_1−α, where q_1−α is the (1−α)-quantile of a chi-squared distribution with n−1 degrees of freedom (Casella and Berger 2002). We remark that the reduction of the degrees of freedom of the distribution from n to n−1 is due to the absence of self-loops.

It is worth noticing that the statistic D is a monotonic increasing function of γ_i. Hence, fixing a significance coverage α∈[0,1], the LR test ultimately identifies a threshold $\bar \gamma _{i}$, equal to the value for which the statistic D is equal to the (1−α)-quantile of a chi-squared distribution with n−1 degrees of freedom. All the values of $\gamma _{i}>\bar \gamma _{i}$ are rejected. Hence, our procedure establishes a confidence interval for the parameter γ_i of the form $\gamma _{i}\in [0, \bar \gamma _{i}]$. Unfortunately, the value $\bar \gamma _{i}$ that produces the significance coverage α cannot be derived analytically. However, since the statistic D is monotone in γ_i, several numerical methods can efficiently retrieve a good approximation of the threshold $\bar \gamma _{i}$ (Hammings 1973).

The technique described above identifies a range for the ith component γ_i of the community belief vector γ that is compatible with the empirical data. Implementing this procedure for all the nodes $i\in \mathcal V$, we obtain a confidence interval for the whole parameter vector γ, which is the n-dimensional hyperrectangle $\gamma \in [0,\bar \gamma _{1}]\times \dots \times [0,\bar \gamma _{n}]$. This yields a set of mathematical models that are compatible with the observed data. The parameter vector may be tuned within this hyperrectangle, depending on the user’s belief in the community structure, on the amount of data available (as we will discuss in “Validation on synthetic networks” section) and, possibly, on additional information available on the systems such as historical data, or measurements on similar systems. We refer to the model obtained with $\gamma _{i} = \bar \gamma _{i}$, for all $i\in \mathcal V$, as the model with the largest belief in the community structure, among those compatible with the observed data. The flow chart in Fig. 4 summarizes the whole procedure of our parameter identification method, from the data consisting of a sequence of temporal links, to the definition of an rADN model.

Validation on synthetic networks

In this section, we validate our procedure over three different benchmarks of temporal networks. In the first benchmark, described in “Exclusive heterogeneous communities” section, nodes are partitioned into six exclusive communities of different size, so that each node belongs to exactly one community (and to the trivial, all-to-all one). A time-invariant, irreducible backbone is also present. In the proposed benchmark, nodes present a high level of heterogeneity, both in their global activity and in their involvement in their community. The analysis of this benchmark suggests that our parameter identification method is able to identify the model parameters in presence of heterogeneity in the network structure. In “Hierarchical communities” section, we propose a second benchmark where communities present a hierarchical structure. The third benchmark, characterized by overlapping communities, is discussed in “Overlapping communities” section. Also in these cases, we successfully perform the parameter identification by means of the method proposed in this work.

Exclusive heterogeneous communities

We generate a network with an exclusive community structure, where each node belongs to one and only one nontrivial community. We partition n=100 nodes into k=6 nontrivial communities with heterogeneous size, as shown in Table 2. Thus, the community set $\mathcal K$ and the sets $\mathcal V_{h}, h\in \mathcal K$ are known.

Table 2 Benchmark with exclusive community structures

Full size table

The community strength matrix Q is randomly constructed, in order to model heterogeneity in the nodes’ attitude to generate inter-community links. Specifically, the entries of the first column of Q, which represent the nodes’ involvement in the all-to-all community, are selected from independent beta distributions with mean 0.25 and variance 0.02. We remark that the selection of the entries from a beta distribution ensures Q_0i∈[0,1], for any $i\in \mathcal V$ (Ross 2009). Matrix Q is fully determined by its first column. In fact, each row of the matrix Q has only two nonzero entries (since each node belongs to a unique nontrivial community) and the matrix is stochastic. Hence, the other nonzero entry of the generic ith row is equal to 1−Q_i0. Matrix Q obtained according to this procedure is illustrated in Fig. 5a.

The backbone matrix R is defined as follows. First, we construct the graph $G_{R}=(\mathcal V,\mathcal E_{R})$, corresponding to the backbone, according to an Erdös-Rényi random graph model (Erdős and Rényi 1959) with parameter p=4/99. Such a choice of the parameter p produces a network with average degree equal to 4. Specifically, link $(i,j)\in \mathcal E_{R}$ with probability p, each link independently of the others. Then, the entries R_ij, for $(i,j)\in \mathcal E_{R}$, are assigned uniformly at random, such that each row sums 1, while all the other entries of the row are set to 0. In the extreme case in which node $i\in \mathcal V$ has no outgoing links, then we set the diagonal entry R_ii=1, all other entries equal to 0, and the corresponding λ_i=1. The other entries of the vector λ (i.e., those corresponding to nodes i that have at least an outgoing link in the backbone) are realizations of independent beta-distributed random variables with mean 0.71 and variance 0.01. Finally, the activity potentials are selected as realizations of independent and identically power-law distributed random variables with exponent equal to − 2.5 and lower cut-off a_min=0.01. The resulting routing matrix P is represented through a color-coded graph in Fig. 5b.

The system is then simulated for a time-window of duration T and the data corres-ponding to the integrated network are stored in the weighted adjacency matrix W.

Initially, we test the capability of our technique to correctly identify the model parameters when a sufficiently large amount of data is available. Then, we study the performance of our method in a critical scenario in which the system is observed for a short time-window and the data vector of temporal links is much smaller, in order to appreciate the effectiveness of the approach even in this situation.

In our first analysis, we set T=10⁶ and we observe approximately 1,500,000 temporal connections during the time-window. We apply our parameter identification technique by setting the community belief parameter γ_i equal to the maximum value of the confidence interval identified through the procedure proposed in “Confidence interval for the community belief parameter” section, i.e., $\gamma _{i}=\bar \gamma _{i}$, $\forall i\in \mathcal V$. This choice is the one that allows us to explain the largest part of the link generation process in terms of communities, compatibly with the observed data. Figure 6 illustrates the accuracy of our method in the estimation of matrix Q. We observe that the nonzero entries of matrix Q are estimated with a high accuracy, and without any bias related to the size of the communities and the heterogeneity in the parameters. In fact, accuracy does not change over the different communities, associated with different colors in Fig. 6, and it is high both for small and for large values of Q_ij and R_ij. Specifically, the mean square error of the estimated entries of matrix Q varies from 0.011 to 0.020 over the communities, with no statistically significant difference between them, while for the entries of matrix R, it is equal to 0.012. In additional simulations, here omitted for brevity, we have observed that increasing the number of communities k has no significant effect on the performance of our parameter identification method.

In our second analysis, we reduce the size of the data vector, setting T=10⁵ and T=10⁴, and generating approximately 150,000 and 15,000 temporal connections, respectively. Figure 7 reveals that, when the system can be observed only for a short time-window and few temporal links are observed, the choice of the largest value of the community belief γ_i in its confidence interval may lead to an overestimation of the contribution of the communities in the link generation process. We observe from Fig. 7a-b that the entries of vector λ, which weight the contribution of communities in the link generation mechanism, tend to be overestimated for small values of T. In fact, the mean square error of the estimated entries of vector λ increases from 0.020 for T=10⁶, to 0.055 and 0.185, for T=10⁵ and T=10⁴, respectively. We believe that the overestimation of the community weights is a common phenomenon when the number of temporal links is small. To address this issue, we suggest to reduce the value of the parameter γ_i, within the confidence interval established in “Confidence interval for the community belief parameter” section. Figure 8 shows that, also with few data available, an accurate estimation of the vector λ and a good estimation of the community strength matrix Q can be obtained by selecting smaller values for the community belief parameter vector. In Fig. 8a we can appreciate an excellent agreement between the estimated community weight vector and the actual one (mean square error equal to 0.003). In Fig. 8b we observe that, even though the accuracy in the estimation of the matrix Q is reduced with respect to the case with large T, there is still a satisfactory agreement between the estimated entries of the community strength matrix and the corresponding real quantities (mean square error equal to 0.091). In our simulations, we select the value for the community belief parameter vector γ by performing a bisection method in the range $\gamma _{i} \in [0,\bar {\gamma }_{i}]$, to minimize the absolute deviation between the estimated vector λ and the original benchmark community weight vector. This confirms our intuition that the best model estimation is within the confidence interval we have assessed. In our example, we observe that, when the temporal link set is reduced to the 1% of the original amount (i.e., 15,000 links), the optimal value for gamma is found to be in average the 18% of the extreme value $\bar \gamma $. In real-world scenarios, where the real values of λ are unknown, the optimal selection of parameter γ remains an open problem, which will be tackled in our future research.

Hierarchical communities

Here, we assess our method on a second benchmark in which communities present a hierarchical structure. Specifically, we define a two-level hierarchy: nodes are first partitioned into two first-level communities, then, each of these communities is split into two second-level communities. Each node thus belongs to a first-level community and to a second-level one, besides the all-to-all community. Details are reported in Table 3.

Table 3 Benchmark with hierarchical community structures

Full size table

Matrix Q is generated similarly to the previous benchmark. The entries of its first column are selected from a beta distribution with mean 0.25 and variance 0.02, each one independent of the others. The other two nonzero terms of each row are realizations of uniformly distributed random variables, normalized to obtain a stochastic matrix. Matrix Q obtained according to this procedure is illustrated, through color coding, in Fig. 9a. Then, the backbone matrix R, the weight vector λ, and the activity rate vector a are generated following the same procedure described in the previous section.

The system is simulated for a time-window of duration T=10⁶, obtaining approximately 1,500,000 temporal connections, the weighted adjacency matrix W of the integrated network is generated, and our technique is used to estimate the parameters. The results of our analysis, illustrated in Fig. 7b-c, suggest that our method is also able to deal with hierarchical community structures. We observe that we are able to identify the model parameters with a high accuracy and without any bias due to the different levels in the hierarchical community structure. In fact, in Fig. 9c, we observe that there is no significant difference in the accuracy of the estimation of the entries corresponding to the first-level communities (i.e., 1 and 2) and the second-level ones (i.e., 3–6). Specifically, the mean square error of the estimated entries of matrix Q varies over the communities from 0.015 to 0.025 (in average, it is equal to 0.021 for the first-level communities and 0.017 for the second-level ones, with no statistically significant difference between the two quantities). Finally, we observe that, also in this case where hierarchical communities are present, the problem of a reduced size of the data vector can be addressed by reducing the tradeoff parameter vector γ, in order to avoid data overfitting. Results are omitted for brevity.

Overlapping communities

Finally, we consider a benchmark in which communities have an overlapping structure. The n=100 nodes are divided into 7 communities, as detailed in Table 4.

Table 4 Benchmark with overlapping community structures

Full size table

The community structure presents several overlaps, notably between communities 2,3,4, and 7. The community strength matrix Q (illustrated in Fig. 10a), the backbone matrix R, as well as the two vectors λ and a are defined following the procedure presented in the previous benchmarks. Then, the system is simulated for a time-window of duration T=10⁶, generating approximately 1,500,000 temporal connections, and our technique is used to identify the parameters from the weighted adjacency matrix W of the integrated network obtained from our simulations. Also in this scenario, as illustrated in Fig. 10b and c, our method is able to identify the model parameters with high accuracy and is free of any bias due to the presence of overlaps between the communities. In fact, in Fig. 10b we observe that the accuracy in the estimation of the entries of the community weight vector λ are not influenced on whether a node belongs to an overlapping community or not: the average mean square error for the nodes in overlapping communities is equal to 0.019, while for nodes not in overlapping communities is equal to 0.015, with no statistically significant difference between the two quantities. Similarly, also the outcome of the estimation of the community strength matrix Q is not influenced by the presence of overlaps between the communities, as can be observed in Fig. 10c, by comparing nodes that belong to different communities. In fact, the mean square error of the estimated entries of matrix Q varies over the communities from 0.013 to 0.023 (in average, it is equal to 0.015 for the communities with no overlaps and 0.017 for the overlapping ones, with no statistically significant difference between the two quantities). Also in this case, the tradeoff parameter vector γ can be set to a smaller value than the extreme $\bar \gamma $ to avoid overestimation of λ when little data is available, similarly to what discussed in the case of exclusive communities.

Case studies

Enron email corpus

We use our method to enrich the results of community detection for a real-world case study: the Enron email corpus (Cohen). This is a dataset of more than 500,000 emails sent by the 158 employees of Enron company from 1979-12-31 to 2002-06-21, when the company failed. In order to deal with a uniform dataset, in which the community structure and the irreducible backbone can be assumed to be constant, we restrict the dataset to the portion of mails sent after 1998-11-13. We also remove self-sent emails and nodes that do not send or receive any email. After such a data cleaning procedure, we obtain a dataset with n=143 employees and 108,786 emails, which identify a temporal network where each employee is a node and each email determines a link from the sender to the receiver.

We then use a community detection algorithm on the integrated network to identify the community structure. We observe that the application of different algorithms for community detection may lead to the identification of different community structures. As stated in the introduction, our method can be used to establish a criterion to discriminate among the outcome of different community detection algorithms. In fact, for any community structure obtained by means of a different community detection algorithm, we can identify the model parameters using our method, and then compare the average entry of the community weight vector λ over the nodes, namely

$$ <\lambda>:=\frac1n\sum_{i\in\mathcal V}\lambda_{i}. $$

(21)

This quantity measures the fraction of links that can be statistically explained by means of a community-based mechanism. Therefore, the community structure that is able to produce the highest value of <λ> is the one that is able to explain the largest part of the link formation process.

In our case study, we apply three different algorithms to detect the communities from the integrated network: Infomap (Rosvall and Bergstrom 2008), Louvain (Blondel et al. 2008), and OSLOM (Lancichinetti et al. 2011). Since the community detection algorithms are based on randomized techniques, we perform 100 runs of each algorithm. For each of these 300 outputs, we perform our parameter identification method. The community belief vector γ is chosen within its confidence interval by using a bisection method to maximize the value of <λ>. Then, for each of the three community detection algorithms, we select the run that yields the largest value of <λ>. The estimated matrices Q for the three different algorithms are illustrated and compared in Fig. 11. We observe that the three outputs are significantly dissimilar, since a different number of communities is originally detected. Specifically, we observe that the largest community identified by Louvain is split into two or more small communities by the other algorithms. Despite these differences, we can identify some common patterns. For instance, there is a first group of 17 nodes that belong to a “strong” community, where members have high tendency of generating inter-community links. This feature of the system emerges from all of the three outputs, as can be observed by comparing the different panels of Fig. 11.

When comparing the output of the different community detection algorithms used in this case study, we observe that, interestingly, Infomap produces a very stable result: in fact, in each of the runs it always retrieves the same community partition. Louvain, instead, identifies 6 different outcomes in the 100 runs, whereas OSLOM produces a different outcome in each run. In Fig. 12a, we plot the overlapping normalized mutual information (ONMI) evaluated between each pair of partitions produced by the same method (McDaid et al. 2011). This figure supports our claim that the output of OSLOM is strongly unstable, since each run produces a different outcome. In fact, the correlation between two different outputs can be small, as seen in the box-plot. Instead, the six different outputs of Louvain algorithm are strongly correlated. In Fig. 12b, the distribution of the value of <λ> in different runs of the algorithm is illustrated. We observe that Infomap outperforms the other two algorithms, while OSLOM and Louvain seem to have a similar performance. Finally, in Fig. 12c we plot the ONMI of an OSLOM partition with the Infomap partition as a function of the quantity <λ>. From this figure, we can observe a significant positive correlation between the two quantities, which supports our intuition that <λ> might be used as a performance index of the community detection algorithm.

Primary school

We apply our algorithm to a second real-world case study: the SocioPattern primary school dataset (SocioPatterns). This dataset consists of a temporal network of face-to-face interactions between students and teachers in a French primary school, recorded via proximity sensors. The dataset comprises 77,602 interactions (sampled with a time resolution of 20 s) between n=242 individuals over a time-window of duration T=2 days. In this dataset, individuals are naturally partitioned: 232 of them are students, divided into 10 classes, and 10 are teachers (Stehlé et al. 2011; Gemmetto et al. 2014). These metadata provide a ground truth for the community structure. The main limitation of this dataset is that the direction of the interactions is not known, since it cannot be registered by the proximity sensors. In order to apply our method, in the absence of exact information on the link direction, we assume it to be homogeneously distributed. Hence, we perform a Monte Carlo parameter identification over 100 runs in which we randomize over the direction of each link in the dataset. Specifically, for each undirected link {i,j}, we interpret it as a directed link from i to j with probability 1/2, and as a directed link from j to i, otherwise, each one independent of the others. Since classes provide an evidence on the ground truth of the community structure and the number of interactions in the dataset is sufficiently large, we set the largest value of belief parameter within its confidence interval, i.e., $\gamma _{i}=\bar \gamma _{i}$, for all $i\in \mathcal V$. Then, we evaluate the average vector $\hat \lambda $ and the average matrices $\hat Q$ and $\hat R$ over the multiple realizations.

The results illustrated in Fig. 13 show that our model is able to capture the community structure of the system, supporting the hypothesis that comes from the natural partition of students into their classes. In fact, the distribution of $\hat \lambda $ illustrated in Fig. 13a shows that the class-based community structure is able to describe a large part of the observed links for most of the nodes. This can also be observed by the large involvement of members in their communities, illustrated in Fig. 13b. Notably, the teachers, corresponding to the last row of Fig. 13b, make an exception: they have small values of $\hat Q$ within their community (last column), while they have a large involvement in the all-to-all community. This seems to reflect the reality, since a teacher often interacts more with students (of several classes) than with other teachers. It is worth noticing that this is an information that a traditional community detection algorithm can hardly reveal. In addition, a more detailed structure of dyadic relationships, both within and outside the classes, is revealed in the backbone matrix $\hat R$ represented in Fig. 13c. From its structure, one can infer the presence of strong relationships between students, mostly classmates. From these interactions one can infer the presence of subcommunities within each class and use this information to reconstruct a hierarchical community structure. It also worth noticing that, for last-years students (the last rows before teachers), the dyadic relationships in the backbone are not limited to the classmates, but also inter-classes nonzero entries are present. This is consistent with other analyses performed on the same dataset, which show that last-years students are more active in generating out-of-class relationships (Stehlé et al. 2011; Gemmetto et al. 2014).

Conclusion

In this paper, we deal with the problem of modeling and characterizing the complex network structure of real-world systems. First, we present a mathematical model for temporal networks that generalizes the ADN paradigm, by including link directionality, the presence of a heterogeneous, hierarchical, and overlapping community structure, and the existence of an irreducible backbone of connections. Then, based on this model, we propose a technique to estimate the model parameters from empirical data and assess the effect of communities and the irreducible backbone on the link generation process into an intelligible form, providing a mesoscopic description of the system at the communities level. The proposed technique is based on the introduction of a free parameter that can be calibrated within a confidence interval. This parameter models our belief in the role of communities in the link formation mechanism. We validate our method on three different synthetic networks and on a real-world case study, with satisfactory results. We also apply our method to two different real-world case studies. In the first one, the ground truth about the community structure is unknown and our method is used to establish a criterion to assess the performance of different community detection algorithms. In the second scenario, a ground truth about the community structure is instead provided by the partition of students and teachers in classes. In this case, we are able to i) retrieve the actual partition in classes and ii) reveal the different role of students and teachers in their classes.

The presented method is characterized by a reasonable computational effort. This property, together with the possibility of analytical treatment exhibited by ADNs, is essential to tackle real-world problems. For example, the possibility to detect the role of nodes and communities in interactions between individuals allows for the design of accurate targeted immunization strategies for the case of disease spreading (Masuda 2009; Salathè and Jones 2010; Gong et al. 2013), or for the detection of closed communities that drive the spread of misinformation and fake news, named echo chambers (Del Vicario et al. 2016). For these reasons, we believe that the possibility of unveiling the architecture of a complex system, through the characterization of its community structure, may play a fundamental role in the development of effective techniques to address real-world problems, with potential invaluable benefits to the society.

Abbreviations

ADN:: activity driven network
LR:: likelihood ratio
ONMI:: overlapping normalized mutual information
OSLOM:: order statistics local optimization method
rADN:: routed activity driven network

References

Alessandretti, L, Sun K, Baronchelli A, Perra N (2017) Random walks on activity-driven networks with attractiveness. Phys Rev E 95(5):052318.
Article ADS Google Scholar
Bailey, NTJ (1990) The Elements of Stochastic Processes with Applications to the Natural Sciences. Wiley, New York.
MATH Google Scholar
Ballerini, M, Cabibbo N, Candelier R, Cavagna A, Cisbani E, Giardina I, Lecomte V, Orlandi A, Parisi G, Procaccini A, Viale M, Zdravkovic V (2008) Interaction ruling animal collective behavior depends on topological rather than metric distance: Evidence from a field study. Proc Natl Acad Sci USA 105(4):1232–1237.
Article ADS Google Scholar
Bao, W, Michailidis G (2018) Core community structure recovery and phase transition detection in temporally evolving networks. Sci Rep 8(1):12938.
Article ADS Google Scholar
Benson, AR, Gleich DF, Leskovec J (2016) Higher-order organization of complex networks. Science 353(6295):163–166.
Article ADS Google Scholar
Bishop, C (2006) Pattern Recognition and Machine Learning. Springer, New York.
MATH Google Scholar
Blondel, VD, Guillaume J-L, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008(10):10008.
Article Google Scholar
Bongiorno, C, Zino L, Rizzo A (2018) On unveiling the community structure of temporal networks In: Proceedings of the 57th IEEE Conference on Decision and Control (CDC), 6210–6215.
Boyd, S, Vandenberghe L (2004) Convex Optimization. Cambridge University Press, New York.
Book MATH Google Scholar
Casella, G, Berger RL (2002) Statistical Inference, vol. 2. Duxbury, Pacific Grove.
Google Scholar
Cohen, WWEnron Email Dataset. https://www.cs.cmu.edu/~./enron/. Accessed 27 Feb 2019.
Del Vicario, M, Bessi A, Zollo F, Petroni F, Scala A, Caldarelli G, Stanley HE, Quattrociocchi W (2016) The spreading of misinformation online. Proc Natl Acad Sci USA 113(3):554–559.
Article ADS Google Scholar
Erdős, P, Rényi A (1959) On random graphs. Publ Math Debrecen 6:290–297.
MathSciNet Google Scholar
Estrada, E (2011) The Structure of Complex Networks: Theory and Applications. Oxford University Press, Oxford.
Book Google Scholar
Fortunato, S, Hric D (2016) Community detection in networks: A user guide. Phys Rep 659:1–44. Community detection in networks: A user guide.
Article ADS MathSciNet Google Scholar
Gemmetto, V, Cardillo A, Garlaschelli D (2017) Irreducible network backbones: unbiased graph filtering via maximum entropy. arXiv preprint arXiv:1706.00230.
Gemmetto, V, Barrat A, Cattuto C (2014) Mitigation of infectious disease at school: targeted class closure vs school closure,. BMC Infect Dis 14(1):695.
Article Google Scholar
Girvan, M, Newman MEJ (2002) Community structure in social and biological networks. Proc Natl Acad Sci USA 99(12):7821–7826.
Article ADS MathSciNet MATH Google Scholar
Gong, K, Tang M, Hui PM, Zhang HF, Younghae D, Lai Y-C (2013) An efficient immunization strategy for community networks. PLoS ONE 8(12):1–11.
Google Scholar
Hammings, R (1973) Numerical Methods for Scientists and Engineers, 2nd edition. Dover Publications, New York.
Google Scholar
Holme, P, Saramäki J (2012) Temporal networks. Phys Rep 519(3):97–125.
Article ADS Google Scholar
Karsai, M, Perra N, Vespignani A (2014) Time varying networks and the weakness of strong ties. Sci Rep 4:4001.
Article ADS Google Scholar
Khan, BS, Niazi MA (2017) Network community detection: A review and visual survey. arXiv preprint arXiv:1708.00977.
Lancichinetti, A, Fortunato S, Kertész J (2009) Detecting the overlapping and hierarchical community structure in complex networks. New J Phys 11(3):033015.
Article Google Scholar
Lancichinetti, A, Radicchi F, Ramasco JJ, Fortunato S (2011) Finding statistically significant communities in networks. PLoS ONE 6(4):1–18.
Article Google Scholar
Latapy, M, Viard T, Magnien C (2018) Stream graphs and link streams for the modeling of interactions over time. Soc Netw Anal Min 8(1):61.
Article MATH Google Scholar
Lei, Y, Jiang X, Guo Q, Ma Y, Li M, Zheng Z (2016) Contagion processes on the static and activity-driven coupling networks. Phys Rev E 93(3):032308.
Article ADS MathSciNet Google Scholar
Leicht, EA, Newman MEJ (2008) Community structure in directed networks. Phys Rev Lett 100(11):118703.
Article ADS Google Scholar
Li, D, Han D, Ma J, Sun M, Tian L, Khouw T, Stanley HE (2017) Opinion dynamics in activity-driven networks. EPL 120(2):28002.
Article ADS Google Scholar
Liu, S, Perra N, Karsai M, Vespignani A (2014) Controlling contagion processes in activity driven networks. Phys Rev Lett 112(11):118702.
Article ADS Google Scholar
Masuda, N (2009) Immunization of networks with community structure. New J Phys 11(12):123018.
Article Google Scholar
McDaid, AF, Greene D, Hurley N (2011) Normalized mutual information to evaluate overlapping community finding algorithms. arXiv preprint arXiv:1110.2515.
Nadini, M, Rizzo A, Porfiri M (2018a) Epidemic spreading in temporal and adaptive networks with static backbone. IEEE Trans Netw Sci Eng. https://doi.org/10.1109/TNSE.2018.2885483.
Nadini, M, Sun K, Ubaldi E, Starnini M, Rizzo A, Perra N (2018b) Epidemic spreading in modular time-varying networks. Sci Rep 8(1):2352.
Article ADS Google Scholar
Newman, MEJ (2003) The structure and function of complex networks. SIAM Rev 45(2):167–256.
Article ADS MathSciNet MATH Google Scholar
Newman, MEJ (2006) Modularity and community structure in networks. Proc Natl Acad Sci USA 103(23):8577–8582.
Article ADS Google Scholar
Onnela, J-P, Saramäki J, Hyvönen J, Szabó G, Lazer D, Kaski K, Kertész J, Barabási A-L (2007) Structure and tie strengths in mobile communication networks. Proc Natl Acad Sci USA 104(18):7332–7336.
Article ADS Google Scholar
Palla, G, Barabási A-L, Vicsek T (2007) Quantifying social group evolution. Nature 446:664–667.
Article ADS Google Scholar
Palla, G, Derényi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435:814–818.
Article ADS Google Scholar
Pastor-Satorras, R, Castellano C, Van Mieghem P, Vespignani A (2015) Epidemic processes in complex networks. Rev Mod Phys 87(3):925.
Article ADS MathSciNet Google Scholar
Perra, N, Gonçalves B, Pastor-Satorras R, Vespignani A (2012) Activity driven modeling of time varying networks. Sci Rep 2:469.
Article ADS Google Scholar
Petri, G, Barrat A (2018) Simplicial activity driven model. Phys Rev Lett 121(22):228301.
Article ADS Google Scholar
Pons, P, Latapy M (2011) Post-processing hierarchical community structures: Quality improvements and multi-scale view. Theor Comput Sci 412(8):892–900.
Article MathSciNet MATH Google Scholar
Pozzana, I, Sun K, Perra N (2017) Epidemic spreading on activity-driven networks with attractiveness. Phys Rev E 96(4):042310.
Article ADS Google Scholar
Ravasz, E, Somera AL, Mongru DA, Oltvai ZN, Barabási A-L (2002) Hierarchical organization of modularity in metabolic networks. Science 297(5586):1551–1555.
Article ADS Google Scholar
Ribeiro, B, Perra N, Baronchelli A (2013) Quantifying the effect of temporal resolution on time-varying networks,. Sci Rep 3:3006.
Article ADS Google Scholar
Rizzo, A, Porfiri M (2016) Innovation diffusion on time-varying activity driven networks. Eur Phys J B 89:20.
Article ADS Google Scholar
Rizzo, A, Frasca M, Porfiri M (2014) Effect of individual behavior on epidemic spreading in activity driven networks. Phys Rev E 90(4):042801.
Article ADS Google Scholar
Rizzo, A, Pedalino B, Porfiri M (2016) A network model for ebola spreading. J Theor Biol 394:212–222.
Article MathSciNet MATH Google Scholar
Ross, SM (2009) Introduction to Probability Models. Academic Press, Cambridge.
Google Scholar
Rossetti, G, Cazabet R (2018) Community discovery in dynamic networks: a survey. ACM Comput Surv 51(2):35.
Article Google Scholar
Rosvall, M, Bergstrom CT (2008) Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci USA 105(4):1118–1123.
Article ADS Google Scholar
Salathè, M, Jones JH (2010) Dynamics and control of diseases in networks with community structure. PLOS Comput Biol 6(4):1–11.
Article MathSciNet Google Scholar
Schaub, MT, Delvenne J-C, Rosvall M, Lambiotte R (2017) The many facets of community detection in complex networks. Appl Netw Sci 2(1):4.
Article Google Scholar
SocioPatternsSocioPatterns Primary School Temporal Network Data. http://www.sociopatterns.org/datasets/primary-school-temporal-network-data/. Accessed 27 Feb 2019.
Starnini, M, Pastor-Satorras R (2014) Temporal percolation in activity-driven networks. Phys Rev E 89(3):032807.
Article ADS Google Scholar
Stehlé, J, Voirin N, Barrat A, Cattuto C, Isella L, Pinton J, Quaggiotto M, Van den Broeck W, Régis C, Lina B, Vanhems P (2011) High-resolution measurements of face-to-face contact patterns in a primary school. PLoS ONE 6(8):23176.
Article ADS Google Scholar
Sun, K, Baronchelli A, Perra N (2015) Contrasting effects of strong ties on sir and sis processes in temporal networks. Eur Phys J B 88:326.
Article ADS MathSciNet Google Scholar
Volz, E, Meyers LA (2008) Epidemic thresholds in dynamic contact networks. J Royal Soc Interface 6(32):233–241.
Article Google Scholar
Yang, J, Leskovec J (2014) Overlapping communities explain core–periphery organization of networks. Proc IEEE 102(12):1892–1902.
Article Google Scholar
Zhang, X, Ma Z, Zhang Z, Sun Q, Yan J (2018) A review of community detection algorithms based on modularity optimization. J Phys Conf Ser 1069(1):012123.
Article Google Scholar
Zino, L, Rizzo A, Porfiri M (2016) Continuous-time discrete-distribution theory for activity-driven networks. Phys Rev Lett 117(22):228302.
Article ADS Google Scholar
Zino, L, Rizzo A, Porfiri M (2017) An analytical framework for the study of epidemic models on activity driven networks. J Complex Netw 5(6):924–952.
Google Scholar
Zino, L, Rizzo A, Porfiri M (2018) Modeling memory effects in activity-driven networks. SIAM J Appl Dyn Syst 17(4):2830–2854.
Article MathSciNet MATH Google Scholar

Download references

Acknowledgment

The authors acknowledge the SocioPatterns collaboration.

Funding

This publication is realized with the contribution of the Italian Ministry of Foreign Affairs and International Cooperation, within the project “Mac2Mic”, “Macro to Micro: uncovering the hidden mechanisms driving network dynamics,” funded under the bilateral agreement between Italy and Israel for scientific, technological, and industrial cooperation. It is also supported by Compagnia di San Paolo and by the Italian Ministry of Education, Universities and Research under grant Dipartimenti di Eccellenza 2018–2022 (E11G18000350001).

Availability of data and materials

The datasets generated during the current study are available upon request. The source code used to perform the parameter identification is avaialble at https://github.com/cbongiorno/RADNFIT. The datasets analyzed in the case study is available in the Enron Email Dataset (https://www.cs.cmu.edu/~./enron/) and in the SocioPatterns Primary School Temporal Network Data (http://www.sociopatterns.org/datasets/primary-school-temporal-network-data/), respectively.

Author information

Christian Bongiorno and Lorenzo Zino contributed equally to this work.

Authors and Affiliations

Department of Electronics and Telecommunications, Politecnico di Torino, Corso Duca degli Abruzzi, 24, Torino, 10129, Italy
Christian Bongiorno & Alessandro Rizzo
Department of Mathematical Sciences “G.L. Lagrange”, Politecnico di Torino, Corso Duca degli Abruzzi, 24, Torino, 10129, Italy
Lorenzo Zino
Office of Innovation, New York University Tandon School of Engineering, Brooklyn NY, 11201, US
Alessandro Rizzo

Authors

Christian Bongiorno
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo Zino
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Rizzo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

AR coordinated the research; CB performed the numerical studies and conceived the parameter identification method; LZ designed the model and wrote the initial draft; all authors contributed to data analysis and writing the final manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Alessandro Rizzo.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Bongiorno, C., Zino, L. & Rizzo, A. A novel framework for community modeling and characterization in directed temporal networks. Appl Netw Sci 4, 10 (2019). https://doi.org/10.1007/s41109-019-0119-2

Download citation

Received: 06 December 2018
Accepted: 06 March 2019
Published: 29 March 2019
DOI: https://doi.org/10.1007/s41109-019-0119-2

A novel framework for community modeling and characterization in directed temporal networks

Abstract

Introduction

Model

Routing matrix P

Community matrix C

Backbone matrix R

Standard ADNs

Modular ADNs

ADNs with attractiveness

Estimation of the model parameters from empirical data

Parameter identification procedure

Confidence interval for the community belief parameter

Validation on synthetic networks

Exclusive heterogeneous communities

Hierarchical communities

Overlapping communities

Case studies

Enron email corpus

Primary school

Conclusion

Abbreviations

References

Acknowledgment

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords