 Research
 Open Access
 Published:
The blockconstrained configuration model
Applied Network Science volume 4, Article number: 123 (2019)
Abstract
We provide a novel family of generative blockmodels for random graphs that naturally incorporates degree distributions: the blockconstrained configuration model. Blockconstrained configuration models build on the generalized hypergeometric ensemble of random graphs and extend the wellknown configuration model by enforcing blockconstraints on the edgegenerating process. The resulting models are practical to fit even to large networks. These models provide a new, flexible tool for the study of community structure and for network science in general, where modeling networks with heterogeneous degree distributions is of central importance.
Introduction
Stochastic blockmodels (SBMs) are random models for graphs characterized by group, communities, or block structures. They are a generalization of the classical G(n,p) ErdősRènyi model (1959), where vertices are separated into B different blocks, and different probabilities to create edges are then assigned to each block. This way, higher probabilities correspond to more densely connected groups of vertices, capturing the structure of clustered graphs (Fienberg et al. 1985; Holland et al. 1983; Peixoto 2012).
SBMs are specified by defining a B×B blockmatrix of probabilities B such that each of its elements \(\omega _{b_{i}b_{j}}\) is the probability of observing an edge between vertices i and j, where b_{i} denotes the block to which vertex i belongs. Most commonly, blockmatrices are used to encode community structures. This is achieved by defining a diagonal blockmatrix, with the inclusion of small offdiagonal elements.
Thanks to its simple formulation, the edge generating process of the standard SBM can retain the block structure of the graph that needs to be modeled (Karrer and Newman 2011). However, it fails to reproduce empirical degree sequences. The reason for this is that in the G(n,p) model and its extensions, edges are sampled independently from each other with fixed probabilities, generating homogeneous degreesequences across blocks. This issue impairs the applicability of the standard SBM to most realworld graphs. Because of the lack of control on the degree distributions generated by the model, SBMs are not able to reproduce the complex structures of empirical graphs, resulting in poorly fitted models (Karrer and Newman 2011).
Different strategies have been formulated to overcome this issue. Among others, one approach is that of using exponential random graph models (Krivitsky 2012). These models are very flexible in terms of the kind of patterns they can incorporate. However, as soon as their complexity increases, they lose the analytical tractability that characterizes the standard SBM. This is due to the need for computing the partition function that defines the underlying probability distribution (Park and Newman 2004). Another, more prominent, approach taken to address the issue of uniform degreesequences in SBMs are degreecorrected block models (DCSBM) (e.g. Peixoto (2014); Newman and Peixoto (2015); Karrer and Newman (2011); Peixoto (2015)). Degreecorrected blockmodels address this problem by extending standard SBMs with degree corrections, which serve the purpose of enforcing a given expected degreesequence within the block structures. Moreover, this is achieved without hampering the simplicity of the standard SBM. For this reason, DCSBMs are widely used for community detection tasks (Newman and Reinert 2016; Peixoto 2015). Recently, they have further been extended to a Bayesian framework, allowing nonparametric model estimation (Peixoto 2017; Peixoto 2018).
One of the main assumptions of G(n,p) models, SBMs, and DCSBMs as well, is that the probability of creating edges for each pair of vertices are independent of each others (Karrer and Newman 2011). While such a modeling assumption allows defining distributions whose parameters are in general easy to estimate, for many realworld graphs, this is a strong assumption that should be verified, and which is possibly unrealistic (Squartini et al. 2015). Many social phenomena studied through empirical graphs, such as triadic closure (Granovetter 1973), or balance theory (Newcomb and Heider 1958), are based on the assumption that edges between vertices are not independent. Similarly, for graphs arising from the observation of constrained systems, like financial and economic networks, it is unreasonable to assume that edge probabilities are independent of each other. This is because the observed edges in the graph, which are the representation of interactions between actors in a system, are driven by optimization processes characterized by limited resources and budget constraints, which introduce correlations among different edge probabilities (Caldarelli et al. 2013; Nanumyan et al. 2015).
Moreover, one of the consequences of the assumption of independence of edge probabilities is the fact that the total number of edges of the modelled graph is preserved only in expectation. In the case of SBMs and DCSBMs, the total number of edges is assumed to follow a Poisson distribution. For a Poisson process to be the appropriate model for an empirical graph, the underlying edge generating process needs to meet the following conditions (Consul and Jain 1973): (i) the observation of one edge should not affect the probability that a second edge will be observed, i.e., edges occur independently; (ii) the rate at which edges are observed has to be constant; (iii) two edges cannot be observed at precisely the same instant. However, it is often hard to evaluate whether these conditions are verified because the edge generating process may not be known, or these conditions are not met altogether.
Melnik et al. (2014) have proposed an alternative approach to the problem of preserving degree distributions and the independence of edges. Such an approach is a generalisation of the configuration model that allows constructing modular random graphs, characterised by heterogeneous degreedegree correlations between each block. The model, in particular, relies on specifying different values \(P_{k,k\prime }^{b_{i},b_{i\prime }}\) for the probability that a randomly chosen edge connects a degreek node from block b_{i} to a degree k′ node from block b_{i′}. The socalled \(P_{k,k\prime }^{i,i\prime }\)model, though, only considers unweighted and undirected graphs (Melnik et al. 2014).
Similarly to the approach discussed in Melnik et al. (2014), we address the problem of incorporating degree distributions generalising the configuration model. Doing so, we propose a family of blockmodels that preserves the number of edges exactly, instead of in expectation. This circumvents the issue of assuming a given model for the number of edges in the graph, treating it merely as an observed datum. The configuration model of random graphs (Chung and Lu 2002a; 2002b; Bender and Canfield 1978; Molloy and Reed 1995) is, in fact, the simplest random model that can reproduce heterogeneous degree distributions given a fixed number of edges. It achieves this by randomly rewiring edges between vertices and thus preserving the degreesequence of the original graph. Doing so, it keeps the number of edges in the graph fixed.
Differently from what proposed by Melnik et al. (2014), though, we extend the standard configuration model to reproduce arbitrary block structures by introducing block constraints on its rewiring process by means of the formalism provided by the generalised hypergeometric ensemble of random graphs. While this approach is not as general as the one proposed by Melnik et al. (2014) in terms of how degreedegree correlations can be incorporated, it allows us to deal with multiedge, directed graphs. We refer to the resulting model as blockconstrained configuration model (BCCM). Significant advantages of our approach are (i) the natural degreecorrection provided by BCCM, and (ii) the preservation of the exact number of edges.
Generalised hypergeometric ensembles of random graphs (gHypEG)
Our approach builds on the generalized hypergeometric ensemble of random graphs (gHypEG) (Casiraghi et al. 2016; Casiraghi and Nanumyan 2018). This class of models extends the configuration model (CM) (Molloy and Reed 1995; 1998) by encoding complex topological patterns, while at the same time preserving degree distributions. Block constraints fall into the larger class of patterns that can be encoded by means of gHypEG. For this reason, before introducing the formulation of the blockconstrained configuration model, we provide a brief overview of gHypEG. More details, together with a more formal presentation, are given in Casiraghi et al. (2016); Casiraghi and Nanumyan (2018).
In the configuration model of random graphs, the probability of connecting two vertices depends only on their (out and in) degrees. In its most common formulation, the configuration model assigns to each vertex as many outstubs (or halfedges) as its outdegree, and as many instubs as its indegree. It then proceeds connecting random pairs of vertices joining out and instubs. This is done by sampling uniformly at random one out and one instub from the pool of all out and instubs respectively and then connecting them, until all stubs are connected. The left side of Fig. 1 illustrates the case from the perspective of a vertex A. The probability of connecting vertex A with one of the vertices B, C or D depends only on the abundance of stubs, and hence on the indegree of the vertices themselves. The higher the indegree, the higher the number of instubs of the vertex. Hence, the higher the probability to randomly sample a stub belonging to the vertex.
Generalized hypergeometric ensembles of random graphs provide an expression for the probability distribution underlying this process, where the degrees of the vertices are preserved in expectations. This result is achieved by mapping the process described above to an urn problem. Edges are represented by balls in an urn, and sampling from the configuration model is described by sampling balls (i.e., edges) from an urn appropriately constructed. For each pair of vertices (i,j), we can denote with \(k^{\text {out}}_{i}\) and \(k^{\text {in}}_{j}\) their respective out and indegrees. The number of combinations of outstubs of i with instubs of j which could be connected to create an edge is then given by kioutkjin. To map this process to an urn, for each dyad (i,j) we should place exactly kioutkjin balls of a given colour in the urn (Casiraghi and Nanumyan 2018). The process of sampling m edges from the configuration model is hence described by sampling m balls from this urn, and the probability distribution of observing a graph \(\mathcal {G}\) under the model is given by the multivariate hypergeometric distribution with parameters Ξ={kioutkjin}_{i,j}:
where A_{ij} denotes the element ij of the adjacency matrix of \(\mathcal {G}\), and the probability of observing \(\mathcal {G}\) is nonzero only if \(\sum _{ij}A_{ij}=m\).
Generalized hypergeometric ensembles of random graphs further extend this formulation. In gHypEG, the probability of connecting two vertices depends not only on the degree (i.e., number of stubs) of the two vertices but also on an independent propensity of the two vertices to be connected, which captures nondegree related effects. Doing so allows constraining the configuration model such that given edges are more likely than others, independently of the degrees of the respective vertices. The right side of Fig. 1 illustrates this case, where A is most likely to connect with vertex D, belonging to the same group, even though D has only one available stub.
In generalized hypergeometric ensembles the distribution over multigraphs (denoted \(\mathcal {G}\)) is formulated such that it depends on two sets of parameters: the combinatorial matrix Ξ, and a propensity matrix Ω that captures the propensity each pair of vertices to be connected. Each of these two matrices has dimensions n×n where n is the number of vertices in \(\mathcal {G}\). The contributions of the two matrices to the model are as follows. The combinatorial matrix Ξ encodes the configuration model as described above. The propensity matrix Ω encodes dyadic propensities of vertices that go beyond what prescribed by the combinatorial matrix Ξ. The ratio between any two elements Ω_{ij} and Ω_{kl} of the propensity matrix is the oddsratio of observing an edge between vertices i and j instead k and l, independently of the degrees of the vertices.
As for the case of the configuration model, this process can be seen as sampling edges from an urn. Moreover, specifying a propensity matrix Ω allows to bias the sampling in specified ways, so that some edges are more likely to be sampled than others. The probability distribution over a graph \(\mathcal {G}\) given Ξ and Ω is then described by the multivariate Wallenius’ noncentral hypergeometric distribution (Wallenius 1963; Chesson 1978).
We further denote with A the adjacency matrix of the multigraph \(\mathcal {G}\) and with V its set of vertices, the probability distribution underlying a gHypEG \(\mathbb X(\mathbf \Xi,\mathbf \Omega,m)\) with parameters Ξ,Ω, and with m edges is defined as follows:
with
In Eq. 2, the first term on the righthand side represents combinatorial effects encoding degrees, inherited from the configuration model. The second term, constituted by the integral, encodes the biases that need to be enforced on top of the process defined by the configuration model. Note that, if Ω_{ij}=c for all i,j and for any constant c, i.e., if no biases are enforced on the configuration model, Eq. 2 corresponds to Eq. 1 (Casiraghi and Nanumyan 2018). The probability distribution for undirected graphs and graphs without selfloops are defined similarly: by excluding the lower triangular entries of the adjacency matrix or by excluding its diagonal entries respectively (we refer to Casiraghi and Nanumyan (2018) for more details).
In the case of large graphs, sampling from an urn without replacement can be approximated by sampling with replacement from the same urn. Under this assumption, the approximation allows to estimate the probability given in Eq. 2 by means of a multinomial distribution with parameters \(p_{ij}=\Xi _{ij}\Omega _{ij}/\sum _{kl}\Xi _{kl}\Omega _{kl}\).
Blockconstrained configuration model
The main modelling assumption that differentiate gHypEGs from SBMs is in the dependence/independence of edge probabilities. In particular, while SBMs assume independent edge probabilities, and specifies a Poisson process for the edge generating process, gHypEG fixes the total number of edges m in the model and removes the assumption of independence between edge probabilities. This assumption has the conceptual advantage of not assuming an arbitrary edge generating process, such as the Poisson process considered by DCSBMs.
We hence define the blockconstrained configuration model (BCCM) building on the framework provided by generalized hypergeometric ensembles of random graphs. We achieve so by utilizing a particular form of the propensity matrix Ω. Specifically, we need to encode the block structure that we observe in the propensity matrix Ω. We do so by specifying a block propensity matrix Ω^{(B)} where each of its elements \({\Omega ^{(B)}}_{ij}=\omega _{b_{i}}\) if the vertices i and j are in the same block b_{i}, and \({\Omega ^{(B)}}_{ij}=\omega _{b_{i}b_{j}}\) if the vertices i and j are in different blocks b_{i} and b_{j} respectively. Figure 2 shows a blockpropensity matrix characterised by three blocks. Similarly to SBMs, in the presence of B blocks, we can specify a B×B blockmatrix B that captures the block structure through its parameters \(\omega _{b_{i}b_{j}}\). However, in the case of a BCCM, the entries \(\omega _{b_{i}b_{j}}\) capture the deviations in terms of edge propensities from the configuration model defined by the matrix Ξ, constraining edges into blocks.
The blockmatrix B can be specified to generate various structures, extending those naturally produced by degrees only, such as a diagonal blockmatrix can model graphs with disconnected components. The inclusion of small offdiagonal elements gives rise to standard community structures, with densely connected clusters of vertices. By specifying different types of blockmatrices, it is also possible to model coreperiphery, hierarchical, or multipartite structures.
The blockconstrained configuration model \(\mathbb X(\mathbf \Xi,\mathbf B,m)\) with m edges is thus completely defined by the combinatorial matrix Ξ, and by the blockmatrix B generating the propensity matrix Ω^{(B)}. We can then rewrite the general probability for a gHypEG given in Eq. 2 for BCCM:
with
Table 1 summarises the differences between the distributions underlying the DCSBM and the BCCM.
Despite its complicated expression, the probability distribution in Eq. 4 allows computing probabilities for large graphs, without the need to resort to MonteCarlo simulations (Fog 2008a). This permits the study of large graphs and provides simple model selection methods based on the comparison of likelihoods, such as likelihoodratio tests, or those based on information criteria. In this article, we will consider model selection based on the comparison of information criteria.
We will adopt the two most commonly used ones: Akaike information criterion (AIC) (Akaike 1974), and Schwarz or Bayesian information criterion (BIC) (Schwarz and et al 1978). Both criteria depend on the likelihood function of the models to be compared and penalize for the number of parameters estimated by the model. The model with the lowest score is the preferred one, as it best fits the data without overfitting it. In particular, it is not the absolute size of the score, but it is the difference between values that matters for model selection.
Informationtheoretic methods considered here provide a simple way to select the bestapproximating model from a candidate set of models. The concept of information criterion has allowed major practical and theoretical advances in model selection and the analysis of complex data sets (Stone 1982;Bozdogan 1987;DeLeeuw 1992). In particular, AIC and BIC allow performing model selection without the need of simulations, nor the assumption of specific asymptotic behaviors of the probability distribution of the model (although BIC assumes that the priors for the parameters estimated are asymptotically normal). Moreover, the aim of model selection by means AIC and BIC is not to identify exactly the ‘true model,’ i.e., the actual process generating the data, but to propose simpler models that are good approximations of it (Kuha 2004). They only allow the selection of the best model among those within a specified set. This means that, if all models in the set are very poor, information criteria will select the best model, but even that relatively best model might be poor in the absolute sense (Burnham and Anderson 2004).
The Akaike information criterion for a model \(\mathbb X\) given a graph \(\mathcal {G}\) is formulated as follows:
where k is the number of parameters estimated by X and \(\hat L(\mathbb X\mathcal {G})\) is the likelihood of model \(\mathbb X\) given the graph \(\mathcal {G}\). AIC gives an estimate of the expected, relative KullbackLeibler distance between the fitted model and the unknown true mechanism generating the observed data. Hence, the best model among a set of models is the one that has the minimal distance from the true process, and thus the one that minimizes AIC.
The Bayesian information criterion for a model \(\mathbb X\) given a graph \(\mathcal {G}\) is given by:
where k is the number of parameters estimated by \(\mathbb X\), m is the number of observations, i.e., edges, and \(\hat L(\mathbb X\mathcal {G})\) is the likelihood of model \(\mathbb X\) given the graph \(\mathcal {G}\). Similarly to AIC, the best model in a set according to BIC is the one which minimizes the criterion. Because of the presence of a higher penalty for model size, BIC tends to select models with lower parameters compared to AIC.
As mentioned above, what matters for model selection is the difference between the value of AICs or BICs and not their absolute values. For this reason, it is helpful to rank models in terms of their differences from the model which minimizes a given criterion. Suppose that there are R models, and we want to find the best one according to either AIC or BIC. Let AIC_{min} be the model which minimizes AIC for a given dataset. Then we can define the AIC differences \(\Delta _{i}^{\text {AIC}{}}\) as the difference AIC_{i}−AIC_{min} of the AIC score for model i∈{1,…,R}, and the model which minimizes AIC. BIC differences are defined in a similar manner. While AIC and BIC differences are useful in ranking models, it is possible to quantify the plausibility of each model by defining relative likelihoods for the models. Specifically, the quantity \(\phantom {\dot {i}\!}e^{1/2\Delta _{i}}\) defines the relative likelihood of model i given the data (Burnham and Anderson 2004). To better interpret relative likelihoods, statisticians usually normalize relative likelihoods to be a set of positive weights w_{i} defined as
In the case of AIC, such model weights are usually referred to as Akaike weights and are considered to be the weight of evidence in favor of model i being the best model. In the case of BIC, instead, the weights define the posterior model probabilities. The bigger Δ_{i} is, the smaller w_{i} and the less plausible is model i as being the actual best model based on the design and sample size used. These weights provide an effective way to scale and interpret the Δ_{i} values and hence select the best model (Burnham and Anderson 2004).
In the next sections, we describe how BCCM can be used to generate graphs and how to fit its parameters to an observed graph. Because the absolute values of AIC and BIC are not important, and only relative Δ_{i}s matter, in the following we will usually report only the value of the relative differences.
Generating realizations from the BCCM.BCCM is a practical generative model that allows the creation of synthetic graphs with complex structures by drawing realizations from the multivariate Wallenius noncentral hypergeometric distribution. The process of generating synthetic graphs can be divided into two tasks. First, it is needed to specify the degree sequences for the vertices. It can be accomplished by, e.g., sampling the degree sequences from a powerlaw or exponential distributions. From the degree sequences we can generate the combinatorial matrix Ξ, specifying its elements \(\Xi _{ij}=k_{i}^{\text {out}}k_{j}^{\text {in}}\), where \(k_{i}^{\text {out}}\) is the outdegree of vertex i. Second, we need to define a blockmatrix B, whose elements specify the propensities of observing edges between vertices, between and within the different blocks.
The blockmatrix B takes the form given in Eq. 9:
Elements ω_{kl}, with k,l∈{1,…,B}, should be specified such that the ratio between any two elements corresponds to the chosen oddsratios of observing an edge in the block corresponding to the first element instead of the block corresponding to the second element, given the degrees of the corresponding vertices were the same. For example, ω_{1}/ω_{32} corresponds to the oddsratio of observing an edge between vertices in block 1 compared to an edge between block 2 and block 3. Note that in the case of an undirected graph, ω_{kl}=ω_{lk}∀k,l∈{1,…,B}. On the other hand, in the case of a directed graph, blocks may have a preferred directionality, i.e., edges between blocks may be more likely in one direction. In this case, we may choose ω_{kl}≠ω_{lk}.
Once the parameters of the model are defined, we sample graphs with m edges from the BCCM \(\mathbb X(\mathbf \Xi,\mathbf \Omega _{B},m)\) defined by the combinatorial matrix Ξ, and the blockpropensity matrix Ω_{B} defined by B. As described in the previous section, sampling a graph from \(\mathbb X(\mathbf \Xi,\mathbf \Omega _{B},m)\) corresponds to sample m edges according to the multivariate Wallenius noncentral hypergeometric distribution.
ExamplesWe can specify different types of clustered graphs using this construction. As a demonstrative example, we define a blockmatrix with five blocks connected in a ring. Each block is as dense as the others, and blocks are weakly connected with only their closest neighbors. The blockmatrix quantifying these specification is given as
According to the choice made in Eq. 10, edges within diagonal blocks are 10 times more likely than edges within offdiagonal blocks.
After fixing this blockmatrix, we can define different degree sequences for the vertices. We highlight here the results obtained when fixing three different options in a directed graph without selfloops, with n=50 vertices and m=500 edges. We generate realizations by specifying the combinatorial matrix Ξ and the block propensity matrix and exploiting the random number generator provided byFog (2008b) in the R library BiasedUrn.
The first degreesequence we can set is the most straightforward option, corresponding to the standard nondegreecorrected stochastic blockmodel. This model corresponds to setting each entry in the combinatorial matrix Ξ equal to m^{2}/(n(n−1)) (Casiraghi et al. 2016). If we assign the same number of vertices to each block, we expect the model to generate graphs with homogeneous blocks. Figure 3a shows a realisation from this model. The second degreesequence we can set is defined such that the degrees of the vertices of each block are drawn from a powerlaw distribution. We expect that each block shows the same structure, with few vertices with high degrees, and many with low degrees. Because of this, we expect that most blocks are connected with directed edges starting from highdegree vertices. Figure 3b shows a realization from this model where this is visible. Finally, we set a degree sequence where the degrees of all vertices are drawn from a powerlaw distribution. Figure 3c shows a realization from this model. The combinatorial matrices corresponding to Fig. 3b and c are included with this article as additional files.
Instead of varying the degree sequences of the underlying configuration model, we can as well alter the strength of the block structure, changing the blockmatrix B. Similarly to what we did above, we show three different combinations of parameters. First, we set the within group parameters \(\omega _{b_{i}}\) equal to the between group parameters \(\omega _{b_{i}b_{j}} \forall i,j\). Second, we set the parameters \(\omega _{b_{1}}=10\) so that the more edges are concentrated in the first block. Third, we set the parameter to reconstruct a hierarchical structure. We modify the parameters \(\omega _{b_{1}b_{2}}=\omega _{b_{3}b_{4}}=\omega _{b_{4}b_{5}}=0.8\) to model graphs with two macro clusters weakly connected, where the one is split into two clusters strongly connected and the other into three clusters strongly connected. Realizations drawn from each of these three models are shown in Fig. 4.
Fitting the blockmatrix.In DCSBMs the number of edges between each pair (i,j) of vertices are assumed to be drawn from independent Poisson distributions, with parameters \(\phantom {\dot {i}\!}\theta _{i}\theta _{j}\omega _{b_{i}b_{j}}\). Let \(A_{b_{\alpha } b_{\beta }}=\sum _{i\in b_{\alpha },j\in b_{\beta }}A_{ij}\) denote the number of edges between all vertices i that are in the block b_{α} and j in block b_{β}. We further denote b_{i} the partition of vertex i. Exploiting the independence of probabilities, the maximum likelihood estimates \(\widehat \theta _{i}\) and \(\widehat \omega _{b_{i}b_{j}}\) of the parameters θ_{i} and \(\omega _{b_{i}b_{j}}\) are given by \(\widehat \theta _{i}:=k_{i}/\kappa _{b_{i}}\) and \(\widehat \omega _{b_{i}b_{j}}:=A_{b_{i} b_{j}}\) (Karrer and Newman 2011).
Because BCCM does not assume independence among edge probabilities, the parameter estimation is necessarily more complicated than that of DCSBMs. However, the formulation of the blockconstrained configuration model allows for the fast estimation of the parameters of the blockmatrix. Similarly to what is done with SBMs, we fit the BCCM by preserving in expectation the observed number of edges between and within different blocks. To estimate the entries ω_{b} of the blockmatrix \(\hat {\mathbf {B}}\), we exploit the properties of the generalized hypergeometric ensemble of random graphs.
In gHypE, the entries of the expected adjacency matrix 〈A_{ij}〉 are obtained by solving the following system of equations (Casiraghi and Nanumyan 2018):
with the constraint \(\sum _{i,j \in V} \langle A_{ij}\rangle = m\).
Because to estimate BCCM we need to fix the expectation of the number of edges between blocks and not between dyads, we proceed as described below. We denote with \(\Xi _{b_{\alpha }}=\sum _{i,j\in b_{\alpha }}\Xi _{ij}\) the sum of all the elements of the matrix Ξ corresponding to those dyads. Then, we fix the expectations of the ensemble such that the number of edges between and within blocks is given by \(A_{b_{\alpha }}\)s. Hence, in the case of the blockconstrained configuration model with B blocks we estimate the B·(B+1)/2 parameters \(\omega _{b_{\alpha } b_{\beta }}\)s constituting the blockmatrix \(\hat {\mathbf {B}}\) solving the following set of independent equations, defined up to an arbitrary constant k:
Solving for \(\omega _{b_{\alpha } b_{\beta }}\), we find that the entries of the blockmatrix \(\hat {\mathbf {B}}\) that preserve in expectation the observed number of edges between and within blocks are given by
The estimation of the parameters scales quadratically only with the number of blocks.
When the parameters of the BCCM are estimated as described here, the blockconstrained configuration model has the advantageous property of asymptotic consistency. It means that, if the method described here is applied to synthetic graphs generated from a BCCM, the technique introduced in this article can correctly recover the original model.
Estimating the Ξ matrix.In the case of the configuration model defined by Eq. 1, the elements Ξ_{ij} of the combinatorial matrix are defined as \(k_{i}^{\text {in}}k_{j}^{\text {out}}\). This definition generates a model that preserves the degree sequences of the observed graph (Casiraghi and Nanumyan 2018). By generalizing the model according to Eq. 4, where the propensity matrix is estimated as in Eq. 13, we introduce constraints on the edge sampling process that allows preserving the observed number of edges in each block. The estimated parameters can hence be interpreted as the bias needed to modify the configuration model to reproduce block structures.
To preserve the degrees of the observed graph in the BCCM, we need to update the combinatorial matrix such that it defines the degreesequences of the corresponding configuration model like there were no block constraints. We achieve this by redefining the combinatorial matrix elements as \(\Xi _{ij}=k_{i}^{\text {in}}k_{j}^{\text {out}}\theta _{i}^{\text {in}}\theta _{j}^{\text {out}}\). The estimation of Ξ and Ω is then performed by an expectationmaximization algorithm that iteratively estimates Ξ and Ω such that degrees and blocks are preserved in expectation. A pseudocode for the algorithm estimating the parameters of a BCCM model for directed graphs is provided in Algorithm 1. In the case of undirected graphs, the algorithm is adapted according to the fact that Ξ and Ω are uppertriangular matrices.
Case studies
We conclude the article with a case study analysis of synthetic and empirical graphs. We highlight the interpretability of the resulting blockconstrained configuration models in terms of deviations from the classical configuration model. In particular, a weak community structure in a graph is reflected in a small contribution to the likelihood of the estimated blockmatrix. On the other hand, a strong community structure is reflected in a substantial contribution to the likelihood of the estimated blockmatrix. Here, we quantify this difference employing AIC or BIC. However, other information criteria may also be used. Moreover, studying the relative values of the estimated parameters in the block matrices quantifies how much the configuration model has to be biased towards a block structure to fit the observed graph optimally. The more different are the values of the parameters, the stronger is the block structure compared to what is expected from the configuration model.
We start by analyzing synthetic graphs generated according to different rules, and we show that fitting the blockconstrained configuration model parameters allows selecting the correct, i.e., planted, partition of vertices, among a given set of different partitions. We perform three experiments with large directed graphs with clusters of different sizes. Finally, we conclude by employing the BCCM to compare how well different partitions obtained by different clustering algorithms fit popular realworld networks.
Analysis of synthetic graphs.We generate synthetic graphs incorporating ‘activities’ of vertices in a classical SBM, to be able to plant different outdegree sequences in the synthetic graphs. First, we need to assign the given activity to each vertex. Higher activity means that the vertex is more likely to have a higher degree. Second, we need to assign vertices to blocks and assign a probability of sampling edges to each block. Densely connected blocks have a higher probability than weakly connected blocks. The graph is then generated by a weighted sampling of edges with replacement from the list containing all dyads of the graph. The product between the activity corresponding to the fromvertex and the weight corresponding to the block to which the dyad belongs gives samplingweights for each dyad. The probabilities of sampling edges correspond to the normalized weights so that their sum is 1.
For example, let us assume we want to generate a 3vertices graph with two clusters. We can fix the block weights as follows: edges in block 1 or 2 have weight w_{1} and w_{2} respectively; edges between block 1 and block 2 have weight w_{12}. Table 2 shows the list of dyads from which to sample together with their weights, where the activity of vertices is fixed to (a_{1},a_{2},a_{3}), and the first two vertices belong to the first block. Note that if the activities of the vertices were all set to the same value, this process would correspond to the original SBM. In the following experiments, we generate different directed graphs with N=500 vertices, m=40000 edges, and different planted block structures and vertex activities.
In the first experiment, we show the difference between estimating the parameters for an SBM and the BCCM when the block structure is given. To do so, we first generate the activities of vertices from an exponential distribution with parameter λ=N/m (such that the expected sum of all activities is equal to the number of edges m we want to sample). After sorting the activity vector in decreasing order, we assign it to the vertices. In this way, the first vertex has the highest activity, and hence the highest outdegree, and so on. In the first experiment, we do not assign block weights so that the graphs obtained do not show any consistent cluster structure, and have a skewed outdegree distribution according to the fixed vertex activity (correlation ∼1).
First, we assign the vertices randomly to two blocks. We proceed by estimating the parameters for an SBM and a BCCM, according to the blocks to which the vertex has been assigned. Since no block structure has been enforced and the vertex has been assigned randomly to blocks, we expect that the estimated parameters for the block matrices \(\hat {\mathbf {B}}_{\text {SBM}}\) and \(\hat {\mathbf {B}}_{\text {BCCM}}\) will all be close to 1 (when normalized by the maximum value), reflecting the absence of a block structure. The resulting estimated parameters for an exemplary realisation are reported in Eq. 14.
As expected, the estimated values for both models are close to 1.
After changing the way vertices are assigned to blocks, we repeat the estimation of the two models. Now, we separate the vertices into two blocks such that the first 250 vertices ordered by activity are assigned to the first block and the last 250 to the second one. We expect that the SBM will assign different parameters to the different blocks because now the first block contains all vertices with high degree, and the second block all vertices with low degree. Hence, most of the edges are found between vertices in the first block or between the two blocks. Differently, from the SBM, the BCCM corrects for the observed degrees. Hence, we expect that the parameters found for the blockmatrix will be all close to 1 again, as no structure beyond that one generated by degrees is present. Thus the block assignment does not matter for the estimated parameter. The block matrices for the two models, estimated for the same realisation used above, are provided in Eq. 15.
We observe that the SBM assigns different values to each block, impairing the interpretability of the result. In particular, the parameters of \(\hat {\mathbf {B}}_{\text {SBM}}\) show the presence of a coreperiphery structure which cannot be distinguished from what obtained naturally from skewed degree distributions. The estimation of \(\hat {\mathbf {B}}_{\text {BCCM}}\), on the contrary, highlights the absence of any block structure beyond that one generated by the degree sequence, and we can correctly conclude that the degree distributions entirely generate the coreperiphery structure of the observed graph.
In the second synthetic experiment, we highlight the model selection features of the BCCM. Thanks to the fact that we are able to compute the likelihood of the model directly, we can efficiently compute information criteria such as AIC or BIC to perform model selection. We generate directed graphs with selfloops with N=500 vertices, m=40000 edges, and two equally sized clusters. Again, we generate vertex activities from an exponential distribution with rate λ=N/m. We fix the block weights to be w_{1}=1,w_{2}=3, and w_{12}=0.1. Using this setup, we can generate synthetic graphs with two clusters, one of which is denser than the other. If we fit a BCCM to the synthetic graph with the correct assignment of vertices to blocks, we obtain the following blockmatrix \(\hat {\mathbf {B}}_{\text {BCCM}}\) for an exemplary realization:
We note that we approximately recover the original block weights used to generate the graph.
We can now compare the AIC obtained for the fitted BCCM model, AIC_{BCCM}=662060, to that obtained from a simple configuration model (CM) with no block assignment, AIC_{CM}=693540. The CM model is formulated in terms of a gHypEG where the propensity matrix Ω≡1. The AIC for the BCCM is considerably smaller, confirming that the model with block structure fits better the observed graph. In terms of AIC differences, ΔBCCMAIC=0 and ΔCMAIC=31480. This corresponds to model weights w_{BCCM}∼1 and w_{CM}∼0. That means that there is no evidence for model CM. As a benchmark, we compute the AIC for BCCM models where the vertices have been assigned randomly to the two blocks.
Table 3 reports the AIC differences obtained for 1000 random assignment of vertices to the blocks, computed on the same observed graph. We observe that this usually results in values close to that of the simple configuration model, as the block assignments do not reflect the structure of the graph. In a few cases, a small number of vertices are correctly assigned to blocks, showing a slight reduction in AIC, which is however far from that of the correct assignment.
BCCM also allows comparing models with a different number of blocks. To do so, we separate the vertices in one of the blocks of the model above into two new blocks. Because we add more degrees of freedom, we expect an increase in the likelihood of the new BCCM with three blocks, but this should not be enough to give a considerable decrease in AIC. Since the synthetic graph has been built planting two blocks, the AIC should allow us to select as an optimal model the BCCM with two blocks. The resulting blockmatrix \(\hat {\mathbf {B}}_{\text {BCCM}}^{(3)}\) with three blocks is reported in Eq. 17.
We see that the estimated model fits different parameter values for the two subblocks, since the added parameters can now accommodate for random variations generated by the edge sampling process. However, as expected, there is no (statistical) evidence to support the more complex model. In fact, comparing the AIC values we obtain \(\text {AIC}_{\text {BCCM}}^{(3)} = 662065 > 662060 = \text {AIC}_{\text {BCCM}}\). This corresponds to ΔBCCMAIC=0 and \(\Delta ^{\text {AIC}{}}_{\text {BCCM}^{(3)}}=5\). In terms of model weights, we get w_{BCCM}∼0.92 and \(w_{\text {BCCM}}^{(3)}\sim 0.08\). That means that there is strong evidence against the more complex model, as the probability that the more complex model is closer to the real process is only 0.08, given the data used to estimate the model.
To provide more evidence in support of this selection procedure, we can repeat this experiment on 100 samples from the same model used before. The results provide median AIC differences of Δ_{BCCM}=0 and \(\Delta _{\text {BCCM}}^{(3)}=4.32\). Moreover, out of the 100 samples only 7 have \(\text {AIC}_{\text {BCCM}}^{(3)} < \text {AIC}_{\text {BCCM}}\). This is aligned with the probability of 0.08 estimated employing model weights. We can thus successfully use BCCM to perform model selection, both when a different number of clusters or various vertex assignments are used.
In the third experiment, instead of two clusters, we plant three clusters of different sizes (B_{1}=250, B_{2}=125, B_{3}=125). We choose the block parameters such that one of the smaller clusters is more densely connected with the bigger cluster, and the smaller cluster is relatively denser than the others. To do so we choose the block weights as follows: w_{1}=w_{2}=1,w_{3}=3,w_{13}=w_{23}=0.1,w_{12}=0.8. As before, we draw vertex activities from an exponential distribution with parameter λ=N/m. One exemplary realisation is plotted in Fig. 5. The plot clearly shows the separation into three clusters, with cluster 1 (purple) and 2 (orange) more densely connected to each other than to cluster 3 (green). Fitting the same BCCM as before allows comparing the AICs for the threeblocks BCCM to the 2block BCCM. In this case, we expect that the model with three blocks will fit considerably better the graph. Results of the fitting for the realisation plotted in Fig. 5 give \(\text {AIC}_{\text {BCCM}}^{(3)} = 673585 < 699765 = \text {AIC}_{\text {BCCM}}^{(2)}\), correctly selecting the more complex model. This corresponds to \(\Delta ^{\text {AIC}{}}_{\text {BCCM}^{(2)}}=26180\) and \(\Delta ^{\text {AIC}{}}_{\text {BCCM}^{(3)}}=0\). In terms of model weights, we get \(w_{\text {BCCM}}^{(2)}\sim 0\) and \(w_{\text {BCCM}}^{(3)}\sim 1\). That means that there is strong evidence against the simpler model.
It is known that AIC does not punish model complexity as much as BIC. For this reason, in this case, we also compare the values of BIC obtained for the two models. Also in this case, with \(\text {BIC}_{\text {BCCM}}^{(3)} = 2822787 < 2848941 = \text {BIC}_{\text {BCCM}}^{(2)}\), the information criterion allows to correctly select the model with 3 blocks. Comparing posterior probabilities for the two models, we get again \(w_{\text {BCCM}}^{(2)}\sim 0\) and \(w_{\text {BCCM}}^{(3)}\sim 1\).
Finally, we can use AIC and BIC to evaluate and rank the goodnessoffit different block assignments that are obtained from various community detection algorithms. This allows choosing the best block assignment in terms of deviations from the configuration model, i.e., which of the detected block assignment better captures the block structure that goes beyond that generated by the degree sequence of the observed graph. We compare the result obtained from 5 different algorithms run using their igraph implementation for R. In the following we use: cluster_fast_greedy, a greedy optimisation of modularity (Clauset et al. 2004); cluster_infomap, the implementation of infomap available through igraph (Rosvall and Bergstrom 2008); cluster_label_prop, label propagation algorithm (Raghavan et al. 2007); cluster_spinglass, find communities in graphs via a spinglass model and simulated annealing (Reichardt and Bornholdt 2006); cluster_louvain, the Louvain multilevel modularity optimisation algorithm (Blondel et al. 2008). As the modularity maximization algorithms are implemented only for undirected graphs, we apply them to the undirected version of the observed graph. The results of the application of the 5 different algorithms on the realisation shown in Fig. 5 are reported in the table in Table 4.
The five different community detection algorithms find three different block structures. Three of them are not able to detect the third block, while the other two algorithms split the vertices into too many blocks. AIC ranks best infomap even though it detects one block too many. BIC punishes for the number of parameters more, so ranks best the 2blocks. These results are consistent when repeating the experiment with different synthetic graphs generated from the same model. It is worth noting that none of the community detection algorithms was able to detect the planted block structure correctly. However, both the AIC and BIC of the BCCM fitted with the correct block structure are lower than those found by the different algorithms. This shows that information criteria computed using BCCM have the potential to develop novel community detection algorithms that are particularly suited for applications where degree correction is crucial. However, the development of such algorithms is beyond the scope of this article and is left to future investigations.
Analysis of empirical graphsWe conclude this article by providing a comparison of the BCCM obtained by fitting the block structures detected by the five community detection algorithms described above on five different realworld networks. The results show that different algorithm performs better for different graphs, highlighting the nontrivial effect that degrees have on block structure and community detection in general.
We study five wellknown graphs with heterogeneous characteristics and sizes. All graphs are multiedge, and are freely available as dataset within the igraphdataR package. The first graph analyzed is rfid: hospital encounter network data. It consists of 32424 undirected edges between 75 individuals (Vanhems et al. 2013). The second graph analyzed is karate: Zachary’s Karate Club. It consists of 231 undirected edges between 34 vertices (Zachary 1977). The third graph analyzed is UKfaculty: Friendship network of a UK university faculty. It consists of 3730 directed edges between 81 vertices (Nepusz et al. 2008). The fourth graph is USairports: US airport network of December 2010. It consists of 23473 directed edges between 755 airports (Von Mering et al. 2002). It has selfloops. The graph is plotted in Fig. 6, using the forceatlas2 layout (Jacomy et al. 2014). The four different plots are colored according to the block structures detected by four of the five algorithms (cluster_spinglass cannot be applied as the graph is disconnected). They are ordered by increasing AIC. From the visualization, we can see that the best block structure is the one who can separate three different blocks within the largest cluster of vertices (top of the visualizations). In particular, it is essential to note that the largest cluster consists of high and lowdegree vertices. If these vertices are belonging to the same block, the configuration model predicts then highdegree vertices should be connected by many edges (similarly to the first synthetic experiment described above). However, we observe then some of these highdegree vertices are separated and mainly connected to lowdegree vertices. For this reason, block structures that can separate these highdegree vertices into different blocks rank higher than others. The fifth graph analyzed is enron: Enron Email Network. It consists of 125409 directed edges between 184 individuals (Priebe et al. 2005). It has selfloops.
Each of these graphs has a clear block structure that could be detected. The different algorithms provide different results, both in the number of blocks detected and in the assignment of vertices. Ranking the different results employing the goodnessoffit of BCCM fitted according to the different block partitions shows that the best results are not necessarily those with fewer or more blocks, nor those obtained from a specific algorithm, as the results change with the graph studied. The results of this analysis are provided in Table 5, where the smallest AICs and BICs for each graph are highlighted in bold, together with the algorithm that provides the smallest number of blocks. The algorithm that provides the largest number of blocks is highlighted in italic.
Conclusion
In this article we have presented a novel generative model for clustered graphs: the blockconstrained configuration model. It generalizes the standard configuration model of random graphs by constraining edges within blocks, preserving degree distributions. The BCCM builds on the generalized hypergeometric ensemble of random graphs, by giving the propensity matrix Ω a block structure. The framework provided by gHypEG allows for a fast estimation of the parameters of the model. Moreover, thanks to the fact that the probability distribution underlying gHypEG is known, it allows for the generation of random realizations, as well as to the effortless computation of likelihoods, and hence various kinds of information criteria and goodnessoffit measures, such as AIC and BIC.
There are many advantages of the formulation highlighted above. Firstly, the proposed model seamlessly applies to directed and undirected graphs with or without selfloops. Moreover, the BCCM preserves exactly the number of edges in the graph, avoiding the need for assuming an arbitrary edge generating process. This also allows dropping the assumption of independence between edge probabilities, which characterizes degreecorrected stochastic block models. Finally, model selection, facilitated by the gHypE framework, provides a natural method to quantify the optimal number of blocks needed to model given realworld graph. The statistical significance of a block structure can be studied performing likelihoodratio tests (Casiraghi et al. 2016), or comparing information criteria such as AIC, BIC, or the description length of the estimated models. Furthermore, within the framework of generalized hypergeometric ensembles blockconstrained configuration models can be extended, including heterogeneous properties of vertices or edges (see Casiraghi (2017)).
The more complicated expression and estimation of BCCM compared to DCSBMs arises from dropping the assumption of independence between edge probabilities. However, thanks to the formulation provided in this article, BCCM is still practicable and can be applied to empirical graphs of various sizes. BCCM opens new routes to develop community detection algorithms suitable for applications where degree correction is particularly valuable, and where the assumption of an arbitrary edge generating process is not acceptable.
Availability of data and materials
The datasets generated and/or analysed during the current study are available as a gitHub repository, https://github.com/gi0na/BCCMSupportingMaterial.git. The combinatorial matrices corresponding to Fig. 3b and c are included within the article (and its additional file(s)). A software implementation of the BCCM can be found as part of the R package ghypernet available at https://github.com/gi0na/ghypernet.git.
Abbreviations
 AIC:

Akaike information criterion
 BCCM:

Blockconstrained configuration model
 BIC:

Bayesian information criterion
 DCSBM:

Degreecorrected stochastic block model
 gHypEG:

Generalised hypergeometric ensemble of random graphs
 SBM:

Stochastic block model
References
Akaike, H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723. https://doi.org/10.1109/TAC.1974.1100705.
Bender, EA, Canfield ER (1978) The asymptotic number of labeled graphs with given degree sequences. J Comb Theory Ser A 24(3):296–307. https://doi.org/10.1016/00973165(78)900596.
Blondel, VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008(10):10008.
Bozdogan, H (1987) Model selection and Akaike’s Information Criterion (AIC): The general theory and its analytical extensions. Psychometrika 52(3):345–370. https://doi.org/10.1007/BF02294361.
(2004) Model Selection and Multimodel Inference(Burnham KP, Anderson DR, eds.). Springer, New York. https://doi.org/10.1007/b97636. http://link.springer.com/10.1007/b97636.
Caldarelli, G, Chessa A, Pammolli F, Gabrielli A, Puliga M (2013) Reconstructing a credit network. Nat Phys. https://doi.org/10.1038/nphys2580.
Casiraghi, G (2017) Multiplex Network Regression: How do relations drive interactions?arXiv preprint arXiv:1702.02048. http://arxiv.org/abs/1702.02048.
Casiraghi, G, Nanumyan V (2018) Generalised hypergeometric ensembles of random graphs: the configuration model as an urn problem. http://arxiv.org/abs/1810.06495.
Casiraghi, G, Nanumyan V, Scholtes I, Schweitzer F (2016) Generalized Hypergeometric Ensembles: Statistical Hypothesis Testing in Complex Networks. arXiv preprint arXiv:1607.02441. http://arxiv.org/abs/1607.02441.
Chesson, J (1978) Measuring Preference in Selective Predation. Ecology 59(2):211–215.
Chung, F, Lu L (2002) Connected Components in Random Graphs with Given Expected Degree Sequences. Ann Comb 6(2):125–145. https://doi.org/10.1007/PL00012580.
Chung, F, Lu L (2002) The average distances in random graphs with given expected degrees. Proc Natl Acad Sci 99(25):15879–15882. https://doi.org/10.1073/pnas.252631999.
Clauset, A, Newman ME, Moore C (2004) Finding community structure in very large networks. Phys Rev E 70(6):066111.
Consul, PC, Jain GC (1973) A Generalization of the Poisson Distribution. Technometrics 15(4):791–799. https://doi.org/10.1080/00401706.1973.10489112.
DeLeeuw, J (1992) Introduction to Akaike 1973 Information Theory and an Extension of the Maximum Likelihood Principle:599–609. https://doi.org/10.1007/9781461209195_37. http://link.springer.com/10.1007/9781461209195_37.
Erdös, P, Rényi A (1959) On random graphs I. Publ Math Debrecen 6:290–297.
Fienberg, SE, Meyer MM, Wasserman SS (1985) Statistical Analysis of Multiple Sociometric Relations. J Am Stat Assoc 80(389):51–67. https://doi.org/10.1080/01621459.1985.10477129.
Fog, A (2008) Calculation Methods for Wallenius’ Noncentral Hypergeometric Distribution. Commun Stat Sim Comput 37(2):258–273. https://doi.org/10.1080/03610910701790269.
Fog, A (2008) Sampling Methods for Wallenius’ and Fisher’s Noncentral Hypergeometric Distributions. Commun Stat Sim Comput 37(2):241–257. https://doi.org/10.1080/03610910701790236.
Granovetter, MS (1973) The Strength of Weak Ties. Am J Soc 78(6):1360–80. https://doi.org/10.1086/225469. NIHMS150003.
Holland, PW, Laskey KB, Leinhardt S (1983) Stochastic blockmodels: First steps. Social Networks 5(2):109–137. https://doi.org/10.1016/03788733(83)900217.
Jacomy, M, Venturini T, Heymann S, Bastian M (2014) Forceatlas2, a continuous graph layout algorithm for handy network visualization designed for the gephi software. PLoS ONE 9(6):98679.
Karrer, B, Newman MEJ (2011) Stochastic blockmodels and community structure in networks. Phys Rev E 83(1):16107. https://doi.org/10.1103/PhysRevE.83.016107.
Krivitsky, PN (2012) Exponentialfamily random graph models for valued networks. Electron J Stat 6:1100–1128. https://doi.org/10.1214/12EJS696.
Kuha, J (2004) AIC and BIC. Soc Methods Res 33(2):188–229. https://doi.org/10.1177/0049124103262065.
Melnik, S, Porter MA, Mucha PJ, Gleeson JP (2014) Dynamics on modular networks with heterogeneous correlations. Chaos Interdiscip J Nonlinear Sci 24(2):023106.
Molloy, M, Reed B (1995) A critical point for random graphs with a given degree sequence. Random Struct Algoritms 6(23):161–180. https://doi.org/10.1002/rsa.3240060204.
Molloy, M, Reed B (1998) The Size of the Giant Component of a Random Graph with a Given Degree Sequence. Comb Probab Comput 7(3):295–305.
Nanumyan, V, Garas A, Schweitzer F (2015) The Network of Counterparty Risk: Analysing Correlations in OTC Derivatives. PLoS ONE 10(9):0136638. https://doi.org/10.1371/journal.pone.0136638. http://arxiv.org/abs/1506.04663.
Nepusz, T, Petróczi A, Négyessy L, Bazsó F (2008) Fuzzy communities and the concept of bridgeness in complex networks. Phys Rev E 77(1):016107.
Newcomb, T, Heider F (1958) The Psychology of Interpersonal Relations. Am Soc Rev. https://doi.org/10.2307/2089062. http://arxiv.org/abs/arXiv:1011.1669v3.
Newman, MEJ, Peixoto TP (2015) Generalized Communities in Networks. Phys Rev Lett 115(8):088701. https://doi.org/10.1103/PhysRevLett.115.088701.
Newman, MEJ, Reinert G (2016) Estimating the Number of Communities in a Network. Phys Rev Lett 117(7):078301. https://doi.org/10.1103/PhysRevLett.117.078301.
Park, J, Newman MEJ (2004) Statistical mechanics of networks. Phys Rev E 70(6):066117. https://doi.org/10.1103/PhysRevE.70.066117.
Peixoto, TP (2012) Entropy of stochastic blockmodel ensembles. Phys Rev E 85(5):056122. https://doi.org/10.1103/PhysRevE.85.056122.
Peixoto, TP (2014) Hierarchical Block Structures and HighResolution Model Selection in Large Networks. Phys Rev X 4(1):011047. https://doi.org/10.1103/PhysRevX.4.011047.
Peixoto, TP (2015) Model Selection and Hypothesis Testing for LargeScale Network Models with Overlapping Groups. Phys Rev X 5(1):011033. https://doi.org/10.1103/PhysRevX.5.011033.
Peixoto, TP (2017) Nonparametric Bayesian inference of the microcanonical stochastic block model. Phys Rev E 95(1):012317. https://doi.org/10.1103/PhysRevE.95.012317.
Peixoto, TP (2018) Reconstructing Networks with Unknown and Heterogeneous Errors. Phys Rev X 8(4):041011. https://doi.org/10.1103/PhysRevX.8.041011.
Priebe, CE, Conroy JM, Marchette DJ, Park Y (2005) Scan statistics on enron graphs. Comput Math Org Theory 11(3):229–247.
Raghavan, UN, Albert R, Kumara S (2007) Near linear time algorithm to detect community structures in largescale networks. Phys Rev E 76(3):036106.
Reichardt, J, Bornholdt S (2006) Statistical mechanics of community detection. Phys Rev E 74(1):016110.
Rosvall, M, Bergstrom CT (2008) Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci 105(4):1118–1123.
Schwarz, G, et al (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464.
Squartini, T, Mastrandrea R, Garlaschelli D (2015) Unbiased sampling of network ensembles. N J Phys. 10.1088/13672630/17/2/023052. http://arxiv.org/abs/1406.1197.
Stone, CJ (1982) Local asymptotic admissibility of a generalization of Akaike’s model selection rule. Ann Inst Stat Math 34(1):123–133. https://doi.org/10.1007/BF02481014.
Vanhems, P, Barrat A, Cattuto C, Pinton JF, Khanafer N, Régis C, Kim Ba, Comte B, Voirin N (2013) Estimating potential infection transmission routes in hospital wards using wearable proximity sensors. PloS ONE 8(9):73970.
Von Mering, C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P (2002) Comparative assessment of largescale data sets of protein–protein interactions. Nature 417(6887):399.
Wallenius, KT (1963) Biased Sampling: the Noncentral Hypergeometric Probability Distribution. Ph.d. thesis. https://doi.org/10.21236/ad0426243.
Zachary, WW (1977) An Information Flow Model for Conflict and Fission in Small Groups. J Anthropol Res 33(4):452–473.
Acknowledgements
The author thanks Frank Schweitzer for his support and valuable comments, and Laurence Brandenberger, Giacomo Vaccario and Vahan Nanumyan for useful discussions.
Funding
Not applicable.
Author information
Affiliations
Contributions
The author read and approved the final manuscript.
Corresponding author
Correspondence to Giona Casiraghi.
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Casiraghi, G. The blockconstrained configuration model. Appl Netw Sci 4, 123 (2019). https://doi.org/10.1007/s4110901902411
Received:
Accepted:
Published:
Keywords
 Block model
 Community structure
 Random graphs
 Configuration model
 Network analysis
 gHypEG