Skip to main content

Testing biological network motif significance with exponential random graph models

Abstract

Analysis of the structure of biological networks often uses statistical tests to establish the over-representation of motifs, which are thought to be important building blocks of such networks, related to their biological functions. However, there is disagreement as to the statistical significance of these motifs, and there are potential problems with standard methods for estimating this significance. Exponential random graph models (ERGMs) are a class of statistical model that can overcome some of the shortcomings of commonly used methods for testing the statistical significance of motifs. ERGMs were first introduced into the bioinformatics literature over 10 years ago but have had limited application to biological networks, possibly due to the practical difficulty of estimating model parameters. Advances in estimation algorithms now afford analysis of much larger networks in practical time. We illustrate the application of ERGM to both an undirected protein–protein interaction (PPI) network and directed gene regulatory networks. ERGM models indicate over-representation of triangles in the PPI network, and confirm results from previous research as to over-representation of transitive triangles (feed-forward loop) in an E. coli and a yeast regulatory network. We also confirm, using ERGMs, previous research showing that under-representation of the cyclic triangle (feedback loop) can be explained as a consequence of other topological features.

Introduction

Molecular interactions in biological systems are often represented as networks (Winterbach et al. 2013). Some such networks are inherently undirected, such as protein–protein interaction (PPI) networks (De Las Rivas and Fontanillo 2010). Others may be directed, such as gene regulatory networks, where nodes represent operons, and arcs (directed edges) represent transcriptional interactions between them. Much research with such biological networks has concerned “motifs”, small subgraphs which occur more frequently than would be expected by chance. Motifs have been considered the building blocks of complex networks (Alon 2007; Ciriello and Guerra 2008; Milo et al. 2002; Shen-Orr et al. 2002). The biological significance of network motifs derives from their possible interpretation as signs of evolutionary events (Middendorf et al. 2005; Rice et al. 2005).

Two simple examples of motifs in undirected networks are triangles (three-cycles) and squares (four-cycles) (Rice et al. 2005). Directed networks allow for a larger set of potentially important motifs (Middendorf et al. 2005; Milo et al. 2002; Rice et al. 2005), which can be quite complicated, leading to problems of consistency in their definition (Konagurthu and Lesk 2008b).

It is worth noting that such (three-node) motifs are an idea with a long history in social network analysis, where the counts of all sixteen possible three-node directed graphs (triads) are known as the triad census (Davis and Leinhardt 1967; Holland and Leinhardt 1970, 1976; Wasserman and Faust 1994). A systematic naming convention has been developed that is based on the number of mutual, asymmetric, and null (M, A, and N) dyads in the triad, followed by a letter to distinguish the orientation if it is not unique (Fig. 1). For example, the transitive triangle is designated 030T, which distinguishes it from the cyclic triad 030C. Although in common usage in social network research, and cited by Milo et al. (2002) and Saul and Filkov (2007) in the context of biological networks, this naming convention is rarely used in discussions of motifs in the bioinformatics or biology literature. There are efficient algorithms for computing the triad census (Batagelj and Mrvar 2001; Moody 1998), implemented in widely used general purpose graph libraries such as igraph (Csárdi and Nepusz 2006) and NetworkX (Hagberg et al. 2008). The triad census has recently been extended to colored triads, that is, distinguishing the nodes in the triads based on a categorical attribute assigned to them (Lienert et al. 2019). It has long been noted in the social networks literature that the dyad census constrains the triad census, and yet empirical social networks often still have counts for some triads greater than expected given those constraints (Faust 2010).

Fig. 1
figure1

Triad census classes labeled with the MAN (mutual, asymmetric, null) dyad census naming convention. When the dyad census does not uniquely identify a triad, a letter designating “up”, “down”, “transitive”, or “cyclic” is appended

To determine if a motif is over-represented, the count of the motif in an observed network is compared to the distribution of its counts in a set of simulated random networks (Ciriello and Guerra 2008) (it is also possible to determine the significance of motif over-representation without simulation (Martorana et al. 2020; Picard et al. 2008)). This leads to the problem of choosing the appropriate random networks (null model), and some supposed motifs have been found to not be significantly over-represented, and occur with the observed frequencies simply due to topological properties of random networks (Konagurthu and Lesk 2008a) or correlations between motifs created by the randomization process (Ginoza and Mugler 2010), although such correlations can also occur even with uniform sampling (Fodor et al. 2020).

Estimating motif (triad census) significance by comparing the triad census of an empirical network to that of ensembles of random graphs also has a long history, for example the conditional uniform graph (CUG) distribution (Anderson et al. 1999; Butts 2008; Mayhew 1984), conditional on the dyad census (U|MAN) (Holland and Leinhardt 1976), or on the degree distribution (Snijders 1991). A more modern variation on a similar idea is the dk-series (Mahadevan et al. 2006; Orsini et al. 2015), a sequence of nested network distributions of increasing complexity, fitting in turn density, degree distribution, degree homophily, average local clustering, and clustering by degree (Orsini et al. 2015).

The recent work of Fodor et al. (2020) shows that the assumptions of mainstream methods for motif identification, specifically normally distributed motif frequencies and independence of motifs, do not always hold, and that, as a consequence, such methods cannot always correctly estimate the statistical significance of motif over-representation.

Aside from such intrinsic statistical limitations, it may be the case that the apparent statistical over-representation of motifs has no evolutionary or functional significance (Ingram et al. 2006; Mazurie et al. 2005; Payne and Wagner 2015), and the choice of null model is a critical factor in this lack of evident relationship between over-representation and evolutionary preservation (Beber et al. 2012; Mazurie et al. 2005). Alternatively, the apparent lack of functional significance (Payne and Wagner 2015) may be due to too narrow a definition of “function” (Ahnert and Fink 2016). Recently, it has also been suggested that elementary motifs are a lower level of structure than that which is most functionally relevant in gene regulatory networks characterizing different physiological states (Lesk and Konagurthu 2021).

It might also be the case that particular motifs are over-represented, not because they are evolutionarily selected for function, but because of spatial clustering (Artzy-Randrup et al. 2004). For example, in the context of PPI networks, we might expect that interactions would be over-represented between proteins that share a subcellular location, and under-represented between those that do not, since proteins known to interact usually have the same subcellular locations (von Mering et al. 2002). Indeed PPI networks can be used as predictors of subcellular location (Kumar and Ranganathan 2010; Shin et al. 2009).

There are many algorithms for motif discovery in complex networks; for recent reviews, see Jazayeri and Yang (2020), Patra and Mohapatra (2020) and Yu et al. (2020). In the present work we are considering only static, not temporal, networks. Although they differ in many details, especially regarding computational efficiency and scalability, these motif discovery algorithms work fundamentally in the manner described above. That is, they count occurrences of a motif in the observed network, and compare this to the distribution of the motif’s frequency in an ensemble of randomized versions of the original network (typically preserving degree sequence). Therefore these conventional methods all test the significance of one motif at a time, assuming independence of motifs, and are all potentially subject to the problems described by the recent work of Fodor et al. (2020), mentioned above. That is, that the assumptions of independence and normal distribution of motif frequencies may not hold, and that therefore these methods might not be able to correctly estimate the statistical significance of motif over-representation.

In this work we describe a different approach to determining motif significance in complex networks, which can potentially overcome these problems. Rather than comparing the observed frequency of a candidate motif to its frequency in a set of randomized networks, we take a model-based approach. Specifically, we estimate parameters of a model (an exponential random graph model, abbreviated ERGM) of the observed network. These parameters correspond to substructures which resemble potential motifs of interest. This allows the significance of the candidate motifs to be tested simultaneously in a single model, in such a way that independence of the motifs is not assumed.

Once such a model is estimated, it can also be used to test for motif significance in the traditional way, using the ERGM to simulate an ensemble of random networks. Recently, this approach was used test for motifs (dyads, triads, and tetrads; that is, two, three, and four node motifs) in a collection of social (rather than biological) networks (Felmlee et al. 2021). Using ERGM rather than degree-preserving randomization, “reduces the scope for misleading results by controlling for multiple, potential correlates in the same set of random models.” (Felmlee et al. 2021, p. 2).

We demonstrate the ERGM approach in biological networks (both undirected (PPI) and directed gene regulatory networks) using some recently developed ERGM estimation methods (Borisenko et al. 2019; Byshkin et al. 2016, 2018; Stivala et al. 2020), which allow estimation of models for larger networks than was practical with earlier methods of ERGM parameter estimation.

The remainder of this article is organized as follows. First, we describe ERGMs, and review the literature on the application of ERGMs to biological networks. We then report the biological networks considered in this work, and the details of the ERGM configurations, estimation methods, and goodness-of-fit tests we used. Following that, we present and discuss new ERGM models of these networks, comparing the inferences as to motif significance with existing published results using conventional motif discovery methods. In the next section, we detail the limitations of this application of ERGMs, and indicate some potential future work. We conclude with a summary of the inferences drawn from the ERGM models of the networks considered.

Exponential random graph models

ERGMs are widely used in the social sciences, typically to model social networks (Amati et al. 2018; Koskinen 2020; Lusher et al. 2013; Robins et al. 2007a). Cimini et al. (2019) is a recent review of ERGMs for modeling real-world networks, from a statistical physics viewpoint.

An ERGM is a probability distribution with the form

$$\begin{aligned} \Pr (X =x) = \frac{1}{\kappa (\theta )}\exp \left( \sum _A \theta _A z_A(x)\right) \end{aligned}$$
(1)

where

  • \(X = [X_{ij}]\) is a 0–1 matrix of random tie variables,

  • x is a realization of X,

  • A is a “configuration”, a (small) set of nodes and a subset of ties between them,

  • \(z_A(x)\) is the network statistic for configuration A,

  • \(\theta _A\) is a model parameter corresponding to configuration A,

  • \(\kappa (\theta )\) is a normalizing constant to ensure a proper distribution.

Given an observed network x, we aim to find the parameter vector \(\theta\) which maximizes the probability of x under the model. Then for each configuration A in the model, its corresponding parameter \(\theta _A\) and its estimated standard error allow us to make inferences about the over- or under-representation of that configuration in the observed network. If \(\theta _A\) is significantly different from zero, then if \(\theta _A > 0\) the configuration A is over-represented, or under-represented if \(\theta _A < 0\).

Note that a “configuration”, unlike a motif (in its most common usage) or the triad census classes, is not an induced subgraph. That is, it does not include every edge in the original graph of which it is a subgraph: a configuration is any occurrence of the substructure in question in the graph; it is defined only by its edges, not by its edges and non-edges. See Fig. 2 for an example based on one from Fodor et al. (2020, Fig. 5B).

Fig. 2
figure2

Motif examples. F, the transitive triangle (triad 030T) is not a special case of H, the out-star (triad 021D), when considered as motifs (or triad census classes): they are distinct induced subgraphs of three nodes. However, when considered as ERGM configurations, since H is a subgraph (but not an induced subgraph) of F (the transitive triangle is formed by “closing” the out-star with an additional arc), in their corresponding statistics both F and H are counted for an occurrence of F

ERGMs solve the problem of the need to correct for correlations between motif occurrences, and also other attributes such as subcellular location (functional and evolutionary significance is another matter entirely). Given an observed network, model parameters can be estimated by maximum likelihood. Hence parameters corresponding to candidate motifs such as triangles can be estimated, and a positive significant parameter would indicate triangles occurring more frequently than by chance, given the other parameters in the model (which would include parameters to control for density and degree distribution, for example). ERGMs allow different structural configurations to be incorporated, as well as configurations based on node attributes (such as physico-chemical properties, or spatial locality), and the significance of the configurations can then be assessed given all the other structural and other configurations included in the model.

ERGMs fulfill all of the desirable criteria for improved network models listed by de Silva and Stumpf (2005, p. 427). They take into account that networks are finite. Indeed, far from requiring very large networks to fit the requirements of mean-field theories, they are dependent on network size and do not scale consistently to infinity (Rolls et al. 2013; Schweinberger et al. 2020; Shalizi and Rinaldo 2013)—a property that can be used to estimate population size from network samples (Rolls and Robins 2017). They can handle modular organization or community or block structure (Babkin et al. 2020; Fronczak et al. 2013; Gross et al. 2021; Schweinberger 2020; Schweinberger and Handcock 2015; Schweinberger and Luna 2018; Wang et al. 2019), samples from larger networks (An 2016; Handcock and Gile 2010; Pattison et al. 2013; Stivala et al. 2016), and missing data (Koskinen et al. 2013; Robins et al. 2004). And finally, they are flexible at incorporating additional information such as nodal attributes, including dyadic attributes, such as distances between nodes. ERGMs have also been extended to handle valued networks (Desmarais and Cranmer 2012; Krivitsky 2012) and dynamic (time-varying) networks (Krivitsky and Handcock 2014), and to use graphlets (Pržulj 2007) as the ERGM configurations (Yaveroǧlu et al. 2015).

Despite these potential advantages, however, ERGM parameter estimation is a computationally intractable problem, and in practice it is generally necessary to use Markov chain Monte Carlo (MCMC) methods (Hunter et al. 2012). A variety of algorithms for ERGM model fitting (Hummel et al. 2012; Hunter and Handcock 2006; Krivitsky 2017; Snijders 2002) are implemented in widely used software packages such as statnet (Handcock et al. 2008; Hunter et al. 2008; Morris et al. 2008) and PNet/MPNet (Wang et al. 2009), and Bayesian methods are also available (Caimo and Friel 2011, 2014). These packages also implement the so-called “alternating” or “geometrically weighted” configurations (Robins et al. 2007b; Snijders et al. 2006), which alleviate problems with model “near-degeneracy”, where the model’s probability mass is concentrated in a very small region of possible networks, which can occur when only simple configurations, such as stars and triangles, are used (Hunter et al. 2012).

Until recently, the computational difficulty of ERGM parameter estimation has limited its application to biological networks, which are often larger than the social networks (traditionally measured by observations and surveys, rather than online social networks) for which the techniques were developed. Now, however, advances such as snowball sampling and conditional estimation (Pattison et al. 2013; Stivala et al. 2016), improved ERGM distribution samplers such as the “improved fixed density” (IFD) sampler (Byshkin et al. 2016), and new estimation algorithms (Hummel et al. 2012), including the “Equilibrium Expectation” (EE) algorithm (Byshkin et al. 2018; Borisenko et al. 2019) and its implementation for large directed networks (Stivala et al. 2020), have reduced by orders of magnitude the time taken to estimate ERGM parameters.

Literature review of application of ERGMs to biological networks

ERGMs were first applied to biological networks by Saul and Filkov (2007), who estimated model parameters for Escherichia coli (Salgado et al. 2001) and yeast regulatory networks, and a collection of metabolic networks. As well as introducing the use of ERGMs to the field of bioinformatics for analyzing biological networks, Saul and Filkov (2007) used ERGM models to build topological profiles which they showed to be capable of classifying organisms into biological and functional groups. With the algorithms and implementations available at the time, the larger networks could only be estimated by maximum pseudo-likelihood (Strauss and Ikeda 1990), an approximation which is now considered problematic (van Duijn et al. 2009; Hunter et al. 2012; Robins et al. 2007b) and useful mostly for obtaining initial parameter estimates for a more accurate (but also more computationally expensive) method (Hummel et al. 2012; Hunter and Handcock 2006; Krivitsky 2017). Further, all the networks in Saul and Filkov (2007) were treated as undirected, thereby losing important directional information (and not, for example, being able to distinguish between cyclic and transitive triads) in regulatory networks. The E. coli regulatory network, treated as undirected, was also used as an example application of the “stepping” algorithm for ERGM estimation by Hummel et al. (2012).

Exponential random graph models for similar E. coli regulatory networks were described by Begum et al. (2014), leaving the networks directed rather than treating them as undirected. These models were very simple, however, including only Arc and In-star terms, and therefore model degree distribution, but not triangular motifs.

Bayesian estimation of an ERGM model of a human PPI network with 401 proteins was described by Bulashevska et al. (2010). This model used only very basic structural features (not including any triangular structures, for example), but made use of nodal attributes, specifically a binary variable indicating if the protein is disordered. This ERGM was not used to analyze network motifs, but rather the relationship between disordered proteins and their “sociality”, a measure of their importance in the PPI network, finding that intrinsically disordered proteins tend to be more “social” (Bulashevska et al. 2010). In their Conclusions, Bulashevska et al. (2010) suggest that “The ERGM modelling of networks offers a natural way of assessing importance of the network motifs” (Bulashevska et al. 2010, p. 13).

Similar techniques, that is, Bayesian estimation of ERGMs with only very simple structural terms, have also been used with gene–gene relationship networks to model mechanisms of gene dysregulation (Azad et al. 2017). These models were used to infer potential aberrant gene pairs, and suggested a novel pattern of aberrant signaling (Azad et al. 2017).

A mixture ERGM was introduced by Wang et al. (2019) and applied to a yeast gene interaction network with 424 genes (Schuldiner et al. 2005; Wang et al. 2019). The model included geometrically weighted in-degree and out-degree terms, but not any triangular terms; the interest is rather in the clusters it finds, which may be used to predict function (Wang et al. 2019).

An ERGM incorporating a directed form of the degree-corrected stochastic blockmodel (Karrer and Newman 2011) was introduced by Gross et al. (2021), and applied to the connectome of the C. elegans worm (279 nodes representing neurons), and an A. thaliana PPI network (4344 nodes representing proteins). These models assume dyadic independence, and hence triangular configurations could not be incorporated. The advantage of the mixture ERGM (Wang et al. 2019) or stochastic blockmodel ERGM generalizations (\(\beta\)-SBM and \(p_1\)-SBM (Gross et al. 2021)) is that they can capture heterogeneity in clusters found in the network, but we do not address cluster or community structure here.

ERGMs have been applied to neural networks with 90 nodes, representing brain regions (Simpson et al. 2011, 2012), finding that an ERGM approach outperforms conventional approaches for constructing group-based representative brain networks (Simpson et al. 2012). Bayesian ERGM techniques, with 96 nodes representing brain regions, have been used to model brain networks over the human lifespan (Sinke et al. 2016). Recently, Bayesian ERGMs, extended to multiple networks, were used to compare functional connectivity structure across groups of individuals (Lehmann et al. 2021).

ERGMs have also been used to model human brain networks inferred from electroencephalographic (EEG) signals; these networks have 56 (the number of EEG sensors) nodes (Obando and De Vico Fallani 2017). These models showed that clustering and node centrality (as reflected by over-representation of triangles and stars) better explained global properties of the brain networks than other graph metrics, supporting the view that segregated modules exchange information via hubs.

An enhanced version of the generalized (or valued) ERGM (Desmarais and Cranmer 2012) was used to model the human Default Mode Network (DMN) with 20 nodes, representing brain regions (Stillman et al. 2017). This model showed that the DMN appears to be organized in a “segregated highway” structure, that is, with fewer hubs and more triadic closure than expected, in contrast to “small world” structure of the whole-brain network (Stillman et al. 2017). This work is an example of an ERGM that incorporates spatial distances, in the form of three-dimensional Euclidean distances between nodes.

A Bayesian ERGM has been used to model transient structure in intrinsically disordered proteins, providing a means for identifying transient structures that differ in favorability across variants (Grazioli et al. 2019a). A specific family of ERGMs has been used to model amyloid fibril topologies, leading to the construction of a systemic nomenclature that can classify all known amyloid fibril structures, and a simulation technique that can explore the kinetics of fibril self-assembly (Grazioli et al. 2019b).

Simple ERGMs for undirected networks (A. thaliana, yeast, human, and C. elegans PPI networks, and undirected versions of E. coli regulatory and Drosophila optic medulla networks) were estimated in Byshkin et al. (2018, S.I.), demonstrating that the EE algorithm could be used to estimate in minutes a model that takes many hours or is practically impossible with earlier methods. In addition, a more complex model of the A. thaliana PPI network was estimated, showing not just the over-representation of the triangle motif, but also the tendency for plant-specific proteins to interact preferentially with each other, and for kinases to interact preferentially with phosphorylated proteins (Byshkin et al. 2018). However that work dealt only with undirected networks. An implementation of the EE algorithm for directed networks was described in Stivala et al. (2020), but no biological networks were considered in that work.

Methods

Network data

We obtained a yeast PPI network (von Mering et al. 2002) from the igraph (Csárdi and Nepusz 2006) Nexus network repository (this is no longer available, we used the network downloaded on 10 November 2016). The yeast PPI network has the proteins annotated with one of 12 functional categories (Mewes et al. 2002; Ruepp et al. 2004) (or “uncharacterized”), as described in the Supplementary Information of von Mering et al. (2002).

We obtained a human PPI network from the HIPPIE database (Alanis-Lobato et al. 2016; Schaefer et al. 2012, 2013; Suratanee et al. 2014), version 2.2, downloaded from http://cbdm.uni-mainz.de/hippie/ (accessed 12 June 2021). Edges in this network are labeled with a confidence score between zero and one. We built a binary “high confidence” network by selecting edges where the score is \(\ge 0.70\), the third quartile of the score distribution.

To annotate nodes in the human PPI network with their subcellular location using terms in the Gene Ontology (GO) (Ashburner et al. 2000), we used the Protein ANalysis THrough Evolutionary Relationships (PANTHER) database (Mi et al. 2019, 2021). We used the PANTHER database version 16.0 downloaded from http://data.pantherdb.org/ftp/sequence_classifications/current_release/PANTHER_Sequence_Classification_files/PTHR16.0_human (accessed 21 June 2021). We used the R package GOxploreR (Manjang et al. 2020, 2021) to rank the GO terms for subcellular component in the PANTHER database, and annotated each node (representing a protein) in the network with the highest ranking term for that protein. This results in a cellular component GO term for 6131 of the 11,517 nodes (53%) in the human PPI network. The cellular component GO terms are treated as a categorical attribute, of which there are 271 unique values in the data. The nodes with no cellular component GO term assigned are given an “NA” category, which, when used in the “Match” statistic in ERGM modeling, does not match any category (including the NA category itself).

The previously mentioned E. coli regulatory network (Salgado et al. 2001; Shen-Orr et al. 2002) was obtained via the statnet package (Handcock et al. 2008, 2016). Following Hummel et al. (2012), we removed the loops (self-edges) representing self-regulation, and considered self-regulation instead in a simplistic way by a binary node attribute designated “self” which is true when a self-loop was present and false otherwise. In some models, we use the original version of this network with self-edges retained, and when this is done it is noted in the results. We also obtained a Saccharomyces cerevisiae (yeast) regulatory network (Costanzo et al. 2001; Milo et al. 2002) (http://www.weizmann.ac.il/mcb/UriAlon/download/collection-complex-networks; accessed 29 April 2019) and processed it in the same way.

For all networks, we removed multiple edges and, unless noted otherwise, self-loops, where these are present.

Summary statistics of the networks are in Table 1 and the degree distributions of the networks are shown in Fig. 3. In this figure, \(\alpha\) is the exponent in the discrete power law distribution \(\Pr (X=x) = Cx^{-\alpha }\) (where C is a normalization constant), and \(\mu\) and \(\sigma\) are the parameters (respectively, mean and standard deviation of \(\log (x)\)) of the discrete log-normal distribution. Power law and log-normal distributions were fitted using the methods of Clauset et al. (2009) implemented in the poweRlaw package (Gillespie 2015).

Table 1 Summary statistics for the biological networks
Fig. 3
figure3

Degree distributions of the networks. Power law and log-normal distributions fitted to the CDF for degree distributions of the networks (in- and out-degree for directed networks, degree for undirected networks). All distributions apart from the E. coli in-degree distribution (for which a log-normal distribution could not be fitted), and the Human PPI (HIPPIE) degree distribution (which is not consistent with a power law distribution, \(p < 0.01\)), are consistent with both power law and log-normal distributions

ERGM configurations

The ERGM parameters used in the models for undirected networks are shown in Table 2, and those for directed networks in Table 3. Detailed descriptions of these parameters and their corresponding statistics can be found in Lusher et al. (2013); Robins et al. (2007a, 2007b, 2009); Snijders et al. (2006); Stivala et al. (2020), but two of the important ones used in this work are shown in Fig. 4.

Table 2 Parameters for undirected networks
Table 3 Parameters for directed networks
Fig. 4
figure4

Alternating two-paths and alternating transitive triangles ERGM configurations for directed networks. Unlike motifs, ERGM configurations are not induced subgraphs, so it is normal (and often required) for one to be a subgraph of another. So AltTwoPathsT and AltKTrianglesT are frequently included in a model together, with AltKTrianglesT consisting of the AltTwoPathsT configuration “closed” by the addition of an arc

The “alternating” statistics (Lusher et al. 2013; Robins et al. 2007b; Snijders et al. 2006) such as alternating k-stars involve sums of counts of configurations with alternating signs and a decay factor \(\lambda\), and, except where otherwise specified, we set \(\lambda = 2\) in accordance with common ERGM modeling practice.

ERGM parameter estimation

ERGM parameters for undirected networks were estimated using the EE algorithm (Byshkin et al. 2018) with the IFD sampler (Byshkin et al. 2016) implemented for undirected networks in the Estimnet software as described in Byshkin et al. (2018), with 20 estimations (run in parallel). ERGM parameters for directed networks were estimated using the simplified EE algorithm (Borisenko et al. 2019; Byshkin et al. 2018) with IFD sampler implemented for directed networks in the EstimNetDirected software (Stivala et al. 2020), with 64 estimations (run in parallel).

The Alon E. coli network does not contain any reciprocated arcs (directed loops of length two), and so estimation is made conditional on this by preventing the creation of reciprocated arcs in the MCMC procedure.

Convergence and goodness-of-fit tests

Convergence was tested as described in Byshkin et al. (2018), Stivala et al. (2020), by requiring the absolute value of each parameter’s t-ratio to be no greater than 0.3, and by visual inspection of the parameter and statistic trace plots. For the directed networks estimated with EstimNetDirected, an additional heuristic convergence test was used, as described in Stivala et al. (2020). Observed graph statistics were plotted on the same plots as the distributions of those statistics in the networks simulated in the EE algorithm MCMC process, to check that they do not diverge. The statistics used are the same as those of the actual goodness-of-fit test described below, but note that this test is only for estimation convergence, not goodness-of-fit (Stivala et al. 2020).

For the directed networks estimated with EstimNetDirected, a simulation-based goodness of fit procedure was used, similar to that used in statnet (Hunter et al. 2008). A set of networks was simulated from the estimated model (using the SimulateERGM program in the EstimNetDirected software), and the distribution of certain graph statistics compared with those of the observed network by plotting the observed network values on the same plots as the distribution of simulated values. The statistics used were the in- and out-degree distributions, reciprocity, giant component size, mean local and global clustering coefficients, triad census, geodesic distance (shortest path length) distribution, and edge-wise and dyad-wise shared partners distributions.

Results and discussion

Table 4 shows the basic structural model for the yeast PPI network (Model 1), a model with the alternating k-two-paths (A2P) parameter added (Model 2), as well as a model (Model 3) incorporating a parameter for the propensity of interactions to occur between proteins in the same functional category (class). Model 1 reproduces a model of this network in a previous work (Byshkin et al. 2018, Table S3); Models 2 and 3 are new.

Table 4 Parameter estimates with 95% confidence interval for the yeast PPI network, from the EE algorithm

Each of these model estimations took approximately 7 minutes total elapsed time on cluster nodes with Intel Xeon E5-2650 v3 2.30GHz processors using 20 parallel tasks.

We expect that proteins of the same functional category should preferentially interact with each other (von Mering et al. 2002), and this is confirmed by the significant positive parameter estimated for the “Match class” effect. The alternating k-triangle (AT) parameter is positive and significant in all models, showing an over-representation of triangles (which we might expect given the very high value of the clustering coefficient for this network, Table 1), even in models also including parameters for two-paths and preferential interaction of proteins in the same class.

Table 5 shows a basic structural model for the human PPI high confidence network (Model 1), and a model with a term to control for subellular location by categorical matching on the cellular component GO term (Model 2).

Table 5 Parameter estimates with 95% confidence interval for the human PPI (HIPPIE high confidence) network, from the EE algorithm

Estimation of Model 1 took approximately 64 minutes elapsed time, and Model 2 approximately 73 minutes, on cluster nodes with Intel Xeon E5-2650 v3 2.30GHz processors using 20 parallel tasks.

As discussed in the Introduction, we expect that interactions would be over-represented between proteins that share a subcellular location, and this is confirmed by a statistically significant positive parameter estimate for categorical matching on cellular component (Model 2 in Table 5). The alternating k-triangle (AT) parameter is positive and statistically significant in both models. This indicates an over-representation of triangles, even when controlling for subcellular location (Model 2).

We estimated four different models of the Alon E. coli regulatory network (Table 6). In Models 1 and 2, following Hummel et al. (2012), we modeled self-regulation by using a nodal covariate “self” which is true exactly when the node had a self-edge (loop) in the original network. These ERGM models are new, in that previous work with ERGMs on these networks either treated them as undirected (Hummel et al. 2012; Saul and Filkov 2007), thereby ignoring the inherently directed nature of such a regulatory network; or, in the case where the network was left as directed, included only Arc and alternating k-in-stars terms, as the estimation methods used at the time could not find converged models when other terms, such as triangles, were included (Begum et al. 2014).

Table 6 Parameter estimates with 95% confidence interval for the Alon E. coli regulatory network

Each of these model estimations took approximately three minutes total elapsed time on cluster nodes with Intel Xeon E5-2650 v3 2.30GHz processors using 64 parallel tasks.

In these models, the Sink and Source parameters are used to control, respectively, for the presence of genes that do not regulate any genes (have out-degree zero) and genes that are not regulated by any gene (have in-degree zero). The alternating k-in-stars (AltInStars) parameter is positive and significant in all models except Model 3, indicating significant skewness of the in-degree distribution, that is, the presence of “hubs” with higher in-degree than other nodes. There is no significant effect for (or against) such skewness of the out-degree distribution (see Figs. 3 and 5).

Fig. 5
figure5

Alon E. coli regulatory network. (a) Node size is proportional to in-degree. (b) Node size is proportional to out-degree. Self-regulating operons are depicted as filled (red) circles. In (a) there appears to be a small set of high in-degree nodes and a much larger set of smaller in-degree nodes, while in (b) the out-degree of the nodes appears to be much more evenly distributed. The hypothesis we might make from (a), that there is centralization on in-degree, is confirmed by the ERGM results. This same model finds no support for the hypothesis we might make from (b), that there is a tendency against centralization on out-degree

The only other parameter that is consistently significant (and positive) is path closure (AltKTrianglesT), which we can interpret as a significant tendency for the “feed-forward loop” to be over-represented, consistent with the results in Milo et al. (2002).

A goodness-of-fit plot for Model 1 (Table 6) is shown in Additional file 1: Fig. S1a, showing a good fit for the model. A goodness-of-fit plot for the triad census (Fig. 6a) shows that the model reproduces the triad census well, and specifically triad 030T, the transitive triad (three node feed-forward loop), giving additional confidence that the positive and statistically significant AltKTrianglesT parameter is evidence for over-representation of this motif, given the other parameters in the model.

Fig. 6
figure6

Goodness-of-fit plots for the triad census of (a) the Alon E. coli regulatory network, Model 1 (Table 6), and (b) the Alon yeast regulatory network, Model 1 (Table 7). The observed triad counts are plotted in red with the triad counts of 100 simulated networks plotted as black boxplots. Because the triad counts (y-axis) are on a log scale, values of zero are omitted (observed zero counts shown as a red point on the bottom of the graph). In (a), for triad census class 030C (cyclic triad), the “box plot” consisting of a single median line for the simulated count represents a single (out of 100 simulations) occurrence of a nonzero count (of 1) for 030C

Note that this E. coli regulatory network does not contain any instances of the three-cycle, or “three-node feedback loop” (Milo et al. 2002). Indeed the Alon E. coli network does not contain any loops greater than size one (Shen-Orr et al. 2002), and so the cyclic closure parameter (AltKTrianglesC) is not included in the models.

In Models 3 and 4 (Table 6), unlike the other models, self-edges (loops) are retained in the network, and self-edges are allowed in the modeling process, allowing the formation of loops to be modeled jointly with the other structural features in the model.Footnote 1 In Model 4, the new parameter “Loop” is introduced, for which the corresponding statistic is the count of self-edges in the network. This parameter is statistically significant and positive, indicating that self-edges are over-represented, given the other effects included in the model. Goodness-of-fit plots for Models 3 and 4 (Table 6) are shown in Additional file 1: Fig. S4, showing that when the Loop parameter is not included in the model (Model 3 in Table 6), there is a poor fit for the number of loops (Additional file 1: Fig. S4a). However, when the Loop parameter is included (Model 4 in Table 6), there is a good fit for the number of loops (Additional file 1: Fig. S4b).

We found that it is also possible to estimate similar models of this relatively small network using the most recent version of the statnet ergm package (Handcock et al. 2021; Krivitsky et al. 2021), with the “stepping” algorithm (Hummel et al. 2012). These models are shown in Additional file 1: Table S1, and the goodness-of-fit plots in Additional file 1: Figs. S6, S7. The results are consistent with those in Table 6. Specifically, there is a significant positive estimate for geometrically weighted edge-wise shared partners (GWESP, equivalent to AltKTrianglesT), and a significant negative estimate for geometrically weighted in-degree, indicating centralization in the in-degree distribution.Footnote 2 The statnet model finds a significant tendency against centralization on out-degree, while the models in Table 6 did not have a significant estimate for the corresponding parameter (AltOutStars). Similarly the statnet model (Model 2 in Additional file 1: Table S1) finds a significant negative parameter estimate for Matching on the “self-regulating” attribute, while no significant effect is found in Model 2 in Table 6. The statnet ergm package does not allow for the modeling of self-edges, however (Hummel et al. 2012).

Table 7 shows ERGM parameter estimates for the Alon yeast regulatory network. Each of these model estimations took approximately three minutes total elapsed time on cluster nodes with Intel Xeon E5-2650 v3 2.30GHz processors using 64 parallel tasks. These ERGM models are also new; previously published ERGMs for similar networks having treated them as undirected (Saul and Filkov 2007).

Table 7 Parameter estimates with 95% confidence intervals for the Alon yeast regulatory network

In Model 1 (Table 7), estimation is conditional on no reciprocated arcs, just as was done for the E. coli regulatory network. However in this yeast regulatory network, there is actually a single reciprocated arc (two-cycle) in the data, and hence the fit of the model on statistics involving reciprocated arcs is poor. This is apparent, for example, in the poor fit for triad census class 102 (triad with only a mutual arc) in Fig. 6b, or for the reciprocity statistic in the goodness-of-fit plot (Additional file 1: Fig. S1b). The fit for other statistics, and in particular the degree and shared partner distributions, is acceptable (with the exception of poor fit on the giant component size). Importantly, the fit on the triad census class 030T (transitive triad) is good (Fig. 6b).

In order to better model reciprocity, a model (Model 2 in Table 7) was estimated without being conditional on there being no reciprocated arcs, but without a reciprocity term in the model. This model also has adequate goodness-of-fit, but this time including good fit on the reciprocity statistic (Additional file 1: Fig. S2a). It does, however, for some triads involving reciprocated arcs (120U for example), generate significantly more such triads than are observed in the data (Additional file 1: Fig. S2b). Therefore, a third model (Model 3 in Table 7) was estimated, including the Reciprocity parameter. However, probably due to the fact that the data contains only a single reciprocated arc, this model has a very large estimated standard error for the Reciprocity parameter. Further, it exhibits poor convergence with respect to the Reciprocity statistic, with a t-ratio greater than the maximum value of 0.3 we consider acceptable, since the data contains exactly one reciprocated arc, yet the model most frequently generates networks with none.

Model 1 and Model 2, therefore, are preferable. Nevertheless, in all three models, the sign and significance of estimated parameters (except Reciprocity) are the same. There is a positive and significant parameter for alternating k-out-stars (AltOutStars), indicating the presence of “hubs” with higher out-degree than other nodes. This is as we might expect from Fig. 3 and previous research (Balaji et al. 2006; Guelzim et al. 2002; Monteiro et al. 2020; Ouma et al. 2018), and contrasts with the E. coli regulatory network, which has in-degree hubs but not out-degree hubs.

Also in all three models, there is a positive and significant parameter estimate for transitive closure (AltKTrianglesT). Given this estimate, and the good fit for the transitive closure motif 030T (Fig. 6b) we can again interpret this as a significant over-representation of this motif (“feed-forward loop”), consistent with the results of Milo et al. (2002).

In all three models in Table 7, the decay parameter \(\lambda\) for the “alternating” statistics has been set to a value other than the default \(\lambda =2\) for alternating k-out-stars (AltOutStars), multiple two-paths (AltTwoPathsT), and transitive closure (AltKTrianglesT). This is because models initially estimated with the default \(\lambda =2\) value (Additional file 1: Table S2) showed poor goodness-of-fit on the out-degree distribution (Additional file 1: Fig. S3a) and triad census class 030T (Additional file 1: Fig. S3b). Therefore, new models were estimated with a higher value of \(\lambda\) for the alternating k-out-star parameter to assist with modeling the highly skewed out-degree distribution (Koskinen and Daraganova 2013), and also a higher value of \(\lambda\) for AltTwoPathsT and AltKTrianglesT (the same value of \(\lambda\) for both) to aid model convergence and fit for transitivity (Snijders et al. 2006).

As with the E. coli network, we also estimated a model of the yeast regulatory network, in which self-edges are retained, and allowing self-edges (loops) in the model. This network (even leaving aside the presence of self-edges) is, however, not identical to the network used for the models shown in Table 7, having two additional nodes. Its graph summary statistics, are, however the same (to the precision shown) as those of the version shown in Table 1, other than it having 690 rather than 688 nodes. Since the network modeled is a slightly different network than that used for the models shown in Table 7, these models are presented separately, in Additional file 1: Table S3. The results are consistent with those in Table 7, with statistically significant positive parameter estimates for AltOutStars and AltKTrianglesT. The estimate for the Loop parameter is not statistically significant, however. Goodness-of-fit plots for the models in Additional file 1: Table S3 are shown in Additional file 1: Fig. S5. These figures show that the model which allows self-edges, but does not include the Loop parameter (Model 1 in Additional file 1: Table S3) does not fit the number of loops well, while the model that includes the Loop parameter (Model 2 in Additional file 1: Table S3) does fit the number of loops well.

The cyclic triangle structure has been suggested as an “anti-motif” (i.e. occurs less frequently than expected), but in some cases its apparent under-representation has been shown to be an expected consequence of other topological properties of biological networks (Konagurthu and Lesk 2008a). This closed-loop structure, also known as a “multicomponent loop”, can provide feedback control and potentially produce systems that can switch between two states (Ferrell 2002; Lee et al. 2002). In the examples used here, there were so few (or no) occurrences of this motif, that models including the corresponding parameter (in the form of the AltKTrianglesC parameter) would not converge. Yet the networks simulated from these models also contain no (or very few) occurrences of this candidate anti-motif. This is consistent with the lack of cyclic triangles not being due to cyclic triangles being an anti-motif as such, but rather as a consequence of the other topological features of the network, and specifically in these examples, the features described by the parameters included in the models. This is not a new finding, it having previously been noted that the lack of three-node feedback loops in the E. coli regulatory network (Lee et al. 2002; Shen-Orr et al. 2002) is reproduced in randomized networks (Shen-Orr et al. 2002).

The biological significance of the feed-forward loop (transitive triangle) is suggested to be that, by providing two pathways to affect the output, one direct, and one through an intermediate link, it can act as a logical “AND” gate, and filter out transient activation signals (Alon 2007; Lesk and Konagurthu 2021; Mangan and Alon 2003; Shen-Orr et al. 2002). Whether or not this is indeed the biological function of the feed-forward loop (Mazurie et al. 2005), this motif is found to be significantly over-represented in the transcriptional regulatory networks of several organisms (Alon 2007), including the yeast and E. coli networks studied here, and the feed-forward loop has been described as “highly favored during the evolution of transcriptional regulatory networks in yeast” (Lee et al. 2002, p. 801).

More recently, there has been interest in trying to understand the function of motifs by examining higher levels of structure. Gorochowski et al. (2018) examine the clustering of motifs, including the feed-forward loop, and find that a measure of motif clustering diversity can predict functionally important nodes in the E. coli metabolic network. Lesk and Konagurthu (2021) describes how the local structure of the yeast regulatory network is reconfigured in different physiological states.

So far we have only discussed results for three-node motifs, such as the feed-forward loop. We can test for the over-representation of other motifs, without including parameters for them in the model, by using the ERGM as the null model against which to compare the count of the motif in the observed network. This was the technique used by Felmlee et al. (2021), for example.

Fig. 7
figure7

(a) The bi-fan and bi-parallel four-node motifs. Goodness-of-fit plots for these motifs for (b) the Alon E. coli regulatory network Model 1 (Table 6), and (c) the Alon yeast regulatory network Model 1 (Table 7). The observed network statistics are plotted as a red diamond, with the statistics of 100 simulated networks plotted as black boxplots

Figure 7 shows the bi-fan and bi-parallel motifs, as defined by Milo et al. (2002), their counts in the E. coli and yeast regulatory networks, and their distribution in ERGM models of these networks. The motifs were counted with the NetMODE software (Li et al. 2012). Note that NetMODE was used only to count the motifs, not to simulate any networks, which are simulated from the ERGM models as described in the Methods section.

The bi-parallel motif occurs in neither of the observed networks, and nor does it occur in any of the networks simulated from the corresponding ERGMs. The bi-fan motif, however, clearly occurs far more frequently in both observed networks than it does in the corresponding simulated networks. Note that these networks are simulated from ERGMs that model not just degree distribution, but also the distribution of two-paths and transitive triangles. Therefore, this shows that the bi-fan motif appears to be over-represented in the observed networks, even given the over-representation of transitivity captured in the models, which also reasonably reproduce the triad census, geodesic distance distribution, and dyad-wise and edge-wise shared partner distributions. These results are consistent with the results of Milo et al. (2002), where only degree-preserving randomization was used.

Limitations

Finding a converged ERGM for a network is not always possible in practice. In particular, models which include Markov dependency assumption parameters such as triangles, corresponding directly to three-node motif candidates such as three-node feed-forward-loops (transitive triangles) and three-cycles, for example, usually do not converge. For this reason it is normal practice in ERGM modeling to use geometrically weighted or “alternating” configurations to solve this problem (Hunter et al. 2012; Robins et al. 2007b; Snijders et al. 2006), as we did in this work. However this means we are not answering precisely the same question as when we ask directly if a motif is over-represented or not. This is because ERGM is a model for tie (edge or arc) formation, not for motif formation: if we consider ERGM as a type of logistic regression, the outcome variable is the presence or absence of a network tie. The predictor variables are not independent of each other, but form a nested hierarchy of configurations: triangles are formed by “closing” a two-path with an additional edge, for example. So a positive estimate of the alternating k-triangle parameter does not directly mean that the transitive triangle (three node feed-forward loop) motif is over-represented, but rather that there is tendency (that is, it is more probable than chance given the other parameters in the model) for three nodes forming a directed two-path to be closed in a transitive triangle. This makes sense in the social network origins of the model: it might be assumed to be the result in the observed network of the tendency of a person’s friends to also be friends with each other, for example. In the context of biological networks, it might be interpreted as a sign of evolutionary events, however this interpretation is very much open to question, as briefly discussed in the Introduction.

Even when the “alternating” configurations are used, it can be difficult or impossible to find a converged and well-fitting ERGM for a given network. For example, we were unable to fit an ERGM with triangular configurations (using either statnet or EstimNetDirected) to an example of a neural network, the whole-animal chemical connectome (a directed network with 579 nodes and 5246 arcs) of the male C. elegans worm (Cook et al. 2019).

Hence in order to directly test motif significance, without having to fit a parameterized model such as ERGM, new methods, such as the “anchored motif” proposed by Fodor et al. (2020) are still required.

In some of the models presented here, we used values other than the usual default value \(\lambda =2\) for the decay parameter \(\lambda\) of the “alternating” statistics. We had to manually estimate appropriate values of \(\lambda\) based on trial and error, guided by knowledge of the observed network, convergence and goodness-of-fit of the models (or lack thereof), and the definitions of the relevant statistics (Koskinen and Daraganova 2013; Snijders et al. 2006). It is possible to instead estimate \(\lambda\) (or an equivalent parameter) directly from the data, as part of the model, using a “curved ERGM” (Hunter 2007; Hunter and Handcock 2006), and this is implemented in the statnet R package (Handcock et al. 2008, 2016, 2021; Hunter et al. 2008; Krivitsky et al. 2021; Morris et al. 2008). However it is not currently possible to estimate curved ERGMs using the EstimNetDirected software (Stivala et al. 2020), and this is an area requiring further work. In the absence of such a principled way of estimating the decay parameters, an alternative to the heuristic (trial and error) approach used here is to estimate many models with systematically varying values of the \(\lambda\) decay parameter for each relevant “alternating” model parameter, and use a grid search to find the model with best fit.Footnote 3 We applied this method to the Alon yeast regulatory network model (Additional file 1: Table S2), using the Mahalanobis distance between a vector of some of the observed network summary statistics used for goodness-of-fit (degree distributions, reciprocity, giant component size, global and average local clustering coefficient), and the corresponding vectors for networks simulated from the model, as the value to minimize. We used a two-dimensional grid, varying the \(\lambda\) value for AltOutStars as one dimension, and the value of \(\lambda\) for both AltTwoPaths and AltKTriangles (these values should be the same, as described in Snijders et al. (2006)) as the other dimension. With both values varying from 1.5 to 5.0 in steps of 0.5, we found the minimum Mahalanobis distance was at \(\lambda = 4.5\) for the AltOutStars parameter, and \(\lambda = 1.5\) for the AltTwoPathsT and AltKTrianglesT parameters. The parameters estimated for this model are not substantively different from those in Table 7. The values of \(\lambda\) that we determined heuristically (Table 7) were at rank 15 (of 64) using this criterion. The model with the default \(\lambda = 2.0\) for all alternating statistic parameters, with subjectively poor goodness-of-fit on the out-degree distribution, is at rank 48 (of 64).

As previously mentioned, the configurations available in an ERGM are determined by the dependence assumptions: although there is a lot of flexibility available in ERGM configurations, we cannot simply add arbitrary configurations without regard for the underlying dependency assumption (Koskinen 2020). The least restrictive assumption used in practice is the “social circuit” dependency assumption (Lusher et al. 2013; Robins et al. 2007b, 2009; Snijders et al. 2006) used in this work, which allows the use of the “alternating” configurations.

We also note that some recent work suggests that complex network structure, including heavy-tailed degree distributions, closure (clustering), large connected components, and short path lengths can arise simply from thresholding normally distributed data to generate the binary network (Cantwell et al. 2020). Hence inferences from ERGM modeling about network structure, just as with other techniques such as comparison to ensembles of random graphs, could be consequences of the way the binary network was constructed.

Valued ERGMs (Desmarais and Cranmer 2012; Krivitsky 2012) may be used to avoid this problem by removing the need to construct a binary network at all, and working directly with the network with valued edges. Parameter estimation for these models is even more computationally intensive than for binary networks, and hence is so far impractical to use for networks of the size considered here. Using new estimation techniques to improve the scalability of parameter estimation for valued ERGMs is another area requiring further research.

For the relatively small (on the order of one thousand nodes or fewer) directed networks considered here, it is possible to do simulation-based goodness-of-fit tests. However, it is possible to estimate ERGM parameters for far larger (over one million nodes) networks using the EstimNetDirected software, but it is not practical to simulate such large networks from the model, and this is an area requiring further work (Stivala et al. 2020).

One further limitation to consider is the execution time of the ERGM technique. As discussed in the introductory sections, ERGM parameter estimation is a computationally difficult problem. Although recent advances allow the estimation in minutes of models that would have taken hours, or been infeasible to estimate, with earlier methods, it is still much more computationally difficult to do this than it is to run conventional motif finding methods. The networks used here took between three and 73 minutes to estimate, using multiple (up to 64) processor cores in parallel. However motif finding with MFinder (Kashtan et al. 2004) in these networks takes only seconds, and with the faster NetMODE method (Li et al. 2012), even less time, using only a single processor core.

Conclusion

We have re-examined the use of exponential random graph models for analyzing biological networks, an application first introduced in the bioinformatics literature by Saul and Filkov (2007). Advances in ERGM estimation methods since then have allowed more sophisticated models to be estimated for more and larger networks than was possible at the time, and they are now a more practical technique for making inferences about structural hypotheses in biological networks, potentially solving some of the problems inherent in conventional methods for testing motif over-representation. By using an ERGM, all configurations in the model are tested simultaneously, each conditional on all the others, rather than having to test one at a time with the other configurations fixed in a (more or less sophisticated, the choice of which is critical to the results) null model.

The ERGM models of the Alon E. coli network presented here are the first to retain the directed nature of the network and also include terms for triangular structures. They confirm the result of Milo et al. (2002) that path closure (feed-forward loop) is over-represented, even when we include other, related, parameters in the model.

We also presented the first ERGM models of a yeast regulatory network retaining its inherently directed nature (rather than treating it as undirected). We find statistically significant over-representation of the transitive closure motif, just as Milo et al. (2002) did in the same yeast regulatory network, using a simple randomization test.

The lack of the cyclic triangle (feedback loop) structure in the data, however, is reproduced by models that do not contain any parameter corresponding to this structure. This suggests that this structure is not an “anti-motif”, but rather that its lack is a consequence of the structural features of the networks, specifically degree distributions, two-paths, and transitive closure, that are included in the models.

Availability of data and materials

Source code, configuration files, and datasets are available from https://sites.google.com/site/alexdstivala/home/ergm_bionetworks.

Notes

  1. 1.

    Modeling self-edges in this way was suggested by an anonynous reviewer, on the grounds that a node with a self-edge is a (very simple) motif.

  2. 2.

    Note that a negative estimate of the geometrically weighted degree parameter in statnet has the same interpretation as a positive estimate for the alternating k-stars parameter as used here. This frequently leads to confusion (Levy 2016; Levy et al. 2016).

  3. 3.

    This strategy was suggested by an anonymous reviewer.

Abbreviations

CDF:

Cumulative distribution function

CUG:

Conditional uniform graph

DMN:

Default mode network

EE:

Equilibrium expectation

EEG:

Electroencephalography

ERGM:

Exponential random graph model

GO:

Gene ontology

HIPPIE:

Human integrated protein–protein interaction reference

IFD:

Improved fixed density

MAN:

Mutual, asymmetric, null

MCMC:

Markov chain Monte Carlo

PANTHER:

Protein analysis through evolutionary relationships

PPI:

Protein–protein interaction

SBM:

Stochastic block model

References

  1. Ahnert SE, Fink T (2016) Form and function in gene regulatory networks: the structure of network motifs determines fundamental properties of their dynamical state space. J R Soc Interface 13(120):20160179

    Google Scholar 

  2. Alanis-Lobato G, Andrade-Navarro MA, Schaefer MH (2016) HIPPIE v2.0: enhancing meaningfulness and reliability of protein–protein interaction networks. Nucleic Acids Res 45(D1):D408–D414

    Google Scholar 

  3. Alon U (2007) Network motifs: theory and experimental approaches. Nat Rev Genet 8:450–461

    Google Scholar 

  4. Amati V, Lomi A, Mira A (2018) Social network modeling. Annu Rev Stat Appl 5:343–369

    MathSciNet  Google Scholar 

  5. An W (2016) Fitting ERGMs on big networks. Soc Sci Res 59:107–119. https://doi.org/10.1016/j.ssresearch.2016.04.019

    Article  Google Scholar 

  6. Anderson BS, Butts C, Carley K (1999) The interaction of size and density with graph-level indices. Soc Netw 21(3):239–267

    Google Scholar 

  7. Artzy-Randrup Y, Fleishman SJ, Ben-Tal N, Stone L (2004) Comment on “network motifs: simple building blocks of complex networks” and “superfamilies of evolved and designed networks.” Science 305(5687):1107c

    Google Scholar 

  8. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25–29

    Google Scholar 

  9. Azad A, Lawen A, Keith JM (2017) Bayesian model of signal rewiring reveals mechanisms of gene dysregulation in acquired drug resistance in breast cancer. PLoS ONE 12(3):e0173331

    Google Scholar 

  10. Babkin S, Stewart J, Long X, Schweinberger M (2020) Large-scale estimation of random graph models with local dependence. Comput Stat Data Anal 152:107029

    MathSciNet  MATH  Google Scholar 

  11. Balaji S, Babu MM, Iyer LM, Luscombe NM, Aravind L (2006) Comprehensive analysis of combinatorial regulation using the transcriptional regulatory network of yeast. J Mol Biol 360(1):213–227

    Google Scholar 

  12. Batagelj V, Mrvar A (2001) A subquadratic triad census algorithm for large sparse networks with small maximum degree. Soc Netw 23(3):237–243

    Google Scholar 

  13. Beber ME, Fretter C, Jain S, Sonnenschein N, Müller-Hannemann M, Hütt MT (2012) Artefacts in statistical analyses of network motifs: general framework and application to metabolic networks. J R Soc Interface 9(77):3426–3435

    Google Scholar 

  14. Begum M, Bagga J, Saha S (2014) Network motif identification and structure detection with exponential random graph models. Netw Biol 4(4):155–169

    Google Scholar 

  15. Borisenko A, Byshkin M, Lomi A (2019) A simple algorithm for scalable Monte Carlo inference. arXiv preprint arXiv:1901.00533v3

  16. Bulashevska S, Bulashevska A, Eils R (2010) Bayesian statistical modelling of human protein interaction network incorporating protein disorder information. BMC Bioinform 11(1):46

    MATH  Google Scholar 

  17. Butts CT (2008) Social network analysis: a methodological introduction. Asian J Soc Psychol 11(1):13–41

    Google Scholar 

  18. Byshkin M, Stivala A, Mira A, Krause R, Robins G, Lomi A (2016) Auxiliary parameter MCMC for exponential random graph models. J Stat Phys 165(4):740–754

    MathSciNet  MATH  Google Scholar 

  19. Byshkin M, Stivala A, Mira A, Robins G, Lomi A (2018) Fast maximum likelihood estimation via equilibrium expectation for large network data. Sci Rep 8:11509

    Google Scholar 

  20. Caimo A, Friel N (2011) Bayesian inference for exponential random graph models. Soc Netw 33(1):41–55

    Google Scholar 

  21. Caimo A, Friel N (2014) Bergm: Bayesian exponential random graphs in R. J Stat Softw 61(2):1–25

    Google Scholar 

  22. Cantwell GT, Liu Y, Maier BF, Schwarze AC, Serván CA, Snyder J, St-Onge G (2020) Thresholding normally distributed data creates complex networks. Phys Rev E 101(6):062302

    Google Scholar 

  23. Cimini G, Squartini T, Saracco F, Garlaschelli D, Gabrielli A, Caldarelli G (2019) The statistical physics of real-world networks. Nat Rev Phys 1:58–71

    Google Scholar 

  24. Ciriello G, Guerra C (2008) A review on models and algorithms for motif discovery in protein–protein interaction networks. Brief Funct Genom 7(2):147–156

    Google Scholar 

  25. Clauset A, Shalizi CR, Newman ME (2009) Power-law distributions in empirical data. SIAM Rev 51(4):661–703

    MathSciNet  MATH  Google Scholar 

  26. Cook SJ, Jarrell TA, Brittin CA, Wang Y, Bloniarz AE, Yakovlev MA, Nguyen KC, Tang LTH, Bayer EA, Duerr JS et al (2019) Whole-animal connectomes of both Caenorhabditis elegans sexes. Nature 571(7763):63–71

    Google Scholar 

  27. Costanzo MC, Crawford ME, Hirschman JE, Kranz JE, Olsen P, Robertson LS, Skrzypek MS, Braun BR, Hopkins KL, Kondu P, Lengieza C, Lew-Smith JE, Tillberg M, Garrels JI (2001) YPD™, PombePD™ and WormPD™: model organism volumes of the BioKnowledge™ Library, an integrated resource for protein information. Nucleic Acids Res 29(1):75–79. https://doi.org/10.1093/nar/29.1.75

    Article  Google Scholar 

  28. Csárdi G, Nepusz T (2006) The igraph software package for complex network research. InterJournal Complex Syst 1695:1–9

    Google Scholar 

  29. Davis JA, Leinhardt S (1967) The structure of positive interpersonal relations in small groups. In: Berger J (ed) Sociological theories in progress, vol 2. Houghton Mifflin, Boston, MA, pp 251–281

    Google Scholar 

  30. De Las Rivas J, Fontanillo C (2010) Protein–protein interactions essentials: key concepts to building and analyzing interactome networks. PLoS Comput Biol 6(6):e1000807

    Google Scholar 

  31. Desmarais BA, Cranmer SJ (2012) Statistical inference for valued-edge networks: the generalized exponential random graph model. PLoS ONE 7(1):e30136

    Google Scholar 

  32. van Duijn MA, Gile KJ, Handcock MS (2009) A framework for the comparison of maximum pseudo-likelihood and maximum likelihood estimation of exponential family random graph models. Soc Netw 31(1):52–62

    Google Scholar 

  33. Faust K (2010) A puzzle concerning triads in social networks: graph constraints and the triad census. Soc Netw 32(3):221–233

    Google Scholar 

  34. Felmlee D, McMillan C, Whitaker R (2021) Dyads, triads, and tetrads: a multivariate simulation approach to uncovering network motifs in social graphs. Appl Netw Sci 6(1):63

    Google Scholar 

  35. Ferrell JE (2002) Self-perpetuating states in signal transduction: positive feedback, double-negative feedback and bistability. Curr Opin in Cell Biol 14(2):140–148

    Google Scholar 

  36. Fodor J, Brand M, Stones RJ, Buckle AM (2020) Intrinsic limitations in mainstream methods of identifying network motifs in biology. BMC Bioinform 21:165

    Google Scholar 

  37. Fronczak P, Fronczak A, Bujok M (2013) Exponential random graph models for networks with community structure. Phys Rev E 88(3):032810

    Google Scholar 

  38. Gillespie CS (2015) Fitting heavy tailed distributions: the poweRlaw package. J Stat Softw 64(2):1–16

    Google Scholar 

  39. Ginoza R, Mugler A (2010) Network motifs come in sets: correlations in the randomization process. Phys Rev E 82(1):011921

    Google Scholar 

  40. Gorochowski TE, Grierson CS, Di Bernardo M (2018) Organization of feed-forward loop motifs reveals architectural principles in natural and engineered networks. Sci Adv 4(3):eaap9751

    Google Scholar 

  41. Grazioli G, Martin RW, Butts CT (2019a) Comparative exploratory analysis of intrinsically disordered protein dynamics using machine learning and network analytic methods. Front Mol Biosci 6:42

    Google Scholar 

  42. Grazioli G, Yu Y, Unhelkar MH, Martin RW, Butts CT (2019b) Network-based classification and modeling of amyloid fibrils. J Phys Chem B 123(26):5452–5462

    Google Scholar 

  43. Gross E, Petrović S, Stasi D (2021) Random graphs with node and block effects: models, goodness-of-fit tests, and applications to biological networks. arXiv preprint arXiv:2104.03167v1

  44. Guelzim N, Bottani S, Bourgine P, Képès F (2002) Topological and causal structure of the yeast transcriptional regulatory network. Nat Genet 31(1):60–63

    Google Scholar 

  45. Hagberg A, Swart P, S Chult D (2008) Exploring network structure, dynamics, and function using NetworkX. In: Varoquaux G, Vaught T, Millman J (eds) Proceedings of the 7th Python in science conference (SciPy 2008), pp 11–16

  46. Handcock MS, Gile KJ (2010) Modeling social networks from sampled data. Ann Appl Stat 4(1):5–25

    MathSciNet  MATH  Google Scholar 

  47. Handcock MS, Hunter DR, Butts CT, Goodreau SM, Morris M (2008) statnet: software tools for the representation, visualization, analysis and simulation of network data. J Stat Softw 24(1):1–11

    Google Scholar 

  48. Handcock MS, Hunter DR, Butts CT, Goodreau SM, Krivitsky PN, Bender-deMoll S, Morris M (2016) statnet: software tools for the statistical analysis of network data. The Statnet Project http://www.statnet.org, CRAN.R-project.org/package=statnet, R package version 2016.9

  49. Handcock MS, Hunter DR, Butts CT, Goodreau SM, Krivitsky PN, Morris M (2021) ergm: fit, simulate and diagnose exponential-family models for networks. The Statnet Project https://statnet.org, https://CRAN.R-project.org/package=ergm, R package version 4.1.2

  50. Holland PW, Leinhardt S (1970) A method for detecting structure in sociometric data. Am J Sociol 76(3):492–513

    Google Scholar 

  51. Holland PW, Leinhardt S (1976) Local structure in social networks. Sociol Methodol 7:1–45

    Google Scholar 

  52. Hummel RM, Hunter DR, Handcock MS (2012) Improving simulation-based algorithms for fitting ERGMs. J Comput Graph Stat 21(4):920–939

    MathSciNet  Google Scholar 

  53. Hunter DR (2007) Curved exponential family models for social networks. Soc Netw 29(2):216–230

    Google Scholar 

  54. Hunter DR, Handcock MS (2006) Inference in curved exponential family models for networks. J Comput Graph Stat 15(3):565–583

    MathSciNet  Google Scholar 

  55. Hunter DR, Handcock MS, Butts CT, Goodreau SM, Morris M (2008) ergm: a package to fit, simulate and diagnose exponential-family models for networks. J Stat Softw 24(3):1–29

    Google Scholar 

  56. Hunter DR, Krivitsky PN, Schweinberger M (2012) Computational statistical methods for social network models. J Comput Graph Stat 21(4):856–882

    MathSciNet  Google Scholar 

  57. Ingram PJ, Stumpf MP, Stark J (2006) Network motifs: structure does not determine function. BMC Genom 7:108

    Google Scholar 

  58. Jazayeri A, Yang CC (2020) Motif discovery algorithms in static and temporal networks: a survey. J Complex Netw 8(4):cnaa031. https://doi.org/10.1093/comnet/cnaa031

    MathSciNet  Article  Google Scholar 

  59. Karrer B, Newman ME (2011) Stochastic blockmodels and community structure in networks. Phys Rev E 83(1):016107

    MathSciNet  Google Scholar 

  60. Kashtan N, Itzkovitz S, Milo R, Alon U (2004) Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11):1746–1758

    Google Scholar 

  61. Konagurthu AS, Lesk AM (2008a) On the origin of distribution patterns of motifs in biological networks. BMC Syst Biol 2:73

    Google Scholar 

  62. Konagurthu AS, Lesk AM (2008b) Single and multiple input modules in regulatory networks. Proteins 73(2):320–324

    Google Scholar 

  63. Koskinen J (2020) Exponential random graph modelling. In: Atkinson P, Delamont S, Cernat A, Sakshaug J, Williams R (eds) SAGE research methods foundations. SAGE, London. https://doi.org/10.4135/9781526421036888175

    Chapter  Google Scholar 

  64. Koskinen J, Daraganova G (2013) Exponential random graph model fundamentals. In: Lusher D, Koskinen J, Robins G (eds) Exponential random graph models for social networks. Cambridge University Press, New York, pp 49–76

    Google Scholar 

  65. Koskinen JH, Robins GL, Wang P, Pattison PE (2013) Bayesian analysis for partially observed network data, missing ties, attributes and actors. Soc Netw 35(4):514–527

    Google Scholar 

  66. Krivitsky PN (2012) Exponential-family random graph models for valued networks. Electron J Stat 6:1100–1128

    MathSciNet  MATH  Google Scholar 

  67. Krivitsky PN (2017) Using contrastive divergence to seed Monte Carlo MLE for exponential-family random graph models. Comput Stat Data An 107:149–161

    MathSciNet  MATH  Google Scholar 

  68. Krivitsky PN, Handcock MS (2014) A separable model for dynamic networks. J R Stat Soc B Met 76(1):29–46

    MathSciNet  MATH  Google Scholar 

  69. Krivitsky PN, Hunter DR, Morris M, Klumb C (2021) ergm 4.0: new features and improvements. arXiv preprint arXiv:2106.04997

  70. Kumar G, Ranganathan S (2010) Network analysis of human protein location. BMC Bioinform 11(7):S9

    Google Scholar 

  71. Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I et al (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298(5594):799–804

    Google Scholar 

  72. Lehmann B, Henson R, Geerligs L, White S et al (2021) Characterising group-level brain connectivity: a framework using Bayesian exponential random graph models. Neuroimage 225:117480

    Google Scholar 

  73. Lesk AM, Konagurthu AS (2021) Neighbourhoods in the yeast regulatory network in different physiological states. Bioinformatics 37(4):551–558

    Google Scholar 

  74. Levy M (2016) gwdegree: improving interpretation of geometrically-weighted degree estimates in exponential random graph models. J Open Source Softw 1(3):36

    Google Scholar 

  75. Levy M, Lubell M, Leifeld P, Cranmer S (2016) Interpretation of GW-degree estimates in ERGMs. https://doi.org/10.6084/m9.figshare.3465020.v1

  76. Li X, Stones RJ, Wang H, Deng H, Liu X, Wang G (2012) NetMODE: network motif detection without nauty. PLoS ONE 7(12):e50093

    Google Scholar 

  77. Lienert J, Koehly L, Reed-Tsochas F, Marcum CS (2019) An efficient counting method for the colored triad census. Soc Netw 58:136–142

    Google Scholar 

  78. Lusher D, Koskinen J, Robins G (eds) (2013) Exponential random graph models for social networks. Structural analysis in the social sciences. Cambridge University Press, New York

    Google Scholar 

  79. Mahadevan P, Krioukov D, Fall K, Vahdat A (2006) Systematic topology analysis and generation using degree correlations. ACM SIGCOMM Comput Commun 36(4):135–146

    Google Scholar 

  80. Mangan S, Alon U (2003) Structure and function of the feed-forward loop network motif. Proc Natl Acad Sci USA 100(21):11980–11985

    Google Scholar 

  81. Manjang K, Tripathi S, Yli-Harja O, Dehmer M, Emmert-Streib F (2020) Graph-based exploitation of gene ontology using GOxploreR for scrutinizing biological significance. Sci Rep 10(1):16672

    Google Scholar 

  82. Manjang K, Emmert-Streib F, Tripathi S, Yli-Harja O, Dehmer M (2021) GOxploreR: structural exploration of the gene ontology (GO) knowledge base. https://CRAN.R-project.org/package=GOxploreR, R package version 1.2.1

  83. Martorana E, Micale G, Ferro A, Pulvirenti A (2020) Establish the expected number of induced motifs on unlabeled graphs through analytical models. Appl Netw Sci 5(1):58

    Google Scholar 

  84. Mayhew BH (1984) Baseline models of sociological phenomena. J Math Sociol 9(4):259–281

    Google Scholar 

  85. Mazurie A, Bottani S, Vergassola M (2005) An evolutionary and functional assessment of regulatory network motifs. Genome Biol 6(4):R35

    Google Scholar 

  86. von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P (2002) Comparative assessment of large-scale data sets of protein–protein interactions. Nature 417(6887):399–403

    Google Scholar 

  87. Mewes HW, Frishman D, Güldener U, Mannhaupt G, Mayer K, Mokrejs M, Morgenstern B, Münsterkötter M, Rudd S, Weil B (2002) MIPS: a database for genomes and protein sequences. Nucleic Acids Res 30(1):31–34

    Google Scholar 

  88. Mi H, Muruganujan A, Huang X, Ebert D, Mills C, Guo X, Thomas PD (2019) Protocol update for large-scale genome and gene function analysis with the PANTHER classification system (v. 14.0). Nat Protoc 14(3):703–721

    Google Scholar 

  89. Mi H, Ebert D, Muruganujan A, Mills C, Albou LP, Mushayamaha T, Thomas PD (2021) PANTHER version 16: a revised family classification, tree-based classification tool, enhancer regions and extensive API. Nucleic Acids Res 49(D1):D394–D403

    Google Scholar 

  90. Middendorf M, Ziv E, Wiggins CH (2005) Inferring network mechanisms: the Drosophila melanogaster protein interaction network. Proc Natl Acad Sci USA 102(9):3192–3197

    Google Scholar 

  91. Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002) Network motifs: simple building blocks of complex networks. Science 298(5594):824–827

    Google Scholar 

  92. Monteiro PT, Pedreira T, Galocha M, Teixeira MC, Chaouiya C (2020) Assessing regulatory features of the current transcriptional network of Saccharomyces cerevisiae. Sci Rep 10(1):17744

    Google Scholar 

  93. Moody J (1998) Matrix methods for calculating the triad census. Soc Netw 20(4):291–299

    Google Scholar 

  94. Morris M, Handcock M, Hunter D (2008) Specification of exponential-family random graph models: terms and computational aspects. J Stat Softw 24(4):1–24

    Google Scholar 

  95. Obando C, De Vico FF (2017) A statistical model for brain networks inferred from large-scale electrophysiological signals. J R Soc Interface 14(128):20160940

    Google Scholar 

  96. Orsini C, Dankulov MM, Colomer-de Simón P, Jamakovic A, Mahadevan P, Vahdat A, Bassler KE, Toroczkai Z, Boguná M, Caldarelli G et al (2015) Quantifying randomness in real networks. Nat Commun 6:8627

    Google Scholar 

  97. Ouma WZ, Pogacar K, Grotewold E (2018) Topological and statistical analyses of gene regulatory networks reveal unifying yet quantitatively different emergent properties. PLoS Comput Biol 14(4):e1006098

    Google Scholar 

  98. Patra S, Mohapatra A (2020) Review of tools and algorithms for network motif discovery in biological networks. IET Syst Biol 14(4):171–189

    Google Scholar 

  99. Pattison PE, Robins GL, Snijders TAB, Wang P (2013) Conditional estimation of exponential random graph models from snowball sampling designs. J Math Psychol 57(6):284–296

    MathSciNet  MATH  Google Scholar 

  100. Payne JL, Wagner A (2015) Function does not follow form in gene regulatory circuits. Sci Rep 5:13015

    Google Scholar 

  101. Picard F, Daudin JJ, Koskas M, Schbath S, Robin S (2008) Assessing the exceptionality of network motifs. J Comput Biol 15(1):1–20

    MathSciNet  Google Scholar 

  102. Pržulj N (2007) Biological network comparison using graphlet degree distribution. Bioinformatics 23(2):e177–e183

    Google Scholar 

  103. Rice JJ, Kershenbaum A, Stolovitzky G (2005) Lasting impressions: motifs in protein–protein maps may provide footprints of evolutionary events. Proc Natl Acad Sci USA 102(9):3173–3174

    Google Scholar 

  104. Robins G, Pattison P, Woolcock J (2004) Missing data in networks: exponential random graph (p*) models for networks with non-respondents. Soc Netw 26(3):257–283

    Google Scholar 

  105. Robins G, Pattison P, Kalish Y, Lusher D (2007) An introduction to exponential random graph (p*) models for social networks. Soc Netw 29(2):173–191

    Google Scholar 

  106. Robins G, Snijders TAB, Wang P, Handcock M, Pattison P (2007) Recent developments in exponential random graph (p*) models for social networks. Soc Netw 29(2):192–215

    Google Scholar 

  107. Robins G, Pattison P, Wang P (2009) Closure, connectivity and degree distributions: exponential random graph (p*) models for directed social networks. Soc Netw 31(2):105–117

    Google Scholar 

  108. Rolls DA, Robins G (2017) Minimum distance estimators of population size from snowball samples using conditional estimation and scaling of exponential random graph models. Comput Stat Data Anal 116:32–48

    MathSciNet  MATH  Google Scholar 

  109. Rolls DA, Wang P, Jenkinson R, Pattision PE, Robins GL, Sacks-Davis R, Daraganova G, Hellard M, McBryde E (2013) Modelling a disease-relevant contact network of people who inject drugs. Soc Netw 35(4):699–710

    Google Scholar 

  110. Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, Güldener U, Mannhaupt G, Münsterkötter M et al (2004) The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res 32(18):5539–5545

    Google Scholar 

  111. Salgado H, Santos-Zavaleta A, Gama-Castro S, Millán-Zárate D, Díaz-Peredo E, Sánchez-Solano F, Pérez-Rueda E, Bonavides-Martínez C, Collado-Vides J (2001) RegulonDB (version 3.2): transcriptional regulation and operon organization in Escherichia coli K-12. Nucleic Acids Res 29(1):72–74

    Google Scholar 

  112. Saul ZM, Filkov V (2007) Exploring biological network structure using exponential random graph models. Bioinformatics 23(19):2604–2611

    Google Scholar 

  113. Schaefer MH, Fontaine JF, Vinayagam A, Porras P, Wanker EE, Andrade-Navarro MA (2012) HIPPIE: integrating protein interaction networks with experiment based quality scores. PLoS ONE 7(2):e31826

    Google Scholar 

  114. Schaefer MH, Lopes TJ, Mah N, Shoemaker JE, Matsuoka Y, Fontaine JF, Louis-Jeune C, Eisfeld AJ, Neumann G, Perez-Iratxeta C et al (2013) Adding protein context to the human protein–protein interaction network to reveal meaningful interactions. PLoS Comput Biol 9(1):e1002860

    Google Scholar 

  115. Schuldiner M, Collins SR, Thompson NJ, Denic V, Bhamidipati A, Punna T, Ihmels J, Andrews B, Boone C, Greenblatt JF et al (2005) Exploration of the function and organization of the yeast early secretory pathway through an epistatic miniarray profile. Cell 123(3):507–519

    Google Scholar 

  116. Schweinberger M (2020) Consistent structure estimation of exponential-family random graph models with block structure. Bernoulli 26(2):1205–1233

    MathSciNet  MATH  Google Scholar 

  117. Schweinberger M, Handcock MS (2015) Local dependence in random graph models: characterization, properties and statistical inference. J Am Stat Assoc 77(3):647–676

    MathSciNet  MATH  Google Scholar 

  118. Schweinberger M, Luna P (2018) Hergm: hierarchical exponential-family random graph models. J Stat Softw 85(1):1–39

    Google Scholar 

  119. Schweinberger M, Krivitsky PN, Butts CT, Stewart JR (2020) Exponential-family models of random graphs: inference in finite, super and infinite population scenarios. Stat Sci 35(4):627–662

    MathSciNet  MATH  Google Scholar 

  120. Shalizi CR, Rinaldo A (2013) Consistency under sampling of exponential random graph models. Ann Stat 41(2):508–535

    MathSciNet  MATH  Google Scholar 

  121. Shen-Orr SS, Milo R, Mangan S, Alon U (2002) Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31(1):64–68

    Google Scholar 

  122. Shin CJ, Wong S, Davis MJ, Ragan MA (2009) Protein–protein interaction as a predictor of subcellular location. BMC Syst Biol 3:28

    Google Scholar 

  123. de Silva E, Stumpf MP (2005) Complex networks and simple models in biology. J R Soc Interface 2(5):419–430

    Google Scholar 

  124. Simpson SL, Hayasaka S, Laurienti PJ (2011) Exponential random graph modeling for complex brain networks. PLoS ONE 6(5):e20039

    Google Scholar 

  125. Simpson SL, Moussa MN, Laurienti PJ (2012) An exponential random graph modeling approach to creating group-based representative whole-brain connectivity networks. Neuroimage 60(2):1117–1126

    Google Scholar 

  126. Sinke MR, Dijkhuizen RM, Caimo A, Stam CJ, Otte WM (2016) Bayesian exponential random graph modeling of whole-brain structural networks across lifespan. Neuroimage 135:79–91

    Google Scholar 

  127. Snijders TAB (1991) Enumeration and simulation methods for 0–1 matrices with given marginals. Psychometrika 56(3):397–417

    MathSciNet  MATH  Google Scholar 

  128. Snijders TAB (2002) Markov chain Monte Carlo estimation of exponential random graph models. J Soc Struct 3(2):1–40

    MathSciNet  Google Scholar 

  129. Snijders TAB, Pattison PE, Robins GL, Handcock MS (2006) New specifications for exponential random graph models. Sociol Methodol 36(1):99–153

    Google Scholar 

  130. Stillman PE, Wilson JD, Denny MJ, Desmarais BA, Bhamidi S, Cranmer SJ, Lu ZL (2017) Statistical modeling of the default mode brain network reveals a segregated highway structure. Sci Rep 7(1):11694

    Google Scholar 

  131. Stivala A, Robins G, Lomi A (2020) Exponential random graph model parameter estimation for very large directed networks. PLoS ONE 15(1):e0227804

    Google Scholar 

  132. Stivala AD, Koskinen JH, Rolls D, Wang P, Robins GL (2016) Snowball sampling for estimating exponential random graph models for large networks. Soc Netw 47:167–188

    Google Scholar 

  133. Strauss D, Ikeda M (1990) Pseudolikelihood estimation for social networks. J Am Stat Assoc 85(409):204–212

    MathSciNet  Google Scholar 

  134. Suratanee A, Schaefer MH, Betts MJ, Soons Z, Mannsperger H, Harder N, Oswald M, Gipp M, Ramminger E, Marcus G et al (2014) Characterizing protein interactions employing a genome-wide siRNA cellular phenotyping screen. PLoS Comput Biol 10(9):e1003814

    Google Scholar 

  135. Wang P, Robins G, Pattison P (2009) PNet: program for the estimation and simulation of p* exponential random graph models. Department of Psychology, The University of Melbourne, Parkville

    Google Scholar 

  136. Wang Y, Fang H, Yang D, Zhao H, Deng M (2019) Network clustering analysis using mixture exponential-family random graph models and its application in genetic interaction data. IEEE/ACM Trans Comput Biol Bioinform 16(5):1743–1752

    Google Scholar 

  137. Wasserman S, Faust K (1994) Social network analysis: methods and applications. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  138. Winterbach W, Van Mieghem P, Reinders M, Wang H, de Ridder D (2013) Topology of molecular interaction networks. BMC Syst Biol 7:90

    Google Scholar 

  139. Yaveroǧlu ON, Fitzhugh SM, Kurant M, Markopoulou A, Butts CT, Pržulj N (2015) ergm.graphlets: a package for ERG modeling based on graphlet statistics. J Stat Softw 65(12):1–29

    Google Scholar 

  140. Yu S, Feng Y, Zhang D, Bedru HD, Xu B, Xia F (2020) Motif discovery in networks: a survey. Comput Sci Rev 37:100267

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

This work was supported by Swiss National Science Foundation National Research Programme 75 [Grant No. 167326]; and Melbourne Bioinformatics at the University of Melbourne [Grant No. VR0261].

Author information

Affiliations

Authors

Contributions

AS conceived the work and conducted the analysis. AS and AL interpreted the results. AS drafted the original manuscript and both authors revised it. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Alex Stivala.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Supplementary tables and figures.

Additional models and goodness-of-fit plots.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Stivala, A., Lomi, A. Testing biological network motif significance with exponential random graph models. Appl Netw Sci 6, 91 (2021). https://doi.org/10.1007/s41109-021-00434-y

Download citation

Keywords

  • Motif
  • Biological network
  • Exponential random graph model
  • ERGM