 Research
 Open Access
 Published:
Testing biological network motif significance with exponential random graph models
Applied Network Science volume 6, Article number: 91 (2021)
Abstract
Analysis of the structure of biological networks often uses statistical tests to establish the overrepresentation of motifs, which are thought to be important building blocks of such networks, related to their biological functions. However, there is disagreement as to the statistical significance of these motifs, and there are potential problems with standard methods for estimating this significance. Exponential random graph models (ERGMs) are a class of statistical model that can overcome some of the shortcomings of commonly used methods for testing the statistical significance of motifs. ERGMs were first introduced into the bioinformatics literature over 10 years ago but have had limited application to biological networks, possibly due to the practical difficulty of estimating model parameters. Advances in estimation algorithms now afford analysis of much larger networks in practical time. We illustrate the application of ERGM to both an undirected protein–protein interaction (PPI) network and directed gene regulatory networks. ERGM models indicate overrepresentation of triangles in the PPI network, and confirm results from previous research as to overrepresentation of transitive triangles (feedforward loop) in an E. coli and a yeast regulatory network. We also confirm, using ERGMs, previous research showing that underrepresentation of the cyclic triangle (feedback loop) can be explained as a consequence of other topological features.
Introduction
Molecular interactions in biological systems are often represented as networks (Winterbach et al. 2013). Some such networks are inherently undirected, such as protein–protein interaction (PPI) networks (De Las Rivas and Fontanillo 2010). Others may be directed, such as gene regulatory networks, where nodes represent operons, and arcs (directed edges) represent transcriptional interactions between them. Much research with such biological networks has concerned “motifs”, small subgraphs which occur more frequently than would be expected by chance. Motifs have been considered the building blocks of complex networks (Alon 2007; Ciriello and Guerra 2008; Milo et al. 2002; ShenOrr et al. 2002). The biological significance of network motifs derives from their possible interpretation as signs of evolutionary events (Middendorf et al. 2005; Rice et al. 2005).
Two simple examples of motifs in undirected networks are triangles (threecycles) and squares (fourcycles) (Rice et al. 2005). Directed networks allow for a larger set of potentially important motifs (Middendorf et al. 2005; Milo et al. 2002; Rice et al. 2005), which can be quite complicated, leading to problems of consistency in their definition (Konagurthu and Lesk 2008b).
It is worth noting that such (threenode) motifs are an idea with a long history in social network analysis, where the counts of all sixteen possible threenode directed graphs (triads) are known as the triad census (Davis and Leinhardt 1967; Holland and Leinhardt 1970, 1976; Wasserman and Faust 1994). A systematic naming convention has been developed that is based on the number of mutual, asymmetric, and null (M, A, and N) dyads in the triad, followed by a letter to distinguish the orientation if it is not unique (Fig. 1). For example, the transitive triangle is designated 030T, which distinguishes it from the cyclic triad 030C. Although in common usage in social network research, and cited by Milo et al. (2002) and Saul and Filkov (2007) in the context of biological networks, this naming convention is rarely used in discussions of motifs in the bioinformatics or biology literature. There are efficient algorithms for computing the triad census (Batagelj and Mrvar 2001; Moody 1998), implemented in widely used general purpose graph libraries such as igraph (Csárdi and Nepusz 2006) and NetworkX (Hagberg et al. 2008). The triad census has recently been extended to colored triads, that is, distinguishing the nodes in the triads based on a categorical attribute assigned to them (Lienert et al. 2019). It has long been noted in the social networks literature that the dyad census constrains the triad census, and yet empirical social networks often still have counts for some triads greater than expected given those constraints (Faust 2010).
To determine if a motif is overrepresented, the count of the motif in an observed network is compared to the distribution of its counts in a set of simulated random networks (Ciriello and Guerra 2008) (it is also possible to determine the significance of motif overrepresentation without simulation (Martorana et al. 2020; Picard et al. 2008)). This leads to the problem of choosing the appropriate random networks (null model), and some supposed motifs have been found to not be significantly overrepresented, and occur with the observed frequencies simply due to topological properties of random networks (Konagurthu and Lesk 2008a) or correlations between motifs created by the randomization process (Ginoza and Mugler 2010), although such correlations can also occur even with uniform sampling (Fodor et al. 2020).
Estimating motif (triad census) significance by comparing the triad census of an empirical network to that of ensembles of random graphs also has a long history, for example the conditional uniform graph (CUG) distribution (Anderson et al. 1999; Butts 2008; Mayhew 1984), conditional on the dyad census (UMAN) (Holland and Leinhardt 1976), or on the degree distribution (Snijders 1991). A more modern variation on a similar idea is the dkseries (Mahadevan et al. 2006; Orsini et al. 2015), a sequence of nested network distributions of increasing complexity, fitting in turn density, degree distribution, degree homophily, average local clustering, and clustering by degree (Orsini et al. 2015).
The recent work of Fodor et al. (2020) shows that the assumptions of mainstream methods for motif identification, specifically normally distributed motif frequencies and independence of motifs, do not always hold, and that, as a consequence, such methods cannot always correctly estimate the statistical significance of motif overrepresentation.
Aside from such intrinsic statistical limitations, it may be the case that the apparent statistical overrepresentation of motifs has no evolutionary or functional significance (Ingram et al. 2006; Mazurie et al. 2005; Payne and Wagner 2015), and the choice of null model is a critical factor in this lack of evident relationship between overrepresentation and evolutionary preservation (Beber et al. 2012; Mazurie et al. 2005). Alternatively, the apparent lack of functional significance (Payne and Wagner 2015) may be due to too narrow a definition of “function” (Ahnert and Fink 2016). Recently, it has also been suggested that elementary motifs are a lower level of structure than that which is most functionally relevant in gene regulatory networks characterizing different physiological states (Lesk and Konagurthu 2021).
It might also be the case that particular motifs are overrepresented, not because they are evolutionarily selected for function, but because of spatial clustering (ArtzyRandrup et al. 2004). For example, in the context of PPI networks, we might expect that interactions would be overrepresented between proteins that share a subcellular location, and underrepresented between those that do not, since proteins known to interact usually have the same subcellular locations (von Mering et al. 2002). Indeed PPI networks can be used as predictors of subcellular location (Kumar and Ranganathan 2010; Shin et al. 2009).
There are many algorithms for motif discovery in complex networks; for recent reviews, see Jazayeri and Yang (2020), Patra and Mohapatra (2020) and Yu et al. (2020). In the present work we are considering only static, not temporal, networks. Although they differ in many details, especially regarding computational efficiency and scalability, these motif discovery algorithms work fundamentally in the manner described above. That is, they count occurrences of a motif in the observed network, and compare this to the distribution of the motif’s frequency in an ensemble of randomized versions of the original network (typically preserving degree sequence). Therefore these conventional methods all test the significance of one motif at a time, assuming independence of motifs, and are all potentially subject to the problems described by the recent work of Fodor et al. (2020), mentioned above. That is, that the assumptions of independence and normal distribution of motif frequencies may not hold, and that therefore these methods might not be able to correctly estimate the statistical significance of motif overrepresentation.
In this work we describe a different approach to determining motif significance in complex networks, which can potentially overcome these problems. Rather than comparing the observed frequency of a candidate motif to its frequency in a set of randomized networks, we take a modelbased approach. Specifically, we estimate parameters of a model (an exponential random graph model, abbreviated ERGM) of the observed network. These parameters correspond to substructures which resemble potential motifs of interest. This allows the significance of the candidate motifs to be tested simultaneously in a single model, in such a way that independence of the motifs is not assumed.
Once such a model is estimated, it can also be used to test for motif significance in the traditional way, using the ERGM to simulate an ensemble of random networks. Recently, this approach was used test for motifs (dyads, triads, and tetrads; that is, two, three, and four node motifs) in a collection of social (rather than biological) networks (Felmlee et al. 2021). Using ERGM rather than degreepreserving randomization, “reduces the scope for misleading results by controlling for multiple, potential correlates in the same set of random models.” (Felmlee et al. 2021, p. 2).
We demonstrate the ERGM approach in biological networks (both undirected (PPI) and directed gene regulatory networks) using some recently developed ERGM estimation methods (Borisenko et al. 2019; Byshkin et al. 2016, 2018; Stivala et al. 2020), which allow estimation of models for larger networks than was practical with earlier methods of ERGM parameter estimation.
The remainder of this article is organized as follows. First, we describe ERGMs, and review the literature on the application of ERGMs to biological networks. We then report the biological networks considered in this work, and the details of the ERGM configurations, estimation methods, and goodnessoffit tests we used. Following that, we present and discuss new ERGM models of these networks, comparing the inferences as to motif significance with existing published results using conventional motif discovery methods. In the next section, we detail the limitations of this application of ERGMs, and indicate some potential future work. We conclude with a summary of the inferences drawn from the ERGM models of the networks considered.
Exponential random graph models
ERGMs are widely used in the social sciences, typically to model social networks (Amati et al. 2018; Koskinen 2020; Lusher et al. 2013; Robins et al. 2007a). Cimini et al. (2019) is a recent review of ERGMs for modeling realworld networks, from a statistical physics viewpoint.
An ERGM is a probability distribution with the form
where

\(X = [X_{ij}]\) is a 0–1 matrix of random tie variables,

x is a realization of X,

A is a “configuration”, a (small) set of nodes and a subset of ties between them,

\(z_A(x)\) is the network statistic for configuration A,

\(\theta _A\) is a model parameter corresponding to configuration A,

\(\kappa (\theta )\) is a normalizing constant to ensure a proper distribution.
Given an observed network x, we aim to find the parameter vector \(\theta\) which maximizes the probability of x under the model. Then for each configuration A in the model, its corresponding parameter \(\theta _A\) and its estimated standard error allow us to make inferences about the over or underrepresentation of that configuration in the observed network. If \(\theta _A\) is significantly different from zero, then if \(\theta _A > 0\) the configuration A is overrepresented, or underrepresented if \(\theta _A < 0\).
Note that a “configuration”, unlike a motif (in its most common usage) or the triad census classes, is not an induced subgraph. That is, it does not include every edge in the original graph of which it is a subgraph: a configuration is any occurrence of the substructure in question in the graph; it is defined only by its edges, not by its edges and nonedges. See Fig. 2 for an example based on one from Fodor et al. (2020, Fig. 5B).
ERGMs solve the problem of the need to correct for correlations between motif occurrences, and also other attributes such as subcellular location (functional and evolutionary significance is another matter entirely). Given an observed network, model parameters can be estimated by maximum likelihood. Hence parameters corresponding to candidate motifs such as triangles can be estimated, and a positive significant parameter would indicate triangles occurring more frequently than by chance, given the other parameters in the model (which would include parameters to control for density and degree distribution, for example). ERGMs allow different structural configurations to be incorporated, as well as configurations based on node attributes (such as physicochemical properties, or spatial locality), and the significance of the configurations can then be assessed given all the other structural and other configurations included in the model.
ERGMs fulfill all of the desirable criteria for improved network models listed by de Silva and Stumpf (2005, p. 427). They take into account that networks are finite. Indeed, far from requiring very large networks to fit the requirements of meanfield theories, they are dependent on network size and do not scale consistently to infinity (Rolls et al. 2013; Schweinberger et al. 2020; Shalizi and Rinaldo 2013)—a property that can be used to estimate population size from network samples (Rolls and Robins 2017). They can handle modular organization or community or block structure (Babkin et al. 2020; Fronczak et al. 2013; Gross et al. 2021; Schweinberger 2020; Schweinberger and Handcock 2015; Schweinberger and Luna 2018; Wang et al. 2019), samples from larger networks (An 2016; Handcock and Gile 2010; Pattison et al. 2013; Stivala et al. 2016), and missing data (Koskinen et al. 2013; Robins et al. 2004). And finally, they are flexible at incorporating additional information such as nodal attributes, including dyadic attributes, such as distances between nodes. ERGMs have also been extended to handle valued networks (Desmarais and Cranmer 2012; Krivitsky 2012) and dynamic (timevarying) networks (Krivitsky and Handcock 2014), and to use graphlets (Pržulj 2007) as the ERGM configurations (Yaveroǧlu et al. 2015).
Despite these potential advantages, however, ERGM parameter estimation is a computationally intractable problem, and in practice it is generally necessary to use Markov chain Monte Carlo (MCMC) methods (Hunter et al. 2012). A variety of algorithms for ERGM model fitting (Hummel et al. 2012; Hunter and Handcock 2006; Krivitsky 2017; Snijders 2002) are implemented in widely used software packages such as statnet (Handcock et al. 2008; Hunter et al. 2008; Morris et al. 2008) and PNet/MPNet (Wang et al. 2009), and Bayesian methods are also available (Caimo and Friel 2011, 2014). These packages also implement the socalled “alternating” or “geometrically weighted” configurations (Robins et al. 2007b; Snijders et al. 2006), which alleviate problems with model “neardegeneracy”, where the model’s probability mass is concentrated in a very small region of possible networks, which can occur when only simple configurations, such as stars and triangles, are used (Hunter et al. 2012).
Until recently, the computational difficulty of ERGM parameter estimation has limited its application to biological networks, which are often larger than the social networks (traditionally measured by observations and surveys, rather than online social networks) for which the techniques were developed. Now, however, advances such as snowball sampling and conditional estimation (Pattison et al. 2013; Stivala et al. 2016), improved ERGM distribution samplers such as the “improved fixed density” (IFD) sampler (Byshkin et al. 2016), and new estimation algorithms (Hummel et al. 2012), including the “Equilibrium Expectation” (EE) algorithm (Byshkin et al. 2018; Borisenko et al. 2019) and its implementation for large directed networks (Stivala et al. 2020), have reduced by orders of magnitude the time taken to estimate ERGM parameters.
Literature review of application of ERGMs to biological networks
ERGMs were first applied to biological networks by Saul and Filkov (2007), who estimated model parameters for Escherichia coli (Salgado et al. 2001) and yeast regulatory networks, and a collection of metabolic networks. As well as introducing the use of ERGMs to the field of bioinformatics for analyzing biological networks, Saul and Filkov (2007) used ERGM models to build topological profiles which they showed to be capable of classifying organisms into biological and functional groups. With the algorithms and implementations available at the time, the larger networks could only be estimated by maximum pseudolikelihood (Strauss and Ikeda 1990), an approximation which is now considered problematic (van Duijn et al. 2009; Hunter et al. 2012; Robins et al. 2007b) and useful mostly for obtaining initial parameter estimates for a more accurate (but also more computationally expensive) method (Hummel et al. 2012; Hunter and Handcock 2006; Krivitsky 2017). Further, all the networks in Saul and Filkov (2007) were treated as undirected, thereby losing important directional information (and not, for example, being able to distinguish between cyclic and transitive triads) in regulatory networks. The E. coli regulatory network, treated as undirected, was also used as an example application of the “stepping” algorithm for ERGM estimation by Hummel et al. (2012).
Exponential random graph models for similar E. coli regulatory networks were described by Begum et al. (2014), leaving the networks directed rather than treating them as undirected. These models were very simple, however, including only Arc and Instar terms, and therefore model degree distribution, but not triangular motifs.
Bayesian estimation of an ERGM model of a human PPI network with 401 proteins was described by Bulashevska et al. (2010). This model used only very basic structural features (not including any triangular structures, for example), but made use of nodal attributes, specifically a binary variable indicating if the protein is disordered. This ERGM was not used to analyze network motifs, but rather the relationship between disordered proteins and their “sociality”, a measure of their importance in the PPI network, finding that intrinsically disordered proteins tend to be more “social” (Bulashevska et al. 2010). In their Conclusions, Bulashevska et al. (2010) suggest that “The ERGM modelling of networks offers a natural way of assessing importance of the network motifs” (Bulashevska et al. 2010, p. 13).
Similar techniques, that is, Bayesian estimation of ERGMs with only very simple structural terms, have also been used with gene–gene relationship networks to model mechanisms of gene dysregulation (Azad et al. 2017). These models were used to infer potential aberrant gene pairs, and suggested a novel pattern of aberrant signaling (Azad et al. 2017).
A mixture ERGM was introduced by Wang et al. (2019) and applied to a yeast gene interaction network with 424 genes (Schuldiner et al. 2005; Wang et al. 2019). The model included geometrically weighted indegree and outdegree terms, but not any triangular terms; the interest is rather in the clusters it finds, which may be used to predict function (Wang et al. 2019).
An ERGM incorporating a directed form of the degreecorrected stochastic blockmodel (Karrer and Newman 2011) was introduced by Gross et al. (2021), and applied to the connectome of the C. elegans worm (279 nodes representing neurons), and an A. thaliana PPI network (4344 nodes representing proteins). These models assume dyadic independence, and hence triangular configurations could not be incorporated. The advantage of the mixture ERGM (Wang et al. 2019) or stochastic blockmodel ERGM generalizations (\(\beta\)SBM and \(p_1\)SBM (Gross et al. 2021)) is that they can capture heterogeneity in clusters found in the network, but we do not address cluster or community structure here.
ERGMs have been applied to neural networks with 90 nodes, representing brain regions (Simpson et al. 2011, 2012), finding that an ERGM approach outperforms conventional approaches for constructing groupbased representative brain networks (Simpson et al. 2012). Bayesian ERGM techniques, with 96 nodes representing brain regions, have been used to model brain networks over the human lifespan (Sinke et al. 2016). Recently, Bayesian ERGMs, extended to multiple networks, were used to compare functional connectivity structure across groups of individuals (Lehmann et al. 2021).
ERGMs have also been used to model human brain networks inferred from electroencephalographic (EEG) signals; these networks have 56 (the number of EEG sensors) nodes (Obando and De Vico Fallani 2017). These models showed that clustering and node centrality (as reflected by overrepresentation of triangles and stars) better explained global properties of the brain networks than other graph metrics, supporting the view that segregated modules exchange information via hubs.
An enhanced version of the generalized (or valued) ERGM (Desmarais and Cranmer 2012) was used to model the human Default Mode Network (DMN) with 20 nodes, representing brain regions (Stillman et al. 2017). This model showed that the DMN appears to be organized in a “segregated highway” structure, that is, with fewer hubs and more triadic closure than expected, in contrast to “small world” structure of the wholebrain network (Stillman et al. 2017). This work is an example of an ERGM that incorporates spatial distances, in the form of threedimensional Euclidean distances between nodes.
A Bayesian ERGM has been used to model transient structure in intrinsically disordered proteins, providing a means for identifying transient structures that differ in favorability across variants (Grazioli et al. 2019a). A specific family of ERGMs has been used to model amyloid fibril topologies, leading to the construction of a systemic nomenclature that can classify all known amyloid fibril structures, and a simulation technique that can explore the kinetics of fibril selfassembly (Grazioli et al. 2019b).
Simple ERGMs for undirected networks (A. thaliana, yeast, human, and C. elegans PPI networks, and undirected versions of E. coli regulatory and Drosophila optic medulla networks) were estimated in Byshkin et al. (2018, S.I.), demonstrating that the EE algorithm could be used to estimate in minutes a model that takes many hours or is practically impossible with earlier methods. In addition, a more complex model of the A. thaliana PPI network was estimated, showing not just the overrepresentation of the triangle motif, but also the tendency for plantspecific proteins to interact preferentially with each other, and for kinases to interact preferentially with phosphorylated proteins (Byshkin et al. 2018). However that work dealt only with undirected networks. An implementation of the EE algorithm for directed networks was described in Stivala et al. (2020), but no biological networks were considered in that work.
Methods
Network data
We obtained a yeast PPI network (von Mering et al. 2002) from the igraph (Csárdi and Nepusz 2006) Nexus network repository (this is no longer available, we used the network downloaded on 10 November 2016). The yeast PPI network has the proteins annotated with one of 12 functional categories (Mewes et al. 2002; Ruepp et al. 2004) (or “uncharacterized”), as described in the Supplementary Information of von Mering et al. (2002).
We obtained a human PPI network from the HIPPIE database (AlanisLobato et al. 2016; Schaefer et al. 2012, 2013; Suratanee et al. 2014), version 2.2, downloaded from http://cbdm.unimainz.de/hippie/ (accessed 12 June 2021). Edges in this network are labeled with a confidence score between zero and one. We built a binary “high confidence” network by selecting edges where the score is \(\ge 0.70\), the third quartile of the score distribution.
To annotate nodes in the human PPI network with their subcellular location using terms in the Gene Ontology (GO) (Ashburner et al. 2000), we used the Protein ANalysis THrough Evolutionary Relationships (PANTHER) database (Mi et al. 2019, 2021). We used the PANTHER database version 16.0 downloaded from http://data.pantherdb.org/ftp/sequence_classifications/current_release/PANTHER_Sequence_Classification_files/PTHR16.0_human (accessed 21 June 2021). We used the R package GOxploreR (Manjang et al. 2020, 2021) to rank the GO terms for subcellular component in the PANTHER database, and annotated each node (representing a protein) in the network with the highest ranking term for that protein. This results in a cellular component GO term for 6131 of the 11,517 nodes (53%) in the human PPI network. The cellular component GO terms are treated as a categorical attribute, of which there are 271 unique values in the data. The nodes with no cellular component GO term assigned are given an “NA” category, which, when used in the “Match” statistic in ERGM modeling, does not match any category (including the NA category itself).
The previously mentioned E. coli regulatory network (Salgado et al. 2001; ShenOrr et al. 2002) was obtained via the statnet package (Handcock et al. 2008, 2016). Following Hummel et al. (2012), we removed the loops (selfedges) representing selfregulation, and considered selfregulation instead in a simplistic way by a binary node attribute designated “self” which is true when a selfloop was present and false otherwise. In some models, we use the original version of this network with selfedges retained, and when this is done it is noted in the results. We also obtained a Saccharomyces cerevisiae (yeast) regulatory network (Costanzo et al. 2001; Milo et al. 2002) (http://www.weizmann.ac.il/mcb/UriAlon/download/collectioncomplexnetworks; accessed 29 April 2019) and processed it in the same way.
For all networks, we removed multiple edges and, unless noted otherwise, selfloops, where these are present.
Summary statistics of the networks are in Table 1 and the degree distributions of the networks are shown in Fig. 3. In this figure, \(\alpha\) is the exponent in the discrete power law distribution \(\Pr (X=x) = Cx^{\alpha }\) (where C is a normalization constant), and \(\mu\) and \(\sigma\) are the parameters (respectively, mean and standard deviation of \(\log (x)\)) of the discrete lognormal distribution. Power law and lognormal distributions were fitted using the methods of Clauset et al. (2009) implemented in the poweRlaw package (Gillespie 2015).
ERGM configurations
The ERGM parameters used in the models for undirected networks are shown in Table 2, and those for directed networks in Table 3. Detailed descriptions of these parameters and their corresponding statistics can be found in Lusher et al. (2013); Robins et al. (2007a, 2007b, 2009); Snijders et al. (2006); Stivala et al. (2020), but two of the important ones used in this work are shown in Fig. 4.
The “alternating” statistics (Lusher et al. 2013; Robins et al. 2007b; Snijders et al. 2006) such as alternating kstars involve sums of counts of configurations with alternating signs and a decay factor \(\lambda\), and, except where otherwise specified, we set \(\lambda = 2\) in accordance with common ERGM modeling practice.
ERGM parameter estimation
ERGM parameters for undirected networks were estimated using the EE algorithm (Byshkin et al. 2018) with the IFD sampler (Byshkin et al. 2016) implemented for undirected networks in the Estimnet software as described in Byshkin et al. (2018), with 20 estimations (run in parallel). ERGM parameters for directed networks were estimated using the simplified EE algorithm (Borisenko et al. 2019; Byshkin et al. 2018) with IFD sampler implemented for directed networks in the EstimNetDirected software (Stivala et al. 2020), with 64 estimations (run in parallel).
The Alon E. coli network does not contain any reciprocated arcs (directed loops of length two), and so estimation is made conditional on this by preventing the creation of reciprocated arcs in the MCMC procedure.
Convergence and goodnessoffit tests
Convergence was tested as described in Byshkin et al. (2018), Stivala et al. (2020), by requiring the absolute value of each parameter’s tratio to be no greater than 0.3, and by visual inspection of the parameter and statistic trace plots. For the directed networks estimated with EstimNetDirected, an additional heuristic convergence test was used, as described in Stivala et al. (2020). Observed graph statistics were plotted on the same plots as the distributions of those statistics in the networks simulated in the EE algorithm MCMC process, to check that they do not diverge. The statistics used are the same as those of the actual goodnessoffit test described below, but note that this test is only for estimation convergence, not goodnessoffit (Stivala et al. 2020).
For the directed networks estimated with EstimNetDirected, a simulationbased goodness of fit procedure was used, similar to that used in statnet (Hunter et al. 2008). A set of networks was simulated from the estimated model (using the SimulateERGM program in the EstimNetDirected software), and the distribution of certain graph statistics compared with those of the observed network by plotting the observed network values on the same plots as the distribution of simulated values. The statistics used were the in and outdegree distributions, reciprocity, giant component size, mean local and global clustering coefficients, triad census, geodesic distance (shortest path length) distribution, and edgewise and dyadwise shared partners distributions.
Results and discussion
Table 4 shows the basic structural model for the yeast PPI network (Model 1), a model with the alternating ktwopaths (A2P) parameter added (Model 2), as well as a model (Model 3) incorporating a parameter for the propensity of interactions to occur between proteins in the same functional category (class). Model 1 reproduces a model of this network in a previous work (Byshkin et al. 2018, Table S3); Models 2 and 3 are new.
Each of these model estimations took approximately 7 minutes total elapsed time on cluster nodes with Intel Xeon E52650 v3 2.30GHz processors using 20 parallel tasks.
We expect that proteins of the same functional category should preferentially interact with each other (von Mering et al. 2002), and this is confirmed by the significant positive parameter estimated for the “Match class” effect. The alternating ktriangle (AT) parameter is positive and significant in all models, showing an overrepresentation of triangles (which we might expect given the very high value of the clustering coefficient for this network, Table 1), even in models also including parameters for twopaths and preferential interaction of proteins in the same class.
Table 5 shows a basic structural model for the human PPI high confidence network (Model 1), and a model with a term to control for subellular location by categorical matching on the cellular component GO term (Model 2).
Estimation of Model 1 took approximately 64 minutes elapsed time, and Model 2 approximately 73 minutes, on cluster nodes with Intel Xeon E52650 v3 2.30GHz processors using 20 parallel tasks.
As discussed in the Introduction, we expect that interactions would be overrepresented between proteins that share a subcellular location, and this is confirmed by a statistically significant positive parameter estimate for categorical matching on cellular component (Model 2 in Table 5). The alternating ktriangle (AT) parameter is positive and statistically significant in both models. This indicates an overrepresentation of triangles, even when controlling for subcellular location (Model 2).
We estimated four different models of the Alon E. coli regulatory network (Table 6). In Models 1 and 2, following Hummel et al. (2012), we modeled selfregulation by using a nodal covariate “self” which is true exactly when the node had a selfedge (loop) in the original network. These ERGM models are new, in that previous work with ERGMs on these networks either treated them as undirected (Hummel et al. 2012; Saul and Filkov 2007), thereby ignoring the inherently directed nature of such a regulatory network; or, in the case where the network was left as directed, included only Arc and alternating kinstars terms, as the estimation methods used at the time could not find converged models when other terms, such as triangles, were included (Begum et al. 2014).
Each of these model estimations took approximately three minutes total elapsed time on cluster nodes with Intel Xeon E52650 v3 2.30GHz processors using 64 parallel tasks.
In these models, the Sink and Source parameters are used to control, respectively, for the presence of genes that do not regulate any genes (have outdegree zero) and genes that are not regulated by any gene (have indegree zero). The alternating kinstars (AltInStars) parameter is positive and significant in all models except Model 3, indicating significant skewness of the indegree distribution, that is, the presence of “hubs” with higher indegree than other nodes. There is no significant effect for (or against) such skewness of the outdegree distribution (see Figs. 3 and 5).
The only other parameter that is consistently significant (and positive) is path closure (AltKTrianglesT), which we can interpret as a significant tendency for the “feedforward loop” to be overrepresented, consistent with the results in Milo et al. (2002).
A goodnessoffit plot for Model 1 (Table 6) is shown in Additional file 1: Fig. S1a, showing a good fit for the model. A goodnessoffit plot for the triad census (Fig. 6a) shows that the model reproduces the triad census well, and specifically triad 030T, the transitive triad (three node feedforward loop), giving additional confidence that the positive and statistically significant AltKTrianglesT parameter is evidence for overrepresentation of this motif, given the other parameters in the model.
Note that this E. coli regulatory network does not contain any instances of the threecycle, or “threenode feedback loop” (Milo et al. 2002). Indeed the Alon E. coli network does not contain any loops greater than size one (ShenOrr et al. 2002), and so the cyclic closure parameter (AltKTrianglesC) is not included in the models.
In Models 3 and 4 (Table 6), unlike the other models, selfedges (loops) are retained in the network, and selfedges are allowed in the modeling process, allowing the formation of loops to be modeled jointly with the other structural features in the model.^{Footnote 1} In Model 4, the new parameter “Loop” is introduced, for which the corresponding statistic is the count of selfedges in the network. This parameter is statistically significant and positive, indicating that selfedges are overrepresented, given the other effects included in the model. Goodnessoffit plots for Models 3 and 4 (Table 6) are shown in Additional file 1: Fig. S4, showing that when the Loop parameter is not included in the model (Model 3 in Table 6), there is a poor fit for the number of loops (Additional file 1: Fig. S4a). However, when the Loop parameter is included (Model 4 in Table 6), there is a good fit for the number of loops (Additional file 1: Fig. S4b).
We found that it is also possible to estimate similar models of this relatively small network using the most recent version of the statnet ergm package (Handcock et al. 2021; Krivitsky et al. 2021), with the “stepping” algorithm (Hummel et al. 2012). These models are shown in Additional file 1: Table S1, and the goodnessoffit plots in Additional file 1: Figs. S6, S7. The results are consistent with those in Table 6. Specifically, there is a significant positive estimate for geometrically weighted edgewise shared partners (GWESP, equivalent to AltKTrianglesT), and a significant negative estimate for geometrically weighted indegree, indicating centralization in the indegree distribution.^{Footnote 2} The statnet model finds a significant tendency against centralization on outdegree, while the models in Table 6 did not have a significant estimate for the corresponding parameter (AltOutStars). Similarly the statnet model (Model 2 in Additional file 1: Table S1) finds a significant negative parameter estimate for Matching on the “selfregulating” attribute, while no significant effect is found in Model 2 in Table 6. The statnet ergm package does not allow for the modeling of selfedges, however (Hummel et al. 2012).
Table 7 shows ERGM parameter estimates for the Alon yeast regulatory network. Each of these model estimations took approximately three minutes total elapsed time on cluster nodes with Intel Xeon E52650 v3 2.30GHz processors using 64 parallel tasks. These ERGM models are also new; previously published ERGMs for similar networks having treated them as undirected (Saul and Filkov 2007).
In Model 1 (Table 7), estimation is conditional on no reciprocated arcs, just as was done for the E. coli regulatory network. However in this yeast regulatory network, there is actually a single reciprocated arc (twocycle) in the data, and hence the fit of the model on statistics involving reciprocated arcs is poor. This is apparent, for example, in the poor fit for triad census class 102 (triad with only a mutual arc) in Fig. 6b, or for the reciprocity statistic in the goodnessoffit plot (Additional file 1: Fig. S1b). The fit for other statistics, and in particular the degree and shared partner distributions, is acceptable (with the exception of poor fit on the giant component size). Importantly, the fit on the triad census class 030T (transitive triad) is good (Fig. 6b).
In order to better model reciprocity, a model (Model 2 in Table 7) was estimated without being conditional on there being no reciprocated arcs, but without a reciprocity term in the model. This model also has adequate goodnessoffit, but this time including good fit on the reciprocity statistic (Additional file 1: Fig. S2a). It does, however, for some triads involving reciprocated arcs (120U for example), generate significantly more such triads than are observed in the data (Additional file 1: Fig. S2b). Therefore, a third model (Model 3 in Table 7) was estimated, including the Reciprocity parameter. However, probably due to the fact that the data contains only a single reciprocated arc, this model has a very large estimated standard error for the Reciprocity parameter. Further, it exhibits poor convergence with respect to the Reciprocity statistic, with a tratio greater than the maximum value of 0.3 we consider acceptable, since the data contains exactly one reciprocated arc, yet the model most frequently generates networks with none.
Model 1 and Model 2, therefore, are preferable. Nevertheless, in all three models, the sign and significance of estimated parameters (except Reciprocity) are the same. There is a positive and significant parameter for alternating koutstars (AltOutStars), indicating the presence of “hubs” with higher outdegree than other nodes. This is as we might expect from Fig. 3 and previous research (Balaji et al. 2006; Guelzim et al. 2002; Monteiro et al. 2020; Ouma et al. 2018), and contrasts with the E. coli regulatory network, which has indegree hubs but not outdegree hubs.
Also in all three models, there is a positive and significant parameter estimate for transitive closure (AltKTrianglesT). Given this estimate, and the good fit for the transitive closure motif 030T (Fig. 6b) we can again interpret this as a significant overrepresentation of this motif (“feedforward loop”), consistent with the results of Milo et al. (2002).
In all three models in Table 7, the decay parameter \(\lambda\) for the “alternating” statistics has been set to a value other than the default \(\lambda =2\) for alternating koutstars (AltOutStars), multiple twopaths (AltTwoPathsT), and transitive closure (AltKTrianglesT). This is because models initially estimated with the default \(\lambda =2\) value (Additional file 1: Table S2) showed poor goodnessoffit on the outdegree distribution (Additional file 1: Fig. S3a) and triad census class 030T (Additional file 1: Fig. S3b). Therefore, new models were estimated with a higher value of \(\lambda\) for the alternating koutstar parameter to assist with modeling the highly skewed outdegree distribution (Koskinen and Daraganova 2013), and also a higher value of \(\lambda\) for AltTwoPathsT and AltKTrianglesT (the same value of \(\lambda\) for both) to aid model convergence and fit for transitivity (Snijders et al. 2006).
As with the E. coli network, we also estimated a model of the yeast regulatory network, in which selfedges are retained, and allowing selfedges (loops) in the model. This network (even leaving aside the presence of selfedges) is, however, not identical to the network used for the models shown in Table 7, having two additional nodes. Its graph summary statistics, are, however the same (to the precision shown) as those of the version shown in Table 1, other than it having 690 rather than 688 nodes. Since the network modeled is a slightly different network than that used for the models shown in Table 7, these models are presented separately, in Additional file 1: Table S3. The results are consistent with those in Table 7, with statistically significant positive parameter estimates for AltOutStars and AltKTrianglesT. The estimate for the Loop parameter is not statistically significant, however. Goodnessoffit plots for the models in Additional file 1: Table S3 are shown in Additional file 1: Fig. S5. These figures show that the model which allows selfedges, but does not include the Loop parameter (Model 1 in Additional file 1: Table S3) does not fit the number of loops well, while the model that includes the Loop parameter (Model 2 in Additional file 1: Table S3) does fit the number of loops well.
The cyclic triangle structure has been suggested as an “antimotif” (i.e. occurs less frequently than expected), but in some cases its apparent underrepresentation has been shown to be an expected consequence of other topological properties of biological networks (Konagurthu and Lesk 2008a). This closedloop structure, also known as a “multicomponent loop”, can provide feedback control and potentially produce systems that can switch between two states (Ferrell 2002; Lee et al. 2002). In the examples used here, there were so few (or no) occurrences of this motif, that models including the corresponding parameter (in the form of the AltKTrianglesC parameter) would not converge. Yet the networks simulated from these models also contain no (or very few) occurrences of this candidate antimotif. This is consistent with the lack of cyclic triangles not being due to cyclic triangles being an antimotif as such, but rather as a consequence of the other topological features of the network, and specifically in these examples, the features described by the parameters included in the models. This is not a new finding, it having previously been noted that the lack of threenode feedback loops in the E. coli regulatory network (Lee et al. 2002; ShenOrr et al. 2002) is reproduced in randomized networks (ShenOrr et al. 2002).
The biological significance of the feedforward loop (transitive triangle) is suggested to be that, by providing two pathways to affect the output, one direct, and one through an intermediate link, it can act as a logical “AND” gate, and filter out transient activation signals (Alon 2007; Lesk and Konagurthu 2021; Mangan and Alon 2003; ShenOrr et al. 2002). Whether or not this is indeed the biological function of the feedforward loop (Mazurie et al. 2005), this motif is found to be significantly overrepresented in the transcriptional regulatory networks of several organisms (Alon 2007), including the yeast and E. coli networks studied here, and the feedforward loop has been described as “highly favored during the evolution of transcriptional regulatory networks in yeast” (Lee et al. 2002, p. 801).
More recently, there has been interest in trying to understand the function of motifs by examining higher levels of structure. Gorochowski et al. (2018) examine the clustering of motifs, including the feedforward loop, and find that a measure of motif clustering diversity can predict functionally important nodes in the E. coli metabolic network. Lesk and Konagurthu (2021) describes how the local structure of the yeast regulatory network is reconfigured in different physiological states.
So far we have only discussed results for threenode motifs, such as the feedforward loop. We can test for the overrepresentation of other motifs, without including parameters for them in the model, by using the ERGM as the null model against which to compare the count of the motif in the observed network. This was the technique used by Felmlee et al. (2021), for example.
Figure 7 shows the bifan and biparallel motifs, as defined by Milo et al. (2002), their counts in the E. coli and yeast regulatory networks, and their distribution in ERGM models of these networks. The motifs were counted with the NetMODE software (Li et al. 2012). Note that NetMODE was used only to count the motifs, not to simulate any networks, which are simulated from the ERGM models as described in the Methods section.
The biparallel motif occurs in neither of the observed networks, and nor does it occur in any of the networks simulated from the corresponding ERGMs. The bifan motif, however, clearly occurs far more frequently in both observed networks than it does in the corresponding simulated networks. Note that these networks are simulated from ERGMs that model not just degree distribution, but also the distribution of twopaths and transitive triangles. Therefore, this shows that the bifan motif appears to be overrepresented in the observed networks, even given the overrepresentation of transitivity captured in the models, which also reasonably reproduce the triad census, geodesic distance distribution, and dyadwise and edgewise shared partner distributions. These results are consistent with the results of Milo et al. (2002), where only degreepreserving randomization was used.
Limitations
Finding a converged ERGM for a network is not always possible in practice. In particular, models which include Markov dependency assumption parameters such as triangles, corresponding directly to threenode motif candidates such as threenode feedforwardloops (transitive triangles) and threecycles, for example, usually do not converge. For this reason it is normal practice in ERGM modeling to use geometrically weighted or “alternating” configurations to solve this problem (Hunter et al. 2012; Robins et al. 2007b; Snijders et al. 2006), as we did in this work. However this means we are not answering precisely the same question as when we ask directly if a motif is overrepresented or not. This is because ERGM is a model for tie (edge or arc) formation, not for motif formation: if we consider ERGM as a type of logistic regression, the outcome variable is the presence or absence of a network tie. The predictor variables are not independent of each other, but form a nested hierarchy of configurations: triangles are formed by “closing” a twopath with an additional edge, for example. So a positive estimate of the alternating ktriangle parameter does not directly mean that the transitive triangle (three node feedforward loop) motif is overrepresented, but rather that there is tendency (that is, it is more probable than chance given the other parameters in the model) for three nodes forming a directed twopath to be closed in a transitive triangle. This makes sense in the social network origins of the model: it might be assumed to be the result in the observed network of the tendency of a person’s friends to also be friends with each other, for example. In the context of biological networks, it might be interpreted as a sign of evolutionary events, however this interpretation is very much open to question, as briefly discussed in the Introduction.
Even when the “alternating” configurations are used, it can be difficult or impossible to find a converged and wellfitting ERGM for a given network. For example, we were unable to fit an ERGM with triangular configurations (using either statnet or EstimNetDirected) to an example of a neural network, the wholeanimal chemical connectome (a directed network with 579 nodes and 5246 arcs) of the male C. elegans worm (Cook et al. 2019).
Hence in order to directly test motif significance, without having to fit a parameterized model such as ERGM, new methods, such as the “anchored motif” proposed by Fodor et al. (2020) are still required.
In some of the models presented here, we used values other than the usual default value \(\lambda =2\) for the decay parameter \(\lambda\) of the “alternating” statistics. We had to manually estimate appropriate values of \(\lambda\) based on trial and error, guided by knowledge of the observed network, convergence and goodnessoffit of the models (or lack thereof), and the definitions of the relevant statistics (Koskinen and Daraganova 2013; Snijders et al. 2006). It is possible to instead estimate \(\lambda\) (or an equivalent parameter) directly from the data, as part of the model, using a “curved ERGM” (Hunter 2007; Hunter and Handcock 2006), and this is implemented in the statnet R package (Handcock et al. 2008, 2016, 2021; Hunter et al. 2008; Krivitsky et al. 2021; Morris et al. 2008). However it is not currently possible to estimate curved ERGMs using the EstimNetDirected software (Stivala et al. 2020), and this is an area requiring further work. In the absence of such a principled way of estimating the decay parameters, an alternative to the heuristic (trial and error) approach used here is to estimate many models with systematically varying values of the \(\lambda\) decay parameter for each relevant “alternating” model parameter, and use a grid search to find the model with best fit.^{Footnote 3} We applied this method to the Alon yeast regulatory network model (Additional file 1: Table S2), using the Mahalanobis distance between a vector of some of the observed network summary statistics used for goodnessoffit (degree distributions, reciprocity, giant component size, global and average local clustering coefficient), and the corresponding vectors for networks simulated from the model, as the value to minimize. We used a twodimensional grid, varying the \(\lambda\) value for AltOutStars as one dimension, and the value of \(\lambda\) for both AltTwoPaths and AltKTriangles (these values should be the same, as described in Snijders et al. (2006)) as the other dimension. With both values varying from 1.5 to 5.0 in steps of 0.5, we found the minimum Mahalanobis distance was at \(\lambda = 4.5\) for the AltOutStars parameter, and \(\lambda = 1.5\) for the AltTwoPathsT and AltKTrianglesT parameters. The parameters estimated for this model are not substantively different from those in Table 7. The values of \(\lambda\) that we determined heuristically (Table 7) were at rank 15 (of 64) using this criterion. The model with the default \(\lambda = 2.0\) for all alternating statistic parameters, with subjectively poor goodnessoffit on the outdegree distribution, is at rank 48 (of 64).
As previously mentioned, the configurations available in an ERGM are determined by the dependence assumptions: although there is a lot of flexibility available in ERGM configurations, we cannot simply add arbitrary configurations without regard for the underlying dependency assumption (Koskinen 2020). The least restrictive assumption used in practice is the “social circuit” dependency assumption (Lusher et al. 2013; Robins et al. 2007b, 2009; Snijders et al. 2006) used in this work, which allows the use of the “alternating” configurations.
We also note that some recent work suggests that complex network structure, including heavytailed degree distributions, closure (clustering), large connected components, and short path lengths can arise simply from thresholding normally distributed data to generate the binary network (Cantwell et al. 2020). Hence inferences from ERGM modeling about network structure, just as with other techniques such as comparison to ensembles of random graphs, could be consequences of the way the binary network was constructed.
Valued ERGMs (Desmarais and Cranmer 2012; Krivitsky 2012) may be used to avoid this problem by removing the need to construct a binary network at all, and working directly with the network with valued edges. Parameter estimation for these models is even more computationally intensive than for binary networks, and hence is so far impractical to use for networks of the size considered here. Using new estimation techniques to improve the scalability of parameter estimation for valued ERGMs is another area requiring further research.
For the relatively small (on the order of one thousand nodes or fewer) directed networks considered here, it is possible to do simulationbased goodnessoffit tests. However, it is possible to estimate ERGM parameters for far larger (over one million nodes) networks using the EstimNetDirected software, but it is not practical to simulate such large networks from the model, and this is an area requiring further work (Stivala et al. 2020).
One further limitation to consider is the execution time of the ERGM technique. As discussed in the introductory sections, ERGM parameter estimation is a computationally difficult problem. Although recent advances allow the estimation in minutes of models that would have taken hours, or been infeasible to estimate, with earlier methods, it is still much more computationally difficult to do this than it is to run conventional motif finding methods. The networks used here took between three and 73 minutes to estimate, using multiple (up to 64) processor cores in parallel. However motif finding with MFinder (Kashtan et al. 2004) in these networks takes only seconds, and with the faster NetMODE method (Li et al. 2012), even less time, using only a single processor core.
Conclusion
We have reexamined the use of exponential random graph models for analyzing biological networks, an application first introduced in the bioinformatics literature by Saul and Filkov (2007). Advances in ERGM estimation methods since then have allowed more sophisticated models to be estimated for more and larger networks than was possible at the time, and they are now a more practical technique for making inferences about structural hypotheses in biological networks, potentially solving some of the problems inherent in conventional methods for testing motif overrepresentation. By using an ERGM, all configurations in the model are tested simultaneously, each conditional on all the others, rather than having to test one at a time with the other configurations fixed in a (more or less sophisticated, the choice of which is critical to the results) null model.
The ERGM models of the Alon E. coli network presented here are the first to retain the directed nature of the network and also include terms for triangular structures. They confirm the result of Milo et al. (2002) that path closure (feedforward loop) is overrepresented, even when we include other, related, parameters in the model.
We also presented the first ERGM models of a yeast regulatory network retaining its inherently directed nature (rather than treating it as undirected). We find statistically significant overrepresentation of the transitive closure motif, just as Milo et al. (2002) did in the same yeast regulatory network, using a simple randomization test.
The lack of the cyclic triangle (feedback loop) structure in the data, however, is reproduced by models that do not contain any parameter corresponding to this structure. This suggests that this structure is not an “antimotif”, but rather that its lack is a consequence of the structural features of the networks, specifically degree distributions, twopaths, and transitive closure, that are included in the models.
Availability of data and materials
Source code, configuration files, and datasets are available from https://sites.google.com/site/alexdstivala/home/ergm_bionetworks.
Notes
Modeling selfedges in this way was suggested by an anonynous reviewer, on the grounds that a node with a selfedge is a (very simple) motif.
This strategy was suggested by an anonymous reviewer.
Abbreviations
 CDF:

Cumulative distribution function
 CUG:

Conditional uniform graph
 DMN:

Default mode network
 EE:

Equilibrium expectation
 EEG:

Electroencephalography
 ERGM:

Exponential random graph model
 GO:

Gene ontology
 HIPPIE:

Human integrated protein–protein interaction reference
 IFD:

Improved fixed density
 MAN:

Mutual, asymmetric, null
 MCMC:

Markov chain Monte Carlo
 PANTHER:

Protein analysis through evolutionary relationships
 PPI:

Protein–protein interaction
 SBM:

Stochastic block model
References
Ahnert SE, Fink T (2016) Form and function in gene regulatory networks: the structure of network motifs determines fundamental properties of their dynamical state space. J R Soc Interface 13(120):20160179
AlanisLobato G, AndradeNavarro MA, Schaefer MH (2016) HIPPIE v2.0: enhancing meaningfulness and reliability of protein–protein interaction networks. Nucleic Acids Res 45(D1):D408–D414
Alon U (2007) Network motifs: theory and experimental approaches. Nat Rev Genet 8:450–461
Amati V, Lomi A, Mira A (2018) Social network modeling. Annu Rev Stat Appl 5:343–369
An W (2016) Fitting ERGMs on big networks. Soc Sci Res 59:107–119. https://doi.org/10.1016/j.ssresearch.2016.04.019
Anderson BS, Butts C, Carley K (1999) The interaction of size and density with graphlevel indices. Soc Netw 21(3):239–267
ArtzyRandrup Y, Fleishman SJ, BenTal N, Stone L (2004) Comment on “network motifs: simple building blocks of complex networks” and “superfamilies of evolved and designed networks.” Science 305(5687):1107c
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25–29
Azad A, Lawen A, Keith JM (2017) Bayesian model of signal rewiring reveals mechanisms of gene dysregulation in acquired drug resistance in breast cancer. PLoS ONE 12(3):e0173331
Babkin S, Stewart J, Long X, Schweinberger M (2020) Largescale estimation of random graph models with local dependence. Comput Stat Data Anal 152:107029
Balaji S, Babu MM, Iyer LM, Luscombe NM, Aravind L (2006) Comprehensive analysis of combinatorial regulation using the transcriptional regulatory network of yeast. J Mol Biol 360(1):213–227
Batagelj V, Mrvar A (2001) A subquadratic triad census algorithm for large sparse networks with small maximum degree. Soc Netw 23(3):237–243
Beber ME, Fretter C, Jain S, Sonnenschein N, MüllerHannemann M, Hütt MT (2012) Artefacts in statistical analyses of network motifs: general framework and application to metabolic networks. J R Soc Interface 9(77):3426–3435
Begum M, Bagga J, Saha S (2014) Network motif identification and structure detection with exponential random graph models. Netw Biol 4(4):155–169
Borisenko A, Byshkin M, Lomi A (2019) A simple algorithm for scalable Monte Carlo inference. arXiv preprint arXiv:1901.00533v3
Bulashevska S, Bulashevska A, Eils R (2010) Bayesian statistical modelling of human protein interaction network incorporating protein disorder information. BMC Bioinform 11(1):46
Butts CT (2008) Social network analysis: a methodological introduction. Asian J Soc Psychol 11(1):13–41
Byshkin M, Stivala A, Mira A, Krause R, Robins G, Lomi A (2016) Auxiliary parameter MCMC for exponential random graph models. J Stat Phys 165(4):740–754
Byshkin M, Stivala A, Mira A, Robins G, Lomi A (2018) Fast maximum likelihood estimation via equilibrium expectation for large network data. Sci Rep 8:11509
Caimo A, Friel N (2011) Bayesian inference for exponential random graph models. Soc Netw 33(1):41–55
Caimo A, Friel N (2014) Bergm: Bayesian exponential random graphs in R. J Stat Softw 61(2):1–25
Cantwell GT, Liu Y, Maier BF, Schwarze AC, Serván CA, Snyder J, StOnge G (2020) Thresholding normally distributed data creates complex networks. Phys Rev E 101(6):062302
Cimini G, Squartini T, Saracco F, Garlaschelli D, Gabrielli A, Caldarelli G (2019) The statistical physics of realworld networks. Nat Rev Phys 1:58–71
Ciriello G, Guerra C (2008) A review on models and algorithms for motif discovery in protein–protein interaction networks. Brief Funct Genom 7(2):147–156
Clauset A, Shalizi CR, Newman ME (2009) Powerlaw distributions in empirical data. SIAM Rev 51(4):661–703
Cook SJ, Jarrell TA, Brittin CA, Wang Y, Bloniarz AE, Yakovlev MA, Nguyen KC, Tang LTH, Bayer EA, Duerr JS et al (2019) Wholeanimal connectomes of both Caenorhabditis elegans sexes. Nature 571(7763):63–71
Costanzo MC, Crawford ME, Hirschman JE, Kranz JE, Olsen P, Robertson LS, Skrzypek MS, Braun BR, Hopkins KL, Kondu P, Lengieza C, LewSmith JE, Tillberg M, Garrels JI (2001) YPD™, PombePD™ and WormPD™: model organism volumes of the BioKnowledge™ Library, an integrated resource for protein information. Nucleic Acids Res 29(1):75–79. https://doi.org/10.1093/nar/29.1.75
Csárdi G, Nepusz T (2006) The igraph software package for complex network research. InterJournal Complex Syst 1695:1–9
Davis JA, Leinhardt S (1967) The structure of positive interpersonal relations in small groups. In: Berger J (ed) Sociological theories in progress, vol 2. Houghton Mifflin, Boston, MA, pp 251–281
De Las Rivas J, Fontanillo C (2010) Protein–protein interactions essentials: key concepts to building and analyzing interactome networks. PLoS Comput Biol 6(6):e1000807
Desmarais BA, Cranmer SJ (2012) Statistical inference for valuededge networks: the generalized exponential random graph model. PLoS ONE 7(1):e30136
van Duijn MA, Gile KJ, Handcock MS (2009) A framework for the comparison of maximum pseudolikelihood and maximum likelihood estimation of exponential family random graph models. Soc Netw 31(1):52–62
Faust K (2010) A puzzle concerning triads in social networks: graph constraints and the triad census. Soc Netw 32(3):221–233
Felmlee D, McMillan C, Whitaker R (2021) Dyads, triads, and tetrads: a multivariate simulation approach to uncovering network motifs in social graphs. Appl Netw Sci 6(1):63
Ferrell JE (2002) Selfperpetuating states in signal transduction: positive feedback, doublenegative feedback and bistability. Curr Opin in Cell Biol 14(2):140–148
Fodor J, Brand M, Stones RJ, Buckle AM (2020) Intrinsic limitations in mainstream methods of identifying network motifs in biology. BMC Bioinform 21:165
Fronczak P, Fronczak A, Bujok M (2013) Exponential random graph models for networks with community structure. Phys Rev E 88(3):032810
Gillespie CS (2015) Fitting heavy tailed distributions: the poweRlaw package. J Stat Softw 64(2):1–16
Ginoza R, Mugler A (2010) Network motifs come in sets: correlations in the randomization process. Phys Rev E 82(1):011921
Gorochowski TE, Grierson CS, Di Bernardo M (2018) Organization of feedforward loop motifs reveals architectural principles in natural and engineered networks. Sci Adv 4(3):eaap9751
Grazioli G, Martin RW, Butts CT (2019a) Comparative exploratory analysis of intrinsically disordered protein dynamics using machine learning and network analytic methods. Front Mol Biosci 6:42
Grazioli G, Yu Y, Unhelkar MH, Martin RW, Butts CT (2019b) Networkbased classification and modeling of amyloid fibrils. J Phys Chem B 123(26):5452–5462
Gross E, Petrović S, Stasi D (2021) Random graphs with node and block effects: models, goodnessoffit tests, and applications to biological networks. arXiv preprint arXiv:2104.03167v1
Guelzim N, Bottani S, Bourgine P, Képès F (2002) Topological and causal structure of the yeast transcriptional regulatory network. Nat Genet 31(1):60–63
Hagberg A, Swart P, S Chult D (2008) Exploring network structure, dynamics, and function using NetworkX. In: Varoquaux G, Vaught T, Millman J (eds) Proceedings of the 7th Python in science conference (SciPy 2008), pp 11–16
Handcock MS, Gile KJ (2010) Modeling social networks from sampled data. Ann Appl Stat 4(1):5–25
Handcock MS, Hunter DR, Butts CT, Goodreau SM, Morris M (2008) statnet: software tools for the representation, visualization, analysis and simulation of network data. J Stat Softw 24(1):1–11
Handcock MS, Hunter DR, Butts CT, Goodreau SM, Krivitsky PN, BenderdeMoll S, Morris M (2016) statnet: software tools for the statistical analysis of network data. The Statnet Project http://www.statnet.org, CRAN.Rproject.org/package=statnet, R package version 2016.9
Handcock MS, Hunter DR, Butts CT, Goodreau SM, Krivitsky PN, Morris M (2021) ergm: fit, simulate and diagnose exponentialfamily models for networks. The Statnet Project https://statnet.org, https://CRAN.Rproject.org/package=ergm, R package version 4.1.2
Holland PW, Leinhardt S (1970) A method for detecting structure in sociometric data. Am J Sociol 76(3):492–513
Holland PW, Leinhardt S (1976) Local structure in social networks. Sociol Methodol 7:1–45
Hummel RM, Hunter DR, Handcock MS (2012) Improving simulationbased algorithms for fitting ERGMs. J Comput Graph Stat 21(4):920–939
Hunter DR (2007) Curved exponential family models for social networks. Soc Netw 29(2):216–230
Hunter DR, Handcock MS (2006) Inference in curved exponential family models for networks. J Comput Graph Stat 15(3):565–583
Hunter DR, Handcock MS, Butts CT, Goodreau SM, Morris M (2008) ergm: a package to fit, simulate and diagnose exponentialfamily models for networks. J Stat Softw 24(3):1–29
Hunter DR, Krivitsky PN, Schweinberger M (2012) Computational statistical methods for social network models. J Comput Graph Stat 21(4):856–882
Ingram PJ, Stumpf MP, Stark J (2006) Network motifs: structure does not determine function. BMC Genom 7:108
Jazayeri A, Yang CC (2020) Motif discovery algorithms in static and temporal networks: a survey. J Complex Netw 8(4):cnaa031. https://doi.org/10.1093/comnet/cnaa031
Karrer B, Newman ME (2011) Stochastic blockmodels and community structure in networks. Phys Rev E 83(1):016107
Kashtan N, Itzkovitz S, Milo R, Alon U (2004) Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11):1746–1758
Konagurthu AS, Lesk AM (2008a) On the origin of distribution patterns of motifs in biological networks. BMC Syst Biol 2:73
Konagurthu AS, Lesk AM (2008b) Single and multiple input modules in regulatory networks. Proteins 73(2):320–324
Koskinen J (2020) Exponential random graph modelling. In: Atkinson P, Delamont S, Cernat A, Sakshaug J, Williams R (eds) SAGE research methods foundations. SAGE, London. https://doi.org/10.4135/9781526421036888175
Koskinen J, Daraganova G (2013) Exponential random graph model fundamentals. In: Lusher D, Koskinen J, Robins G (eds) Exponential random graph models for social networks. Cambridge University Press, New York, pp 49–76
Koskinen JH, Robins GL, Wang P, Pattison PE (2013) Bayesian analysis for partially observed network data, missing ties, attributes and actors. Soc Netw 35(4):514–527
Krivitsky PN (2012) Exponentialfamily random graph models for valued networks. Electron J Stat 6:1100–1128
Krivitsky PN (2017) Using contrastive divergence to seed Monte Carlo MLE for exponentialfamily random graph models. Comput Stat Data An 107:149–161
Krivitsky PN, Handcock MS (2014) A separable model for dynamic networks. J R Stat Soc B Met 76(1):29–46
Krivitsky PN, Hunter DR, Morris M, Klumb C (2021) ergm 4.0: new features and improvements. arXiv preprint arXiv:2106.04997
Kumar G, Ranganathan S (2010) Network analysis of human protein location. BMC Bioinform 11(7):S9
Lee TI, Rinaldi NJ, Robert F, Odom DT, BarJoseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I et al (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298(5594):799–804
Lehmann B, Henson R, Geerligs L, White S et al (2021) Characterising grouplevel brain connectivity: a framework using Bayesian exponential random graph models. Neuroimage 225:117480
Lesk AM, Konagurthu AS (2021) Neighbourhoods in the yeast regulatory network in different physiological states. Bioinformatics 37(4):551–558
Levy M (2016) gwdegree: improving interpretation of geometricallyweighted degree estimates in exponential random graph models. J Open Source Softw 1(3):36
Levy M, Lubell M, Leifeld P, Cranmer S (2016) Interpretation of GWdegree estimates in ERGMs. https://doi.org/10.6084/m9.figshare.3465020.v1
Li X, Stones RJ, Wang H, Deng H, Liu X, Wang G (2012) NetMODE: network motif detection without nauty. PLoS ONE 7(12):e50093
Lienert J, Koehly L, ReedTsochas F, Marcum CS (2019) An efficient counting method for the colored triad census. Soc Netw 58:136–142
Lusher D, Koskinen J, Robins G (eds) (2013) Exponential random graph models for social networks. Structural analysis in the social sciences. Cambridge University Press, New York
Mahadevan P, Krioukov D, Fall K, Vahdat A (2006) Systematic topology analysis and generation using degree correlations. ACM SIGCOMM Comput Commun 36(4):135–146
Mangan S, Alon U (2003) Structure and function of the feedforward loop network motif. Proc Natl Acad Sci USA 100(21):11980–11985
Manjang K, Tripathi S, YliHarja O, Dehmer M, EmmertStreib F (2020) Graphbased exploitation of gene ontology using GOxploreR for scrutinizing biological significance. Sci Rep 10(1):16672
Manjang K, EmmertStreib F, Tripathi S, YliHarja O, Dehmer M (2021) GOxploreR: structural exploration of the gene ontology (GO) knowledge base. https://CRAN.Rproject.org/package=GOxploreR, R package version 1.2.1
Martorana E, Micale G, Ferro A, Pulvirenti A (2020) Establish the expected number of induced motifs on unlabeled graphs through analytical models. Appl Netw Sci 5(1):58
Mayhew BH (1984) Baseline models of sociological phenomena. J Math Sociol 9(4):259–281
Mazurie A, Bottani S, Vergassola M (2005) An evolutionary and functional assessment of regulatory network motifs. Genome Biol 6(4):R35
von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P (2002) Comparative assessment of largescale data sets of protein–protein interactions. Nature 417(6887):399–403
Mewes HW, Frishman D, Güldener U, Mannhaupt G, Mayer K, Mokrejs M, Morgenstern B, Münsterkötter M, Rudd S, Weil B (2002) MIPS: a database for genomes and protein sequences. Nucleic Acids Res 30(1):31–34
Mi H, Muruganujan A, Huang X, Ebert D, Mills C, Guo X, Thomas PD (2019) Protocol update for largescale genome and gene function analysis with the PANTHER classification system (v. 14.0). Nat Protoc 14(3):703–721
Mi H, Ebert D, Muruganujan A, Mills C, Albou LP, Mushayamaha T, Thomas PD (2021) PANTHER version 16: a revised family classification, treebased classification tool, enhancer regions and extensive API. Nucleic Acids Res 49(D1):D394–D403
Middendorf M, Ziv E, Wiggins CH (2005) Inferring network mechanisms: the Drosophila melanogaster protein interaction network. Proc Natl Acad Sci USA 102(9):3192–3197
Milo R, ShenOrr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002) Network motifs: simple building blocks of complex networks. Science 298(5594):824–827
Monteiro PT, Pedreira T, Galocha M, Teixeira MC, Chaouiya C (2020) Assessing regulatory features of the current transcriptional network of Saccharomyces cerevisiae. Sci Rep 10(1):17744
Moody J (1998) Matrix methods for calculating the triad census. Soc Netw 20(4):291–299
Morris M, Handcock M, Hunter D (2008) Specification of exponentialfamily random graph models: terms and computational aspects. J Stat Softw 24(4):1–24
Obando C, De Vico FF (2017) A statistical model for brain networks inferred from largescale electrophysiological signals. J R Soc Interface 14(128):20160940
Orsini C, Dankulov MM, Colomerde Simón P, Jamakovic A, Mahadevan P, Vahdat A, Bassler KE, Toroczkai Z, Boguná M, Caldarelli G et al (2015) Quantifying randomness in real networks. Nat Commun 6:8627
Ouma WZ, Pogacar K, Grotewold E (2018) Topological and statistical analyses of gene regulatory networks reveal unifying yet quantitatively different emergent properties. PLoS Comput Biol 14(4):e1006098
Patra S, Mohapatra A (2020) Review of tools and algorithms for network motif discovery in biological networks. IET Syst Biol 14(4):171–189
Pattison PE, Robins GL, Snijders TAB, Wang P (2013) Conditional estimation of exponential random graph models from snowball sampling designs. J Math Psychol 57(6):284–296
Payne JL, Wagner A (2015) Function does not follow form in gene regulatory circuits. Sci Rep 5:13015
Picard F, Daudin JJ, Koskas M, Schbath S, Robin S (2008) Assessing the exceptionality of network motifs. J Comput Biol 15(1):1–20
Pržulj N (2007) Biological network comparison using graphlet degree distribution. Bioinformatics 23(2):e177–e183
Rice JJ, Kershenbaum A, Stolovitzky G (2005) Lasting impressions: motifs in protein–protein maps may provide footprints of evolutionary events. Proc Natl Acad Sci USA 102(9):3173–3174
Robins G, Pattison P, Woolcock J (2004) Missing data in networks: exponential random graph (p*) models for networks with nonrespondents. Soc Netw 26(3):257–283
Robins G, Pattison P, Kalish Y, Lusher D (2007) An introduction to exponential random graph (p*) models for social networks. Soc Netw 29(2):173–191
Robins G, Snijders TAB, Wang P, Handcock M, Pattison P (2007) Recent developments in exponential random graph (p*) models for social networks. Soc Netw 29(2):192–215
Robins G, Pattison P, Wang P (2009) Closure, connectivity and degree distributions: exponential random graph (p*) models for directed social networks. Soc Netw 31(2):105–117
Rolls DA, Robins G (2017) Minimum distance estimators of population size from snowball samples using conditional estimation and scaling of exponential random graph models. Comput Stat Data Anal 116:32–48
Rolls DA, Wang P, Jenkinson R, Pattision PE, Robins GL, SacksDavis R, Daraganova G, Hellard M, McBryde E (2013) Modelling a diseaserelevant contact network of people who inject drugs. Soc Netw 35(4):699–710
Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, Güldener U, Mannhaupt G, Münsterkötter M et al (2004) The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res 32(18):5539–5545
Salgado H, SantosZavaleta A, GamaCastro S, MillánZárate D, DíazPeredo E, SánchezSolano F, PérezRueda E, BonavidesMartínez C, ColladoVides J (2001) RegulonDB (version 3.2): transcriptional regulation and operon organization in Escherichia coli K12. Nucleic Acids Res 29(1):72–74
Saul ZM, Filkov V (2007) Exploring biological network structure using exponential random graph models. Bioinformatics 23(19):2604–2611
Schaefer MH, Fontaine JF, Vinayagam A, Porras P, Wanker EE, AndradeNavarro MA (2012) HIPPIE: integrating protein interaction networks with experiment based quality scores. PLoS ONE 7(2):e31826
Schaefer MH, Lopes TJ, Mah N, Shoemaker JE, Matsuoka Y, Fontaine JF, LouisJeune C, Eisfeld AJ, Neumann G, PerezIratxeta C et al (2013) Adding protein context to the human protein–protein interaction network to reveal meaningful interactions. PLoS Comput Biol 9(1):e1002860
Schuldiner M, Collins SR, Thompson NJ, Denic V, Bhamidipati A, Punna T, Ihmels J, Andrews B, Boone C, Greenblatt JF et al (2005) Exploration of the function and organization of the yeast early secretory pathway through an epistatic miniarray profile. Cell 123(3):507–519
Schweinberger M (2020) Consistent structure estimation of exponentialfamily random graph models with block structure. Bernoulli 26(2):1205–1233
Schweinberger M, Handcock MS (2015) Local dependence in random graph models: characterization, properties and statistical inference. J Am Stat Assoc 77(3):647–676
Schweinberger M, Luna P (2018) Hergm: hierarchical exponentialfamily random graph models. J Stat Softw 85(1):1–39
Schweinberger M, Krivitsky PN, Butts CT, Stewart JR (2020) Exponentialfamily models of random graphs: inference in finite, super and infinite population scenarios. Stat Sci 35(4):627–662
Shalizi CR, Rinaldo A (2013) Consistency under sampling of exponential random graph models. Ann Stat 41(2):508–535
ShenOrr SS, Milo R, Mangan S, Alon U (2002) Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31(1):64–68
Shin CJ, Wong S, Davis MJ, Ragan MA (2009) Protein–protein interaction as a predictor of subcellular location. BMC Syst Biol 3:28
de Silva E, Stumpf MP (2005) Complex networks and simple models in biology. J R Soc Interface 2(5):419–430
Simpson SL, Hayasaka S, Laurienti PJ (2011) Exponential random graph modeling for complex brain networks. PLoS ONE 6(5):e20039
Simpson SL, Moussa MN, Laurienti PJ (2012) An exponential random graph modeling approach to creating groupbased representative wholebrain connectivity networks. Neuroimage 60(2):1117–1126
Sinke MR, Dijkhuizen RM, Caimo A, Stam CJ, Otte WM (2016) Bayesian exponential random graph modeling of wholebrain structural networks across lifespan. Neuroimage 135:79–91
Snijders TAB (1991) Enumeration and simulation methods for 0–1 matrices with given marginals. Psychometrika 56(3):397–417
Snijders TAB (2002) Markov chain Monte Carlo estimation of exponential random graph models. J Soc Struct 3(2):1–40
Snijders TAB, Pattison PE, Robins GL, Handcock MS (2006) New specifications for exponential random graph models. Sociol Methodol 36(1):99–153
Stillman PE, Wilson JD, Denny MJ, Desmarais BA, Bhamidi S, Cranmer SJ, Lu ZL (2017) Statistical modeling of the default mode brain network reveals a segregated highway structure. Sci Rep 7(1):11694
Stivala A, Robins G, Lomi A (2020) Exponential random graph model parameter estimation for very large directed networks. PLoS ONE 15(1):e0227804
Stivala AD, Koskinen JH, Rolls D, Wang P, Robins GL (2016) Snowball sampling for estimating exponential random graph models for large networks. Soc Netw 47:167–188
Strauss D, Ikeda M (1990) Pseudolikelihood estimation for social networks. J Am Stat Assoc 85(409):204–212
Suratanee A, Schaefer MH, Betts MJ, Soons Z, Mannsperger H, Harder N, Oswald M, Gipp M, Ramminger E, Marcus G et al (2014) Characterizing protein interactions employing a genomewide siRNA cellular phenotyping screen. PLoS Comput Biol 10(9):e1003814
Wang P, Robins G, Pattison P (2009) PNet: program for the estimation and simulation of p* exponential random graph models. Department of Psychology, The University of Melbourne, Parkville
Wang Y, Fang H, Yang D, Zhao H, Deng M (2019) Network clustering analysis using mixture exponentialfamily random graph models and its application in genetic interaction data. IEEE/ACM Trans Comput Biol Bioinform 16(5):1743–1752
Wasserman S, Faust K (1994) Social network analysis: methods and applications. Cambridge University Press, Cambridge
Winterbach W, Van Mieghem P, Reinders M, Wang H, de Ridder D (2013) Topology of molecular interaction networks. BMC Syst Biol 7:90
Yaveroǧlu ON, Fitzhugh SM, Kurant M, Markopoulou A, Butts CT, Pržulj N (2015) ergm.graphlets: a package for ERG modeling based on graphlet statistics. J Stat Softw 65(12):1–29
Yu S, Feng Y, Zhang D, Bedru HD, Xu B, Xia F (2020) Motif discovery in networks: a survey. Comput Sci Rev 37:100267
Acknowledgements
Not applicable.
Funding
This work was supported by Swiss National Science Foundation National Research Programme 75 [Grant No. 167326]; and Melbourne Bioinformatics at the University of Melbourne [Grant No. VR0261].
Author information
Authors and Affiliations
Contributions
AS conceived the work and conducted the analysis. AS and AL interpreted the results. AS drafted the original manuscript and both authors revised it. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1: Supplementary tables and figures.
Additional models and goodnessoffit plots.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Stivala, A., Lomi, A. Testing biological network motif significance with exponential random graph models. Appl Netw Sci 6, 91 (2021). https://doi.org/10.1007/s4110902100434y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s4110902100434y
Keywords
 Motif
 Biological network
 Exponential random graph model
 ERGM