Skip to main content

Hierarchical Bayesian adaptive lasso methods on exponential random graph models


The analysis of network data has become an increasingly prominent and demanding field across multiple research fields including data science, health, and social sciences, requiring the development of robust models and efficient computational methods. One well-established and widely employed modeling approach for network data is the Exponential Random Graph Model (ERGM). Despite its popularity, there is a recognized necessity for further advancements to enhance its flexibility and variable selection capabilities. To address this need, we propose a novel hierarchical Bayesian adaptive lasso model (BALERGM), which builds upon the foundations of the ERGM. The BALERGM leverages the strengths of the ERGM and incorporates the flexible adaptive lasso technique, thereby facilitating effective variable selection and tackling the inherent challenges posed by high-dimensional network data. The model improvements have been assessed through the analysis of simulated data, as well as two authentic datasets. These datasets encompassed friendship networks and a respondent-driven sampling dataset on active and healthy lifestyle awareness programs.


Multiple disciplines such as sociology, political science, and biology have extensively employed network analysis and random graph studies to comprehend and represent relationships among entities, ranging from friendships and global trading partners to proteins and genes. Early models generating random graphs assumed equal probability among graphs of the same size or independence among edges, but these models had evident limitations (Erdös and Rényi 1959). Holland and Leinhardt presented the next advancement by introducing a model for directed graphs that solely employed independent dyads (Holland and Leinhardt 1981). Subsequent work overcame the limitations of independence assumptions and introduced Markov random graph models, establishing the foundation for ERGMs that have been endured for decades (Frank and Strauss 1986), however, traditional statistical methods have limitations in effectively capturing the complexities of relational data. The ERGM has emerged as a valuable tool for quantifying such data, elucidating how local interactions shape the overall structure of a network. ERGMs acknowledge and capture the inherent interdependence embedded within network structures. The probability of an edge’s existence is influenced not only by the presence of other edges but also by various network configurations, such as triangles, and the characteristics of nodes throughout the entire network. This assumption of dependence aligns closely with our intuitive understanding of how networks are formed and operate. It is noteworthy that the development of ERGMs by Frank and Strauss (1986) was primarily motivated by the recognition of tie-dependence in networks.

Fundamentally, ERGMs are analogous to logistic regression when the dyads are independent, offering regression-like analysis on random networks. ERGMs estimate the probability of tie existence between pairs of nodes in a network. Since ERGMs share commonalities with logistic regression, let us recall the traditional lasso method in classical linear regression and discuss its development and relation to Bayesian theory, providing hints about the potential problems developing lasso estimates on the exponential random network. The lasso of Tibshirani is a method for simultaneous shrinkage and model selection in regression problems. Tibshirani (1996) In the context of linear regression, the lasso is a regularization technique for simultaneous estimation and variable selection where if \({\varvec{y}}={\varvec{X}}\varvec{\beta }+\varvec{\epsilon }\) where \({\varvec{y}}=(y_1,y_2,\cdots ,y_n)^{\top }\) is the response vector, \({\varvec{X}}=({\varvec{x}}_1,{\varvec{x}}_2,\cdots ,{\varvec{x}}_p)\) is an \(n \times p\) predictor matrix, \(\varvec{\beta }=(\beta _1,\beta _2,\cdots ,\beta _p)\) is a corresponding vector of regression coefficients, \(\varvec{\epsilon }=(\epsilon _1,\cdots ,\epsilon _n)\) are independent normal distributed errors, then the lasso estimates are defined as

$$\begin{aligned} {\hat{\beta }}(lasso)=\arg \min _{\varvec{\beta }}\Vert {\varvec{y}}-\sum \limits _{j=1}^p \varvec{x_j}\beta _j\Vert ^2+\lambda \sum \limits _{j=1}^{p}|\beta _j| \end{aligned}$$

where the second term in (1) is the so-called “\(l^1\) penalty”. The tuning parameter \(\lambda\) controls the amount of penalty. Fan and Li (2001) studied a class of penalized models including the lasso. They proved that the lasso can perform automatic variable section because of the singularity of \(l^1\) penalty at the origin. If certain conditions are not satisfied, the lasso estimates could be inconsistent. To overcome the above issues, Zou in 2006 and Wang et al. proposed to use an adaptive lasso that enjoys the consistency and the oracle properties: namely, it performs as well as if the true underlying model were given in advance. Zou (2006), Wang and Leng (2008) Tibshirani suggested that lasso estimates can be interpreted as posterior mode estimates when the regression parameters have independent and identical Laplace (i.e., double-exponential) priors. Tibshirani (1996) Targeting at finding this mode, several other authors studied subsequently different Bayesian contexts. Yuan and Lin (2006), Park and Casella (2008), Leng et al. (2014), Alhamzawi and Ali (2018) However, all these studies are for linear regressions and they are not built on random networks.

In the context of ERGMs, estimation encounters computational challenges when there is dependence among dyads. These challenges are primarily attributed to the intractability of the normalizing constant and the issue of degeneracy. Chatterjee and Diaconis (2013) Intractability refers to the computational difficulties associated with calculating the normalizing constant, which ensures that the probability mass function sums to one. On the other hand, degeneracy refers to the phenomenon where the models assign a significant proportion of their probability mass to a small subset of graphs. This leads to a cascading effect throughout the graph, resulting in the model assigning most of its probability mass to very sparse or very dense graphs. Bayesian computational methods have proven instrumental in circumventing these challenges. Caimo and Friel were the first to develop complete Bayesian frameworks for network models, enabling the incorporation of Bayesian analysis into real-world networks, which often exhibit large-scale, high-dimensional, and complex structures with numerous attribute variables associated with nodes (Caimo and Friel 2011). Subsequently, Caimo et al. integrated a transdimensional reversible jump Markov Chain Monte Carlo (RJMCMC) approach, initially introduced by Green (1995), with the exchange algorithm (Caimo and Friel 2013, 2014). This algorithm incorporates an independence sampler, utilizing a distribution that fits a parametric density approximation to the within-model posterior. This method is appealing in model selection since it relies exclusively on probabilistic considerations but is challenging computationally since it needs to estimate the posterior probability for each competing model. In scenarios with a high number of variables, the presence of numerous potential models becomes more pronounced. The increased dimensionality leads to a larger set of competing models, making the task of model selection more challenging and critical. This motivates the development of the penalized exponential random graph model developed in this paper.

While penalized estimation methods have been discussed in the context of graphical models by various researchers, these studies either lack a specific focus on ERGMs or fail to fully account for the inherent dependencies present in network data, often transforming the problem into generalized penalized linear regression. Meinshausen and Bühlmann (2006), Shojaie et al. (2012), Shojaie and Michailidis (2010), Shojaie (2013), Fan et al. (2009) Motivated by the need to explore network model uncertainty and achieve parsimony in exponential random graphs, we propose a more flexible and adaptive lasso-type penalized model within the framework of the ERGM. This model aims to improve parameter estimations and prediction accuracy, enabling effective variable selection within high-dimensional network data. Through comprehensive evaluations and comparisons with existing methods, our model demonstrates its superiority in terms of efficiency and effectiveness in selecting significant variables. It promises substantial improvements in the field by addressing the critical challenge of model selection in the analysis of high-dimensional network data.

In summary, the utilization of Bayesian adaptive lasso model offers two prominent advantages: (1) Enhanced convergence speed and improved parameter mixing: adaptive lasso addresses a notable limitation of the conventional lasso regularization technique, which often exhibits sluggish convergence and difficulties in selecting significant variables within high-dimensional datasets. Consequently, it facilitates faster convergence and more effective mixing of parameters. This characteristic proves particularly advantageous in scenarios involving extensive datasets or a substantial number of predictors. (2) Effective variable selection: Bayesian adaptive lasso exponential random graph model demonstrates exceptional proficiency in this task by automatically identifying pertinent variables while concurrently shrinking or eliminating less relevant ones. The process is facilitated through the utilization of multiple chains generated by a parallel direction sampling algorithm, which enhances the efficiency and accuracy of variable selection. These benefits are the primary focus of the discussed article.

This article is structured as follows. Section 2 provides a basic introduction to exponential random graph models, offering a foundation for the subsequent discussions. In Sect. 3, we introduce a Bayesian Exponential Adaptive Lasso Model for the exponential random graph, which enhances the Monte Carlo maximum likelihood method proposed by Geyer and the Bayesian ERGM (BERGM) presented by Caimo and Friel (Geyer 1991; Caimo et al. 2022). Section 4 presents a derivation of the Gibbs sampling theory underlying the model, shedding light on the underlying theoretical framework. In Sect. 5, we introduce the adaptive parallel direction sampling algorithm, which is incorporated into the Gibbs sampling theory to improve the mixing of the Monte Carlo chains, thereby enhancing the overall performance of the model. Section 6 outlines the algorithm procedure and provides a comparative analysis with the BERGM method proposed by Caimo et al., highlighting the strengths and advantages of our proposed approach. Caimo and Friel (2013), Caimo and Friel (2014), Caimo et al. (2022) In Sect. 7, we describe the network dataset called Faux Dixon High, which is used to test the model and present simulation results. Additionally, this section includes the results of applying the proposed model to data collected in a study conducted with the Prevention Research Center at USC and Sumter County Active Lifestyles (SCAL). In Sect. 8, we discuss the goodness of fit for the proposed Bayesian adaptive lasso method, providing an evaluation of its performance and suitability. Finally, in Sect. 9, we summarize the key findings and contributions of the paper and identify open problems and avenues for future research

Exponential random graph models

Examples and context

Exponential Random Graph Models (ERGMs) are widely applicable to research questions in the social and health sciences. In psychology, researchers studied Romanian school children’s friendship networks to find that sex and mental health showed patterns of homophily, concluding that ERGM are a “promising avenue for further research.” Baggio et al. (2017) Also in the social and health sciences, Becker et al. considered the friendship network of members of a sorority and the influence of disordering eating habits on friendship finding that women tended to have disordered eating habits, unlike their friends (Becker et al. 2018). This unexpected result has implications for understanding the complex social dynamics that go into a serious health concern. Solo et al. note the utility and suitability of ERGM for modeling connections within the brain compared to more traditional methods, though they also note the computational difficulty of ERGM (Solo et al. 2018). On a much larger scale, ERGM have been used to understand the influences of information sharing on tourism. The model helped answer questions about the existence of patterns in the network including whether or not the network exhibited the characteristic of homophily and how organizations should understand their role in the network (Williams and Hristov 2018). In the biological world, Stivala et al. show that ERGM can address some of the limitations that previous research had found in modeling biological processes (Stivala and Lomi 2021). These examples show the incredible flexibility and significance of exponential random graph models.

Model structure

For any network, it can be expressed with an adjacency matrix. The connectivity of the network’s graph is described by an \(n\times n\) adjacency matrix \({\varvec{Y}}\). Its i-j entry \(Y_{i,j} = 1\) if node i will give referral to node j and \(Y_{i,j} = 0\) otherwise. Let \({\mathcal {Y}}\) be the set of all possible graphs on n nodes and let \({\varvec{y}}\) be a realization of \({\varvec{Y}}\). A given network \({\varvec{y}}\) consists of n nodes and m edges that define a relationship between pairs of nodes called dyads. The adjacency matrix of the network graph \({\varvec{Y}}\) allows for the analysis of the structural relationship in the observed network.

For general exponential random graph models, the network has the following exponential family type density: (Lusher et al. 2013)

$$\begin{aligned} \pi ({\varvec{y}} | \varvec{\theta }) = \frac{1}{z(\varvec{\theta })} e^{\varvec{\theta }^{T}s({\varvec{y}})} \end{aligned}$$

where \({\varvec{y}}\) is the observed network, \(\varvec{\theta }\) is a vector of parameters, and \(s({\varvec{y}})\) is a vector of network statistics. Each i-th network statistic \(s_{i}(\cdot )\) has a corresponding parameter \(\theta _{i}\). A positive value of \(\theta _{i}\) indicates that the edges involved in the formation of network statistics \(s_{i}\) are more likely to be connected with each other. The normalizing constant \(z(\varvec{\theta })\) is the summation \(\sum \limits _{{\varvec{y}} \in {\mathcal {Y}}}e^{\varvec{\theta }^{T}s({\varvec{y}})}\) where \({\mathcal {Y}}\) is the set of all possible graphs with the same number of nodes as \({\varvec{y}}\). The number of possible graphs with \(n\) nodes is \(2^{n(n-1)/2}\) which becomes very large for all but the smallest graphs. Lusher et al. (2013) Hence, the calculation of \(z(\varvec{\theta })\) is feasible only for small networks in computer computation. It becomes challenging to find this normalization constant for large networks or even moderate-sized networks.

Let \(\varvec{\delta }=s({\varvec{y}}_{ij}^{+})-s({\varvec{y}}_{ij}^{-})\) be the vector of changes in the statistics in \({\varvec{s}}\) when the edge \(y_{ij}\) between node i and j in the graph \({\varvec{y}}\) changes from 1 to 0 along with the complement part \({\varvec{y}}_{ij}^c\) same. Conditioned on the state of the rest of the graph represented \({\varvec{Y}}_{-ij}\), the \(\log\) odds of the probability of a tie existing between node i and j is:

$$\begin{aligned} \log \frac{P(Y_{ij}= 1| {\varvec{Y}}_{-ij} = {\varvec{y}}_{-ij}, \varvec{\theta })}{P(Y_{ij}= 0| {\varvec{Y}}_{-ij} = {\varvec{y}}_{-ij}, \varvec{\theta })} = \varvec{\theta }^{T}\varvec{\delta } \end{aligned}$$

These network statistics can be overlapping subgraph configurations such as the number of edges, mutual edges, triangles, and uniform homophily etc. The representation above gives the intuitive explanation of the model parameter \(\varvec{\theta }\) about their effect on the probability of an edge between node i and j.

Classical inference for ERGMs

Estimation methods

The inferential statistical goal is to find an appropriate estimate of \(\varvec{\theta }\) such that the corresponding generated network has the probability distribution centered on the observed network on average. That is, we want to solve the moment equation:

$$\begin{aligned} {\mathbb {E}}_{\varvec{\theta }}(s({\varvec{y}})) = s({\varvec{y}}_{\text {obs}}) \end{aligned}$$

where \({\varvec{y}}_{obs}\) is the observed network and \(s({\varvec{y}})\) is a vector of network statistics in the proposed graph and \(s({\varvec{y}}_{\text {obs}})\) is a vector of the network statistics in the observed graph. However, in most cases, the moment equation cannot be solved analytically. This challenge leads to two mainstream simulations: Maximum Pseudolikelihood estimation and Monte Carlo Maximum Likelihood estimation.

Maximum pseudolikelihood estimation

The direct Maximum likelihood estimation of ERGMs is complicated since the likelihood function is difficult to compute for models and networks of moderate or large size. Strauss et al. proposed a standard approximation with maximum pseudolikelihood estimation (MPLE). Strauss and Ikeda (1990) Instead of conditioning each tie on the state of the entire graph, the assumption is that the dependence of each dyad is weak. In particular, the MPLE estimates can be obtained by assuming the independence among values of \(Y_{ij}\):

$$\begin{aligned} P(Y_{ij} = 1 | \varvec{\delta }_{-ij} = {\varvec{y}}_{-ij}) = P(Y_{ij} = 1) \end{aligned}$$

This allows for the pseudolikelihood function that has the strength of quick estimation but has been shown to not provide reliable estimates. van Duijn et al. (2009), Friel et al. (2009)

$$\begin{aligned} \pi ({\varvec{y}} | \varvec{\theta }) \approx \pi _{pseudo}({\varvec{y}} | \varvec{\theta })&= \prod _{i \ne j} \pi (y_{ij}|{\varvec{y}}_{-ij}, \varvec{\theta }) \end{aligned}$$
$$\begin{aligned}&= \prod _{i \ne j} \frac{\pi (y_{ij} =1 |{\varvec{y}}_{-ij}, \varvec{\theta })^{y_{ij}}}{ [1 - \pi (y_{ij} = 0 |{\varvec{y}}_{-ij}, \varvec{\theta })]^{y_{ij}-1}} \end{aligned}$$

This will only provide the true estimate for ERGM with dyadic independence or when the change statistics can be found only considering one tie without knowing the rest of the graph. Research by van Duijn et al. compares the maximum pseudo-likelihood and maximum likelihood estimates, and their study shows the pseudo-likelihood estimation is biased and MPLE can only approximate the transitivity pattern in the network well. van Duijn et al. (2009)

Monte Carlo maximum likelihood estimation

Similar to methods in linear regression, ERGMs are log-linear, and a typical method for finding the maximum likelihood requires finding the roots of the derivative of the \(\log\) of the function. This results in the \(s({\varvec{y}})^{T} - {\mathbb {E}}_{\varvec{\theta }}(s({\varvec{y}})) = 0\) found earlier. The Monte Carlo maximum likelihood estimation in ERGM case needs to find the following important ratio: (van Duijn et al. 2009)

$$\begin{aligned} \frac{z(\varvec{\theta })}{z(\varvec{\theta }_{0})} = {\mathbb {E}}_{{\varvec{y}}|\varvec{\theta }_{0}} \left[ \frac{e^{\varvec{\theta }^{T}s({\varvec{y}})}}{e^{\varvec{\theta }_{0}^{T}s({\varvec{y}}_{\text {obs}})}} \right] . \end{aligned}$$

The log-likelihood equation, however, is not directly solvable without computing the normalizing constant. As previously mentioned, this is computationally intensive for all but the smallest graphs. With this approximation, though, the normalizing constant can be estimated by generating \(m\) graphs from the density \(\pi (\varvec{\pi |\varvec{\theta }_0})\) and finding \(e^{(\varvec{\theta }-\varvec{\theta }_{0})^{T}s({\varvec{y}}_{i})}\) for each graph and use importance sampling technique. The estimates of \(\varvec{\theta }\) can be obtained by maximizing the log-likelihood ratio approximated as the following:

$$\begin{aligned} \ell (\varvec{\theta }) - \ell (\varvec{\theta }_{0}) \approx (\varvec{\theta } -\varvec{\theta }_{0})^{T} - \ln \left[ \frac{1}{m} \sum _{i = 1}^{m} e^{(\varvec{\theta }-\varvec{\theta }_{0})^{T}s({\varvec{y}}_{i})}\right] \end{aligned}$$

However, in this method, the choice of the initial \(\varvec{\theta }_0\) is tricky and should be near the maximum likelihood estimate of \(\varvec{\theta }_0\). Poor choice of \(\varvec{\theta }_0\) can lead to the failure of the maximization log-likelihood function and degeneracy problem. van Duijn et al. (2009), Handcock (2003)

Bayesian adaptive lasso exponential random graph model

This work is motivated by the need to explore model uncertainty and flexibility. With these objectives, we consider the following exponential random graph model, this model is a particular class of discrete exponential random exponential families that represent the probability distribution of the adjacency matrix \({\varvec{Y}}\in {\mathcal {Y}}\) where \({\mathcal {Y}}\) is the set of all possible graphs on n nodes. Let \({\varvec{y}}\) a realization of \({\varvec{Y}}\). The likelihood function of an ERGM stands for the probability density of a random network and can be expressed as:

$$\begin{aligned} \pi ({\varvec{y}}|\varvec{\theta })=\frac{q({\varvec{y}}|\varvec{\theta })}{z(\varvec{\theta })}=\frac{e^{\varvec{\theta }^{T} s({\varvec{y}})}}{z(\varvec{\theta })} \end{aligned}$$

where \(q({\varvec{y}}|\varvec{\theta })=e^{\varvec{\theta }^{T} s({\varvec{y}})}\) is the unnormalized likelihood.

We consider the following adaptive lasso estimator on the exponential random network:

$$\begin{aligned} \hat{\varvec{\theta }}&=\arg \max _{\theta }l(\varvec{\theta }|{\varvec{y}})-P(\varvec{\theta }), \end{aligned}$$
$$\begin{aligned} P(\varvec{\theta })&=\sum \limits _{j=1}^p\lambda _{j}|\theta _{j}| \end{aligned}$$

where \(l(\varvec{\theta }|{\varvec{y}})=\ln (\pi ({\varvec{y}}|\varvec{\theta }))\) is the log-likelihood function of \(\varvec{\theta }\) and each \(\lambda _j\) is a different penalty parameter used for the coefficients. In dyadic independence ERGMs, maximizing the log-likelihood function (10) is equivalent to maximizing the following log pseudo-likelihood function:

$$\begin{aligned} l(\varvec{\theta }|{\varvec{y}})=\sum \limits _{{\varvec{y}}} y_{ij}\ln (\pi _{ij})+\sum \limits _{{\varvec{y}}}(1-y_{ij})\ln (1-\pi _{ij})-\sum \limits _{j=1}^{p} \lambda _{j}\vert \theta _{j}\vert \end{aligned}$$

where \(\pi _{ij}=P(Y_{ij}=1|{\varvec{y}}_{ij}^c)=P(Y_{ij}=1)\). In this case, the network estimation problems are transformed into the classical adaptive lasso logistic linear regression model. For example, the coordinate descent algorithm developed in glmnet package for R (Tay et al. 2023; Friedman et al. 2010) can get estimations of the parameters \(\theta _j\), \(j=1,2,3,\cdots ,p\) with penalties include the lasso, ridge and the elastic net. However, different from the generalized linear regression models, the challenge of estimation on the dyadic dependent ERGMs relies on the intractable normalizing constant appearing in the log-likelihood function. With the review of ERGMs likelihood-based methods in Sect. 2, the solution to the equation (10) has similar obstacles. To get around those obstacles, we will study this problem with an adaptively Bayesian estimate obtained from the lasso penalized method on the random networks.

Assume that a prior distribution \(\pi (\varvec{\theta })\) is placed on \(\varvec{\theta }\), and we are interested in the posterior distribution

$$\begin{aligned} \pi (\varvec{\theta }|{\varvec{y}}) \propto \pi ({\varvec{y}}|\varvec{\theta })\pi (\varvec{\theta }) \end{aligned}$$

We consider a conditional Laplace prior specification of the form similar to the classical Bayesian lasso linear regression developed in Park and Casella (2008) but with different penalty terms so that we have \(\lambda _j \text { for } j=1, 2, 3, \cdots , p\):

$$\begin{aligned} \pi (\varvec{\theta }|\sigma ^2)=\prod \limits _{j=1}^p\frac{\lambda _j}{2\sqrt{\sigma ^2}}e^{-\lambda _j |\theta _j|/\sqrt{\sigma ^2}} \end{aligned}$$

We can now formulate a hierarchical model on the exponential random graph, which we can use to implement this version of the Bayesian lasso with a Gibbs sampler, using the Laplace distribution as a scale mixture of Gaussians. When the mixing distribution is exponential, the resulting distribution is Laplace. Andrews and Mallows (1974)

$$\begin{aligned} \frac{a}{2}e^{-a|z|}=\int _0^{\infty } \frac{1}{\sqrt{2\pi s}}e^{-\frac{z^2}{2s}}\frac{a^2}{2}e^{-\frac{a^2s}{2}}ds,\,\,\,\,a>0 \end{aligned}$$

Now we use a latent parameter \(\tau ^2\) to make the prior (14) as a scale mixture of normal distributions (15). We can consider \(\tau _j\)s as additional parameters that assign different variances to the prior of \(\varvec{\theta }\). When \(\tau _j\rightarrow 0\), the coefficient of \(s_j({\varvec{y}})\) is shrunk to zero.

Assume \(\varvec{\theta } = ( \theta _{1}, \theta _{2},..., \theta _{p})\) follows normal distributions centered at zero with variance defined below.

$$\begin{aligned} \varvec{\theta } | \sigma ^{2}, \tau _{1}^{2},\tau _{2}^{2},..., \tau _{p}^{2} \sim {\mathcal {N}}(0_{p},\,\sigma ^{2}{\varvec{D}}_{\tau } ) \end{aligned}$$

where \(\sigma ^2>0\) and \({\varvec{D}}_{\tau }=diag(\tau _1^2,\tau _2^2,\cdots ,\tau _p^2)\) is a matrix that allows each parameter to come from a normal distribution with a different variance.

Different than the basic Bayesian lasso model proposed by Park and Casella (2008) in which \(\varvec{\tau }\) follows

$$\begin{aligned} \pi ( \varvec{\tau }^{2} ) = \frac{\varvec{\lambda }^{2}}{2} e^{- \frac{\varvec{\lambda }^{2} \varvec{\tau }^{2}}{2}}, \end{aligned}$$

our Bergm adaptive lasso model sets up different shrinkage parameters for different coefficients. This motivates us to define a more adaptive penalty in the hierarchical structure:

$$\begin{aligned} \pi (\sigma ^2,\tau _1,\tau _2,\cdots , \tau _p) \propto \pi (\sigma ^2)\prod \limits _{j=1}^{p}\frac{\lambda _{j}^{2}}{2} e^{- \frac{\lambda _{j}^{2} \tau _{j}^{2}}{2}} \end{aligned}$$

and an independent non-informative scale-invariant marginal prior \(\pi (\sigma ^2)\propto \displaystyle \frac{1}{\sigma ^2}\) on \(\sigma ^2\) suggested by Park and Casella. Park and Casella (2008) The conditional distribution on \(\sigma ^2\) guarantees a unimodal full posterior distribution for the estimate \(\varvec{\theta }\) on the network. (See Appendix A ). The unimodal posterior distribution ensures the quick convergence of the Gibbs sampling algorithm and ensures the meaningful single point estimate of \(\varvec{\theta }\).

Finally, the simplest prior for the penalty term \(\lambda _j\), for \(j=1,2,3,\cdots ,p\) would be a uniform distribution, but this proved to be problematic with complex networks, particularly when a model has many parameters. Thus, following the notation of Park and Casella (2008) we propose a prior such that \(\lambda _{j}^{2}\) follows Gamma distribution with shape parameter r and rate parameter \(\delta _j\):

$$\begin{aligned} \pi (\lambda _{j}^{2}) = \frac{\delta _j^{r}}{\Gamma (r)}\left( \lambda _{j}^{2}\right) ^{r - 1}e^{-\delta _j \lambda _{j}^{2}} \hspace{1cm} \text { for } \lambda _{j}, r, \delta _j > 0. \end{aligned}$$

This prior mixes well with the other choices for the Gibbs sampling and as Park and Casella (2008) note, this prior can approach 0 as \(\varvec{\lambda } \rightarrow \infty\) and can concentrate probability near the MLE.

In summary, the hierarchical formulation of the Bayesian adaptive lasso Model on the exponential random graph is as follows:

$$\begin{aligned} \pi ( {\varvec{y}} | \varvec{\theta }) = \frac{1}{z(\varvec{\theta })} e^{\varvec{\theta }^{T}s({\varvec{y}})} \end{aligned}$$
$$\begin{aligned} \varvec{\theta } | \sigma ^{2} , \tau _{1}^{2} ,\tau _{2}^{2} , . . . , \tau _{p}^{2} \sim {\mathcal {N}}(0_{p},\,\sigma ^{2}{\varvec{D}}_{\tau } )\end{aligned}$$
$$\begin{aligned} {\varvec{D}}_{\tau }=diag(\tau _1^2,\cdots ,\tau _p^2)\end{aligned}$$
$$\begin{aligned} \pi (\sigma ^2,\tau _1,\tau _2,\cdots , \tau _p | \lambda _{j} )\propto \pi (\sigma ^2)\prod \limits _{j=1}^{p}\frac{\lambda _{j}^{2}}{2} e^{- \frac{\lambda _{j}^{2} \tau _{j}^{2}}{2}} \end{aligned}$$
$$\begin{aligned} \pi ( \lambda _{j}^{2}) = \frac{\delta _j^{r}}{\Gamma (r)}\left( \lambda _{j}^{2}\right) ^{r - 1}e^{-\delta _j \lambda _{j}^{2}}\end{aligned}$$
$$\begin{aligned} \pi (\sigma ^{2} ) \propto \frac{1}{\sigma ^{2}} \end{aligned}$$

for \(\sigma ^{2}, r, \delta _j,j=1,2,3,\cdots ,p \text { and } \tau _1^2,\tau _2^2,\cdots ,\tau _p^2>0\).

The major differences of this formulation compared with the Bayesian lasso in Park and Casella (2008) are first, the Bayesian lasso method in Park and Casella (2008) is applied to linear regression model \({\varvec{y}}=\mu {\varvec{1}}_n+\varvec{X\beta }+\varvec{\epsilon }\) without any network structure. In other words, \({\varvec{y}}\) in Park and Casella (2008) follows the normal distribution \({\mathcal {N}}(\mu {\varvec{1}}_n+\varvec{X\beta }, \sigma ^2{\varvec{I}}_n)\), where \({\varvec{y}}\) is a \(n\times 1\) vector of responses which doesn’t involve random graph. Second, our model allows different penalty variables \(\lambda _j\), one for each different parameter. In this case, each \(\tau _{j}^{2}\) can have its own distribution and thus the variance of each normal distribution can be different. With the flexibility of the penalties, the lasso estimate of the parameter for less important random variables on the exponential random graph will have a larger penalty. And smaller penalty will be applied to those important random variables. And compared with the existing Bayesian Adaptive Lasso model (Leng et al. 2014),Alhamzawi and Ali (2018), our model is built on the random network. And compared with the Bayesian Exponential Random Graph Model (BERGM) by Caimo and Friel (2011), our model Bayesian Adaptive Lasso Exponential Random Graph Model(BALERGM) has more accurate estimations, and the structure is more flexible and adaptive to the network statistics level by adopting distinct shrinkage and penalties for different network statistics. The estimates \(\hat{\theta _j}\) of \(\theta _j \text { for } j=1,2,3, \cdots , p\) will be small and close to 0 if it does not provide much improvement on predicting the random network \({\varvec{Y}}\). So it naturally leads to an estimator with an automatic variable selection property. The value of \(\lambda _j\) will affect the estimates \(\theta _j\). The larger \({\hat{\lambda }}_j\) exists in the model, the sparser \(\varvec{\theta }\) will be. (namely, more coefficients are small and near 0) whereas smaller \({\hat{\theta }}_j\) leads to a less sparse \(\varvec{\theta }\). Sparsity is a common belief in high-dimensional statistics because we anticipate only a few covariates are actually related to the response and most covariates are useless. BALERGM is very powerful in this scenario because it leads to a sparse estimator on the network (many coefficients are near 0). Note that high-dimensional problems in network science are very common. For example, in genetics, there are many genes per individual but often we have few patients in our study, or in neuroscience, the fMRI machine produces many voxels per person at a given time.

Gibbs sampler implementation

Now we will implement the model with a Gibbs sampler. The Gibbs sampling method is a Markov Chain Monte Carlo (MCMC) algorithm. In our case, the joint distribution is difficult to sample from directly, but the conditional distribution of each variable is known and is easier to sample from. The Gibbs sampling algorithm generates an instance from the distribution of each variable in turn, conditioned on the current values of the other variables. The construction of the hierarchical model (20) makes the derivation of the full conditional distributions for each component of the estimates feasible.

Thus we can write the joint density as the product of the conditional density of \({\varvec{y}} |\varvec{ \theta }\) and the density of \(\varvec{\theta }\). Using the pieces of the hierarchical formulation of the model from (20) we can substitute in each piece that we have already chosen to find the joint distribution.

$$\begin{aligned} \begin{aligned} \pi ( {\varvec{y}}, \varvec{\theta }, \sigma , \varvec{\lambda }, \varvec{\tau } )&= \pi ({\varvec{y}}|\varvec{\theta })\pi (\varvec{\theta }) \\&= \pi ({\varvec{y}}|\varvec{\theta }) \prod _{j=1}^{p} \pi (\theta _{j} | \tau _{j}^{2}, \sigma ^{2}) \pi (\tau _{j}^{2}|\lambda _{j}) \pi (\lambda _{j})\pi (\sigma ^{2})\\&= \frac{1}{z(\varvec{\theta })}e^{\varvec{\theta }^{T}s({\varvec{y}})} \prod _{j=1}^{p} \frac{1}{(2\sigma ^{2}\tau _{j}^2)^{1/2}} e^{-\frac{1}{2\sigma ^{2}\tau _{j}^2}\theta _{j}^{2}} \frac{\lambda _{j}^{2}}{2} e^{- \frac{\lambda _{j}^{2} \tau _{j}^{2}}{2}} \frac{\delta _j^{r}}{\Gamma (r)}\left( \lambda _{j}^{2}\right) ^{r - 1}e^{-\delta _j \lambda _{j}^{2}} \frac{1}{\sigma ^{2}} \end{aligned} \end{aligned}$$

To implement the Gibbs sampling, we require the distribution of each parameter \(\tau _{j}, \lambda _{j}, \sigma ^{2}\) to update in turn. From the joint distribution (26), we consider all parts of that joint distribution that depend on each variable. As summarized in Table 1, we consider the full conditional distributions for \(\tau _{j}, \lambda _{j},\text { and } \sigma ^{2}\) respectively.

Table 1 Sampling distributions from joint distribution for each variable

Sample \(\tau _{j}\)

For each \(\tau _{j}\) we have the following distribution.

$$\begin{aligned} \pi (\tau _{j} | {\varvec{y}}, \varvec{\theta }, \sigma , \varvec{\lambda } ) \propto (\tau _{j}^{2})^{\frac{-1}{2}} e^{-\frac{1}{2} \left( \frac{\theta _{j}^{2} / \sigma ^{2}}{\tau _{j}^{2}} + \lambda _{j}^{2} \tau _{j}^{2} \right) } \end{aligned}$$

To find what distribution each \(\tau _{j}\) follows, we begin by considering the following transformation. Chhikara and Folks (1988) If a random variable \(x \sim \text {Inverse Gaussian}(\mu , \lambda ')\), that is

$$\begin{aligned} f( x, \mu , \lambda ') = \left( \frac{\lambda '}{2\pi x^{3}}\right) ^{\frac{1}{2}} e^{- \frac{\lambda '(x-\mu )^2}{2\mu ^2 x}}, \end{aligned}$$

then with the change of variable, we can find the density \(f'\) of \(w = x^{-1}\) as

$$\begin{aligned} f( w, \mu , \lambda ') = \left( \frac{\lambda '}{2\pi w^{3}}\right) ^{\frac{1}{2}} e^{- \frac{\lambda '(1- \mu w)^2}{2\mu ^{2}w}}. \end{aligned}$$


$$\begin{aligned} f'(w, \mu , \lambda ') = \mu w f(w, \mu ^{-1}, \lambda '\mu ^{-2}). \end{aligned}$$

Then we can rewrite equation 27 into the reciprocal of the Inverse Gaussian distribution

$$\begin{aligned} \left( \frac{1}{\tau _{j}^{2}}\right) ^{-\frac{3}{2}} \exp \left\{ - \frac{1}{2} \left( \frac{\theta _{j}^{2}}{\tau _{j}^{2}} + \frac{\lambda _{j}^{2}}{1/\tau _{j}^{2}}\right) \right\} \propto \left( \frac{1}{\tau _{j}^{2}}\right) ^{-\frac{3}{2}} exp \left\{ - \frac{\theta _{j}^{2} \left( \frac{1}{\tau _{j}^{2}} - \sqrt{\frac{\lambda _{j}^{2} \sigma ^{2}}{\theta _{j}^{2}}} \right) ^{2} }{2\sigma ^{2} \frac{1}{\tau _{j}^{2}}} \right\} \end{aligned}$$

thus \(\displaystyle \frac{1}{\tau _j^2}\) follows inverse Gaussian distribution with parameters \(\displaystyle \sqrt{\frac{\lambda _{j}^{2} \sigma ^{2}}{\theta _{j}^{2}} }\) and \(\lambda _{j}^{2}\):

$$\begin{aligned} \frac{1}{\tau _j^2}\sim \text {Inverse Gaussian} \left( \sqrt{\frac{\lambda _{j}^{2} \sigma ^{2}}{\theta _{j}^{2}} }, \lambda _{j}^{2} \right) \end{aligned}$$

Sample \(\sigma ^{2}\)

Similar to the other parameters, we now look at \(\sigma ^{2}\) with the following conditional distribution:

$$\begin{aligned} \pi ( \sigma ^{2} | {\varvec{y}}, \varvec{\theta }, \varvec{\lambda }, \varvec{\tau }) \propto (\sigma ^{2} )^{ -1 - \frac{p}{2}} e ^{- \frac{1}{2\sigma ^{2}} \varvec{\theta }^{T} \text {D}_{\varvec{\tau }}^{-1} \varvec{\theta } }. \end{aligned}$$

If \(x\sim \text {Inverse Gamma }(\alpha ,\beta )\) with the shape parameter \(\alpha\) and scale parameter \(\beta\), then it has the following density function:

$$\begin{aligned} f(x,\alpha ,\beta )=\frac{\beta ^{\alpha }}{\Gamma (\alpha )}x^{-\alpha -1}e^{-\frac{\beta }{x}}. \end{aligned}$$

We can compare the conditional density (33) with (34) to find:

$$\begin{aligned} \pi ( \sigma ^{2} | {\varvec{y}}, \varvec{\theta }, \varvec{\lambda }, \varvec{\tau }) \propto \text {Inverse Gamma} \left( \frac{p}{2}, \frac{1}{2} \varvec{\theta }^{T} D_{\varvec{\tau }}^{-1} \varvec{\theta } \right) . \end{aligned}$$

Sample \(\lambda _{j}^{2}\)

To sample the penalty term \(\varvec{\lambda }\), we have developed three methods providing flexibility depending on the network model requirements (Table 2).

Table 2 Methods for Sampling \(\varvec{\lambda }\)

Method A: The simplest prior for the penalty term \(\lambda _j\), for \(j=1,2,3,\cdots ,p\) would be a uniform distribution, but this proved to be problematic with complex networks, particularly when a model has many parameters. Thus, following the notation of Park and Casella (2008), we propose an adaptive prior such that \(\lambda _{j}^{2} \sim \text { Gamma } (r, \delta _{j})\). That is,

$$\pi (\lambda _{j}^{2} ) = \frac{{\delta _{j}^{r} }}{{\Gamma (r)}}\left( {\lambda _{j}^{2} } \right)^{{r - 1}} e^{{ - \delta _{j} \lambda _{j}^{2} }} {\text{ for }}\lambda _{j} ,r,\delta _{j} > 0$$

For the full Bayes estimation of \(\lambda _{j}^{2}\), we have to find the following distribution.

$$\begin{aligned} \pi (\lambda _{j}^{2} | {\varvec{y}}, \varvec{\theta }, \sigma , \varvec{\tau })&\propto \frac{\lambda _{j}^{2}}{2} e^{- \frac{\lambda _{j}^{2} \tau _{j}^{2}}{2}} \left( \lambda _{j}^{2}\right) ^{r - 1}e^{-\varvec{\delta } \lambda _{j}^{2}}\end{aligned}$$
$$= \frac{{\left( {\lambda _{j}^{2} } \right)^{r} }}{2}\exp \left\{ { - \lambda _{j}^{2} \left( {\frac{{\tau _{j}^{2} }}{2} + \delta _{j} } \right)} \right\}$$

This shows us that \(\lambda _j^{2}\) is proportional to a gamma distribution with \(\alpha = r + 1\) and \(\beta = \frac{\tau _{j}^{2}}{2} + \delta _{j}\), since a standard gamma probability density function is \(\displaystyle f(x) = \frac{\beta ^{\alpha }}{\Gamma (\alpha )} x^{\alpha - 1}e^{-\beta x}\).

Therefore we can conclude:

$$\begin{aligned} \pi (\lambda _{j}^{2} | {\varvec{y}}, \varvec{\theta }, \sigma , \varvec{\tau }) \propto \text {Gamma} \left( r + 1, \frac{\tau _{j}^{2}}{2} + \delta _{j} \right) \end{aligned}$$

where \(r \text { and } \varvec{\delta }\) are chosen constants/vectors of constants.

Method B: In contrast to the previous Method A, where the parameters \(\delta _j\), for \(j=1,2,\cdots ,p\), were treated as fixed constants, the proposed method incorporates an empirical update of the hyperparameter vector \(\varvec{\delta }\) using the Monte Carlo Expectation-Maximization (E-M) algorithm (Levine and Casella 2001). The empirical update of the parameters \(\delta _j\) is performed using the following formula:

$$\begin{aligned} \delta _{j} = \frac{r}{{\textbf{E}}_{\delta _{j}^{(k - 1)}} \left[ \lambda _{j}^{2} | \delta _{j}^{(k -1)}, \varvec{y}^{(k -1)} \right] }. \end{aligned}$$

The full derivation of this method is presented in Appendix B.

The empirical update of the parameters \(\delta _j\) using the E-M algorithm brings several advantages to the estimation process. Firstly, it eliminates the need for manually specifying appropriate hyperparameter values, as the parameter values are estimated directly from the observed data. This data-driven approach enables the selection of hyperparameters based on the characteristics of the data, enhancing the flexibility and adaptability of the model. Additionally, the empirical update of the parameters \(\delta _j\) allows the model to capture intricate nuances and complexities that may not be adequately accounted for by Method 1, which relies on a fixed hyperparameter. By updating the parameters based on the observed data, the model can better capture the intricacies and variability present in the data, leading to improved estimation accuracy and model performance.

Method C: This method uses a full empirical Bayes that directly estimates \(\varvec{\lambda }\) from observed data without assuming any specific distribution or model. The full derivation is in the Appendix B, but we can update \(\lambda _{j}^{2}\)

$$\begin{aligned} \lambda _{j}^{2} = \frac{r}{ {\textbf{E}}_{\varvec{\lambda }_{j}^{(k - 1)}} \left[ \frac{\tau _{j}^{2}}{2} | \lambda _{j}^{(k -1)}, y^{(k -1)} \right] + \delta _{j} } \end{aligned}$$

While this Method C offers several advantages, such as adapting to the data and improving exploration of the parameter space, they also have certain disadvantages that should be considered.

One of the primary disadvantages of full empirical MCMC is its computational cost. Empirical MCMC methods typically require additional iterations and computations compared to traditional MCMC algorithms. The empirical updates of parameters or proposal distributions can be computationally intensive, particularly when dealing with large datasets or complex models. This can result in longer execution times, limiting the scalability of the method.

Another disadvantage is the potential for bias or inefficiency in the estimation process. Empirical updates rely on the observed network data to estimate the parameters and the proposal distribution of the network. If the nodal sufficient statistics are not fully representative or the observations of nodal random variables contain outliers, the empirical estimates may introduce biases or inefficiencies in the MCMC sampling. Additionally, the convergence of this method 3 needs careful tuning of the other hyperparameters to achieve optimal performance. The optimization of hyperparameters can be nontrivial and needs expert knowledge or extensive experimentation.

Adaptive parallel direction sampling algorithm

There have been considerable developments in the approaches dealing with the problem of sampling from a distribution with a doubly intractable normalizing constant. For example, the easy-to-implement and more direct single variable exchange algorithm proposed by Murray et al. (2012). However, if there is strong temporal dependence in the state process and a strong correlation between model parameters, the exchange algorithm performs slow mixing. Caimo and Friel (2011) and Caimo and Mira (2015) apply the ideas in Murray et al. (2012) to increase MCMC sampling efficiency by combining delayed rejection and adaptive Monte Carlo techniques. First, a collection of H parallel Markov chains are generated. Then the next element of a current chain \(h\) is found using estimates from chains \(h_{1}\) and \(h_{2}\) as below.

Algorithm: Parallel Adaptive Sampling Algorithm

while \(i = 1, . . . , N\) do

Define a scalar ADS move factor \(\gamma\), for each chain \({\varvec{h}}\in \{1,2,3,\cdots , H\}\):

1. Sample two current states \(h_1,h_2\) and \(h_1\ne h_2\ne h\).

2. Sample the error term from a symmetric normal distribution. \(\varvec{\epsilon }\sim N({\varvec{0}},\varvec{\sigma }_{\epsilon }^2)\).

3. The sampling of \(\varvec{\theta }_h\) performs a simple random walk: \(\varvec{\theta }_h^{'}=\varvec{\theta }_h+\gamma (\varvec{\theta }_{h_1}-\varvec{\theta }_{h_2})+\varvec{\epsilon }\).

4. Sample \({\varvec{y}}'\) from \(\pi (\cdot |\varvec{\theta }_h^{'})\).

5. Accept \(\varvec{\theta }_h^{'}\) with probability \(\min (1,\frac{q({\varvec{y}}|\varvec{\theta }'_h )\pi (\varvec{\theta }'_h)q({\varvec{y}}'|\varvec{\theta }_h)}{q({\varvec{y}}|\varvec{\theta }_h)\pi (\varvec{\theta }_h)q({\varvec{y}}'|\varvec{\theta }'_h)})\)            (42)

         where \(q({\varvec{y}}|\varvec{\theta })=e^{\varvec{\theta }^{T} s({\varvec{y}})}\) is the unnormalized likelihood.

end while

The move of \(\varvec{\theta }\) is illustrated in Figure 1. Here, two other chains \(h_{1}\) and \(h_{2}\) are chosen at random. The difference between the corresponding estimates in the other two chains \(\varvec{\theta }_{h_1}\) and \(\varvec{\theta }_{h_2}\) are used to find the distance to move away from \(\varvec{\theta }_{h}\). A normal distribution with a very small variance is used to slightly adjust the estimate for the new \(\varvec{\theta }\).

Fig. 1
figure 1

The parallel ADS move of \(\varvec{\theta }_h\) is generated based on the difference of the states \(\varvec{\theta }_{h_1}\) and \(\varvec{\theta }_{h_2}\) in other Markov chains and \(\varvec{\epsilon }\) is a random error term

Bayesian adaptive lasso algorithm

In this section, we will list the algorithm of the Bayesian Exponential Random Graph Model (BERGM) by Caimo et al. (2017) and the algorithm of our Bayesian Adaptive Lasso Exponential Random Graph Model (BALERGM)) for easy comparison. Caimo et al. (2017) set up the exchange algorithm with a Gibbs update of \(\varvec{\theta }'\) and then \({\varvec{y}}'\) using Markov Chain Monte Carlo iteration without penalized terms. The algorithm can be written in the following concise way:

Algorithm: Bayesian Exponential Random Graph Model

while \(i = 1, . . . , N\) do

      while \(h = 1, . . ., H\) do

1. generate \(h_{1}\) and \(h_{2}\) such that \(h_{1} \ne h_{2} \ne h\)

2. generate \(\varvec{\theta }_{h}'\) from \(\gamma (\varvec{\theta }_{h_{1}} - \varvec{\theta }_{h_{2}}) + \epsilon ( \ddots |\varvec{\theta }_{h})\)

3. simulate \(y'\) from \(\pi ( \ddots |\varvec{\theta }_{h}')\)

4. update \(\varvec{\theta }_{h} \rightarrow \varvec{\theta }_{h}'\) with the log of the probability

\(\text {min} \left( 0, [\varvec{\theta }_{h} - \varvec{\theta }_{h}']^{T} [s({\varvec{y}}') - s({\varvec{y}})] + \log \left[ \frac{\pi (\varvec{\theta }_{h}')}{\pi (\varvec{\theta }_{h})} \right] \right)\)

      end while

end while

Where \(s({\varvec{y}})\) and \(s({\varvec{y}}')\) are functions of the observed and simulated vector of network statistics respectively.

For the new Bayesian Adaptive Lasso model, we use the parallel adaptive direction sampler method suggested by BERGM and combine with Gibbs sampling to generate samples to find estimates for \(\varvec{\theta }\).

Algorithm: Bayesian Adaptive Lasso Exponential Random Graph Model Algorithm

Require: Set the initial value for \(\varvec{\lambda }, \sigma ^{2}, \gamma\), Use ERGM to find MPLE (Maximizer to the Psuedolikelihood Function) to find initial values for \(\varvec{\theta }\). Denote samples of \(\varvec{\theta }\) in the \(h\)th chain, as \(\varvec{\theta }_{h}\).

while \(i = 1, . . . , N\) do

      while \(h = 1, . . ., H\) do

1. sample \(\varvec{\theta }_{h}\) with Parallel Adaptive Direction Sampler:

         a. generate \(h_{1}\) and \(h_{2}\) such that \(h_{1} \ne h_{2} \ne h\)

         b. update \({\varvec{D}}_{\varvec{\tau }}^{-1}\)

         c. generate \(\varvec{\theta }_{h}'\) from \(\gamma (\varvec{\theta }_{h_{1}} - \varvec{\theta }_{h_{2}}) + \epsilon ( \ddots |\varvec{\theta }_{h})\)

         d. simulate \({\varvec{y}}'\) from \(\pi ( \ddots |\varvec{\theta }_{h}')\)

         e. update \(\varvec{\theta }_{h} \rightarrow \varvec{\theta }_h'\) with the log of the probability

\(\text {min} \left( 0, [\varvec{\theta }_h - \varvec{\theta }_h']^{T} [s({\varvec{y}}') - s({\varvec{y}})] + \log \left[ \frac{\pi (\varvec{\theta }_h')}{\pi (\varvec{\theta }_h)} \right] \right)\)

            where \(\pi (\varvec{\theta }) \sim {\mathcal {N}}(0_{p},\,\sigma ^{2}{\varvec{D}}_{\varvec{\tau }} )\)

2. sample \(\sigma ^{2} \text {from Inverse Gaussian} ( \frac{p}{2}, -\frac{1}{2}\varvec{\theta }^{T} {\varvec{D}}_{\varvec{\tau }}^{-1} \varvec{\theta } )\)

3. sample \(\tau _{j}^{2} \text { for } j = 1, 2, 3, . . ., p \text { from Inverse Gaussian } \left( \sqrt{\frac{\lambda _{j}^{2} \sigma ^{2}}{\theta _{j}^{2}} }, \lambda _{j}^{2} \right)\)

4a. full Bayes update of \(\varvec{\lambda }\)

         1. sample \({\lambda }_{j}^{2} \text { for } j = 1, 2, 3, . . ., p \text { from Gamma } ( r + 1, \frac{\tau _{j}^{2}}{2} + \delta _{j} )\)


4b. empirical update of \(\varvec{\delta }\) and update of \(\varvec{\lambda }\)

         1. update \(\delta _{j} \text { for } j = 1, 2, 3, . . ., p\) with the mean of the last five \(\varvec{\lambda }\) samples estimating the expected value.

\(\delta _{j} = \frac{r}{{\textbf{E}}_{\delta _{j}^{(k - 1)}} \left[ \lambda _{j}^{2} | \delta _{j}^{(k -1)}, y^{(k -1)} \right] }\)

         2. sample \({\lambda }_{j}^{2} \text { for } j = 1, 2, 3, . . ., p \text { from Gamma } ( r + 1, \frac{\tau _{j}^{2}}{2} + \delta _{j} )\)


4c. full empirical update of \(\varvec{\lambda }\)

         1. update \(\varvec{\lambda }\) with the mean of the last five \(\varvec{\tau }\) estimating the expected value

\(\lambda _{j}^{2} = \frac{r}{ {\textbf{E}}_{\varvec{\lambda }_{j}^{(k - 1)}} \left[ \frac{\tau _{j}^{2}}{2} | \lambda _{j}^{(k -1)}, y^{(k -1)} \right] + \delta _{j} }\)

   end while

end while

This code was built with R version 4.1.1 (2021-08-10). R Core Team (2021) The following package versions were also used: coda 0.19-4, mcmc 0.9-7, Bergm 5.0.3, ergm.count 4.0.2, ergm 4.1.2, mvtnorm 1.1-3.

This BALERGM package is shared on Github:xxxx. (The link will be provided upon the acceptance of this paper).

Simulation and application

In this section, we will show the strength of BALERGM in three key ways. The first way uses the Faux Dixon High School data set to simulate 100 graphs to compare BERGM and BALERGM. The results of trials shows BALERGM is a stable model with accurate estimation, in addition to providing improvements to BERGM with a higher acceptance rate and effective sampling size, and lower MSE. The next two ways showcase the parameter selection abilities of BALERGM. The first on a simulated parameter and the second with network data collected in a study by Prevention Research Center at USC and Sumter County Active Lifestyles (SCAL).


The network object Faux Dixon High represents a friendship network among junior high and high school students based on data gathered by a National Longitudinal Study of Adolescent Health, see details in Resnick et al. (1997). This study, first conducted in 1994–1995, considered more than 90,0000 American students. Students were asked to list friends, and a tie is formed between them in the network if both students claimed friendship (Goodreau et al. 2008).

The final network has 248 nodes with 1,197 directed edges. Each node has three characteristics: grade, sex, and race. The grades include 7th-12th and race is first delineated by Hispanic and non-Hispanic which was further split into Asian, Black, Native American, Other, and White. Figure 2 shows the network plotted with nodes colored for each grade showing the homophily.

Fig. 2
figure 2

Generated in R, this plot shows the clustering of student friendship with students that have the same grade

Executing the BALERGM algorithm requires choosing network statistics with both nodal and edge attributes and structural features such as triangles and triads (Morris et al. 2008). The count of these network statistics is found with the adjacency matrix realization \({\varvec{y}}\) of \({\varvec{Y}}\) with i-j entry in the matrix defined as \(y_{ij}\). For a directed network, the following summations demonstrate the counting procedure.

$$\begin{aligned} \text {Edges: } \sum _{i \ne j } y_{i j} \hspace{1cm} \text {Mutual Edges:} \sum _{i \ne j} y_{ij} y_{ji} \hspace{1cm} \text {Cyclic Triads: } \sum _{i \ne j \ne k } y_{jk} y_{i,k} y_{ij} \end{aligned}$$

A natural network statistic for this data is the instances of homophily between students in the same grade, since as seen in Fig. 2, nodes with the same attribute (in this case grade) appear visually to have more connections. As seen in Table 3, with the diagonal entries of the mixing matrix from Grade \(i\) to Grade \(i\) for \(i \in \{7, 8, 9, 10, 11, 12\}\), most of the connections are between students in the same grade. This feature can be included in network models with the R code nodematch.

Table 3 Summary table for the connections among different grades. The \(i-j\) position in this table shows the number of connections from Grade i to Grade j, \(i=7,8,9,10, j=7,8,9,10\)


To demonstrate the overall effectiveness of BALERGM, we conducted a comparative analysis between BALERGM and BERGM (Caimo and Friel 2014). Our evaluation involved the generation of 100 independent exponential random graphs using the Faux Dixon High dataset, with known and fixed parameters \(\varvec{\theta }\). Specifically, we focused on two selected network statistics: the count of edges in the network ( \(\theta _{1}:\) edges) and the count of occurrences of homophily, where students of the same grade have a friendship connection ( \(\theta _{2}:\) nodematch.Grade). Without loss of generality, we fixed the parameter values \(\varvec{\theta }=(-4.8, 2.3)\) and generated 100 independent exponential random graphs based on the Faux Dixon High dataset, considering them as new instances with associated node attributes. This approach allowed us to create 100 distinct opportunities to estimate the parameter vector \(\varvec{\theta }\) using both the BALERGM and BERGM algorithms, enabling a comprehensive performance comparison against the true parameter values \(\varvec{\theta }=(-4.8, 2.3)\).

In each run of BERGM and BALERGM, the main chain for either model consists of 2000 iterations and the burn-in number is 50 iterations. In 100 simulations, each model generates a sequence of values estimating each \(\varvec{\theta }\) in each simulation. To confirm the stability of the model, the following representation of the MCMC results (Fig. 3) shows the strength and stability of the BALERGM algorithm after relatively few iterations. The unimodal distribution of estimates is on the left, and the center column shows the trace of the estimates indicating a stable estimating process. The final column shows the autocorrelation plot with the lag decreasing quickly; by 50 iterations, the process has stabilized to minimal lag.

Fig. 3
figure 3

MCMC output: Distribution of samples on the left, the trace of samples in the center, autocorrelation plot on the right

Using both the mean and median of these estimates we can compare several metrics. Table 4 compares the acceptance rate of generated estimates for each run, the mean effective sample size, and both the mean and median square error (MSE) of the estimates compared to the chosen true values, where \(MSE=\frac{1}{n}{\varvec{e}}^{\top }{\varvec{e}}\), \({\varvec{e}}\) is the error vector, that is \({\varvec{e}}=\hat{\varvec{\theta }}-(-4.8, 2.3)\).

Table 4 shows that using either the mean or median of the generated estimates in MCMC for \(\varvec{\theta }\) (1) BALERGM has a better overall acceptance rate and effective sample size on average than BERGM. The acceptance rate or the percentage of generated samples that are accepted in the MCMC process is increased. This implies BALERGM adjusts to the true parameter for each single variable faster than BERGM. (2) BALERGM offers an improvement over BERGM with a lower mean squared error (MSE). The mean squared error is dramatically lower with the BALERGM process no matter whether the mean or median in MCMC is used as the estimate for \(\varvec{\theta }\). This can be seen in the quantiles for each estimate of \(\varvec{\theta }\) since the true values are \(\varvec{\theta } = (-4.8, 2.3)\), the BALERGM estimates are much closer to these true values.

Table 4 Results of both BERGM and BALERGM using formula y \(\sim\) edges + nodematch(“Grade”)

In Table 5, the true known value of each \(\varvec{\theta }\) is estimated by either the mean or median of the generated samples. The quantiles for estimates of \(\varvec{\theta }\) show the spread of each estimate.

Table 5 Results of simulating 100 graphs and comparing results for BERGM and BALERGM using means as the estimates of \(\varvec{\theta }\)

Once results are generated, the estimates produced can be used to calculate the probability of a tie, using the \(\varvec{\theta }\). Using the previous example, if no new tie is created, so the change statistic for \(\theta _{1}\) is zero, then the probability that a tie is between students of the same grade can be calculated as follows.

$$\begin{aligned} P(Y_{ij} = 1 | \theta _{2} = 2.8550 ) = \frac{e^{2.8550}}{1 + e^{2.8550 }} = 0.945577. \end{aligned}$$

That means the conditional probability of observing an edge (not involved in the creation of other network statistics included in the model) is about \(94.56\%\).

Variable selection

BALERGM not only improves sampling efficiency compared to previous models but also demonstrates strong performance in variable selection through its adaptive lasso component. This indicates the ability of the model to identify and highlight parameters that are either more or less significant to the network structure. The example using the following simulated dataset showcases the effectiveness of BALERGM in terms of variable selection.

For this example, we still use Faux Dixon high school dataset. The chosen network statistics are the count of the edges in the network (edges), the counts of the occurrences of homophily where students of the same grade have a friendship connection (nodematch.Grade), and the third artificial created term: the counts of the occurrences of homophily from a generated nodal attribute for the wealth of a parent (nodemath.Wealth). This additional nodal attribute “Wealth” is generated from the uniform distribution on 20 and 75. Given that this nodal variable is generated uniformly at random, it is intentionally designed to have no impact on the network structure. Our objective is to test whether BALERGM can effectively identify and exclude this artificially created insignificant nodal variable. Running the BALERGM algorithm produces the following results, demonstrating that the model accurately estimates the value of \(\theta _3\) to be close to zero. This outcome aligns with our expectations, as the variable has no meaningful influence on the network structure (Table 6, Fig. 4)

Fig. 4
figure 4

MCMC output: Distribution of samples on the left, the trace of samples in the center, autocorrelation plot on the right


Table 6 Results of BALERGM using formula y \(\sim\) edges + nodematch(“Grade”) + nodematch(“Wealth”)

Sumter county active lifestyles (SCAL) network analysis

With reports by The US Burden of Disease Collaborators (2018) of worsening metrics of American health, communities are working on addressing and understanding the factors that might improve health outcomes. To this end, the University of South Carolina Prevention Research Center and Sumter County Active Lifestyles (SCAL) based in Sumter County, South Carolina conducted a respondent-driven sampling study in 2014 to better understand the dynamics of social networks and health outcomes.

In this study, community ambassadors chosen for their history of community involvement were given a set compensation for their participation. Each ambassador was instructed to share the survey with those in their social network. Each of these respondents was also compensated for both completion of the survey and sharing the survey with others that completed the survey. Using referral codes, a network can be created with nodes representing survey respondents, edges formed by survey sharing, and nodal characteristics from the results of the survey. The final network has 80 nodes with the data for 30 questions for each respondent. Figure 5 is the network plot labeled with one of the 30 questions: “Have you heard of a group called Sumter County Active Lifestyles (SCAL)?”

The survey was intended to be a brief but broad look at self-reported health benchmarks. Questions cover demographic characteristics revealing that the respondents are primarily white (87%), female (78%), likely to be older than 50 (44%), and more educated with 46% being college graduates. Other questions focused on self-reported health outcomes and activities including exercise habits, eating habits, and social support dynamics. The question forms included qualitative questions about physical activities and opportunities for physical activities in the community. For the purposes of this network, network attributes were assigned using the answers to only multiple-choice questions.

Fig. 5
figure 5

Generated in R, this plot shows results of asking “Have you heard of a group called Sumter County Active Lifestyles (SCAL)?”

The resulting network contains many nodal attributes where ERGM and BERGM cannot be applied effectively. This motivates a model like BALERGM which enables understanding which of these network statistics contribute less to the network structure.


Using the SCAL data set from the previous section, we use the network statistics in Table 7 to analyze this model, where we use the ergm terms “nodematch”, “nodefactor” and “nodecov”. These three terms all provide measures of homophily. “nodematch” counts the instances of nodes with the same attribute for a given attribute. “nodecov” performs a similar function but for continuous variables. “nodefactor” creates network statistics for each discrete level of a nodal attribute and counts the occurrences of connected nodes with the same attribute level. For more details, see Morris et al. (2008).

Table 7 shows the BALERGM output on the SCAL social network. Here the sparsity of the network can be seen in the large negative values for the network statistics for edges and the out-degree of the nodes. While the standard deviations vary with each estimate, the MCMC outputs show stable estimating with symmetric distributions as the quantile values indicate. It 7 provides valuable insights into the relationships between different variables in the network analysis. One interesting observation is that individuals who maintain a healthy diet ( \(\theta _6\)-\(\theta _{13}\) are all positive) tend to have positive connections with each other. This suggests a clustering effect among individuals with similar dietary habits, indicating a potential influence of shared health-conscious behaviors on network connections.

Furthermore, the result highlights that participation in a walking program (variables \(\theta _{18}\) to \(\theta _{19}\)) is positively associated with network connections. This implies that individuals who engage in walking programs are more likely to know each other within the network. This finding suggests a potential social bonding effect among individuals who actively participate in health-promoting activities, leading to the formation of connections and social ties.

Table 7 Results from BALERGM with variable selection on SCAL data

The adaptive lasso penalty in BALERGM is useful for shrinking \(\varvec{\theta }\) values for network statistics that are less significant to the network structures. Depending on the model and network conditions, the parameter estimate might not reach exactly zero. For example, the estimate for both \(\theta _{26} = -.001\) and \(\theta _{20} = -.076\) are small, but this mean of the generated samples as the single factor utilized doesn’t allow for a nuanced ranking of how significant each parameter is. Using the distribution of \(\varvec{\theta }\) found in the Gibbs sampling process, we can find the probability that half of the distribution is less than zero at some significance level \(\alpha\).

$$\begin{aligned}| P(\varvec{\theta } < 0 ) - 0.5 | = \alpha \end{aligned}$$

A larger value of \(\alpha\) indicates a higher importance of the variable in the context of the model. This creates the ability to rank variables. The following Table 8 shows the parameters less significant to the construct of the network at various significance levels. For example, the network statistic of the age of the participant (\(\theta _{26}\)) is less significant than for the network statistic of having heard of the SCAL program (\(\theta _{20}\)). While both are not the primary factors, BALERGM gives researchers insights into the social dynamics of Sumter County allowing for targeted programs to improve health outcomes.

Table 8 Variable Selection with Different Tolerance Levels

This example highlights the powerful functionalities of BALERGM, particularly in the context of variable selection and importance ranking in network analysis. In network studies, the presence of numerous network variables is common. The identification of the most relevant variables is crucial as it enables researchers to concentrate their analysis and interpretation on the factors that significantly influence the network’s structure and behavior. By focusing on these key variables, we can gain a deeper understanding of the underlying mechanisms that drive network dynamics.

Goodness of fit

To assess the performance and goodness of fit of Exponential Random Graph Models (ERGMs), various diagnostics can be employed. These diagnostics involve comparing key statistical measures calculated from observed networks with those obtained from simulated networks generated based on the estimated network parameters. In the Bayesian framework, evaluating the goodness of fit of the model involves conducting posterior predictive assessments. This entails comparing the observed network to a collection of networks simulated from the posterior distribution of the model’s parameter estimates, as determined by Caimo and Friel (2011).

The set of statistics used for the comparison contains degree distributions, the minimum geodesic distance, and the number of edge-wise shared partners. Since the SCAL network graph is a directed graph, the degree distributions for both in and out degrees are included. Since the graph includes isolated nodes and clusters such that there is no path between some nodes, the minimum geodesic distance or the minimum number of edges needed to connect any two nodes is infinite leading to the spike in the plot for minimum geodesic distance in 6. Finally, the edge-wise shared partners are concentrated in the lower values since the number of nodes in common for any number of edges is small.

Fig. 6
figure 6

Bayesian goodness of fit diagnostic for the estimated parameter posterior distribution for BALERGM model on SCAL dataset

The Bayesian goodness of fit diagnostic in Fig. 6 evaluates the implemented model in section 8. The observed network is compared with 300 randomly simulated network samples drawn from the estimated posterior distribution using 50 auxiliary iterations for the network simulation step. Figure 6 illustrates the summary results of these 300 generated graphs in black and gray, alongside the original network represented in red. The comparison reveals a strong alignment in the high-level characteristics that are not explicitly modeled. This indicates that the posterior mean obtained through BALERGM accurately generates networks with corresponding structures.


Bayesian adaptive lasso exponential random graph model (BALERGM) offers several notable advantages in the field of network analysis. One key advantage is its ability to perform automatic variable selection, facilitated by the integration of the Lasso regularization technique. By employing the Lasso penalty, BALERGM effectively identifies and emphasizes the most relevant network parameters while diminishing the influence of less significant ones towards zero. This feature streamlines the modeling process and extracts valuable insights from intricate network data, enhancing the interpretability of the results. Moreover, the Lasso penalty promotes sparsity in parameter estimates, resulting in a more parsimonious model that aids in discerning the influential factors governing network behavior.

Another compelling advantage of BALERGM is its superior adaptive estimation performance. Through the adaptive adjustment of penalties for each parameter, the model swiftly adapts to the data, allowing it to focus on the most relevant network parameters and capture underlying patterns and relationships more effectively. Researchers can readily select key network factors based on their significance levels, providing valuable insights and actionable knowledge.

We have shown the effectiveness of the the proposed algorithm in simulation and compared its performance against the currently popular BERGM method. One promising direction for future work involves the development of more generalized penalized forms within the context of network analysis. While the Lasso penalty has demonstrated its efficacy in variable selection and sparsity promotion, the incorporation of ridge penalty distributions could offer additional benefits. Combining the strengths of both Lasso and ridge penalties would strike a balance between model complexity and over/underfitting issues, leading to more robust parameter estimation.

Availability of data and materials

The Faux Dixon High dataset is avlaible in ERGM R package (Hunter et al. 2008). The SCAL dataset can be requested from the University of South Carolina Prevention Research Center: by filling out an interest form. The R codes used during the current study are available in the GitHub repository


Download references


This work was not supported by any funding.

Author information

Authors and Affiliations



RP and DH conceived the research. MF collected the data. DH developed the mathematical model and designed the MCMC algorithm. DH and VM generated the code and analyzed the data. DH and VM drafted the first version of the manuscript with input from all authors. All authors contributed to the critical revision of the manuscript for important intellectual content. All authors have seen and approved the final version and agreed to its publication. DH and VM had full access to all the data in the study and take responsibility for the accuracy of the mathematical analysis.

Corresponding author

Correspondence to Dan Han.

Ethics declarations

Ethics approval and consent to participate

Not Applicable.

Competing interests

The authors declare no competing fnancial or non-fnancial interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


Appendix A: proof of unimodal posterior

The chosen prior needs to result in a unimodal posterior for faster Gibbs sample convergence and confidence that the estimates found are actually best.


The joint posterior distribution is unimodal for typical choices of \(\pi (\sigma ^{2})\) and any choice of \(\lambda \ge 0\).


We begin by representing the joint distribution of \(\theta\) and \(\sigma ^{2} > 0\) using distributions already defined.

$$\begin{aligned} \pi (\varvec{\theta }, \sigma ^{2} )&\propto \pi (\varvec{\theta }| \sigma ^{2})\pi (\sigma ^{2}) \end{aligned}$$
$$\begin{aligned}&= \prod \limits _{j=1}^p\frac{\lambda _{j}}{2\sqrt{\sigma ^2}} e^{-\lambda _{j} |\theta _j|/\sqrt{\sigma ^2}} \frac{1}{\sigma ^{2}} \end{aligned}$$

We have choose the prior such that \(\pi (\sigma ^{2}) \propto \frac{1}{\sigma ^{2}}\) according to the recommendation of the literature. Park and Casella (2008)

We wish to show that the posterior is unimodal in the sense that every upper-level set of \(\{(\varvec{\theta }, \sigma ^{2}) | \pi ( \varvec{\theta }, \sigma ^{2})> x, \sigma ^{2}>0 \}, \text { for } x > 0\) is connected. We will show this is true under a continuous transform with continuous inverse since the continuous image of a connected set is connected. Munkres and Davis (2018)

The posterior is shown here

$$\begin{aligned} \pi (\varvec{\theta }, \sigma ^{2} | {\varvec{y}})&\propto \pi (y|(\varvec{\theta }, \sigma ^{2}))\pi (\varvec{\theta }, \sigma ^{2}) \end{aligned}$$
$$\begin{aligned}&= \pi ({\varvec{y}}|(\varvec{\theta }, \sigma ^{2}))\pi (\varvec{\theta }| \sigma ^{2})\pi (\sigma ^{2}) \end{aligned}$$
$$\begin{aligned}&= \frac{1}{z(\varvec{\theta })}e^{\varvec{\theta }^{T}s({\varvec{y}})} \prod \limits _{j=1}^p\frac{\lambda _j}{2\sqrt{\sigma ^2}} e^{-\lambda _j |\theta _j|/\sqrt{\sigma ^2}} \frac{1}{\sigma ^{2}} \end{aligned}$$
$$\begin{aligned}&=\frac{e^{\varvec{\theta }^{T}s({\varvec{y}})}}{z(\varvec{\theta })} \frac{1}{\sigma ^{2}} \frac{1}{2^{p}\sqrt{\sigma ^2}^{p}} \prod \limits _{j=1}^p \lambda _{j}e^{-\lambda _j |\theta _j|/\sqrt{\sigma ^2}} \end{aligned}$$

We now take the natural log of the equation above.

$$\begin{aligned} \ln \pi (\varvec{\theta }, \sigma ^{2} | y) = -\ln (\sigma ^{2}) + \varvec{\theta }^{T} s({\varvec{y}}) - \sum _{j = 1}^{p} \lambda _{j} |\theta _{j}|\frac{1}{\sqrt{\sigma ^{2}}} + \sum _{j = 1}^{p} \ln (\lambda _{j}) - p\ln (2) - \frac{p}{2} \ln (\sigma ^{2}) \end{aligned}$$

The following transform allows for easier calculation.

$$\begin{aligned} \phi _{j} \leftrightarrow \frac{\theta _{j}}{\sqrt{\sigma ^{2}}} \qquad \rho \leftrightarrow \frac{1}{\sqrt{\sigma ^{2}}} \qquad j = 1, 2, 3,...p \end{aligned}$$

This is continuous with a continuous inverse when \(0< \sigma ^{2} < \infty\), so the upper-level sets for the original parameters correspond under the transformation to upper-level sets for the original parameters.

Let \(\varvec{\phi } = (\phi _{1}, \phi _{2},... \phi _{p})^{T}\) be the column vector for ease of notation. This transform is one-to-one and continuous for \(0< \sigma ^{2} < \infty\), therefore the unimodality of the transformed equation is equivalent to the unimodality of the original equation.

Using the transform and algebra we get the following expression

$$\begin{aligned} \begin{aligned} h(\varvec{\phi }, \rho )&= \ln \rho ^{2} + (\sqrt{\sigma ^{2}} \varvec{\phi })^{T}s({\varvec{y}}) - \sum _{j = 1}^{p} \lambda _{j}|\phi _{j}| + \frac{p}{2}\ln (\rho ^{2})\\&= (p + 2)\ln (\rho ) + \frac{\varvec{\phi }^{T}s({\varvec{y}})}{\rho } - \sum _{j = 1}^{p} \lambda _{j}|\phi _{j}| \end{aligned} \end{aligned}$$

We can show that (A8) is unimodal by showing it is a concave function in \((\varvec{\phi }, \rho )\). We will do that by considering each term of the equation in turn.

$$\begin{aligned} h_{1} = \ln (\rho ) \qquad h_{2} = \frac{\varvec{\phi }^{T}s({\varvec{y}})}{\rho } \qquad h_{3} = - \sum _{j = 1}^{p} \lambda _{j}|\phi _{j}| \end{aligned}$$

We will determine the concavity of the first two functions by checking the spectral property of the corresponding Hessian matrix.

$$\begin{aligned} H_{h_i} = \begin{bmatrix} \frac{\partial ^{2}h_i}{\partial \phi ^{2}} &{} \frac{\partial ^{2}h_i}{\partial \varvec{\phi } \partial \rho } \\ \frac{\partial ^{2}h_i}{\partial \rho \partial \varvec{\phi }} &{} \frac{\partial ^{2}h_i}{\partial \rho ^{2}} \end{bmatrix}, i=1,2. \end{aligned}$$

For the first term and the second term \(h_{1} = \ln (\rho )\) and \(h_{2} = \frac{\varvec{\phi }^{T}s({\varvec{y}})}{\rho }\), the corresponding Hessian matrix \(H_{h_1}\) and \(H_{h_2}\) are both negative semi-definite and thus \(h_{1}\) and \(h_{2}\) are both concave in \((\varvec{\phi }, \rho )\).

For the third term \(h_{3} = - \sum _{j = 1}^{p} \lambda _{j}|\phi _{j}|\), we see this is a sum of the negative of a constant times an absolute value function. This is a concave function in \(\varvec{\phi }, \rho\), since the j the term in \(h_3\) is \(h_{3}(j) = - \lambda _{j}|\phi _{j}|\) which is a concave function of \(\phi _{j}\) and the sum of concave functions is a concave function.

Using the same reasoning that the sum of concave functions is concave gives that (A8) is concave, and hence the posterior distribution is concave.

Therefore, we can conclude that our posterior distribution is unimodal.

Appendix B: empirical bayes

The Monte Carlo Expectation-maximization algorithm for empirical Bayes estimation of hyperparameters proposed by Levine and Casella (2001) essentially treats the parameters as missing data and then uses the E-M algorithm to iteratively approximate the hyperparameters substituting Monte Carlo estimates for any expected values that cannot be computed explicitly. For BALERGM, the Gibbs sampler is used to estimate the expected values.

Method B: Estimating \(\delta _{j}\)

To begin this process, we consider the part of the joint distribution that depends on \(\varvec{\delta }\), since when taking the derivative all other terms will become zero.

$$\begin{aligned} \pi (\varvec{y}, \varvec{\theta }, \varvec{\delta }) = \frac{\delta _{j}^{r}}{\Gamma (r)} \left( \lambda _{j}^{2} \right) ^{(r -1)} e^{-\delta _{j}\lambda _{j}^{2}} + \text { terms not involving } \delta _{j}^{2}. \end{aligned}$$

We then take the natural log of the resulting equation.

$$\begin{aligned} \ln (\delta _{j} | \varvec{y}, \varvec{\theta }) \propto r\ln (\delta _{j}) - \delta _{j}\lambda _{j}^{2}. \end{aligned}$$

1. Expectation step

$$\begin{aligned}&Q(\delta _{j} | \delta _{j}^{(k - 1)} , y^{(k-1)}) = {\mathbb {E}}_{\delta ^{(k-1)}} \left[ \ln (\delta _{j} |\varvec{y}, \varvec{\theta }) | \delta _{j}^{(k - 1)} , y^{(k-1)}\right] \end{aligned}$$
$$= r \ln (\delta _{j} ) - \delta _{j} {\mathbb{E}}\left[ {\lambda _{j}^{2} |\delta _{j}^{{(k - 1)}} ,y^{{(k - 1)}} } \right] + {\text{ other terms not involving }}\delta _{j}$$

2. Maximization step

$$\begin{aligned} \delta _{j}^{(k)}&= \arg \max _{\delta _{j}} Q(\delta _{j} | \delta _{j}^{(k-1)}, \varvec{y}^{(k-1)}). \end{aligned}$$


$$\begin{aligned} \frac{\partial Q}{\partial \delta _{j}}&= \frac{r}{\delta _{j}} - {\mathbb {E}}\left[ \lambda _{j}^{2} | \delta _{j}^{(k - 1)} , y^{(k-1)}\right] = 0 , \end{aligned}$$

we get

$$\begin{aligned} \delta _{j} = \frac{r}{{\mathbb {E}}\left[ \lambda _{j}^{2} | \delta _{j}^{(k - 1)}, y^{(k-1)}\right] }. \end{aligned}$$

Method C: Estimating \(\lambda _{j}\)

The empirical process of estimating \(\lambda _{j}\) begins with the joint distribution terms that depend on \(\lambda _{j}\).

$$\begin{aligned}&\pi ( \varvec{\theta }, \varvec{\lambda }, \sigma ^{2}, \varvec{\tau }| \varvec{y}, s(\varvec{y})) \propto \pi (\varvec{y}| \varvec{\theta }) \pi (\varvec{\theta }| \sigma ^{2}, \varvec{\tau }) \prod _{j=1}^{p} \pi ( \varvec{\tau } | \lambda _{j}^{2}) \pi (\lambda _{j}^{2})\pi(\sigma^2) \end{aligned}$$
$$\begin{aligned}= \frac{e^{\varvec{\theta ^{T}}s(\varvec{y})}}{z(\varvec{\theta })} \prod _{j=1}^{p}\frac{1}{\sqrt{2\pi j^{2}}} exp\left\{ -\frac{1}{2\tau _{j}^{2}}\theta _{j}^{2}\right\} \frac{\lambda _{j}^{2}}{2} \exp \left\{ -\frac{\lambda _{j}^{2}\tau _{j}^{2}}{2}\right\} \frac{\delta _{j}^{r}}{\Gamma (r)} \left( \lambda _{j}^{2}\right) ^{r-1} e^{-\delta _{j}\lambda _{j}^{2}}\frac{1}{\sigma^2} \end{aligned}$$

Next, we take the natural log

$$\begin{aligned} \ln \pi (\varvec{y}, \varvec{\theta }, \varvec{\lambda }, \sigma ^{2}, \varvec{\tau }| \varvec{y}, s(\varvec{y})) = \sum _{j=1}^{p} \left[ r\ln (\lambda _{j}^{2}) - \lambda _{j}^{2}\left( \frac{\tau _{j}^{2}}{2} + \delta _{j} \right) \right] + \text { terms not involving } \varvec{\lambda }. \end{aligned}$$

1. Expectation step

$$\begin{aligned} \begin{aligned} Q(\lambda | \lambda ^{(k-1)}, \varvec{y}^{(k-1)}) = {\mathbb {E}}_{\lambda ^{(k-1)}}\left[ \ln \pi (\varvec{y}, \varvec{\theta }, \varvec{\lambda }, \sigma ^{2}, \varvec{\tau }| \varvec{y}, s(\varvec{y})) | \lambda ^{(k-e)}, \varvec{y}^{(k-e)}\right] \\= \sum _{j=1}^{p} r\ln (\lambda _{j}^{2}) - \sum _{j=1}^{p}\lambda _{j}^{2} \left( {\mathbb {E}}_{\lambda ^{(k-1)}} \left[ \tau _{j}^{2}| \varvec{y}^{(k-1)}, \lambda ^{(k-1)}\right] +\delta _{j} \right) + \text { terms not involving } \varvec{\lambda }. \end{aligned} \end{aligned}$$

2. Maximization step

$$\begin{aligned} \varvec{\lambda }^{(k)} = \arg \max _{\varvec{\lambda }} Q(\varvec{\lambda } | \varvec{\lambda }^{(k-1)}, \varvec{y}^{(k-1)}). \end{aligned}$$


$$\begin{aligned} \frac{\partial Q}{\partial \lambda _{j}}&= \frac{2r}{\lambda _{j}} - 2\lambda _{j}\left( {\mathbb {E}}_{\varvec{\lambda }^{(k-1)}} \left[ \tau _{j}^{2}| \varvec{y}^{(k-1)}, \lambda ^{(k-1)}\right] + \delta _{j} \right) = 0, \end{aligned}$$

we get

$$\begin{aligned} \lambda _{j}^{2} = \frac{r}{{\mathbb {E}}_{\varvec{\lambda }^{(k-1)}} \left[ \tau _{j}^{2}| \varvec{y}^{(k-1)}, \lambda ^{(k-1)}\right] + \delta _{j}}. \end{aligned}$$

Thus these conditional expectations are just the posterior expectations under the hyperparameter \(\lambda ^{(k-1)}\) thus they can be estimated using the sample averages from a run of the Gibbs sampler described in the section.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Han, D., Modisette, V., Forthofer, M. et al. Hierarchical Bayesian adaptive lasso methods on exponential random graph models. Appl Netw Sci 9, 9 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: