Hierarchical Bayesian Adaptive Lasso Methods on Exponential Random Graph Models

The analysis of network data has become an increasingly prominent and demanding ﬁeld across multiple research ﬁelds including data science, health, and social sciences, requiring the development of robust models and eﬃcient computational methods. One well-established and widely employed modeling approach for network data is the Exponential Random Graph Model (ERGM). Despite its popularity, there is a recognized necessity for further advancements to enhance its ﬂexibility and variable selection capabilities. To address this need, we propose a novel hierarchical Bayesian adaptive lasso model (BALERGM), which builds upon the foundations of the ERGM. The BALERGM leverages the strengths of the ERGM and incorporates the ﬂexible adaptive lasso technique, thereby facilitating eﬀective variable selection and tackling the inherent challenges posed by high-dimensional network data. The model improvements have been assessed through the analysis of simulated data, as well as two authentic datasets. These datasets encompassed friendship networks and a respondent-driven sampling dataset on active and healthy lifestyle awareness programs.


Introduction
Multiple disciplines such as sociology, political science, and biology have extensively employed network analysis and random graph studies to comprehend and represent relationships among entities, ranging from friendships and global trading partners to proteins and genes.Early models generating random graphs assumed equal probability among graphs of the same size or independence among edges, but these models had evident limitations (Erdös and Rényi 1959).Holland and Leinhardt presented the next advancement by introducing a model for directed graphs that solely employed independent dyads (Holland and Leinhardt 1981).Subsequent work overcame the limitations of independence assumptions and introduced Markov random graph models, establishing the foundation for ERGMs that have been endured for decades (Frank and Strauss 1986), however, traditional statistical methods have limitations in effectively capturing the complexities of relational data.The ERGM has emerged as a valuable tool for quantifying such data, elucidating how local interactions shape the overall structure of a network.ERGMs acknowledge and capture the inherent interdependence embedded within network structures.The probability of an edge's existence is influenced not only by the presence of other edges but also by various network configurations, such as triangles, and the characteristics of nodes throughout the entire network.This assumption of dependence aligns closely with our intuitive understanding of how networks are formed and operate.It is noteworthy that the development of ERGMs by Frank and Strauss (1986) was primarily motivated by the recognition of tie-dependence in networks.
Fundamentally, ERGMs are analogous to logistic regression when the dyads are independent, offering regression-like analysis on random networks.ERGMs estimate the probability of tie existence between pairs of nodes in a network.Since ERGMs share commonalities with logistic regression, let us recall the traditional lasso method in classical linear regression and discuss its development and relation to Bayesian theory, providing hints about the potential problems developing lasso estimates on the exponential random network.The lasso of Tibshirani is a method for simultaneous shrinkage and model selection in regression problems.Tibshirani (1996) In the context of linear regression, the lasso is a regularization technique for simultaneous estimation and variable selection where if y = Xβ + ǫ where y = (y 1 , y 2 , • • • , y n ) ⊤ is the response vector, X = (x 1 , x 2 , • • • , x p ) is an n × p predictor matrix, β = (β 1 , β 2 , • • • , β p ) is a corresponding vector of regression coefficients, ǫ = (ǫ 1 , • • • , ǫ n ) are independent normal distributed errors, then the lasso estimates are defined as where the second term in (1) is the so-called " l 1 penalty".The tuning parameter con- trols the amount of penalty.Fan and Li (2001) studied a class of penalized models including the lasso.They proved that the lasso can perform automatic variable section because of the singularity of l 1 penalty at the origin.If certain conditions are not satis- fied, the lasso estimates could be inconsistent.To overcome the above issues, Zou in 2006 and Wang et al. proposed to use an adaptive lasso that enjoys the consistency and the oracle properties: namely, it performs as well as if the true underlying model were given in advance.Zou (2006), Wang and Leng (2008) Tibshirani suggested that lasso estimates can be interpreted as posterior mode estimates when the regression parameters have independent and identical Laplace (i.e., double-exponential) priors.Tibshirani (1996) Targeting at finding this mode, several other authors studied subsequently different Bayesian contexts.Yuan and Lin (2006), Park and Casella (2008), Leng et al. (2014), Alhamzawi and Ali (2018) However, all these studies are for linear regressions and they are not built on random networks.
In the context of ERGMs, estimation encounters computational challenges when there is dependence among dyads.These challenges are primarily attributed to the intractability of the normalizing constant and the issue of degeneracy.Chatterjee and Diaconis (2013) Intractability refers to the computational difficulties associated with calculating the normalizing constant, which ensures that the probability mass function sums to one.On the other hand, degeneracy refers to the phenomenon where the models assign a significant proportion of their probability mass to a small subset of graphs.This leads to a cascading effect throughout the graph, resulting in the model assigning most of its probability mass to very sparse or very dense graphs.Bayesian computational methods have proven instrumental in circumventing these challenges.Caimo and Friel were the first to develop complete Bayesian frameworks for network models, enabling the incorporation of Bayesian analysis into real-world networks, which often exhibit large-scale, high-dimensional, and complex structures with numerous attribute variables associated with nodes (Caimo and Friel 2011).Subsequently, Caimo et al. integrated a transdimensional reversible jump Markov Chain Monte Carlo (RJMCMC) approach, initially introduced by Green (1995), with the exchange algorithm (Caimo andFriel 2013, 2014).This algorithm incorporates an independence sampler, utilizing a distribution that fits a parametric density approximation to the within-model posterior.This method is appealing in model selection since it relies exclusively on probabilistic considerations but is challenging computationally since it needs to estimate the posterior probability for each competing model.In scenarios with a high number of variables, the presence of numerous potential models becomes more pronounced.The increased dimensionality leads to a larger set of competing models, making the task of model selection more challenging and critical.This motivates the development of the penalized exponential random graph model developed in this paper.
While penalized estimation methods have been discussed in the context of graphical models by various researchers, these studies either lack a specific focus on ERGMs or fail to fully account for the inherent dependencies present in network data, often transforming the problem into generalized penalized linear regression.Meinshausen and Bühlmann (2006), Shojaie et al. (2012), Shojaie and Michailidis (2010), Shojaie (2013), Fan et al. (2009) Motivated by the need to explore network model uncertainty and achieve parsimony in exponential random graphs, we propose a more flexible and adaptive lasso-type penalized model within the framework of the ERGM.This model aims to improve parameter estimations and prediction accuracy, enabling effective variable selection within high-dimensional network data.Through comprehensive evaluations and comparisons with existing methods, our model demonstrates its superiority in terms of efficiency and effectiveness in selecting significant variables.It promises substantial improvements in the field by addressing the critical challenge of model selection in the analysis of high-dimensional network data.
In summary, the utilization of Bayesian adaptive lasso model offers two prominent advantages: (1) Enhanced convergence speed and improved parameter mixing: adaptive lasso addresses a notable limitation of the conventional lasso regularization technique, which often exhibits sluggish convergence and difficulties in selecting significant variables within high-dimensional datasets.Consequently, it facilitates faster convergence and more effective mixing of parameters.This characteristic proves particularly advantageous in scenarios involving extensive datasets or a substantial number of predictors.
(2) Effective variable selection: Bayesian adaptive lasso exponential random graph model demonstrates exceptional proficiency in this task by automatically identifying pertinent variables while concurrently shrinking or eliminating less relevant ones.The process is facilitated through the utilization of multiple chains generated by a parallel direction sampling algorithm, which enhances the efficiency and accuracy of variable selection.These benefits are the primary focus of the discussed article.This article is structured as follows.Section 2 provides a basic introduction to exponential random graph models, offering a foundation for the subsequent discussions.In Sect.3, we introduce a Bayesian Exponential Adaptive Lasso Model for the exponential random graph, which enhances the Monte Carlo maximum likelihood method proposed by Geyer and the Bayesian ERGM (BERGM) presented by Caimo and Friel (Geyer 1991;Caimo et al. 2022).Section 4 presents a derivation of the Gibbs sampling theory underlying the model, shedding light on the underlying theoretical framework.In Sect.5, we introduce the adaptive parallel direction sampling algorithm, which is incorporated into the Gibbs sampling theory to improve the mixing of the Monte Carlo chains, thereby enhancing the overall performance of the model.Section 6 outlines the algorithm procedure and provides a comparative analysis with the BERGM method proposed by Caimo et al., highlighting the strengths and advantages of our proposed approach.Caimo and Friel (2013), Caimo and Friel (2014), Caimo et al. (2022) In Sect.7, we describe the network dataset called Faux Dixon High, which is used to test the model and present simulation results.Additionally, this section includes the results of applying the proposed model to data collected in a study conducted with the Prevention Research Center at USC and Sumter County Active Lifestyles (SCAL).In Sect.8, we discuss the goodness of fit for the proposed Bayesian adaptive lasso method, providing an evaluation of its performance and suitability.Finally, in Sect.9, we summarize the key findings and contributions of the paper and identify open problems and avenues for future research

Examples and context
Exponential Random Graph Models (ERGMs) are widely applicable to research questions in the social and health sciences.In psychology, researchers studied Romanian school children's friendship networks to find that sex and mental health showed patterns of homophily, concluding that ERGM are a "promising avenue for further research." Baggio et al. (2017) Also in the social and health sciences, Becker et al. considered the friendship network of members of a sorority and the influence of disordering eating habits on friendship finding that women tended to have disordered eating habits, unlike their friends (Becker et al. 2018).This unexpected result has implications for understanding the complex social dynamics that go into a serious health concern.Solo et al. note the utility and suitability of ERGM for modeling connections within the brain compared to more traditional methods, though they also note the computational difficulty of ERGM (Solo et al. 2018).On a much larger scale, ERGM have been used to understand the influences of information sharing on tourism.The model helped answer questions about the existence of patterns in the network including whether or not the network exhibited the characteristic of homophily and how organizations should understand their role in the network (Williams and Hristov 2018).In the biological world, Stivala et al. show that ERGM can address some of the limitations that previous research had found in modeling biological processes (Stivala and Lomi 2021).These examples show the incredible flexibility and significance of exponential random graph models.

Model structure
For any network, it can be expressed with an adjacency matrix.The connectivity of the network's graph is described by an n × n adjacency matrix Y .Its i-j entry Y i,j = 1 if node i will give referral to node j and Y i,j = 0 otherwise.Let Y be the set of all possible graphs on n nodes and let y be a realization of Y .A given network y consists of n nodes and m edges that define a relationship between pairs of nodes called dyads.The adjacency matrix of the network graph Y allows for the analysis of the structural relationship in the observed network.
For general exponential random graph models, the network has the following exponential family type density: (Lusher et al. 2013) where y is the observed network, θ is a vector of parameters, and s(y) is a vector of net- work statistics.Each i-th network statistic s i (•) has a corresponding parameter θ i .A posi- tive value of θ i indicates that the edges involved in the formation of network statistics s i are more likely to be connected with each other.The normalizing constant z(θ ) is the summation y∈Y e θ T s(y) where Y is the set of all possible graphs with the same number of nodes as y .The number of possible graphs with n nodes is 2 n(n−1)/2 which becomes very large for all but the smallest graphs.Lusher et al. (2013) Hence, the calculation of z(θ ) is feasible only for small networks in computer computation.It becomes challenging to find this normalization constant for large networks or even moderate-sized networks.
Let δ = s(y + ij ) − s(y − ij ) be the vector of changes in the statistics in s when the edge y ij between node i and j in the graph y changes from 1 to 0 along with the complement part y c ij same.Conditioned on the state of the rest of the graph represented Y −ij , the log odds of the probability of a tie existing between node i and j is: These network statistics can be overlapping subgraph configurations such as the number of edges, mutual edges, triangles, and uniform homophily etc.The representation above gives the intuitive explanation of the model parameter θ about their effect on the prob- ability of an edge between node i and j.

Estimation methods
The inferential statistical goal is to find an appropriate estimate of θ such that the cor- responding generated network has the probability distribution centered on the observed network on average.That is, we want to solve the moment equation: where y obs is the observed network and s(y) is a vector of network statistics in the pro- posed graph and s(y obs ) is a vector of the network statistics in the observed graph. (2) However, in most cases, the moment equation cannot be solved analytically.This challenge leads to two mainstream simulations: Maximum Pseudolikelihood estimation and Monte Carlo Maximum Likelihood estimation.

Maximum pseudolikelihood estimation
The direct Maximum likelihood estimation of ERGMs is complicated since the likelihood function is difficult to compute for models and networks of moderate or large size.Strauss et al. proposed a standard approximation with maximum pseudolikelihood estimation (MPLE).Strauss and Ikeda (1990) Instead of conditioning each tie on the state of the entire graph, the assumption is that the dependence of each dyad is weak.In particular, the MPLE estimates can be obtained by assuming the independence among values of Y ij : This allows for the pseudolikelihood function that has the strength of quick estimation but has been shown to not provide reliable estimates.van Duijn et al. (2009), Friel et al. (2009) This will only provide the true estimate for ERGM with dyadic independence or when the change statistics can be found only considering one tie without knowing the rest of the graph.Research by van Duijn et al. compares the maximum pseudo-likelihood and maximum likelihood estimates, and their study shows the pseudo-likelihood estimation is biased and MPLE can only approximate the transitivity pattern in the network well.van Duijn et al. (2009)

Monte Carlo maximum likelihood estimation
Similar to methods in linear regression, ERGMs are log-linear, and a typical method for finding the maximum likelihood requires finding the roots of the derivative of the log of the function.This results in the s(y) T − E θ (s(y)) = 0 found earlier.The Monte Carlo maxi- mum likelihood estimation in ERGM case needs to find the following important ratio: (van Duijn et al. 2009) The log-likelihood equation, however, is not directly solvable without computing the normalizing constant.As previously mentioned, this is computationally intensive for all but the smallest graphs.With this approximation, though, the normalizing constant can be estimated by generating m graphs from the density π(π|θ 0 ) and finding e (θ−θ 0 ) T s(y i ) .
for each graph and use importance sampling technique.The estimates of θ can be obtained by maximizing the log-likelihood ratio approximated as the following: However, in this method, the choice of the initial θ 0 is tricky and should be near the maximum likelihood estimate of θ 0 .Poor choice of θ 0 can lead to the failure of the maxi- mization log-likelihood function and degeneracy problem.van Duijn et al. (2009), Handcock ( 2003)

Bayesian adaptive lasso exponential random graph model
This work is motivated by the need to explore model uncertainty and flexibility.With these objectives, we consider the following exponential random graph model, this model is a particular class of discrete exponential random exponential families that represent the probability distribution of the adjacency matrix Y ∈ Y where Y is the set of all pos- sible graphs on n nodes.Let y a realization of Y .The likelihood function of an ERGM stands for the probability density of a random network and can be expressed as: where q(y|θ) = e θ T s(y) is the unnormalized likelihood.We consider the following adaptive lasso estimator on the exponential random network: where l(θ|y) = ln(π(y|θ)) is the log-likelihood function of θ and each j is a different penalty parameter used for the coefficients.In dyadic independence ERGMs, maximizing the log-likelihood function ( 10) is equivalent to maximizing the following log pseudo-likelihood function: where . In this case, the network estimation problems are transformed into the classical adaptive lasso logistic linear regression model.For example, the coordinate descent algorithm developed in glmnet package for R (Tay et al. 2023;Friedman et al. 2010) can get estimations of the parameters θ j , j = 1, 2, 3, • • • , p with penalties include the lasso, ridge and the elastic net.However, different from the generalized linear regression models, the challenge of estimation on the dyadic dependent ERGMs relies on the intractable normalizing constant appearing in the log-likelihood ( 8) With the review of ERGMs likelihood-based methods in Sect.2, the solution to the equation ( 10) has similar obstacles.To get around those obstacles, we will study this problem with an adaptively Bayesian estimate obtained from the lasso penalized method on the random networks.
Assume that a prior distribution π(θ ) is placed on θ , and we are interested in the posterior distribution We consider a conditional Laplace prior specification of the form similar to the classical Bayesian lasso linear regression developed in Park and Casella (2008) but with different penalty terms so that we have j for j = 1, 2, 3, • • • , p: We can now formulate a hierarchical model on the exponential random graph, which we can use to implement this version of the Bayesian lasso with a Gibbs sampler, using the Laplace distribution as a scale mixture of Gaussians.When the mixing distribution is exponential, the resulting distribution is Laplace.Andrews and Mallows (1974) Now we use a latent parameter τ 2 to make the prior (14) as a scale mixture of normal distributions (15).We can consider τ j s as additional parameters that assign different var- iances to the prior of θ .When τ j → 0 , the coefficient of s j (y) is shrunk to zero.
where σ 2 > 0 and ) is a matrix that allows each parameter to come from a normal distribution with a different variance.
Different than the basic Bayesian lasso model proposed by Park and Casella (2008) in which τ follows our Bergm adaptive lasso model sets up different shrinkage parameters for different coefficients.This motivates us to define a more adaptive penalty in the hierarchical structure: and an independent non-informative scale-invariant marginal prior π(σ 2 ) ∝ 1 σ 2 on σ 2 suggested by Park and Casella.Park and Casella (2008) The conditional distribution on (13) π(θ|y) ∝ π(y|θ)π(θ) σ 2 guarantees a unimodal full posterior distribution for the estimate θ on the network.(See Appendix A ).The unimodal posterior distribution ensures the quick convergence of the Gibbs sampling algorithm and ensures the meaningful single point estimate of θ.
Finally, the simplest prior for the penalty term j , for j = 1, 2, 3, • • • , p would be a uni- form distribution, but this proved to be problematic with complex networks, particularly when a model has many parameters.Thus, following the notation of Park and Casella (2008) we propose a prior such that 2 j follows Gamma distribution with shape parameter r and rate parameter δ j : This prior mixes well with the other choices for the Gibbs sampling and as Park and Casella (2008) note, this prior can approach 0 as → ∞ and can concentrate probability near the MLE.
In summary, the hierarchical formulation of the Bayesian adaptive lasso Model on the exponential random graph is as follows: The major differences of this formulation compared with the Bayesian lasso in Park and Casella (2008) are first, the Bayesian lasso method in Park and Casella (2008) is applied to linear regression model y = µ1 n + Xβ + ǫ without any network struc- ture.In other words, y in Park and Casella (2008) follows the normal distribution N (µ1 n + Xβ, σ 2 I n ) , where y is a n × 1 vector of responses which doesn't involve random graph.Second, our model allows different penalty variables j , one for each different parameter.In this case, each τ 2 j can have its own distribution and thus the variance of each normal distribution can be different.With the flexibility of the penalties, the lasso estimate of the parameter for less important random variables on the exponential random graph will have a larger penalty.And smaller penalty will be ( 19) for j , r, δ j > 0. (20) applied to those important random variables.And compared with the existing Bayesian Adaptive Lasso model (Leng et al. 2014), Alhamzawi and Ali (2018), our model is built on the random network.And compared with the Bayesian Exponential Random Graph Model (BERGM) by Caimo and Friel (2011), our model Bayesian Adaptive Lasso Exponential Random Graph Model(BALERGM) has more accurate estimations, and the structure is more flexible and adaptive to the network statistics level by adopting distinct shrinkage and penalties for different network statistics.The estimates θj of θ j for j = 1, 2, 3, • • • , p will be small and close to 0 if it does not provide much improve- ment on predicting the random network Y .So it naturally leads to an estimator with an automatic variable selection property.The value of j will affect the estimates θ j .
The larger ˆ j exists in the model, the sparser θ will be.(namely, more coefficients are small and near 0) whereas smaller θj leads to a less sparse θ .Sparsity is a common belief in high-dimensional statistics because we anticipate only a few covariates are actually related to the response and most covariates are useless.BALERGM is very powerful in this scenario because it leads to a sparse estimator on the network (many coefficients are near 0).Note that high-dimensional problems in network science are very common.For example, in genetics, there are many genes per individual but often we have few patients in our study, or in neuroscience, the fMRI machine produces many voxels per person at a given time.

Gibbs sampler implementation
Now we will implement the model with a Gibbs sampler.The Gibbs sampling method is a Markov Chain Monte Carlo (MCMC) algorithm.In our case, the joint distribution is difficult to sample from directly, but the conditional distribution of each variable is known and is easier to sample from.The Gibbs sampling algorithm generates an instance from the distribution of each variable in turn, conditioned on the current values of the other variables.The construction of the hierarchical model ( 20) makes the derivation of the full conditional distributions for each component of the estimates feasible.
Thus we can write the joint density as the product of the conditional density of y|θ and the density of θ .Using the pieces of the hierarchical formulation of the model from ( 20) we can substitute in each piece that we have already chosen to find the joint distribution.
To implement the Gibbs sampling, we require the distribution of each parameter τ j , j , σ 2 to update in turn.From the joint distribution (26), we consider all parts of that joint distribution that depend on each variable.As summarized in Table 1, we consider the full conditional distributions for τ j , j , and σ 2 respectively.( 26)

Sample τ j
For each τ j we have the following distribution.
To find what distribution each τ j follows, we begin by considering the following transfor- mation.Chhikara and Folks (1988)  Sample σ 2 Similar to the other parameters, we now look at σ 2 with the following conditional distribution:

Variable
Proportional Distribution If x ∼ Inverse Gamma (α, β) with the shape parameter α and scale parameter β , then it has the following density function: We can compare the conditional density ( 33) with (34) to find: To sample the penalty term , we have developed three methods providing flexibility depending on the network model requirements (Table 2).
Method A: The simplest prior for the penalty term j , for j = 1, 2, 3, • • • , p would be a uniform distribution, but this proved to be problematic with complex networks, particularly when a model has many parameters.Thus, following the notation of Park and Casella (2008), we propose an adaptive prior such that 2 j ∼ Gamma (r, δ j ) .That is, For the full Bayes estimation of 2 j , we have to find the following distribution.
This shows us that 2 j is proportional to a gamma distribution with α = r + 1 and β = τ 2 j 2 + δ j , since a standard gamma probability density function is Therefore we can conclude: where r and δ are chosen constants/vectors of constants.

Method B:
In contrast to the previous Method A, where the parameters δ j , for j = 1, 2, • • • , p , were treated as fixed constants, the proposed method incorporates an empirical update of the hyperparameter vector δ using the Monte Carlo Expectation- Maximization (E-M) algorithm (Levine and Casella 2001).The empirical update of the parameters δ j is performed using the following formula: The full derivation of this method is presented in Appendix B.
The empirical update of the parameters δ j using the E-M algorithm brings several advantages to the estimation process.Firstly, it eliminates the need for manually specifying appropriate hyperparameter values, as the parameter values are estimated directly from the observed data.This data-driven approach enables the selection of hyperparameters based on the characteristics of the data, enhancing the flexibility and adaptability of the model.Additionally, the empirical update of the parameters δ j allows the model to capture intricate nuances and complexities that may not be adequately accounted for by Method 1, which relies on a fixed hyperparameter.By updating the parameters based on the observed data, the model can better capture the intricacies and variability present in the data, leading to improved estimation accuracy and model performance.
Method C: This method uses a full empirical Bayes that directly estimates from observed data without assuming any specific distribution or model.The full derivation is in the Appendix B, but we can update 2 j While this Method C offers several advantages, such as adapting to the data and improving exploration of the parameter space, they also have certain disadvantages that should be considered.
One of the primary disadvantages of full empirical MCMC is its computational cost.Empirical MCMC methods typically require additional iterations and computations compared to traditional MCMC algorithms.The empirical updates of parameters or proposal distributions can be computationally intensive, particularly when dealing with large datasets or complex models.This can result in longer execution times, limiting the scalability of the method.
Another disadvantage is the potential for bias or inefficiency in the estimation process.Empirical updates rely on the observed network data to estimate the parameters and the proposal distribution of the network.If the nodal sufficient statistics are not fully representative or the observations of nodal random variables contain outliers, the empirical estimates may introduce biases or inefficiencies in the MCMC sampling.Additionally, the convergence of this method 3 needs careful tuning of the other hyperparameters to achieve optimal performance.The optimization of hyperparameters can be nontrivial and needs expert knowledge or extensive experimentation. (40) . (41) , y (k−1) + δ j

Adaptive parallel direction sampling algorithm
There have been considerable developments in the approaches dealing with the problem of sampling from a distribution with a doubly intractable normalizing constant.For example, the easy-to-implement and more direct single variable exchange algorithm proposed by Murray et al. (2012).However, if there is strong temporal dependence in the state process and a strong correlation between model parameters, the exchange algorithm performs slow mixing.Caimo and Friel (2011) and Caimo and Mira (2015)  1. Sample two current states h 1 , h 2 and h 1 = h 2 = h.
2. Sample the error term from a symmetric normal distribution.ǫ ∼ N(0, σ 2 ǫ ). 3. The sampling of θ h performs a simple random walk: where q(y|θ) = e θ T s(y) is the unnormalized likelihood.

end while
The move of θ is illustrated in Figure 1.Here, two other chains h 1 and h 2 are chosen at random.The difference between the corresponding estimates in the other two chains θ h 1 and θ h 2 are used to find the distance to move away from θ h .A normal distribution with a very small variance is used to slightly adjust the estimate for the new θ.

Bayesian adaptive lasso algorithm
In this section, we will list the algorithm of the Bayesian Exponential Random Graph Model (BERGM) by Caimo et al. (2017) and the algorithm of our Bayesian Adaptive Lasso Exponential Random Graph Model (BALERGM)) for easy comparison.Caimo et al. (2017) set up the exchange algorithm with a Gibbs update of θ ′ and then y ′ using Markov Chain Monte Carlo iteration without penalized terms.The algorithm can be written in the following concise way:

end while end while
Where s(y) and s(y ′ ) are functions of the observed and simulated vector of network statistics respectively.
For the new Bayesian Adaptive Lasso model, we use the parallel adaptive direction sampler method suggested by BERGM and combine with Gibbs sampling to generate samples to find estimates for θ.
This BALERGM package is shared on Github:xxxx.(The link will be provided upon the acceptance of this paper).

Simulation and application
In this section, we will show the strength of BALERGM in three key ways.The first way uses the Faux Dixon High School data set to simulate 100 graphs to compare BERGM and BALERGM.The results of trials shows BALERGM is a stable model with accurate estimation, in addition to providing improvements to BERGM with a higher acceptance rate and effective sampling size, and lower MSE.The next two ways showcase the parameter selection abilities of BALERGM.The first on a simulated parameter and the second with network data collected in a study by Prevention Research Center at USC and Sumter County Active Lifestyles (SCAL).

Data
The network object Faux Dixon High represents a friendship network among junior high and high school students based on data gathered by a National Longitudinal Study of Adolescent Health, see details in Resnick et al. (1997).This study, first conducted in 1994-1995, considered more than 90,0000 American students.Students were asked to list friends, and a tie is formed between them in the network if both students claimed friendship (Goodreau et al. 2008).
The final network has 248 nodes with 1,197 directed edges.Each node has three characteristics: grade, sex, and race.The grades include 7th-12th and race is first delineated by Hispanic and non-Hispanic which was further split into Asian, Black, Native American, Other, and White.Figure 2 shows the network plotted with nodes colored for each grade showing the homophily.
Executing the BALERGM algorithm requires choosing network statistics with both nodal and edge attributes and structural features such as triangles and triads (Morris et al. 2008).The count of these network statistics is found with the adjacency matrix realization y of Y with i-j entry in the matrix defined as y ij .For a directed network, the following summations demonstrate the counting procedure.
A natural network statistic for this data is the instances of homophily between students in the same grade, since as seen in Fig. 2, nodes with the same attribute (in this case Edges: Mutual Edges: i =j y ij y ji Cyclic Triads: i =j =k y jk y i,k y ij grade) appear visually to have more connections.As seen in Table 3, with the diagonal entries of the mixing matrix from Grade i to Grade i for i ∈ {7, 8, 9, 10, 11, 12} , most of the connections are between students in the same grade.This feature can be included in network models with the R code nodematch.

Simulation
To demonstrate the overall effectiveness of BALERGM, we conducted a comparative analysis between BALERGM and BERGM (Caimo and Friel 2014).Our evaluation involved the generation of 100 independent exponential random graphs using the Faux Dixon High dataset, with known and fixed parameters θ .Specifically, we focused on two selected network statistics: the count of edges in the network ( θ 1 : edges) and the count of occurrences of homophily, where students of the same grade have a friendship connection ( θ 2 : nodematch.Grade).Without loss of generality, we fixed the parameter val- ues θ = (−4.8,2.3) and generated 100 independent exponential random graphs based on the Faux Dixon High dataset, considering them as new instances with associated node attributes.This approach allowed us to create 100 distinct opportunities to estimate the parameter vector θ using both the BALERGM and BERGM algorithms, enabling a com- prehensive performance comparison against the true parameter values θ = (−4.8,2.3).In each run of BERGM and BALERGM, the main chain for either model consists of 2000 iterations and the burn-in number is 50 iterations.In 100 simulations, each model generates a sequence of values estimating each θ in each simulation.To confirm the sta- bility of the model, the following representation of the MCMC results (Fig. 3) shows the strength and stability of the BALERGM algorithm after relatively few iterations.The unimodal distribution of estimates is on the left, and the center column shows the trace of the estimates indicating a stable estimating process.The final column shows the autocorrelation plot with the lag decreasing quickly; by 50 iterations, the process has stabilized to minimal lag.
Using both the mean and median of these estimates we can compare several metrics.Table 4 compares the acceptance rate of generated estimates for each run, the mean effective sample size, and both the mean and median square error (MSE) of the estimates compared to the chosen true values, where MSE = 1 n e ⊤ e , e is the error vector, that is e = θ − (−4.8, 2.3).
Table 4 shows that using either the mean or median of the generated estimates in MCMC for θ (1) BALERGM has a better overall acceptance rate and effective sam- ple size on average than BERGM.The acceptance rate or the percentage of generated samples that are accepted in the MCMC process is increased.This implies BALERGM adjusts to the true parameter for each single variable faster than BERGM.(2) BALERGM offers an improvement over BERGM with a lower mean squared error (MSE).The mean squared error is dramatically lower with the BALERGM process no matter whether the mean or median in MCMC is used as the estimate for θ .This can be seen in the quantiles for each estimate of θ since the true values are θ = (−4.8,2.3) , the BALERGM estimates are much closer to these true values.
In Table 5, the true known value of each θ is estimated by either the mean or median of the generated samples.The quantiles for estimates of θ show the spread of each estimate.
Once results are generated, the estimates produced can be used to calculate the probability of a tie, using the θ .Using the previous example, if no new tie is created, so the change statistic for θ 1 is zero, then the probability that a tie is between students of the same grade can be calculated as follows.
e 2.8550 1 + e 2.8550 = 0.945577.That means the conditional probability of observing an edge (not involved in the creation of other network statistics included in the model) is about 94.56%.

Variable selection
BALERGM not only improves sampling efficiency compared to previous models but also demonstrates strong performance in variable selection through its adaptive lasso component.This indicates the ability of the model to identify and highlight parameters that are either more or less significant to the network structure.The example using the following simulated dataset showcases the effectiveness of BALERGM in terms of variable selection.
For this example, we still use Faux Dixon high school dataset.The chosen network statistics are the count of the edges in the network (edges), the counts of the occurrences of homophily where students of the same grade have a friendship connection (nodematch.Grade), and the third artificial created term: the counts of the occurrences of homophily from a generated nodal attribute for the wealth of a parent (nodemath.Wealth).This additional nodal attribute "Wealth" is generated from the uniform distribution on 20 and 75.Given that this nodal variable is generated uniformly at random, it is intentionally designed to have no impact on the network structure.Our objective is to test whether BALERGM can effectively identify and exclude this artificially created insignificant nodal variable.Running the BALERGM algorithm produces the following results, demonstrating that the model accurately estimates the value of θ 3 to be close to zero.This outcome aligns with our expectations, as the variable has no meaningful influence on the network structure (Table 6, Fig. 4).

Sumter county active lifestyles (SCAL) network analysis
With reports by The US Burden of Disease Collaborators (2018) of worsening metrics of American health, communities are working on addressing and understanding the factors that might improve health outcomes.To this end, the University of South Carolina Prevention Research Center and Sumter County Active Lifestyles (SCAL) based in Sumter County, South Carolina conducted a respondent-driven study in 2014 to better understand the dynamics of social networks and health outcomes.
In this study, community ambassadors chosen for their history of community involvement were given a set compensation for their participation.Each ambassador was instructed to share the survey with those in their social network.Each of these respondents was also compensated for both completion of the survey and sharing the survey with others that completed the survey.Using referral codes, a network can be  are primarily white (87%), female (78%), likely to be older than 50 (44%), and more educated with 46% being college graduates.Other questions focused on self-reported health outcomes and activities including exercise habits, eating habits, and social support dynamics.The question forms included qualitative questions about physical activities and opportunities for physical activities in the community.For the purposes of this network, network attributes were assigned using the answers to only multiplechoice questions.
The resulting network contains many nodal attributes where ERGM and BERGM cannot be applied effectively.This motivates a model like BALERGM which enables understanding which of these network statistics contribute less to the network structure.

Results
Using the SCAL data set from the previous section, we use the network statistics in Table 7 to analyze this model, where we use the ergm terms "nodematch", "nodefactor" and "nodecov".These three terms all provide measures of homophily."nodematch" counts the instances of nodes with the same attribute for a given attribute."nodecov" performs a similar function but for continuous variables."nodefactor" creates network statistics for each discrete level of a nodal attribute and counts the occurrences of connected nodes with the same attribute level.For more details, see Morris et al. (2008).
Table 7 shows the BALERGM output on the SCAL social network.Here the sparsity of the network can be seen in the large negative values for the network statistics for edges and the out-degree of the nodes.While the standard deviations vary with each estimate, the MCMC outputs show stable estimating with symmetric distributions as the quantile values indicate.It 7 provides valuable insights into the relationships between different variables in the network analysis.One interesting observation is that individuals who maintain a healthy diet ( θ 6 -θ 13 are all positive) tend to have positive connections with each other.This suggests a clustering effect among individuals with similar dietary habits, indicating a potential influence of shared health-conscious behaviors on network connections.
Furthermore, the result highlights that participation in a walking program (variables θ 18 to θ 19 ) is positively associated with network connections.This implies that individu- als who engage in walking programs are more likely to know each other within the network.This finding suggests a potential social bonding effect among individuals who actively participate in health-promoting activities, leading to the formation of connections and social ties.
The adaptive lasso penalty in BALERGM is useful for shrinking θ values for network statistics that are less significant to the network structures.Depending on the model and network conditions, the parameter estimate might not reach exactly zero.For example, the estimate for both θ 26 = −.001 and θ 20 = −.076 are small, but this mean of the gen- erated samples as the single factor utilized doesn't allow for a nuanced ranking of how significant each parameter is.Using the distribution of θ found in the Gibbs sampling process, we can find the probability that half of the distribution is less than zero at some significance level α.
A larger value of α indicates a higher importance of the variable in the context of the model.This creates the ability to rank variables.The following Table 8 shows the parameters less significant to the construct of the network at various significance levels.For example, the network statistic of the age of the participant ( θ 26 ) is less significant than for the network statistic of having heard of the SCAL program ( θ 20 ).While both are not the primary factors, BALERGM gives researchers insights into the social dynamics of Sumter County allowing for targeted programs to improve health outcomes.This example highlights the powerful functionalities of BALERGM, particularly in the context of variable selection and importance ranking in network analysis.In network studies, the presence of numerous network variables is common.The identification of the most relevant variables is crucial as it enables researchers to concentrate their analysis and interpretation on the factors that significantly influence the network's structure and behavior.By focusing on these key variables, we can gain a deeper understanding of the underlying mechanisms that drive network dynamics.

Goodness of fit
To assess the performance and goodness of fit of Exponential Random Graph Models (ERGMs), various diagnostics can be employed.These diagnostics involve comparing key statistical measures calculated from observed networks with those obtained from simulated networks generated based on the estimated network parameters.In the Bayesian framework, evaluating the goodness of fit of the model involves conducting posterior predictive assessments.This entails comparing the observed network to a collection of networks simulated from the posterior distribution of the model's parameter estimates, as determined by Caimo and Friel (2011).
The set of statistics used for the comparison contains degree distributions, the minimum geodesic distance, and the number of edge-wise shared partners.Since the SCAL network graph is a directed graph, the degree distributions for both in and out degrees are included.Since the graph includes isolated nodes and clusters such that there is no path between some nodes, the minimum geodesic distance or the minimum number of edges needed to connect any two nodes is infinite leading to the spike in the plot for minimum geodesic distance in 6.Finally, the edge-wise shared partners are concentrated in the lower values since the number of nodes in common for any number of edges is small.
The Bayesian goodness of fit diagnostic in Fig. 6 evaluates the implemented model in section 8.The observed network is compared with 300 randomly simulated network samples drawn from the estimated posterior distribution using 50 auxiliary iterations for the network simulation step.Figure 6 illustrates the summary results of these 300 generated graphs in black and gray, alongside the original network represented in red.The comparison reveals a strong alignment in the high-level characteristics that are not explicitly modeled.This indicates that the posterior mean obtained through BALERGM accurately generates networks with corresponding structures.

Discussion
Bayesian adaptive lasso exponential random graph model (BALERGM) offers several notable advantages in the field of network analysis.One key advantage is its ability to perform automatic variable selection, facilitated by the integration of the Lasso regularization technique.By employing the Lasso penalty, BALERGM effectively identifies and emphasizes the most relevant network parameters while diminishing the influence of less significant ones towards zero.This feature streamlines the modeling process and extracts valuable insights from intricate network data, enhancing the interpretability of the results.Moreover, the Lasso penalty promotes sparsity in parameter estimates, resulting in a more parsimonious model that aids in discerning the influential factors governing network behavior.Another compelling advantage of BALERGM is its superior adaptive estimation performance.Through the adaptive adjustment of penalties for each parameter, the model swiftly adapts to the data, allowing it to focus on the most relevant network parameters and capture underlying patterns and relationships more effectively.Researchers can readily select key network factors based on their significance levels, providing valuable insights and actionable knowledge.We have shown the effectiveness of the the proposed algorithm in simulation and compared its performance against the currently popular BERGM method.One promising direction for future work involves the development of more generalized penalized forms within the context of network analysis.While the Lasso penalty has demonstrated its efficacy in variable selection and sparsity promotion, the incorporation of ridge penalty distributions could offer additional benefits.Combining the strengths of both Lasso and ridge penalties would strike a balance between model complexity and over/underfitting issues, leading to more robust parameter estimation.
If a random variable x ∼ Inverse Gaussian(µ, ′ ) , that is then with the change of variable, we can find the density f ′ of w = x −1 as Hence Then we can rewrite equation 27 into the reciprocal of the Inverse Gaussian distribution

Fig. 1
Fig.1The parallel ADS move of θ h is generated based on the difference of the states θ h1 and θ h2 in other Markov chains and ǫ is a random error term j ) OR 4c.full empirical update of 1. update with the mean of the last five τ estimating the expected value

Fig. 2
Fig. 2 Generated in R, this plot shows the clustering of student friendship with students that have the same grade

Fig. 3
Fig. 3 MCMC output: Distribution of samples on the left, the trace of samples in the center, autocorrelation plot on the right

Fig. 5 Fig. 4
Fig. 5 Generated in R, this plot shows results of asking "Have you heard of a group called Sumter County Active Lifestyles (SCAL)?"

Fig. 6
Fig. 6 Bayesian goodness of fit diagnostic for the estimated parameter posterior distribution for BALERGM model on SCAL dataset

Table 2
Methods for Sampling MethodsMethod A:Full Bayes with a j ∼ Gamma (r, δ j ), j = 1, 2, • • • , p Method B: Partial empirical Bayes with an empirical update of δ = (δ 1 , δ 2 , • • • , δ p ) Method C: Full empirical Bayes with an empirical update of = ( 1 , 2 , • • • , p ) Murray et al. (2012)rray et al. (2012)to increase MCMC sampling efficiency by combining delayed rejection and adaptive Monte Carlo techniques.First, a collection of H parallel Markov chains are generated.Then the next element of a current chain h is found using estimates from chains h 1 and h 2 as below.

Table 3
Summary table for the connections among different grades.The i − j position in this table shows the number of connections from Grade i to Grade j, i = 7, 8, 9, 10, j = 7, 8, 9, 10

Table 5
Results of simulating 100 graphs and comparing results for BERGM and BALERGM using means as the estimates of θ a Chosen true value for parameter for each simulated graph b Mean of MCMC outputs c Quantiles from MCMC output d Median of MCMC outputs Mean

Table 6
Results of BALERGM using formula y ∼ edges + nodematch("Grade") + nodematch("Wealth") with nodes representing survey respondents, edges formed by survey sharing, and nodal characteristics from the results of the survey.The final network has 80 nodes with the data for 30 questions for each respondent.Figure5is the network plot labeled with one of the 30 questions: "Have you heard of a group called Sumter County Active Lifestyles (SCAL)?"The survey was intended to be a brief but broad look at self-reported health benchmarks.Questions cover demographic characteristics revealing that the respondents created

Table 7
Results from BALERGM with variable selection on SCAL data

Table 8
Variable Selection with Different Tolerance Levels