Skip to main content

Approximate inference for longitudinal mechanistic HIV contact network

Abstract

Network models are increasingly used to study infectious disease spread. Exponential Random Graph models have a history in this area, with scalable inference methods now available. An alternative approach uses mechanistic network models. Mechanistic network models directly capture individual behaviors, making them suitable for studying sexually transmitted diseases. Combining mechanistic models with Approximate Bayesian Computation allows flexible modeling using domain-specific interaction rules among agents, avoiding network model oversimplifications. These models are ideal for longitudinal settings as they explicitly incorporate network evolution over time. We implemented a discrete-time version of a previously published continuous-time model of evolving contact networks for men who have sex with men and proposed an ABC-based approximate inference scheme for it. As expected, we found that a two-wave longitudinal study design improves the accuracy of inference compared to a cross-sectional design. However, the gains in precision in collecting data twice, up to 18%, depend on the spacing of the two waves and are sensitive to the choice of summary statistics. In addition to methodological developments, our results inform the design of future longitudinal network studies in sexually transmitted diseases, specifically in terms of what data to collect from participants and when to do so.

Introduction

Networks are used to study a range of systems with interactions or dependencies among their agents, such as the behavior in supply chains and the stock market (Macal et al. 2004), protein-protein interactions in biological systems (Scholtens and Gentleman 2005), and disease transmission on local and global scales (Le et al. 2022). In the study of disease transmission dynamics, the contact structure of a population can be naturally represented as a network, and this representation is especially useful if the contacts persist over time, as is often the case for sexual interactions. Disease dynamics are then driven by interactions (represented by edges) among susceptible and infectious individuals (represented by nodes). More generally, many of these systems arise from stochastic processes forming or dissolving interactions over time that must be accounted for when doing inference.

There are (at least) two main paradigms of networks models: statistical and mechanistic. Statistical network models prioritize tractable likelihoods to facilitate inference at the expense of model flexibility. For example, the Erdős–Rényi graph, also known as the Bernoulli random graph, assumes that each node pair is connected independently and with identical probability. Hence, the likelihood of the number of edges is the standard binomial likelihood with a fixed number of nodes and inference readily follows because a graph is completely identified by its node and edge sets. It also follows that Erdős–Rényi has a binomial (approximately Poisson) degree distribution.

This generative mechanism however clearly does not map well to most real-world networks. This is easily seen in the World Wide Web (WWW). In this scenario, each website is represented by a node and a directed connection (hyperlink) between websites occurs when one links to the other. Unlike Erdős–Rényi networks, where the degree distribution follows a binomial distribution, the degree distribution here follows a power-law where more successful websites tend to grow their connections faster than others (Adamic and Huberman 2000). Exponential Random Graph Models (ERGMs) are generalizations of the Erdős–Rényi model. They represent a probability distribution of graphs on a fixed node set, where the probability of observing a graph is dependent on the presence of the various configurations specified by the model (Robins et al. 2007). A typical graph in this distribution can be interpreted as the aggregate of the local configurations, and slight errors in estimating the local configuration counts can alter beliefs about the distribution (Goyal and Onnela 2020).

Mechanistic models assume that the observed network is generated by a small set of mechanistic rules. The canonical example is the Barabási-Albert (BA) model. Nodes are added one by one to a growing network and each node connects to m previously existing nodes with probability proportional to the nodes’ current degree (Albert and Barabási 2002). This so-called preferential attachment mechanism readily generates power-law degree distributions, which are a type of broad-tailed degree distribution that are characteristic of many empirical networks, including that of the WWW. Apart from the target number of nodes, n, the classic BA model only has one free parameter, m. In this case, the fully grown graph has approximately nm edges, and m can be inferred by dividing the number of edges by the number of nodes n, i.e., m is approximately equal to the average degree of the network. However, even for moderately complex models, the likelihood of the full network becomes intractable due to the fact that the insertion order of the nodes is (usually) not known. Because the graph is sequentially dependent on the previous iteration as it grows, the number of possible graph realizations grows exponentially with the number of added nodes.

Networks have provided insights to major public health problems such as the spread of HIV (Wertheim et al. 2011), the opioid crisis (Aroke et al. 2022), and interventions with people who inject drugs (Rolls et al. 2013). Wertheim et al. (2011) noted that HIV is an evolving disease and constructed a disease transmission network using gene sequencing by tracking the evolutionary path of the virus and inferring edges by measuring the similarity of the virus within different individuals. Using an inferred transmission network, they developed a network statistic that was able to detect community level effects of HIV in a clinical trial setting that could help thwart future infections. Aroke et al. (2022) showed the benefit of peer influence and concluded that individuals who have a diagnosis of opioid use disorder or use many prescribers may help promote positive health behaviors in an opioid prescription network due to the influence of their direct peers on the network structure. They came to this conclusion by showing the type of opioid that an individual uses and their number of prescribers were identified as significant predictors of high betweenness centrality giving them influence over the network at large. Rolls et al. (2013) model network data involving people who inject drugs, using validation techniques, so that these networks can be simulated and intervention strategies could be explored.

There are several mechanistic models for studying the impact of men who have sex with men (MSM) contact networks and their impact on HIV transmission (Birkett et al. 2015; Mei et al. 2010; Hansson et al. 2019). Birkett et al. (2015) used a data-driven simulation model to understand the impact of network-level mechanisms and STI infections on the spread of HIV among Young Men who have Sex with Men (YMSM). Mei et al. (2010) introduced the concept of a Complex Agent Network (CAN) to model the HIV epidemics by combining agent-based modelling and complex networks. An especially interesting model was introduced by Hansson et al. (2019) to study the role of casual contacts on the HIV epidemic in Stockholm, Sweden. Their research was used to recommend interventions to reduce transmission rates. Padeniya (2021) notes Hansson and others’ contribution to intervention strategies as they sought to mathematically model the role of female-sex-worker-client interactions for gonorrhoea transmission. Vajdi et al. (2020) noted Hansson’s choice to model instantaneous casual relationships, and investigated a dynamic model for casual relationships, a two-layer temporal network model, and SIS mean-field equations. A common approach for inference in these papers is to propose mechanisms for contact formation, simulate the spread of disease on the network, and modify parameter values to match disease prevalence to that observed in their respective populations without directly validating their mechanism.

Most scientific studies involving human subjects can be divided into cross-sectional and longitudinal. In cross-sectional studies, measurements are obtained at only a single point in time. The distinguishing feature of longitudinal studies is that the study participants are measured repeatedly (at least twice) throughout the duration of the study, thereby permitting the direct assessment of changes in the response variable over time (Fitzmaurice et al. 2012). To illustrate, participants in a cross-sectional study likely vary in age; however, this type of design cannot be used to study the effect of aging because the effect of aging is potentially confounded with cohort effects. It is important to note here that although we are sampling the evolving network at multiple time points, we are only asking participant information that can maintain privacy.

One example of a longitudinal network study is the work by Birkett et al. (2015). The authors studied the impact of network-level mechanisms and STI infections on the spread of HIV and found that network-level mechanisms and STI infections play a significant role in the spread of HIV and in racial disparities among (YMSM). Their work shows HIV prevention efforts should target YMSM across race, and interventions focusing on YMSM partnerships with older MSM might be highly effective. In general, one would expect observing a network multiple times to provide more information, and therefore improve accuracy of inference, compared to observing the network just once. In addition, one may address questions that can only be interrogated in a longitudinal study. When growing a network in a simulation, we can track every iteration of the dynamic network and have arbitrarily many observations at our disposal. In an actual study, one is of course constrained by resources and logistics. If the data are obtained from self-administered or staff-administered surveys, too frequent reporting may lead to participant burden and reduce his or her willingness to continue participation, whereas too infrequent reporting may lead to recall bias and participants may be lost to follow up. For example, a person may not remember each individual whom they dated over a 5-year period and may not be able to reliably recall the timing of the relationships. Collecting data at different time points that are optimally spaced helps alleviate recall bias while still maintaining an avenue for accurate inference.

Our goal in this paper is to implement a discrete-time version of the mechanistic network model introduced by Hansson et al. and use the model to identify optimal spacing between two data collection points (waves) in a longitudinal network study such that we can achieve the dual goal of accurate inference (learning model parameters as precisely as possible) while minimizing participant burden (using network features that in practice could be elicited from participants with a minimal number of survey questions). These results have implications for study design for HIV and other sexually transmitted diseases, and more broadly they can inform other research questions involving (longitudinal) network data.

This paper is structured as follows. We discuss the discrete-time mechanistic network model in Sect. 2.1 and explain our ABC-based approach to approximate parameter inference in Sect. 2.2. We show our results in Sect. 3 and conclude with a discussion in Sect. 4.

Methods

Mechanistic network model

As noted in the Introduction, there are several mechanistic models for MSM contact formation in specific populations. We focus on the mechanistic model introduced to study MSM contact networks in Stockholm, Sweden (Hansson et al. 2019). The model incorporates specific behaviors that guide the formation and dissolution of sexual contacts as well as migration of individuals in and out of the population. While the original model was formulated in continuous time, we consider a discrete time version of the model. This means that rates in the original formulation correspond to probabilities in ours. We note that as the number of the potential discrete time events tends to infinity and the event probabilities tend to zero, our formulation of the model converges to the original. Throughout this paper, each discrete model time step iteration is taken to correspond to one calendar month, and all events are recorded at the end of each iteration. While a constant number of individuals enter the population at each iteration, each individual leaves the population with a fixed probability at each iteration. The size of the network therefore fluctuates around n nodes, where n is the initial number of nodes in the network.

The model incorporates two types of partnerships: steady and casual. Casual relationships are defined to only last one iteration at onset while steady relationships are defined to have the potential to last longer. An individual can have at most one steady partner at any given time. The probability that a single person enters a steady relationship at a given iteration is \(\rho P_0\), where \(P_0\) is the proportion of single individuals in the present iteration. In the original model, where \(\rho\) is a rate of steady partnership formation, \(P_0\) fluctuates around an equilibrium; in our version, we fix this parameter and absorb it into \(\rho\) for simplicity and to improve identification of model parameters. Our modified probability of a single person entering a steady relationship at a given iteration is therefore \(\rho\). While the number of people willing to form relationships varies from iteration to iteration, the probability a single person joining a relationship stays the same. In the Hansson paper, the differential equation formulation of the model explicitly considers the fluctuation of the likelihood of new relationships while we do it implicitly as the number of singles changes. The probability of said steady relationship dissolving at each iteration is \(\sigma\).

In addition to a steady relationship, an individual may also have one casual partnership at each iteration. These casual relationships may occur alongside steady partnerships or during times when the person is single. A single individual enters a casual relationship with probability \(\omega _0\) while an individual who is currently in a steady relationship forms a casual relationship with probability \(\omega _1\). For any partnership to form, both individuals must be willing to join that relationship. In the scenario where an odd number of individuals would like to form a relationship, one of them (chosen at random) is left out. Each person migrates from the population with probability \(\mu\), and individuals enter into the population at constant rate \(n\mu\). The migration of an individual and the formation and dissolution of a sexual contacts are all determined by the outcome of independent Bernoulli trials. In the original continuous time formulation of the model, duration of steady relationships and the time spent in the population both follow exponential distributions. In contrast, for our discrete time formulation both are geometrically distributed. Starting from an empty graph with n nodes, we first run the model until we are confident that it has converged to the target distribution. We set our migration probability to 0 to ensure we are sampling individuals longitudinally and to maintain a closed cohort design. We note that the ’constant’ number of nodes being added is largely dependent on only \(\mu\) and n and easily recoverable. The model is described in Algorithm 1 and a few graph realizations from the model are illustrated in Fig. 1. We chose 1000 iterations to ensure we are past the burn-in (Hansson et al. 2019).

Fig. 1
figure 1

Network visualizations containing cumulative (from iteration (1) steady (red dashed) and casual (blue solid) edges for iterations 1 (left), 6 (middle), and 12 (right). We used the following parameter values: \(\mu\) = 0, \(\rho\) = 0.3, \(\sigma\) = 0.1, \(w_{1}\) = 0.2, \(w_{0}\) = 0.4

Algorithm 1
figure a

Hansson MSM model (Hansson et al. 2019)

Inference of model parameters

In Bayesian inference, complete knowledge of the model parameters, given the observed data, is contained in the posterior distribution. Typically, in mechanistic models, the complexity of the model means that the likelihood and corresponding posterior distribution is not available in closed form. In mechanistic models one can nevertheless forward simulate data from the model given parameters, and these parameter values may be obtained from a prior distribution. ABC is an inference framework that has been developed to deal with models that have intractable likelihoods. There are several ABC methods to generate samples from an approximate posterior distribution. For clarity of our objective, we use the simple accept/reject algorithm. ABC accept/reject operates by generating simulated data sets using a prior distribution for parameters that govern the data generating model and comparing the simulated data to the observed data using a distance metric. If the discrepancy between simulated and observed data falls below a predefined threshold, the parameter values corresponding to the simulated data are accepted as samples from the approximate posterior distribution that truly generated the observed data. Otherwise, they are rejected. The lower the threshold, the better the theoretical approximation. If the data are taken to match exactly, it is the exact posterior distribution, albeit impossible for continuous data. Summary statistics are used to capture key features of the data and simplify the comparisons of high dimensional data. If the summary statistics are sufficient for the parameters, then at the end of the algorithm we still have a set of parameters that act as the posterior distribution of the parameters governing the model. Sufficiency of the summary statistics can be difficult to achieve, however, for a reasonable approximation the summary statistics must be informative of the parameters (Csilléry et al. 2010). If we only kept parameter values that reproduced the observed data exactly, this approach would recover the exact posterior for discrete data. This approach is however impossible for continuous data because the probability of sampling a continuous value exactly is 0. To ensure that our prior distribution’s support is realistic to MSM relationship characteristics, we utilize a uniform distribution on the duration of average time spent for people to be open to joining a relationship \(\frac{1}{\rho }\) [1 month, 50 months], average time a steady relationship lasts \(\frac{1}{\sigma }\) [1 month, 90 months], average time for a single individual to partake in a casual relationship \(\frac{1}{\omega _0}\) [1 month, 40 months], and average time for an individual in a relationship to partake in a casual relationship \(\frac{1}{\omega _1}\) [1 month, 61 months]. Figure 2 shows the prior distributions on these inverse parameters and the corresponding implied prior distributions on the parameters themselves. We recognize a variety of definitions for steady and casual relationships in MSM contact networks, as well as a variety of estimates for the support of each duration (Malone et al. 2018; Down et al. 2017; de Vroome et al. 2000; Wall et al. 2013; Davidovich 2006; Weiss et al. 2020; Bavinton et al. 2016; Myers et al. 1999). We chose our support to be consistent with the data the model was originally trained on (Hansson et al. 2019), and calculate the reciprocal of each parameter sampled from the prior as an input to our model.

Fig. 2
figure 2

Prior distributions on the inverse parameters and the corresponding implied prior distributions on parameters themselves. The top row shows the distributions of the inverse parameters, which can be interpreted as distributions of the average values of geometric distributions. The bottom row shows the distributions of the parameter values themselves for our discrete time mechanistic network model for the following parameters: \(\rho\), \(\sigma\), \(\omega _1\), \(\omega _0\)

There are at least three major considerations in the ABC accept/reject framework: summary statistics, distance measure, and similarity threshold (Sisson et al. 2018). Given the mechanistic network model of interest, we manually chose a set of summary statistics needed for inference, which renders the network space more manageable (Sisson et al. 2018). The choice of network summary statistics was guided by the principle that it should be possible to obtain this information from study participants using a questionnaire and they should be informative of the model parameters. At a minimum, one needs at least as many summary statistics as there are parameters to be inferred (Sisson et al. 2018). Although the model has five parameters (six if one counts n), as previously mentioned, we opted to fix one of them, the migration probability \(\mu =0\). This leaves us with four parameters to recover: probability of a single person entering a steady relationship \(\rho\), probability of dissolving a steady relationship \(\sigma\), probability of a single individual to enter a casual relationship \(\omega _0\), and probability of an individual in steady relationship to enter a casual relationship \(\omega _1\). More information on parameters can be found in Table 1. We chose the four summaries, denoted \(s_1\) through \(s_4\), as listed in Table 2. We chose summaries that could be elicited by asking participants to consider their sexual history in the past year only. Longer histories could potentially be more informative, but longer look-back periods would likely increase recall bias.

Since the mechanistic network model is outside the exponential family, we have no guarantee of sufficiency, i.e., that our summary statistics fully summarize our network. However, we still require the summary statistics to be informative of the model parameters. An informal way to assess the extent of informativeness is to investigate plots of network summary statistics against model parameters. We denote these relationships as \(s_i(\theta )\) where \(i \in \{1, 2, 3, 4\}\) and \(\theta \in \{\rho , \sigma , w_0, w_1 \}\). We refer to these relationships as mapping functions, and we estimate them using simulations where \(s_{i, k}(\theta )\) represents the value of summary statistic \(s_i\) with respect to generative parameter \(\theta\) in simulation run k. The value of \(s_i(\theta )\) is given as the median value of \(s_{i, k}(\theta )\) taken across all simulations k.

We measured the distance between a simulated network and the observed network by calculating the Euclidean distance within the normalized summary statistic space. The normalized summary statistic value is obtained by first subtracting the mean of the summary statistic from each value and then dividing each value by the standard deviation of the summary statistic. We populated an ABC reference table for each lag by generating 10,000 graphs by sampling parameters from their joint priors and varying the lag between the two network observations between zero and 150 iterations. Next, taking a sample per lag from our joint prior density and its corresponding graph as our ground truth, we simulated samples from the corresponding approximate posterior distribution. We retained the parameters associated with the 100 (top 1%) smallest distances in the normalized summary statistic space between the observed and generated graphs.

Finally, we performed a regression adjustment on samples from the approximate posterior distribution (Beaumont 2019; Beaumont et al. 2002). The goal of the regression adjustment is to improve our ABC posterior’s convergence to our target posterior. The basis of the method is that we can obtain an estimate of our expected parameter values given the summaries using linear regression in the localized neighborhood around our observed data that we get from the approximate posterior. Then, we can use this relationship to adjust our approximate posterior distribution (Beaumont 2019). For clarity, when investigating the parameters’ impact on our summary statistics, the relationship is often complex across the entire range of the prior space. After implementing the accept/reject method, we reduce the range of the potential parameter values to a localized region in the parameter space consistent with the observed data. This restriction removes some of the complexity associated with the overall relationship. We then implement our regression model to train the simpler relationship between the parameters and summaries within this localized region using the results from the accept/reject algorithm. Using this trained model, we then utilize the summary statistics from the observed data to center the approximate posterior samples on the expected observed parameters given the summaries. Next, we alter this centering by adding each approximate posterior’s parameter and corresponding summary statistic’s prediction error, which it contributed to the training model, to the expected centering. If the relationship between the posterior parameters and summary statistics is perfectly trained, there is no error to add, and the expected observed parameter given the observed summary statistics is taken as the entire posterior. We then normalized the parameters in our reference table, and utilized the root mean squared error (RMSE) of the approximated posteriors for a fixed set of 500 ground truth parameters to measure accuracy of inference. Then, we averaged over all 500 parameter sets for an estimate of the RMSE, for a given lag, over our prior space (Fearnhead and Prangle 2012). For clarity, consider \(\theta _{i}\) as the ith ground truth parameter and \(\hat{\theta }_{i,k}\) as the kth sample from the approximate posterior estimating \(\theta _{i}\). Our estimate of RMSE is then

$$\begin{aligned} \hat{RMSE} = \frac{1}{500} \Sigma _{i = 1}^{500} \sqrt{\frac{1}{100} \Sigma _{k = 1}^{100} (\theta _{i} - \hat{\theta }_{i, k})^2}. \end{aligned}$$
(1)

Finally, we fitted a locally weighted regression (loess) with a 95% confidence interval to the data. We note that while the values of the parameters in the reference table are normalized, the resulting approximate posterior distributions are not. Individually normalizing the reference table parameters is useful because it places all parameters on the same scale when calculating the RMSE. However, the posteriors are displayed on the original scale for ease of interpretation.

Table 1 Parameter and their interpretation
Table 2 Summary statistic descriptions

Results

We evaluated the mapping functions on a grid along the unit interval by generating 100 graphs per parameter value and using box plots to summarize the results. We investigated mapping functions in two different scenarios. First, we varied each parameter in turn while keeping all others fixed at the values reported in (Hansson et al. 2019), i.e., we fixed \(\rho\) = 0.3, \(\sigma\) = 0.1, \(\omega _0\) = 0.4, and \(\omega _1\) = 0.2, and we also set \(\mu\) = 0. Since our primary objective is to introduce a new technique for analyzing network data collected over multiple waves, rather than to offer recommendations regarding HIV, we chose to set the migration parameter to zero. Nonetheless, in this context, we can confidently disregard the migration parameter because we are employing a longitudinal design in which we assume that all individuals are traceable. If we were to incorporate migration into the study, we would need to reevaluate the summary statistics we utilize and would likely need to shorten the optimal lag period, as turnover can lead to the loss of informative data. Second, we sampled each free parameter from its respective prior distribution. These plots were used to ensure that the chosen summary statistics are informative about the model parameters as can be seen in Figs. 3 and 4. We chose to fix our lag at 15 iterations to illustrate the general relationship between the parameters and summaries in an efficient manner. The relationship between the summaries and parameters, however, does change as a function of the lag.

While more summaries could be included, that would increase the computational burden and likely would not significantly increase accuracy. We considered several extra summaries during discovery such as size of the largest connected component, average path length, clustering coefficient, transitivity. Since we would like to obtain the data from questionnaires, one also needs to consider participant burden: all else equal, we would like to ask as few questions as needed to address the scientific question at hand. It is also worth emphasizing that each of the listed summaries can be obtained using privacy preserving questions only in data collection, i.e., participants do not need to disclose their identity nor the identity of their steady or casual partners. This arguably improves the quality of the collected data as respondents would be expected to be more likely to report their behavior accurately. We note that while a regression adjustment generally improves the results, it can at times generate functionally impossible values, such as negative probabilities, or worsen our inference when the summaries do not accurately represent the network. In the rare occasion the regression adjustment proposes a negative number, we opt to take a conservative approach and set the value at 0. In this study, we did not see any adjusted proposal probabilities above 1.

We visualized the regression adjusted approximate posteriors when looking at the graph once or twice with a lag of 50 iterations in Fig. 5. Furthermore, as expected, and as shown in Fig. 6, observing a network twice results in a smaller average error compared to observing a network only once. The improvement is largely driven by our ability to recover \(\sigma\) and \(\rho\) parameters as shown in Fig. 7. We also see the average error steadily decreases with the lag between the two network observations until about 40 to 50 iterations. This lag between the two network observations (data collection waves) is optimal in the sense that extending the gap further does not greatly increase accuracy of inference but does lengthen the duration of the study. In a closed cohort study, all else equal, the longer the duration of the study, the greater the expected attrition of study participants. Attrition of study participants in a setting like ours would lead to incomplete ascertainment of network structure and therefore introduce an additional error to network summary statistics. We also see that implementing a regression adjustment does reduce our average error by nearly an additional 2.6%, while maintaining the overall trend and optimal lag. Finally, we note our overall ability to discern parameters from our joint prior distribution when collecting data twice after a regression adjustment with an optimal drop of roughly 62% from our average prior error and 18% when only collecting data once.

Fig. 3
figure 3

Pairwise relationships between the model parameters (horizontal axes) and the summary statistics (vertical axes) used in our ABC inference scheme. Free parameters are fixed at \(\mu\) = 0, \(\rho\) = 0.3, \(\sigma\) = 0.1, \(\omega _0\) = 0.4, \(\omega _1\) = 0.2. The lag between two consecutive network observations is fixed at 15 iterations. Each box plot consists of 100 samples

Fig. 4
figure 4

Pairwise relationships between the model parameters (horizontal axes) and the summary statistics (vertical axes) used in our ABC inference scheme. Free parameters are sampled from the prior distributions. The lag between two consecutive network observations is fixed at 15 iterations. Each box plot consists of 100 samples

Fig. 5
figure 5

Approximate marginal posterior distributions of model parameters obtained by retaining the top 1% of proposed prior samples in our ABC accept/reject inference scheme. Different rows correspond to comparing the prior (top), observing the graph once (middle), and observing the graph twice with a lag of 50 iterations (bottom). All posteriors include a regression adjustment. The blue solid lines represent the 95% credible intervals and the red dotted lines represent the true parameter values

Fig. 6
figure 6

Estimated average RMSE, where the average is taken across multiple network realizations, as a function of the lag between the two network observations. We also include a loess curve with a 95% confidence interval (shaded areas). The average prior average RMSE is 2.22 (not shown), whereas the corresponding regression adjusted error for a network observed only once is 1.03 that for a network observed twice with a lag of 50 iterations is 0.84

Fig. 7
figure 7

Estimated regression adjusted average RMSE for the total error (top curve) and separately for the four parameters considered in our study (bottom four curves). These results show that when observing a network twice, the reduction in total RMSE is mainly due to the reduction of RMSE for \(\rho\) and \(\sigma\)

Discussion

In this paper, we investigated the accuracy of an approximate inference scheme applied to an evolving mechanistic network model in a setting where the network, representing sexual contacts among people in a closed population, is observed at two different time points. As expected, observing the network twice improves the accuracy of inference, but this reduction in inferential error depends on the time lag between the two observations. Given that collection of real-world sexual network data is expensive and logistically challenging, it pays off to optimize the gap between the two time points to maximize accuracy of inference. If the two network observations are too close in time, there may have been only minimal changes in the network structure, and therefore the second observation adds little information. However, if the two network observations are too far apart in time, the study may be logistically difficult to carry out in practice and the population is likely to experience significant churn.

There are a total of six parameters in the model, but we fixed two of them to focus on a closed, fixed-sized cohort. When considering the contribution of the remaining four parameters to inferential error, we observed that the \(\sigma\) (probability of dissolving a steady relationship) and \(\rho\) (probability of a single person entering a steady relationship) parameters benefited the most from the lag between the two network observations. This finding is intuitive as these two parameters influence multiple relationship iterations. However, \(\omega _{0}\) (probability of a single individual to enter a casual relationship) and \(\omega _{1}\) (probability of an individual in steady relationship to enter a casual relationship) both correspond to one-time events and do not benefit as much from a lag. In particular, \(\omega _{1}\) is relatively accurate at all lags while \(\omega _{0}\) would likely see more relative improvement through the consideration of another summary statistic.

The set of summary statistics that may be considered in inference depends on the information obtained from subjects through study questionnaires. The informativeness of questions themselves depends on the mechanisms that drive contact formation in the study population. Depending on the mechanisms, it is possible that any set of individual-level questions (giving rise to so-called egocentric samples of the network) may be inadequate for network inference and instead one may need information about the full network structure. While this type of network-level information could be obtained using a sociocentric design, it is very challenging, and we are aware of only one study that has implemented this in practice. The Likoma Network Study was based on a sociocentric survey of sexual partnerships aimed to investigate the population-level structure of sexual networks connecting the young adult population of several villages on Likoma Island, Malawi (Helleringer and Kohler 2007). We stress that this notable study is cross-sectional and therefore corresponds to a one-time observation of the network (even if the data collection in this study occurred in two stages for logistical reasons). Obtaining two observations of the network would be logistically nearly impossible, and doing so in larger populations is not feasible.

Our results highlight the importance of using simulation to investigate the hypothesized generative mechanisms of network formation to inform future study designs, here specifically (1) what questions to ask so that maximally informative network summary statistics may be constructed and (2) how to space the two (or possibly more) data collection waves. For example, in our setting, introducing extensive migration in the population leads to a shorter optimal lag between the two network observations. Our approach is compatible with the recommended paradigm of using simulations for designing and interpreting intervention trials in infectious diseases, particularly with regard to emerging infectious diseases (Halloran et al. 2017). One of the main goal of such simulations is to more accurately reflect the dynamics of the transmission process. For sexually transmitted diseases, learning about the mechanisms of network formation is an important step in that direction.

In this paper, we have used basic ABC and basic regression adjustment techniques because our goal here is to see whether the ABC approach is effective in its simplest and most interpretable form. More refined variants of these methods, which can substantially improve computational performance, can be studied later on. Finally, at the time of writing, we came across related work on how design choices for egocentric network studies impact statistical estimation and inference for ERGMs (Krivitsky et al. 2022). This investigation is relevant for ours, although our focus is specifically on the multiple observation of the evolving network. For a suitably chosen ERGM, i.e., an ERGM with reasonably simple dependence assumptions, it is possible to attain sufficient summary statistics from egocentric network samples. This allows for exact statistical inference, but at the cost of making distributional assumptions that may not hold. For that reason, it is valuable for investigators to have various methods at their disposal so that they may choose the tool that best fits the scientific problem at hand.

Availability of data and materials

Data was simulated using a mechanistic model introduced to study MSM contact networks in Stockholm, Sweden (Hansson et al. 2019). Our code is accessible at: https://github.com/onnela-lab/longitudinal-inference.

References

  • Adamic LA, Huberman BA (2000) Power-law distribution of the world wide web. Science 287(5461):2115–2115

    Article  Google Scholar 

  • Albert R, Barabási A-L (2002) Statistical mechanics of complex networks. Rev Mod Phys 74(1):47

    Article  MathSciNet  Google Scholar 

  • Aroke H, Katenka N, Kogut S, Buchanan A (2022) Network-based analysis of prescription opioids dispensing using exponential random graph models (ERGMs). In: Complex networks & their applications X: vol 2, proceedings of the tenth international conference on complex networks and their applications complex networks 2021 10, pp 716–730. Springer

  • Bavinton BR, Duncan D, Grierson J, Zablotska IB, Down IA, Grulich AE, Prestage GP (2016) The meaning of ‘regular partner’in HIV research among gay and bisexual men: implications of an Australian cross-sectional survey. AIDS Behav 20(8):1777–1784

    Article  Google Scholar 

  • Beaumont MA (2019) Approximate Bayesian computation. Ann Rev Stat Appl 6:379–403

    Article  MathSciNet  Google Scholar 

  • Beaumont MA, Zhang W, Balding DJ (2002) Approximate Bayesian computation in population genetics. Genetics 162(4):2025–2035

    Article  Google Scholar 

  • Birkett M, Armbruster B, Mustanski B (2015) A data-driven simulation of HIV spread among young men who have sex with men: the role of age and race mixing, and STIs. J Acquir Immune Defic Syndr 70(2):186

    Article  Google Scholar 

  • Csilléry K, Blum MG, Gaggiotti OE, François O (2010) Approximate Bayesian computation (ABC) in practice. Trends Ecol Evolut 25(7):410–418

    Article  Google Scholar 

  • Davidovich E (2006) Liaisons dangereuses: HIV risk behavior and prevention in steady gay relationships

  • Down I, Ellard J, Bavinton BR, Brown G, Prestage G (2017) In Australia, most HIV infections among gay and bisexual men are attributable to sex with ‘new’partners. AIDS Behav 21(8):2543–2550

    Article  Google Scholar 

  • Fearnhead P, Prangle D (2012) Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. J R Stat Soc Ser B (Stat Methodol) 74(3):419–474

    Article  MathSciNet  Google Scholar 

  • Fitzmaurice GM, Laird NM, Ware JH (2012) Applied longitudinal analysis

  • Goyal R, Onnela J (2020) Framework for converting mechanistic network models to probabilistic models. arXiv:2001.08521

  • Halloran ME, Auranen K, Baird S, Basta NE, Bellan SE, Brookmeyer R, Cooper BS, DeGruttola V, Hughes JP, Lessler J (2017) Simulations for designing and interpreting intervention trials in infectious diseases. BMC Med 15(1):1–8

    Article  Google Scholar 

  • Hansson D, Leung KY, Britton T, Strömdahl S (2019) A dynamic network model to disentangle the roles of steady and casual partners for HIV transmission among MSM. Epidemics 27:66–76

    Article  Google Scholar 

  • Helleringer S, Kohler H-P (2007) Sexual network structure and the spread of HIV in Africa: evidence from Likoma Island, Malawi. Aids 21(17):2323–2332

    Article  Google Scholar 

  • Krivitsky PN, Morris M, Bojanowski M (2022) Impact of survey design on estimation of exponential-family random graph models from egocentrically-sampled data. Soc Netw 69:22–34

    Article  Google Scholar 

  • Le T-M, Raynal L, Talbot O, Hambridge H, Drovandi C, Mira A, Mengersen K, Onnela J-P (2022) Framework for assessing and easing global COVID-19 travel restrictions. Sci Rep 12(1):1–13

    Article  Google Scholar 

  • Macal C, Sallach D, North M (2004) Emergent structures from trust relationships in supply chains. In: Proceedings of agent 2004: conference on social dynamics, pp 7–9

  • Malone J, Syvertsen JL, Johnson BE, Mimiaga MJ, Mayer KH, Bazzi AR (2018) Negotiating sexual safety in the era of biomedical HIV prevention: relationship dynamics among male couples using pre-exposure prophylaxis. Culture Health Sex 20(6):658–672

    Article  Google Scholar 

  • Mei S, Sloot PM, Quax R, Zhu Y, Wang W (2010) Complex agent networks explaining the HIV epidemic among homosexual men in Amsterdam. Math Comput Simul 80(5):1018–1030

    Article  MathSciNet  Google Scholar 

  • Myers T, Allman D, Calzavara L, Morrison K, Marchand R, Major C (1999) Gay and bisexual men’s sexual partnerships and variations in risk behaviour

  • Padeniya SMTN (2021) Mathematical modelling to explore the role of the female-sex-worker-client interaction for gonorrhoea transmission and prevention among Australian heterosexuals. Ph.D. thesis, UNSW Sydney

  • Robins G, Pattison P, Kalish Y, Lusher D (2007) An introduction to exponential random graph (p*) models for social networks. Soc Netw 29(2):173–191

    Article  Google Scholar 

  • Rolls DA, Wang P, Jenkinson R, Pattison PE, Robins GL, Sacks-Davis R, Daraganova G, Hellard M, McBryde E (2013) Modelling a disease-relevant contact network of people who inject drugs. Soc Netw 35(4):699–710

    Article  Google Scholar 

  • Scholtens D, Gentleman R (2005) Making sense of high-throughput protein-protein interaction data. Stat Appl Genet Mol Biol 3(1):39

    Article  MathSciNet  Google Scholar 

  • Sisson SA, Fan Y, Beaumont M (2018) Handbook of approximate Bayesian computation

  • Vajdi A, Juher D, Saldaña J, Scoglio C (2020) A multilayer temporal network model for STD spreading accounting for permanent and casual partners. Sci Rep 10(1):1–12

    Article  Google Scholar 

  • Vroome EM, Stroebe W, Sandfort TG, WIT JB, Griensven GJ (2000) Safer sex in social context: individualistic and relational determinants of AIDS-preventive behavior among gay men 1. J Appl Soc Psychol 30(11):2322–2340

    Article  Google Scholar 

  • Wall KM, Stephenson R, Sullivan PS (2013) Frequency of sexual activity with most recent male partner among young, internet-using men who have sex with men in the United States. J Homosex 60(10):1520–1538

    Article  Google Scholar 

  • Weiss KM, Goodreau SM, Morris M, Prasad P, Ramaraju R, Sanchez T, Jenness SM (2020) Egocentric sexual networks of men who have sex with men in the United States: results from the ARTnet study. Epidemics 30:100386

    Article  Google Scholar 

  • Wertheim JO, Kosakovsky Pond SL, Little SJ, De Gruttola V (2011) Using HIV transmission networks to investigate community effects in HIV prevention trials. PloS ONE 6(11):27775

    Article  Google Scholar 

Download references

Acknowledgements

We would like to acknowledge John Quackenbush and Marcello Pagano for their thoughtful insights on summary statistic exploration.

Funding

NIH Award #R01AI138901.

Author information

Authors and Affiliations

Authors

Contributions

All authors conceived the study as well as drafted and revised the manuscript; OS implemented the method in code and carried out data analyses. TH and JP supervised.

Corresponding author

Correspondence to Octavious Smiley.

Ethics declarations

Competing interests

The authors have no Conflict of interest to report.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Smiley, O., Hoffmann, T. & Onnela, JP. Approximate inference for longitudinal mechanistic HIV contact network. Appl Netw Sci 9, 12 (2024). https://doi.org/10.1007/s41109-024-00616-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s41109-024-00616-4

Keywords