Epidemic risk assessment from geographic population density

In the presence of an epidemic outbreak, the availability of instruments to guide containment measures may have an enormous impact on public health, the economy, and society. Mathematical modeling and computer simulations provide powerful tools to understand the dynamics of epidemic outbreaks, design strategies to control and miti-gate them, and project decisions and scenarios into potential alternative outcomes. The classical compartmental models of disease propagation (Kermack and McKendrick 1991; Hethcote 2000) describe the system at the population level by grouping individuals in sub-populations according to their health states relevant for transmission and track-ing changes in the sizes of the sub-populations, without specifying which individuals are involved. The movements in and out of the compartments are governed by constant transition rates in ordinary differential equations, corresponding to exponentially distributed waiting times in the compartments. This approach assumes a fully-mixed population, where an infectious individual is equally likely to spread the disease to any other Abstract The geographic distribution of the population on a region is a significant ingredient in shaping the spatial and temporal evolution of an epidemic outbreak. Heterogeneity in the population density directly impacts the local relative risk: the chances that a specific area is reached by the contagion depend on its local density and connectedness to the rest of the region. We consider an SIR epidemic spreading in an urban territory subdivided into tiles (i.e., census blocks) of given population and demographic profile. We use the relative attack rate and the first infection time of a tile to quantify local severity and timing: how much and how fast the outbreak will impact any given area. Assuming that the contact rate of any two individuals depends on their household distance, we identify a suitably defined geographical centrality that measures the average connectedness of an area as an efficient indicator for local riskiness. We simulate the epidemic under different assumptions regarding the socio-demographic factors that influence interaction patterns, providing empirical evidence of the effectiveness and soundness of the proposed centrality measure. Data driven, Urban system, Geographic

individual, and it is equivalent to a mean-field. The modeling of epidemic contagion has evolved from such simple compartmental schemes to sophisticated data-informed models using synthetic populations whose demographics are statistically indistinguishable from the census data. Socio-demographic and geographic factors, such as age and household distance, are essential in determining contact patterns and disease spreading, especially on an urban scale (Illenberger et al. 2013;Walsh and Pozdnoukhov 2011;Del Valle et al. 2007; Mossong et al. 2008).
In the last decades, lots of work has focused on incorporating heterogeneous contact patterns in the models. In individual-based models, the population-level behavior emerges from the microscopic interactions between agents that can carry a set of attributes such as age, spatial location, and social behavior. Socio-demographic attributes collected from census data and surveys or integrated with mobile, traffic, or wearable sensor data and online tools (Mossong et al. 2008;Cattuto et al. 2010;Mistry et al. 2020), allow the construction of realistic contact matrices describing the mixing patterns in typical social settings (e.g., households, schools, workplaces) (Mossong et al. 2008;Eubank et al. 2004;Merler and Ajelli 2009;Chang et al. 2021). Metapopulation models, built using data from inter-population mobility (Hufnagel et al. 2004;Colizza et al. 2006;Chang et al. 2021;Tizzoni et al. 2014), allowed remarkable progress in describing how the disease travels from one city/nation to the other and estimating epidemic paths predictability (Colizza et al. 2007). Compartmental models with stochastic and time-varying disease transmission rates allow for incorporating control measures and environmental variations due to unobserved processes (Cazelles et al. 2018;Gray et al. 2011;Bonaccorsi and Ottaviano 2016). By representing individuals in the population as vertices of a graph, network-based models aim to understand how the topological structure of the contact network affects the dynamics spreading upon them (Gupta et al. 1989;Newman 2002;Keeling and Eames 2005;Chakrabarti et al. 2008;Pastor-Satorras et al. 2015). Neal (2002, 2008) incorporated some randomness in the classic stochastic Susceptible-Infected-Removed (SIR) model in a closed population, adding to the local contacts among neighbors on a social network, also global contacts among pairs of individuals chosen at random in the population. This model is also suited to mimic multiple transmission routes (Kiss et al. 2006). In Celestini et al. (2022), the authors address similar questions for epidemics at urban scale by simulating an agent-based SIR model in a synthetic urban population. As in Ball and Neal (2008), their model incorporates casual contacts on top of a structured set of social relations. In addition, they use a data-informed model to obtain an age-structured population connected by a geographically referenced social network and allow for global casual contacts to be dependent on distance and the age class of an agent. The model assumes that an individual has daily contact with her household members, frequent contact with a small network of acquaintances (e.g., friends or coworkers), and fortuitous contact with the rest of the population. As data about the population density are often available in a discretized form, the model used in Celestini et al. (2022) considers a regular tiling of the urban territory that induces a partition of the population based on their tile of residence and permits analysis of the geographic diffusion of the disease. According to this model, urban epidemic outbreaks are not predictable (as defined in Colizza et al. 2006) because of the many possible epidemic pathways. However, the authors highlight that the first infection time and the final fraction of the population infected within a tile correlate to the logarithm of the tile population.
Building on these findings, in this paper, we investigate more in detail how the geographical distribution of the population influences the spatial and temporal evolution of an epidemic disease outbreak at an urban scale, where density patterns may impact the epidemic. We focus in particular on two local observables: the tile first infection time τ and relative attack rate α , defined as the attack rate within the tile divided by the total attack rate. Assuming that the frequency of contacts decays as an inverse power of the distance (Liben-Nowell et al. 2005;Illenberger et al. 2013;Kowald et al. 2013;Herrera-Yagüe et al. 2015;Büchel and Ehrlich 2020), we find that the expected value of τ and α can be described efficiently through a metric that we call geographic harmonic centrality ( HC ). We provide qualitative insights on the relations binding τ and α to HC , and support our findings through simulations. We find that these relations still hold when the model is generalized to include age-based mixing, vertex-intrinsic sociability fitness, and multiple levels of mixing with contacts that recur with different frequencies.

SIR epidemics on a real territory: Model and methods
Let us consider a Susceptible-Infectious-Removed (SIR) epidemic model where the individuals of the population belong to one of three disjoint compartments: susceptible ( S ), if they never caught the disease; infectious ( I ), if they are currently infected and contagious; removed ( R ), if they already recovered from or died due to the disease. Only two transitions are possible between the three compartments: a susceptible individual may become infected ( S → I ) due to an interaction with an infectious individual; an infectious individual may spontaneously recover or die ( I → R ). We consider a discrete-time synchronous cellular automaton in which the dynamic follows a reactive process (de Arruda et al. 2018). For each time step t = 0, 1, . . . , we denote with S(t) , I(t) and R(t) the sets of, respectively, susceptible, infected and recovered individuals at time t. The size of these compartments is, respectively, S(t) = |S(t)| , I(t) = |I(t)| and R(t) = |R(t)| . The interactions between the individuals of the population are modeled by a sequence of temporal contact networks G t = (V , E t ) , where E t is the set of contacts occurring at time t and (u, v) ∈ E t with probability p u,v independently of all other edges. If v ∈ S(t) , for each u ∈ I(t) such that (u, v) ∈ E t , the disease is transmitted from u to v with fixed probability β-i.e., the probability that v ∈ I(t + 1) given v ∈ S(t) is 1 − u∈I(t) (1 − p u,v β) . If u ∈ I(t) , u recovers with a fixed rate µ-i.e., the probability that u ∈ R(t + 1) given u ∈ I(t) is µ . We assume that a single individual, called index case, is infectious at time 0, i.e., I(0) = 1.
We simulate epidemic outbreaks in the Municipality of Viterbo, Italy, represented as a rectangular bounding box inhabited by a synthetic population V of N ≈ 60K agents. The territory of Viterbo is partitioned into a grid of n square tiles of fixed side l = 500 m and we disregard all tiles having less than 10 residents (see Fig. 1). Each agent v has the following three attributes: a tile of residence i v , inferred from density estimates provided by the WorldPop Project (Bondarenko et al. 2020); an age-tag g v -child (0-17), young (18-34), adult (35-64), or elder (65+)-drawn based on census data aggregated at the provincial level, provided by the Italian Institute of Statistics (ISTAT) (all details can be found in ); a social fitness score f u , drawn from a Lognormal distribution. The sociability score f u measures u's tendency to interact with other people; using a Lognormal distribution guarantees that the degree distribution of the interaction graphs has a heavy-tail and an adjustable mode . In particular, we set the distribution of f u to have "small" skewness and variance, so that most vertices have degree close to the average and the hubs are limited in both number and size (Kertész et al. 2021). Lognormally distributed data occur across several domains (Mitzenmacher 2004)-including social networks (Liben-Nowell et al. 2005;Illenberger et al. 2013)-and Lognormal distributions often fit the statistical properties of real-world network better than power laws (Broido and Clauset 2019). More details on the choice of the fitness distribution can be found in .
In the following, we will initially consider two cases: homogeneous mixing (HM), characterized by p u,v = p for constant p > 0 ; distance-based mixing (DM), where p u,v ∝ d −1 i,j for all u ∈ V i and v ∈ V j only depends on the distance d i,j between V i and V j . We will then assess, through extensive simulations, whether the predictions obtained for the DM model remain valid when p u,v also depends on the social fitness, on the age and/or on the underlying social fabric, encoded into a static urban social network (USN) that combines synthetic households and acquaintances. Intra-household (e.g., kinship) edges are defined by drawing a clique for each household, where the breakdown of the population into households is entirely inferred from the available data. Acquaintance edges are instead drawn based on the fundamental assumption that the probability of two individuals being "friends" is ultimately governed by their age group, their geographical distance, and their sociability. More details on our urban social network model can be found in  and .
In total, we consider five configurations, summarized in Table 1 and described more in details as needed. All configurations are fully dynamic-i.e., the temporal contact networks G t are mutually independent-except for the multi-layer mixing (MLM), in which part of the interactions that occur at each time t are selected from the USN static social network of "strong ties".
The configurations are characterized by a fixed expected volume of daily contacts, i.e., they all have the same average interaction probability p u,v . In line with Biggerstaff et al. (2014) and Liu et al. (2018), all the experimental results presented in this paper are obtained calibrating the SIR model on a common Influenza Like Illness Fig. 1 The territory of the Municipality of Viterbo, tessellated into tiles. The color of each tile show the number of individuals living in the area, only tiles having 10 or more residents are considered Celestini et al. Applied Network Science (2022) 7:39 (ILI) by setting µ = 1/3 , so that the average time to recovery is 3 days, and p u,v β such that the expected number of infections caused by a single case in a fully susceptible population is R index 0 = 1.3. The tiling of the city's territory induces a partition of the population V into n blocks With a slight abuse of notation, V i will be used to indicate both the ith tile and the population living therein. I i (t) and I i (t) will denote, respectively, the set and the number of infected individuals in block i at time t. The partition of the population into tiles mirrors the fact that the spatial density of the population always comes at some given resolution. The tiling, however, has no meaningful interpretation: at urban scale, in fact, there are no clear boundaries and truly distinct means of transportation (e.g., air vs. road traffic) that permit to identify meaningful sub-populations. The probabilistic rule used to establish if u and v come into contact is the same regardless of whether u and v belong to the same or to different tiles. Albeit the tiling produces a discretization of distances-and, hence, a stratification of contact frequencies-our model significantly differs from typical meta-population models, where the local and global dynamics of contagion are well separated.
On large scale territories, the high number of possible spreading channels may be compensated by the presence of dominant connections in human traffic flow, making the geographic spread of the disease predictable (Colizza et al. 2006), in the sense that different outbreak realizations present similar patterns of spatial epidemic diffusion over time. As shown in Celestini et al. (2022), in our case the predictability of the geographic patterns of diffusion of the epidemic is instead limited-at least far from the epidemic peak-due to the absence of backbones capable of defining preferred epidemic pathways. Still, the stratification of contact frequencies may impact on the "riskiness" of different blocks with respect to epidemic outbreaks. As proxies of riskiness we will use the (expected) relative attack rate and local first infection time, defined as follows: (1) Table 1 The interpersonal contact models used throughout the paper p u,v is the probability that u and v are connected at each step of the simulation f u is the Lognormally distributed social fitness of u s gu,gv is the data-driven age-based social mixing for u and v's age groups g u and g v E H and E A are the edge sets of two static networks, respectively used to represent households and acquaintances (see Sect. 4 and  for more details)

Model
Contact probability p u,v

Name Acronym
Homogeneous mixing HM constant Distance and fitness-based mixing Celestini et al. Applied Network Science (2022) 7:39 where R = R(+∞) and R i = R i (+∞) denote the final epidemic size, in V and V i , respectively, and E[·] is the expected value. For all i = 1, . . . , n , α i is the expected attack rate of V i divided by the total expected attack rate, a measure of the chances for an individual resident in V i to be infected with respect to the average population. τ i is the expected time of the first infection in V i , a measure of how quickly V i must be isolated in order to protect its sub-population. The expected value in the definition of α i and τ i will be replaced by the empirical mean in the experimental analysis presented in Sect. 4. In all cases, the averages will be computed over 100 independent epidemic outbreaks, simulated with identical configuration parameters, that involve at least 25% of the population. We empirically verified that, with a single index case, the epidemic size follows a bimodal distribution with a minimum at around 25%N , so all simulations resulting in R/N < 0.25 have been considered cases of early extinction and, as such, discarded.

Geographic distributions of individuals and the spread of epidemics
In the simplest scenario, the geographic distribution of the population impacts the relative risk of an infection event occurring in a specific local area in two ways: (i) through the local population density, and (ii) through the connectedness of the local area to the rest of the territory. If we assume that the contact rate of any two individuals depends on their household distance, we can quantify the average connectedness of an area in terms of its geographical centrality. Thus, the chances for a given area to be reached by the contagion ultimately depend on its local density and centrality. To formalize this idea, we focus on the relative attack rate α and on the first infection time τ , defined in (1) and (2), starting from two simple models: homogeneous mixing (HM) and distance-based mixing (DM).

Homogeneous mixing
At this level of description, the geography of an outbreak of size R can be described by means of a vector of indices O = (j 0 , j 1 , ..., j R−1 ) such that j h = i if and only if the hth infection occurs in V i (simultaneous infection events are sorted in random order). If we assume that the population is homogeneously mixed, all individuals have, on average, the same chances of contracting the disease. The probability that j h = i is thus S i (t h )/S(t h ) , i.e., the probability that, at the time t h when the hth infection occurs, a susceptible individual chosen uniformly at random belongs to V i . Under homogeneous mixing, for fixed t h , the expected value of the ratio S i (t h )/S(t h ) can be approximated by the local population density ρ i = N i /N. Under these assumptions, V i 's epidemic size R i follows a binomial distribution of parameters ρ i and R, hence, its expected value is ρ i E[R] . This gives: Let W i be the number of infection events needed to observe the first infection in V i , i.e. W i = h i if h i = min{h : j h = i} . W i follows the geometric distribution ρ i (1 − ρ i ) h i −1 , and the expected value of W i is 1/ρ i . In the exponential phase of the epidemic the number of contagion events grows exponentially in time and we can assume that the hth contagion event occurs at time t h ∝ ln(h) , yielding the following estimate for the first infection time:

Connection flow to a sub-population
In the HM case, the local density ρ i = N i /N is a good approximation of the probability of an infection reaching the ith tile, and it is therefore the key ingredient in determining both α i and τ i . In many cases, it is possible to partition the population into sub-populations {V i } n i=1 and to assume that, at least with good approximation, p u,v ≈ p i,j for all u ∈ V i and v ∈ V j , i.e., that p u,v mostly depends on the sub-populations of u and v. A simple example is the geographic partitioning into subpopulations achieved by dividing a territory with a tiling. More generally, one can partition a population based on a combination of geographic and demographic attributes, such as location and age.
In this case, the probability that at least one individual of V i comes into contact with an infectious individual is intuitively proportional to the quantity k i can be interpreted as a per node connection flow to the ith tile: a randomly chosen node u ∈ V has on average k i daily contacts with the nodes v ∈ V i -i.e., the dynamic network G t contains, on average, k i edges incident to u and V i . If we know that none of the individuals in V i is infectious, a better estimate is provided by which only differs from k i for the fact that the average does not involve the members of V i itself.
The same rationale used for the HM case now brings us to the following general estimate for α i in a partitioned population: Along the same line, if the infection reaches V i during the phase of exponential growth of the epidemic, τ i can be estimated as:

Distance-based mixing
Let d i,j be the normalized geographic distance between the center of tiles V i and V j . The normalization is obtained through a division by l 2 , where l is the tile side. Additionally, we impose d i,i = 1 , so that individuals in the same tile are at half the distance of individuals living in neighboring tiles. Finally, we set d u,v = d i,j for all u ∈ V i and v ∈ V j . This approximation is often inevitable, at least for small values of l, because population density data are generally discrete and we may only have access to the residence tile of a vertex.
In line with previous empirical findings ( Liben-Nowell et al. 2005;Illenberger et al. 2013;Kowald et al. 2013;Herrera-Yagüe et al. 2015;Büchel and Ehrlich 2020), to model distance-based mixing (DM) we assume that the probability that u ∈ V i and v ∈ V j come into contact decays as a power of the distance: (5) and (6) gives The right hand sides of (9) and (10) are, actually, measures of geographic centrality. This leads us to introduce the following metrics:  (7) and (8), respectively, we obtain the following estimates for the first infection time and the relative attack rate under the DM model:

Experimental analysis
The analysis presented so far suggests that, under the widely accepted hypothesis that the frequency of contacts decays with the distance, the riskiness of different areas of a city is tied to their local density and geographic centrality.
To support our thesis and to assess whether these findings remain valid under more realistic models, in this section we evaluate empirically the distribution of α i and τ i for five different models. We consider the HM model and the DM model (with γ = 1 ), and three other models that include additional factors influencing the patterns of interpersonal contacts: • In the distance and fitness-based mixing (DFM) model, p u,v ∝ d −1 u,v f u f v , where f u is a measure of u's social fitness (Caldarelli et al. 2002), drawn from a Lognormal distribution. • In the distance and age-based mixing (DAM) model, p u,v ∝ s g u ,g v d −1 u,v , where s i,j is the age-based social mixing for age groups i and j, deduced from aggregated contact data (Mossong et al. 2008) through the SOCRATES Data Tool (Willem et al. 2020). • Finally, in the multi-layer mixing (MLM) model, different sets of edges recur with different probabilities. We first construct two edge-disjoint static networks: the household graph G H = (V , E H ) , composed of a clique for each synthetic household drawn based on census data ; the acquaintance graph G A = (V , E A ) , where, for each u, v ∈ V , (u, v) ∈ E A independently of all others with probability extracted independently of f u , but from the same distribution). Household relations induce a static frame of daily contacts, i.e., p u,v = 1 for all (u, v) ∈ E H , while acquaintance relations induce frequent contacts, with Thus, in this model part of the connections-all (u, v) ∈ E H -are static.
To allow for a fair comparison, in all configurations p u,v is normalized so that u<v p u,v = |E| . This guarantees that the expected density of G t and, thus, the expected number of potential contagion events u<v p u,v β are fixed. The choice of 1 as the distance dependence exponent, and of a Lognormal fitness distribution is justified on an empirical basis in  to best represent social contacts at the urban level. In the MLM model, a fixed social network instance is considered, so that all fluctuations are due to epidemic dynamics.
All the figures presented in the current section are scatterplots fitted by a linear regression model, with a dot for each tile of the territory of Viterbo. Since each epidemic simulation is inherently stochastic, each run reaches a possibly different subset of tiles. Grey dots in the plots represent tiles that are reached by only a fraction of the simulations and that are ignored by the linear regression due to their inherent bias. Configurations with more constrains generally reach fewer tiles, but, in all cases, the tiles in grey in the plots account for less than 3% of the population of Viterbo, in total. More details about the impact of the configuration parameters upon the outcomes of the spreading process are given in Celestini et al. (2022).  (15) is the first confirmation that HC 1 i , weighted by N /N i , is indeed a good candidate to predict the epidemic spread when the frequency of contacts decays as d −1 . Adding fitness to the model (DFM configuration, panel (c)) yields more heterogeneity to the degree of nodes and apparently weakens the role of HC γ i as a predictor of the relative attack rate. For DAM and MLM (panels (d) and (e), respectively), however, we see again a good agreement with (15). Let us remind that in the MLM model both the static acquaintance graph and the fully dynamic part of the interaction graph are built relying on a social fitness score that guarantees a heavy-tailed degree distribution. Nonetheless, N N i HC γ i is a good predictor of α i for the MLM configuration (Fig. 2e), contrary to the DFM (Fig. 2c). The insight is that the approximation works even for heterogeneous networks, provided that a significant volume of interactions tend to recur over time. Finally, in all panels (except (a)) we see that the relative attack rate (and thus the final epidemic size in each tile) grows linearly with HC γ i : the more central the tile is, the greater the fraction of its population that is infected.  First infection time Figure 3 shows the plot of τ i versus H C γ i . Despite the range of τ i being highly dependent on the configuration-with more heterogeneous models corresponding to faster epidemics-all panels show a very good agreement between the average time of first infection in the simulations and the value predicted by (16). For all the configurations tested the time of first infection decreases linearly with the logarithm of H C γ i . In particular, as expectable, the most centrally located tiles are infected first.

Characterizing epidemic front waves
Based on (16), the epidemic front waves, i.e., the sets of tiles having equal τ , can be characterized by the condition for some constant H. Given the spatial distribution of the population, we would like to predict the shape of these contours on the city map, but the geographic interpretation of (17) is not straightforward. In this section, we study these epidemic front waves to see if they can be described using some simple features of V i , ideally, its population and its location in the territory. To this end, we observe that i.e., H C γ i is N i times the average of H C γ (v) over the elements of V i , which disentangles the dependance of τ i on the population and the position of V i . Condition (17) can now be reformulated as  Arguably, in most real world cities �HC γ (v)� i is correlated, to some extent, with the distance d C,i of V i from the center of the city. The actual relation between �HC γ (v)� i and d C,i depends on the spatial density of the considered population, but the intuition that these two quantities are correlated is validated empirically in Fig. 4, where a relation of the type �HC γ (v)� i ∝ e −d C,i seems to emerge, regardless of whether we consider the real, uniform or shuffled population density for Viterbo-here, we identified C as the tile having max HC 1 . As a consequence of the estimated relation �HC γ (v)� i ∝ e −d C,i , the epidemic front waves can be characterized by a simpler condition of the type for a, b > 0 . This relation can be clearly seen in Fig. 5, where we grouped the tiles having similar τ and we included a linear regression for each class. While in the HM model τ only depends on the population, as per equation (4), in the DM model (Fig. 5a, c) the epidemic front waves lie along parallel non-horizontal lines in the d C,i , ln(N i ) plane-albeit with variable slope according to the configuration parameters. The different shapes that these epidemic front waves take on the map show the role that the (generally, positive) correlation between the centrality of a tile and its population have in shaping the riskiness map of urban epidemics.

Conclusions
Thanks to the current availability of data regarding human mobility flows, the largescale spreading of a disease is, to some extent, predictable (Colizza et al. 2006). To the best of our knowledge, it has not been ascertained whether major backbones of disease propagation also exist at the urban scale. In our previous work (Celestini et al. 2022), we explored the consequences of the widely-acknowledged assumption that the frequency of contacts decays as an inverse power of the distance (Liben-Nowell et al. 2005;Illenberger et al. 2013;Kowald et al. 2013;Herrera-Yagüe et al. 2015;Büchel and Ehrlich 2020). We found that this distance-based stratification of contacts is not enough to make the geographic spreading of urban epidemics predictable. Yet, we also showed that heterogeneity in population distribution reflects onto a local variation of epidemic timing and severity. Building on Celestini et al. (2022), in this paper we formalize the relative riskiness of living in different areas of a city by means of a suitable metric of geographic centrality.  To measure the riskiness of a specific portion of the considered territory, we consider the relative attack rate and the first infection time. We find two closed-form expressions that approximate these two quantities when the rate of interpersonal contacts only depends on household distances. Both expressions can be formulated in terms of a new metric of geographic harmonic centrality, that quantifies the average connectedness of a local area with the rest of the territory. By means of simulations, we provide empirical evidence that our results can be generalized to more heterogeneous contact patterns.
The risk evaluation obtained in this way is independent of the actual evolution of the epidemic and can be done from geographic and demographic information only. In addition, under reasonable assumptions, the epidemic front waves can be characterized by a simple condition that depends on the local population density and on the distance from the city center. Our results can thus potentially inform containment strategies at a local scale.
It would be interesting to extend this analysis to the regional scale and compare our predictions and simulation results with actual epidemic data, where available. Another possibility we intend to explore is to replace the distance with the traveling time between different blocks, to include urban environments where physical barriers, such as a river or a railway constrain mobility within the city.