Short-and long-term temporal network prediction based on network memory

Temporal networks are networks whose topology changes over time. Two nodes in a temporal network are connected at a discrete time step only if they have a contact/interaction at that time. The classic temporal network prediction problem aims to predict the temporal network one time step ahead based on the network observed in the past of a given duration. This problem has been addressed mostly via machine learning algorithms, at the expense of high computational costs and limited interpretation of the underlying mechanisms that form the networks. Hence, we propose to predict the connection of each node pair one step ahead based on the connections of this node pair itself and of node pairs that share a common node with this target node pair in the past. The concrete design of our two prediction models is based on the analysis of the memory property of real-world physical networks, i.e., to what extent two snapshots of a network at different times are similar in topology (or overlap). State-of-the-art prediction methods that allow interpretation are considered as baseline models. In seven real-world physical contact networks, our methods are shown to outperform the baselines in both prediction accuracy and computational complexity. They perform better in networks with stronger memory. Importantly, our models reveal how the connections of different types of node pairs in the past contribute to the connection estimation of a target node pair. Predicting temporal networks like physical contact networks in the long-term future beyond short-term i.e., one step ahead is crucial to forecast and mitigate the spread of epidemics and misinformation on the network. This long-term prediction problem has been seldom explored. Therefore, we propose basic methods that adapt each aforementioned prediction model to address classic short-term network prediction problem for long-term network prediction task. The prediction quality of all adapted models is evaluated via the accuracy in predicting each network snapshot and in reproducing key network properties. The prediction based on one of our models tends to have the highest accuracy and lowest computational complexity.


Introduction
Complex systems can be represented as networks, where nodes represent the components of a system and links denote the interaction or relation between the components.
The interactions are, in many cases, not continuously active.For example, individuals connect via email, phone call, or physical contact at specific times instead of constantly.Temporal networks (Holme and Saramäki 2012;Masuda and Lambiotte 2016;Holme 2015) could represent these systems more realistically with time-varying network topology.
The classic temporal network prediction problem aims to predict the interactions (or equivalently the network) one step ahead based on the network observed in the previous L steps.This problem is also equivalent to problems in recommender systems, e.g., predicting which user will purchase which product, which individuals will become acquaintances at the next time step (Kumar et al. 2019;Dhote et at. 2013).The temporal network prediction problem is more challenging than the static network prediction problem, which aims to predict the missing links or future links based on the links observed (Lü and Zhou 2011;Kumar et al. 2020;Cui et al. 2017;Zou et al. 2021;Zhan et al. 2020).Recently, machine learning algorithms have been developed to predict temporal networks.Embedding algorithms embed each node in a low-dimensional space based on the network observed.If the learned representations of two nodes are closer in the vector space, it is more likely to have a contact between this node pair one time step ahead (Kumar et al. 2019;Kazemi et al. 2020;Zhou et al. 2018;Wang et al. 2021;Rahman et al. 2018;Xu et al. 2020).Restricted Boltzmann machine (RBM) based methods (Li et al. 2014) and Graph neural networks (Pareja et al. 2020;Wu et al. 2022;Ma and Tang 2021) have also been developed for this prediction task and they can achieve high prediction accuracy.These methods, however, are at the expense of high computational costs and are limited in providing insights regarding which mechanisms enable the prediction and thus could possibly form temporal networks.
Network-based methods have been proposed to predict new links, i.e., the node pairs that will have contact in the future but have not had any contact in the past, instead of predicting all contacts at a specific future time step.These network-based methods consider a network property, also called similarity, of a node pair as the tendency that a new link will appear between the node pair (Liben-Nowell and Kleinberg 2003;Ahmed et al. 2016;Xu and Zhang 2013).Network-based methods tell directly which mechanisms or properties are used for the prediction and tend to have a low computational complexity.Recently, network properties of a node pair have been combined with learning algorithms to address the classic temporal network prediction problem (Li et al. 2019).
In this work, we aim to design network-based methods to solve the classic temporal network prediction problem and to unravel which mechanisms and network properties enable the prediction.
A temporal network measured at discrete times can be represented as a sequence of network snapshots G = {G 1 , G 2 , ..., G T } , where T is the duration of the observation win- dow, G t = (V ; E t ) is the snapshot at time step t with V and E t being the set of nodes and contacts, respectively.If node j and k have a contact at time step t, (j, k) ∈ E t .We assume all snapshots share the same set V of nodes.The time aggregated network G w contains the same set V of nodes and set of links E = ∪ T t=1 E t .That is, a pair of nodes is connected with a link in the aggregated network if at least one contact occurs between them in the temporal network.We give each link in the aggregated network an index i, where i ∈ [1, M] and M = |E| is the total number of links.The temporal connection or activity of link i over time could then be represented by a T-dimension vector x i whose element is x i (t) , where t ∈ [1, T ] , x i (t) = 1 when link i has a contact at time t and x i (t) = 0 if no contact occurs at t.A temporal network can be thus equivalently represented by its aggregated network, where each link i is further associated with its activity time series x i .
Specifically, the prediction task is to predict the activation/connection tendency of each link i at the next time step t + 1 based on the network observed in the previ- ous L steps [t − L + 1, t] , where 1 ≤ t − L + 1 < t ≤ T .The aggregated network G w is assumed to be known in the prediction problem because it represents social relationships and varies relatively slowly in time compared to contacts.The prediction accuracy is evaluated via the area under the precision-recall curve (AUPR), which compares the predicted activity tendency and the ground-truth connection of each link at the prediction time step.
Firstly, we explore the structural similarity between two network snapshots at any two time steps with a given time lag.We find the similarity or so-called network memory is relatively high when the time lag is small and decays as the time lag increases in the seven real-world physical contact networks considered.Based on this observed timedecaying memory in temporal networks, we design two network-based temporal network prediction models.
The self-driven (SD) model assumes that the activation tendency of a link at a prediction step is solely influenced by its past activity states, with a stronger influence from more recent states.This concept is not new (Li et al. 2019;Jo et al. 2015).The SD model is emphasized as one model here because we will explore in depth the choice and interpretation of its parameter and it is the basis to build our self-and cross-driven (SCD) model (Zou et al. 2023).In the SCD model, the activity tendency is firstly derived for each link at the prediction time step based on SD model.SCD assumes the connection tendency of a target link at a prediction step depends not only on the SD activity tendency of the link itself at the same prediction step but also of the neighboring links (links share a common end node with the target link) in the aggregated network.State-of-theart models that allow interpretation are considered as baselines: Common Neighbor, Lasso Regression, Correlated Discrete Auto-regression model, and the Markov model.In several real-world contact networks, we find that SCD outperforms SD model and both SCD and SD models perform better than the baselines.Both SD and SCD perform better in networks with a stronger memory.Additionally, the SCD model allows us to understand how different types of neighboring links (depending on whether they form a triangle with the target link or not) contribute to the prediction of a target link's future activity.
It is essential to predict the contact network in the long-term future, instead of one step ahead, in order to develop strategies to mitigate the epidemic or information spreading on the network.However, this long-term prediction problem on temporal networks remains unexplored.Hence, we further propose basic methods that adapt the aforementioned models for short-term network prediction to solve the long-term network prediction problem.Specifically, the long-term temporal network prediction problem is to predict the network (activities of all links) at each time step within the prediction period [t + 1, t + L * ] based on the network observed within [t − L + 1, t] .Moreover, the aggregated network G w and the total number of contacts at each time step within the prediction period are assumed to be given.The latter assumption aims to simplify the problem, also because the number of contacts can be influenced by factors like weather and policy other than the network observed in the past.The prediction quality is evaluated via whether the predicted network within the prediction period is precise and could reproduce key network properties.We find in general, the adapted SD model performs the best among all models in all data sets.Its prediction accuracy decays as the prediction step is further ahead in time and this decay speed is positively correlated with the decay speed of network memory.Finally, networks predicted by various models respectively within the prediction period have a heterogeneous distribution of inter-event time of contacts along a link, similar to real-world networks.
The rest of the paper is organized as follows.We will introduce real-world temporal networks to be used to design and evaluate temporal network prediction methods in section "Empirical data sets".Key temporal network properties will be analyzed in section "Memory in temporal networks" to motivate our network-based models (Section "Short-term network prediction methods") for the classic short-term network prediction problem.The proposed models will be evaluated and interpreted in section "Performance analysis in short-term prediction".Finally, our network-based models and baseline models will be further developed and evaluated for the long-term prediction problem in sections "Long-term prediction methods" and "Performance analysis in long-term prediction" respectively.

Empirical data sets
To design and evaluate temporal network prediction methods, we consider seven empirical physical contact networks: Hospital (Vanhems et al. 2013), Workplace (Génois and Barrat 2018), PrimarySchool (Stehlé et al. 2011), HighSchool (Mastrandrea et al. 2015), LH10 (Génois and Barrat 2018), SFHH (Rossi and Ahmed 2015) and Hypertext2009 (Isella et al. 2011).The basic properties of these data sets are given in Table 1.The time steps at which there is no contact in the whole network have been deleted.
Table 1 The number of nodes ( N = |V | ), the number of node pairs that have contact(s) (M), the length of the observation time window (T), time resolution ( δ sec), the type of contacts and the location where the data is collected

Memory in temporal networks
In this section, we explore whether a temporal network has memory, i.e., the network observed at different times share certain similarity.Such memory property may inspire the design of network-based temporal network prediction methods and influence prediction quality.
Auto-correlation Firstly, we explore the correlation of the activity of a link at two times with a given interval , called time lag, via the auto-correlation of the activity series of each link.The auto-correlation of a time series is the Pearson correlation between the given time series and its lagged version.We compute, for each link i, the Pearson correlation coefficient R x i x i (�) between {x i (t)} t=1,2,...,T −� and {x i (t)} t=�+1,�+2,...,T as its auto- correlation coefficient.Figure 1a shows that the average auto-correlation coefficient over all links decays with the time lag in every real-world network.The average auto-correlation decays slower as the time lag increases.
Jaccard similarity Furthermore, the similarity of the network at two times with a given time lag is examined via Jaccard similarity (JS).JS measures how similar two sets are by considering the percentage of shared elements between them.Given two snapshots of a temporal network G t and G t+ , their Jaccard similarity is defined as the size of their intersection in contacts divided by the size of the union of their contact sets, that is, JS(G t , G t+� ) = E t ∩E t+� E t ∪E t+� .Large JS means a large overlap/similarity between the two snapshots of the temporal network.Figure 1b shows the average Jaccard similarity over all possible pairs of temporal network snapshots that have a time lag .Similar to autocorrelation in link activity, the similarity between temporal snapshots decays with their time lag in all empirical data sets, manifesting the time-decaying memory of real-world temporal networks.

Short-term network prediction methods
In this subsection, we will propose two network-based prediction models and four baseline models for the classic short-term temporal prediction problem, that is, predicting the activation tendency of each link in the aggregated network G w ( G w is given) at the next time step t + 1 based on the network observed in the previous L steps within

Our network-based models
Inspired by the time-decaying memory of temporal networks, we propose two networkbased temporal link prediction models.In our previous work (Zou et al. 2022) that uses Lasso Regression for short-term prediction explained in "Lasso regression" section, it has been found that a link's state at the next step is largely determined by the current state of the link itself and of the neighboring links that share a common node with the target link in the aggregated network.Hence, our two network-based models will estimate a link's activity tendency one step ahead based on the past activities of the link itself and of its neighboring links respectively by taking the memory effect into account.

Self-driven (SD) model
The self-driven (SD) model predicts the tendency w i (t + 1) of the link i to be active at the prediction time t + 1 as: where the decay factor τ controls the rate of the memory decay and x i (k) is the state of link i at time step k.A large τ corresponds to a fast decay of memory, such that a small number of previous states affect the tendency of connection.When τ = 0 , all past states have equal influence on the future connection tendency, and w i (t + 1) reduces to the total number of contacts of link i during the past L steps.Such exponential decay has also been considered in Li et al. (2019), Yu et al. (2017).In "Model evaluation" section, we will show that the SD model performs well for a common wide range of the decay factor τ among all real-world networks considered and we do not need to learn τ from the temporal network observed in the past.

Self-and cross-driven (SCD) model
Furthermore, we generalize the SD model to a self-and cross-driven (SCD) model.The SCD model assumes that the activity tendency of a target link one step ahead depends on the SD connection tendency defined in Eq. ( 1) of the link itself and also of neighboring links that share an end-node node with the target link in the aggregated network.The union of the target link and its neighboring links is also called the ego-network centered at the target link, exemplified in Fig. 2. Furthermore, we differentiate three types of links in an ego-network, colored differently in Fig. 2: the target link itself (in grey in Fig. 2), links that form a triangle with the target link (in blue), and the remaining links (1) Fig. 2 An illustrative example of an ego-network centered at a targeted link i.The target link itself, links that form a triangle with the target link, and the other neighboring links, are colored in grey, blue and green respectively (in green).We assume that the previous states of these three types of links may contribute differently to the connection tendency of the target link.This is motivated by a) our finding that, when Lasso Regression is used to estimate connection tendency (Zou et al. 2022), the previous activity of the target link itself contributes more than that of the neighboring links, b) the common neighbor similarity method in static network prediction and c) the observation of temporal motifs (e.g., three contacts that happen within a short duration with a specific ordering in time, and form a triangle in topology) in temporal networks (Paranjape et al. 2017;Saramäki and Moro 2015).
Specifically, our SCD model assumes that the tendency h i (t + 1) for link i to be active at time step t + 1 is a linear function of the contributions of the link itself w i (t + 1) as defined in Eq. ( 1), the neighboring links that form a triangle with the target link u i (t + 1) and the other neighboring links g i (t + 1) .The latter two factors u i (t + 1) and g i (t + 1) will be defined soon as a function of the SD tendency at t + 1 of links in the ego-network.
The contribution u i (t + 1) of the neighboring links that form a triangle with the target link i is defined as follows.For each pair of neighboring links j and k that form a triangle with the target link i, the geometric mean w j (t + 1) • w k (t + 1) suggests the strength that the two end nodes of link i interact with the corresponding common neighbor.We define u i (t + 1) as the average geometric mean over all link pairs that form a triangle with the target link.This design of u i (t + 1) aims to capture the weighted version of common neighbor similarity.The contribution of the other links g i (t + 1) in the ego- network is defined as the average of their SD activity tendency.For each prediction time step t + 1 , a set of coefficients β * 0 , β * 1 , β * 2 , and β * 3 in Eq. (2) will be learned through Lasso Regression from the temporal network observed in the past L steps for all possible target links.

Baseline models
Our goal is to develop network-based models for predicting temporal networks.This is because they usually have low computation complexity and allow us to understand the underlying mechanism that enables the prediction, thus mechanism that potentially forms temporal networks.Hence, as baselines, we introduce four models that are relatively interpretable in their mechanisms of prediction.

Common neighbor
We generalize the common neighbor method from static network prediction (Liben-Nowell and Kleinberg 2003) to the temporal network prediction problem.The number of common neighbors of a target node pair can be computed for each of the previous L snapshots.The total number of common neighbors (CN) over the past L snapshots is used as the target node pair's tendency of connection at the prediction time step t + 1. Scholz et al. (2013) and Tsugawa et al. (2013) have used the number of common neighbors (CN agg ) of a target node pair in the unweighted aggregated network over the past L snapshots to estimate if there will be a new link between this node pair at the prediction (2) time step.Later, we will show that the CN agg method performs overall worse than the CN method in all real-world networks.

Lasso regression
Lasso Regression (Zou et al. 2022) assumes that the activity of link i at time t + 1 is a lin- ear function of the activities of all the links at time t, i.e., The objective is where M is the number of features, as well as the number of links in the aggregated network, c i is the constant coefficient, and ficients of all the features for link i.The coefficients will be learned from the temporal network observed in the past L steps for each link.We use L1 regularization, which adds a penalty to the sum of the magnitude of coefficients M j=1 |β ij | .The parameter α con- trols the penalty strength.The regularization forces some of the coefficients to be zero and thus leads to models with few non-zero coefficients (relevant features).The optimal α that achieves the best prediction is chosen by searching 50 logarithmically spaced points within [10 −4 , 10].

CDARN model
The correlated Discrete Auto-Regression Network (CDARN) model has been shown to be able to capture the non-Markovian evolution of temporal networks and also the correlation between links in their activities (Williams et al. 2022).It assumes that the state of a link at each time step t is either a copy of a previous state of the link itself or another link or is a Bernoulli random variable.The dynamics of each link i is governed by the process: where ) is a Bernoulli variable with average q 2 controlling the density of the network.
Z i (t) is a discrete random variable that is uniformly distributed within {1, 2, . . ., P} and it means that states of previous P steps have equal probability to be chosen as x i (t) .This random variable C i (t) encodes which link's state would be copied by link i and distrib- uted as (3) where Ŵ i is the set of neighboring links of link i in the aggregated network G w .Two links in the aggregated network are neighboring links if they share a common end node.
At any time t, Q i (t) is an independent and identically distributed random variable.The same holds for C i (t) , Z i (t) and Y i (t) .The parameters q 1 , q 2 , and c can be estimated via Maximum Likelihood Estimation as described in Williams et al. (2022) based on the network topology observed in the previous L steps.The same as Lasso Regression models, we confine ourselves to the CDARN model with memory length P = 1 , where a link's current state is determined probabilistically by the states of the link itself or its neighboring links at the previous time step.This choice is also motivated by the high computational cost of CDARN.
Based on the estimated parameters ( q 1 , q 2 and c), the CDARN tendency for each link i to be active at t + 1 has been derived in Williams et al. (2022) for link prediction task, as with where δ(a, b) is the Kronecker delta, equal to 1 if a = b , otherwise 0, and Ŵ i is the set of neighboring links of link i in the aggregated network.The term Ci (t + 1) repre- sents the fraction of active links among all the neighboring links of link i at t.The term (1 − c) Di (t + 1) + c Ci (t + 1) interprets the probability that the state of link i at t + 1 is active given it is a copy of a previous state of the link itself or its neighboring links.

Markov model
Markov model (Kemeny and Snell 1976;Tang et al. 2020) assumes that the activity or time series of a link in a temporal network is independent of that of other links and a link's activity at the current time step depends only on its state at the previous time step.For each link, we can obtain a 2 × 2 transition matrix, where each element represents the transition probability from each possible state (either 0 or 1) at any time step to each possible state at the next consecutive time step, based on the states of the link observed in the last L steps.The Markov tendency for each link being active at t + 1 is the transition probability from its state at t to an active state.

Performance analysis in short-term prediction
In this section, we will evaluate and interpret the performance of these short-term network prediction models in the aforementioned set of real-world physical contact networks.( 6)

Model evaluation
We first introduce the method to evaluate the prediction accuracy of a model.Secondly, we explore how to choose the decay factor in the SD model.Thirdly, we compare the performance of all the models.

Temporal network prediction accuracy of short-term prediction
Each model predicts the activation tendency of each link at time step t + 1 based on the temporal network observed in the past L steps.The prediction step t + 1 is sampled 1000 times from [T /2 + 1, T ] with equal space.
The average proportion of the M links that are active at a time step is lower than 1% in all the real-world networks we considered.The classification labels (the number of active links and inactive links per time step) are imbalanced.Hence, we evaluate the prediction accuracy via the area under the precision-recall curve (AUPR) (Davis and Goadrich 2006).An AUPR can be derived for the prediction of each network snapshot, using the connection tendency of each link derived by a given model and the actual network snapshot.AUPR provides an aggregated accuracy across all possible classification thresholds.The average AUPR of a model over the 1000 prediction snapshots quantifies the prediction accuracy of the model.A high AUPR means high prediction accuracy.

Choice of decay factor
How to choose the decay factor τ will be motivated by comparing two possibilities.We first consider a simple case where τ is a control parameter and does not vary over time, i.e., remaining the same for the 1000 samples of the prediction time step t + 1 .Given a τ , the tendency w i (t + 1) ( i ∈ [1, 2, ..., M] ) can be obtained at each prediction step based on Eq. (1). Figure 3 shows that the decay factor τ indeed affects the prediction accuracy AUPR of the SD model.A universal pattern is that the optimal performance is obtained by a common and relatively broad range of τ ∈ [0.5, 5] in all networks.This implies that our real-world physical contact networks measured at school, hospital, workplace, etc., may be formed by a universal class of time-decaying memory.Hence, τ can be chosen arbitrarily within [0.5, 5].In the second method of choosing τ , a τ (t + 1) for each prediction step t + 1 is learned from the network observed in the past L steps.The τ (t + 1) is chosen as the one that allows the SD model to best predict the temporal network at t based on the network observed in the past L steps.The prediction accuracy achieved by the first (second) method of choosing τ is 0.63 (0.61), 0.68 (0.67), 0.69 (0.63), 0.75 (0.74), 0.68 (0.67), 0.34 (0.33) and 0.65 (0.63), for the seven data sets, respectively.
Hence, τ could be chosen arbitrarily from [0.5, 5], which has lower computational complexity and better prediction accuracy than learning τ dynamically over time.We consider τ = 0.5 to derive the SD tendency and SCD tendency in the rest analysis.

Comparison of models
We further compare the prediction accuracy of all models.As shown in Fig. 4, both SD and SCD models perform better than the baselines.The SCD model, which predicts a link's connection utilizing SD tendency of the neighboring links and of the link itself, indeed performs better than the SD model that uses only the SD tendency of the link itself.Moreover, the SD and SCD models perform the best (worst) in LH10 (PrimarySchool), in line with the strongest (weakest) memory/similarity of LH10 (PrimarySchool) observed in Fig. 1.
Previous studies have shown that the number of common neighbors (CN agg ) in the unweighted aggregated network over the past observation period could relatively accurately predict new links to appear in the aggregated network (Scholz et al. 2013;Tsugawa and Ohsaki 2013).However, our CN method, though performs better than the CN agg , performs poorly in the short-term network prediction problem.This is likely because when the neighboring links that form a triangle with the target have contacts is crucial for the short-term network prediction problem, but largely ignored by CN and CN agg methods.The SCD model, in contrast, weighs events that happen earlier in time less and estimates implicitly the chance those two neighboring links have contacts at the same time, giving rise to its superior performance.CN method uses the sum of the number of common neighbors over the past L snapshots to estimate a target node pair's tendency of connection at the prediction time step.In this case, every two contacts of a node with the target node pair respectively at the same

Model interpretation
In this subsection, we interpret firstly the SCD model, to understand how the past states of different types of links in the ego-network (neighborhood) of a target link contribute to the activation tendency of the target link.Afterwards, we interpret the decay factor τ and the duration L of past observation to understand how past contacts over time contribute to the prediction.

Interpretation of SCD model
As defined in Eq. ( 2), SCD model predicts a link's future connection, based on the SD tendency of the link itself, links that form a triangle with the link, and the rest of the links that share a common node with the link.The contributions of these three types of links are reflected in the learned coefficients in Eq. (2).The average of each coefficient over all prediction steps is given in Table 2.In all networks except for Primary School, This means that the activity of a target link in the future is mainly influenced by the past activity of the link itself, slightly influenced by the activity of the neighboring links that form a triangle with the target link, and seldom affected by the activity of the other neighboring links.The predictive power of neighboring links that form a triangle with the target link may come from the nature of physical contact networks: contacts are often determined by physical proximity; two people that are close to a third but not yet close to each other are likely to already be in relatively close proximity.
One exception is the PrimarySchool, where β * 2 > β * 1 .Table 2 shows the aggregated network of PrimarySchool has the largest clustering coefficient 1 in the aggregated network as shown in Table 2.In general, we find the contribution β * 2 of links that form a triangle with the target link tends to be more significant in temporal networks with a larger clustering coefficient cc.

Duration L of past observation
According to the definition of SD tendency of connection in Eq. ( 1), only the coefficients/contributions e −τ (t−k) of the previous 24 steps (3 steps) are larger than 10 −5 when τ = 0.5 ( τ = 5 ), out of L = T /2 > 1000 previous steps observed.We wonder whether considering only a few previous steps instead of L = T /2 steps would be sufficient for a good prediction.As shown in Fig. 4, the prediction accuracy of SD model when L = 3 and τ = 0.5 is worse than that when L = T /2 and τ = 0.5 .This suggests that although the contribution of each early state of a target link is small, the accumulated contribution of many early states improves the prediction accuracy.The prediction accuracy of the SD model when L = 3 and τ = 0.5 , whose computational complexity is extremely low, is still better or similar to that of Lasso Regression, reflecting the prediction power of recent states of a link.The choice of L may influence the prediction accuracy of all models.Hence, we compare further the prediction accuracy of all models when L = T /4 in Fig. 5.We find the same conclusion holds as that when L = T /2 : both SD and SCD models perform better than the baselines, and SCD performs better than SD model.Additionally, we observe a significant decrease in prediction accuracy when L decreases from L = T /2 to L = T /4 for both Lasso Regression and CDARN models, likely because these learning models need a sufficiently long period of observation for training.In contrast, the prediction accuracy remains relatively stable for the SD, SCD, Makrov, and CN models, despite the change in L. This suggests that the SD, SCD, Markov, and CN models are more resilient to variations in L compared to Lasso Regression and CDARN models.

Long-term prediction methods
Strategies to mitigate epidemics or information spreading are supposed to be carried out for a relatively long period instead of only one time step.Hence, predicting the temporal network in the long-term future is essential for the development of mitigating strategies.
The long-term prediction problem is to predict the temporal network in the longterm future within [t + 1, t + L * ] based on the network topology observed in the past L time steps within [ t − L + 1, t ].The number of contacts m(t + �t) at each prediction step and the aggregated network over the whole time window [1, T] of each data set are known.We introduce two basic methods that adapt each short-term network prediction model for the long-term prediction task: recursive long-term prediction and repeated long-term prediction.The common neighbor model is not considered in view of its low performance in short-term prediction.
Fig. 5 Link prediction accuracy AUPR of all models when L = T /4 , and τ = 0.5 in seven data sets.The prediction accuracy is averaged over 1000 prediction snapshots for all models except that the accuracy of the CDARN model is averaged over 100 prediction snapshots due to its computational complexity

Recursive long-term prediction
In short-term prediction, the SD model differs from all the other models in the sense that SD model has only one parameter τ , which can be chosen arbitrarily from [0.5, 5] to achieve approximately the optimal performance, whereas parameters for the other models (Eq. 2 for SCD model, Eq. 3 for Lasso Regression, Eq. 7 for CDARN model, and transition matrix for Markov model) need to be trained from the network observed in the past.Hence, we will explain how to make recursive long-term predictions using these two kinds of models respectively.
For SD model, the SD connection tendency for each link at t + 1 can be obtained according to Eq. 1 based on the network observed in the past [ t − L + 1, t ] and τ = 0.5 .Since the number of contacts m(t + 1) is known, we predict the temporal net- work by considering the m(t + 1) links with the highest connection tendency as con- tacts.The predicted network G ′ t+1 at t + 1 could be represented by the predicted state of each link {x ′ 1 (t + 1), x ′ 2 (t + 1), . . ., x ′ M (t + 1)} at t + 1 .The predicted network G ′ t+1 at t + 1 and the network observed within [ t − L + 2, t ] will be used to compute the SD connection tendency of each link at t + 2 and to derive further the predicted network G ′ t+2 at t + 2 , equivalently the m(t + 2) contacts.The connection tendency of each link at each future step t + t , where 2 ≤ t ≤ L * is derived recursively using the network observed in The SD connection tendency of each link and the given number of contacts at each future step t + t are used to predict the temporal network at that time step.
For SCD model, Lasso Regression, CDARN model, and Markov model, we train each model only once based on the network observed in the past L steps within [ t − L + 1, t ] to obtain its parameters.Each trained model will be used to derive the activation tendency of each target link at t + 1 using the network observed at t. Then we predict the temporal network at t + 1 by considering the m(t + 1) links with the highest connection tendency to be active.The same trained model will be applied recursively to derive the activation tendency at t + t using the network G ′ t+ t−1 predicted at t + t − 1 and predict the network G ′ t+ t as the set of contacts along the m(t + �t) links with the highest connection tendency.

Repeated long-term prediction
The repeated long-term prediction based on each of the aforementioned models is defined as follows.We firstly derive the connection tendency of each link at t + 1 in the same way as in short-term prediction based on the network observed within [ t − L, t ] using a given model.Then the connection tendency of each link at any pre- diction step t + t where t ∈ [2, L * ] is assumed to be the same as the connection tendency of that link at t + 1 .Given the connection tendency of each link at any (10) prediction step t + t where t ∈ [1, L * ] , we predict the network G ′ t+ t by considering the m(t + �t) links with the highest connection tendency to be active.
Our short-term prediction models can be applied thus either recursively or repeatedly to predict the network in the long-term future.The repeated long-term prediction assumes that the connection tendency of each link remains the same over the long-term prediction period.In contrast, the recursive long-term prediction uses both the observed network and predicted network to predict the network further in time.Hence, it possibly captures the evolving nature of the network over time but leads to accumulative prediction error over time.

Performance analysis in long-term prediction
In this section, we explore the performance of different models applied either repeatedly or recursively in long-term prediction and its relation with the memory property of temporal networks.
The prediction quality of any method is evaluated via the accuracy in 1) predicting the network at each prediction step within the prediction period [ t + 1, t + L * ], 2) predict- ing the weighted aggregated network over the prediction period and 3) reproducing the inter-event time distribution of contacts along a link within the prediction period.The accuracy in these three perspectives is, in general, desirable for long-term network prediction since the network per snapshot, the aggregated network, and the distribution of inter-event time of contacts along a link affect evidently spreading processes unfolding on the network (Scholtes et al. 2014;Vazquez et al. 2007;Newman 2003;Horváth and Kertész 2014).
The prediction length L * is chosen as 10%T , and the starting point t + 1 of each pre- diction period [t + 1, t + L * ] is sampled 1000 times from [ T /2 + 1, 90%T ] with equal space, to illustrate our method.

Model evaluation
Since the number of contacts m(t + �t) in each prediction step t + t ∈ [t + 1, t + L * ] is given, the number of contacts in the network predicted G ′ t+ t at time step t + t is the same that of the real-world network (ground-truth) G t+ t .Hence, we evaluate the accuracy of the network predicted at each time step t + t ∈ [t + 1, t + L * ] via recall, the number of contacts that exist both in the predicted network snapshot G ′ t+ t and the real-world network G t+ t divided by m(t + �t).
Firstly, we compare the prediction accuracy recall of each model using the recursive and repeated prediction methods respectively, as a function of prediction time gap t , the number of time steps that the prediction step t + t is ahead of the observation win- dow [t − L + 1, t] .For each t ∈ [1, L * ] , the prediction accuracy recall is averaged over the 1000 samples of the prediction period.From Fig. 6 (for network Hospital) and Fig. 11 (for other networks) in the Appendix, we could not recognize any difference between the repeated and recursive methods for SD and SCD models, while Lasso Regression, CDARN, and Markov model tend to perform better using the repeated prediction method.
The similar performance of the SD model when it is applied recursively and repeatedly can be explained by the following.When SD model is applied recursively to predict the network at step t + 1 , the m(t + 1) links with the highest connection tendency are predicted to have contacts, which in return makes their connection tendency at t + 2 higher than the other links.In this way, the ranking of links in connection tendency at each prediction time step within [ t + 1, t + L * ] remains nearly the same, as in the SD model applied repeatedly.At any prediction step, it is the ranking of links in connection tendency that decides the predicted network.
The prediction accuracy of Lasso Regression, CDARN model, and Markov model is low in short-term prediction.When we use the network predicted by any of these models at t + t to predict a network at t + t + 1 using the recursive prediction method, the prediction error is accumulated.This is likely why these three models tend to perform better using the repeated prediction method.We consider the repeated prediction method in the rest analysis of this section.
Secondly, we compare the performance of all models using the repeated prediction method in each data set.Figure 7 shows the average Recall decreases as the prediction gap t increases for all models in all temporal networks.In general, SCD performs the best among all models in all data sets when t = 1 , as observed in the short-term pre- diction in "Model evaluation" section.When �t > 2 , SD achieves roughly the best pre- diction accuracy.

Prediction accuracy in relation to network memory
As SD achieves roughly the best prediction accuracy among all models, we further explore the relation between the prediction accuracy of SD model and the memory property of temporal networks, aiming to understand in which kind of temporal networks the SD model predicts better.The prediction accuracy recall of SD model in general decreases as the prediction time gap t increases.The decrease is faster when t is smaller, as shown in Fig. 7.We will focus on the prediction gap within [1,1%T ] since the prediction accuracy is too low when �t > 1%T.
Intuitively, it's probably difficult to predict a temporal network if the network has a weak memory, i.e., the network observed at different times shares low similarity, especially for models like SD and SCD that utilize network memory in network prediction.Hence, we explore the relation between the prediction accuracy of the SD model and the Fig. 7 The prediction accuracy, Recall, for SD, SCD, Lasso Regression, CDARN, and Markov model applied repeatedly, respectively in seven temporal networks at each prediction step t + t .In the Random model, the m(t + �t) predicted contacts are randomly chosen at the prediction step t + t memory property of the temporal network.Figure 8a shows the recall of SD model as a function of the normalized prediction time gap t 1%T .Figure 8b illustrates the average Jaccard similarity of two snapshots of a temporal network when their time lag equals t 1%T .We observe that approximately the prediction accuracy tends to be better in networks with stronger memory (Jaccard similarity).For example, the recall is the largest (smallest) in LH10 (PrimarySchool, Workplace, and Hospital), whose Jaccard similarity is also the largest (smallest).Furthermore, the decay rate of the prediction accuracy with prediction time gap t 1%T in Fig. 8 seems to be related to the decay rate of Jaccard similarity with the time lag t 1%T in Fig. 8b.The decay rate of recall within the interval [ 1 1%T , t 1%T ] is defined as , and the same definition holds for the decay rate of Jaccard similarity.Figure 8c and d show the decay rate of recall and JS respectively within ] as a function of t 1%T .We find the ranking of the real-world networks in the decay rate of recall approximates that in decay rate of JS at any t 1%T .This means that the prediction accuracy of SD model decays fast in networks with fast decaying memory.

Aggregated network
We evaluate further the precision of the predicted aggregated network G ′ w (t + 1, t + L * ) , which is the network predicted per time step aggregated within the prediction period [t + 1, t + L * ] .The aggregated network G w (t + 1, t + L * ) of the real-world network is constructed as follows.Two nodes are connected by a link in G w (t + 1, t + L * ) , if the two nodes have at least a contact in the real-world network within [ t + 1, t + L * ].Moreo- ver, the weight of each link is defined as the number of contacts along the link within [ t + 1, t + L * ].The weighted aggregated network G w (t + 1, t + L * ) could be represented by a weighted adjacency matrix A G w (t+1,t+L * ) whose element in row i and column j is the number of contacts between node i and node j within [ t + 1, t + L * ].Similarly, we can construct the predicted aggregated network G ′ w (t + 1, t + L * ) , a weighted network, based on the network predicted within [ t + 1, t + L * ] and represent it by the weighted adjacency matrix A ′ G w (t+1,t+L * ) .Note that the number of contacts in the predicted network is the same as that in the real-world network at any prediction step.
We evaluate the accuracy in predicting the aggregated network via the generalized recall ( Recall wei ): which measures the extent that the two weighted aggregated networks G ′ w (t + 1, t + L * ) and G w (t + 1, t + L * ) overlap.
We first compare the generalized recall of each model using the recursive and repeated prediction methods respectively, as a function of the prediction period L * .Instead of considering L * = 10%T as in the previous sections, we consider the general scenario where L * is a variable L * ∈ [1, 10%T ] .We have observed the same when evaluating the prediction accuracy per snapshot and per aggregated network.We could not recognize any difference in prediction accuracy between the repeated and recursive methods for (11 , both SD and SCD models, while Lasso Regression, CDARN, and Markov model tend to perform better using the repeated prediction method (see Fig. 12 in Appendix).When all models are applied repeatedly, SD achieves roughly the best prediction accuracy (see Fig. 13 in Appendix).The SD model tends to predict better in networks with a strong memory, and the prediction accuracy of SD model decays fast in networks with fast-decaying memory as shown in Fig. 14 in Appendix.

The distribution of inter-event time
The inter-event time ( ) is the time between two consecutive contacts of a link.Firstly, we derive the inter-event distribution of a real-world (predicted) temporal network within [t + 1, t + L * ] from the inter-event times collected from all links that have at least two contacts within [t + 1, t + L * ] .The objective is to explore whether the predicted network and corresponding real-world network during the prediction period have a similar inter-event distribution.Figure 9 (Fig. 15 in Appendix) shows the inter-event time distribution in each real-world network and the corresponding predicted networks when each model is applied repeatedly (recursively).In each data set, the networks predicted by various models possess almost the same heterogeneous inter-event time distribution, which can be explained as follows.
For repeated prediction, the tendency for each link to be active at each prediction time step t + t ( t ∈ [2, L * ] ) is the same as its activation tendency at t + 1 , and m(t + �t) links with the highest connection tendency are considered to have contacts at t + t .When no links have the same rank in tendency, the total number of contacts in the network at each time step over time decides the distribution of inter-event time.The link whose connection tendency is the rth largest among all links will be active at a prediction step if the total number of contacts at that prediction step is no less than r.Links with high (low) link tendency are likely to be active (inactive) at each time step, leading to many (few) small (large) inter-event times, which leads to a heterogeneous distribution of inter-event time.The distribution of inter-event time observed in predicted networks approximates roughly the distribution in the corresponding real-world network.
Fig. 9 The probability density function of inter-event time ( ) in each real network and network predicted by various models applied repeatedly Still, this does not mean that predicted networks have the burstiness of inter-event time as observed in real-world networks: contacts between a pair of nodes usually occur in bursts of many contacts close in time followed by a long period of inactivity.To systematically explore the burstiness property of inter-event time along a link, we group links based on their total number of contacts within [t + 1, t + L * ] as in the method of Goh and Barabási (2008) and derive the probability density function f � (x) of inter-event times ( ) collected from all links in each group.Figure 10 shows the scaled probability density function � 0 f � (x) for each group of links as a func- tion of x/� 0 , where 0 is the average inter-event time of the same group.As shown in Fig. 10, the distributions of inter-event time of all groups in both the real-world network and the network predicted by SD model follow a similar heavy-tail distribution.Networks predicted by our SD model reproduce approximately the burstiness of inter-event times as observed in real-world networks.

Conclusion
In this work, we propose two network-based models to solve the short-term temporal network prediction problem.The design of these models is motivated by the timedecaying memory observed in temporal networks.The proposed self-driven (SD) model and self-and cross-driven (SCD) model predict a link's future activity based on the past activities of the link itself, and also of the neighboring links, respectively.Both models perform better than the baseline models.Interestingly, we find that SD and SCD models perform better in temporal networks with a stronger memory.
The SCD model reveals that a link's future activity is mainly determined by (the past activities of ) the link itself, moderately by neighboring links that form a triangle with the target link, and hardly by other neighboring links.However, if the temporal network has a high clustering coefficient in its aggregated network, the contribution of the neighboring links that form a triangle with the target link tends to be significant and possibly dominant.
We further apply these short-term network prediction models either recursively or repeatedly to make the long-term network prediction., that is the prediction of the Fig. 10 The scaled probability density function � 0 f � (x) of the inter-event times derived from each group of links as a function of x/� 0 , where 0 is the average inter-event time of the same group, in each real temporal network (circle) and network predicted by SD model (asterisk).Different colors indicate the distribution derived from different groups of links.Links are sorted into 10 groups with a logarithmically increasing width based on their number of contacts temporal network in the long-term future based on the network topology observed in the past and given the number of contacts at each prediction step.The accuracy of longterm prediction accuracy is evaluated from the perspective of the network predicted per snapshot and the predicted aggregated network.The repeated method performs, in general, better for all prediction models.This is likely because the iterative method uses both the observed network and the predicted network which is not precise enough to predict the network further in time.In general, SD model performs the best among all models in all data sets.It predicts better in networks with a stronger memory.The prediction accuracy decays as the prediction step is further ahead in time and this decay speed is positively correlated with the decay speed of network memory.Finally, networks predicted by various models respectively have a heterogeneous distribution of interevent time similar to real-world networks, and also the burstiness of inter-event times of a link.
Our work is a starting point to explore network-based temporal network prediction methods.Our findings may shed light on the modeling of the formation of temporal networks which is crucial in understanding and controlling the dynamics of and on temporal networks.Our finding that activities of neighboring links that form a triangle with a target link have prediction power on the connection of the target link may suggest that higher-order events (Ceria and Wang 2023;Benson et al. 2018) like triangles in each network snapshot may contribute to the prediction of (pairwise and higher-order) temporal networks.It is also interesting to evaluate the prediction accuracy of network-based prediction methods in comparison with state-of-the-art machine learning methods that target at high accuracy.

Appendix: The prediction accuracy in long-term network prediction
See Figs. 11,12,13,14 and 15.Fig. 11 The prediction accuracy, Recall, of SD, SCD, Lasso Regression, CDARN, Markov respectively applied recursively (blue curve) or repeatedly (yellow curve) in each real-world network at each prediction step that is t step ahead of the training/observed network Fig. 12 The accuracy Recall wei in predicting the aggregated network within [t + 1, t + t] , for SD, SCD, Lasso Regression, CDARN, and Markov, respectively applied recursively or repeatedly, as a function of t Fig. 13 The accuracy Recall wei in predicting the aggregated network within [t + 1, t + t] , of SD, SCD, Lasso Regression, CDARN, and Markov model applied repeatedly, respectively in seven temporal networks

Fig. 1 a
Fig. 1 a The average auto-correlation coefficient R xx over all links as a function of the time lag and b the average Jaccard similarity of two snapshots of a temporal network with a given time lag in each of the seven data sets

Fig. 3
Fig. 3 Link prediction accuracy AUPR of the SD model as a function of the decay factor τ in seven data sets

Fig. 4
Fig. 4 Temporal network prediction accuracy AUPR of Common Neighbor model (CN and CN agg ), Lasso Regression (LR), CDARN model, Markov model, SD model, and SCD model.All methods consider L = T /2 and τ = 0.5 except for SD ( τ = 0.5 , L = 3 ), which is needed only for "Duration L of past observation" section

Fig. 6
Fig.6The prediction accuracy, Recall, for SD, SCD, Lasso Regression, CDARN, and Markov model, respectively applied recursively or repeatedly, in Hospital at each t step ahead

Fig. 8 a
Fig. 8 a The prediction accuracy, recall, of SD model applied repeatedly, b the Jaccard similarity (JS), c the decay rate of recall and d the decay rate of JS as a function of normalized prediction time gap t 1%T

Fig. 14 aFig. 15
Fig. 14 a The accuracy, Recall wei , of SD model applied repeatedly in predicting the aggregated network over [t + 1, t + t] , b the Jaccard similarity (JS), c the decay rate of Recall wei and d the decay rate of JS as a function of the normalized prediction time gap t 1%T

Table 2
The learned coefficient β * 1 , β * 2 , and β * 3 in SCD model averaged over 1000 prediction steps and the clustering coefficient (cc) of the aggregated network in each empirical network 1The clustering coefficient of a network is the probability that two neighbors of a node are connected.