Predicting variable-length paths in networked systems using multi-order generative models

Apart from nodes and links, for many networked systems, we have access to data on paths, i.e., collections of temporally ordered variable-length node sequences that are constrained by the system’s topology. Understanding the patterns in such data is key to advancing our understanding of the structure and dynamics of complex systems. Moreover, the ability to accurately model and predict paths is important for engineered systems, e.g., to optimise supply chains or provide smart mobility services. Here, we introduce MOGen, a generative modelling framework that enables both next-element and out-of-sample prediction in paths with high accuracy and consistency. It features a model selection approach that automatically determines the optimal model directly from data, effectively making MOGen parameter-free. Using empirical data, we show that our method outperforms state-of-the-art sequence modelling techniques. We further introduce a mathematical formalism that links higher-order models of paths to transition matrices of random walks in multi-layer networks.


Introduction
Network models have provided us with important insights into the structure and dynamics of complex systems that consist of many interacting elements.In these models, we represent direct interactions between elements by means of dyadic or, increasingly, polyadic relationships, e.g., directed or undirected links connecting pairs of nodes, hyperedges connecting groups of nodes, or abstract simplicial complexes simultaneously capturing multiple (sub)sets of interacting nodes (Benson et al. 2021;Torres et al. 2021).
A key contribution of this topological perspective is that it tells us which of a system's elements can possibly influence each other either directly via a relationship, or indirectly via sequences of relationships that constitute paths. 1 While the topology of a network tells us which paths are possible in principle, we nowadays often directly observe paths, i.e., we have access to data capturing time-ordered sequences of nodes that interact with each other or that are traversed by a process.Examples include time-stamped interactions capturing how information (Dai et al. 2014;West and Leskovec 2012;Arlitt and Jin 2000;Cadez et al. 2000) or infectious diseases (Chai and Pavlou 2016;Wang and Wu 2018;Jo et al. 2021) propagate along sequences of social actors, data about sequences of proteins, genes, or cells that lead to a specific biological function (Olson 2006;Shapira et al. 2009;Karlebach and Shamir 2008), or data on the flow of goods in logistics networks or supply chains (Kim et al. 2015;Li and Zobel 2020;Pavlov et al. 2019).Methods that help us to model and understand sequential patterns in such data, especially those that cannot be explained by the topology of the underlying network, bear the promise to fundamentally advance our understanding of complex systems (Schwarze and Porter 2021).As such, path-centric models constitute an interesting class of higher-order networks that model both the temporal and topological dimension of complex systems (Lambiotte et al. 2019).Moreover, the ability to accurately predict paths in networks such as travel itineraries in transportation systems (Hackl et al. 2018;RITA 2014;Transport for London 2014) or user navigation on eCommerce websites (Montgomery et al. 2004;Bollen et al. 2009;Hui et al. 2009;Brodley and Kohavi 2000) can help us to more effectively manage travel disruptions, break infection paths, recommend related products, or predict online purchases.
Despite this importance, the modelling of patterns in data on paths in networks is still an open challenge.To address it, we must simultaneously account for three characteristics of real-world data: (i) we typically have large collections of (short) paths with variable lengths, (ii) each of those paths consists of a temporally ordered node sequence, and (iii) possible node sequences are constrained by an underlying network topology, i.e. two subsequent nodes u, v can only occur if a link (u, v) exists.Previously used modelling frameworks have, at most, accounted for two of these three characteristics at the same time.Variable length sequences are typically studied by means of sequence modelling algorithms (Tax et al. 2020;Fournier-Viger et al. 2017).Examples are Petri nets (Weijters and Ribeiro 2011;Augusto et al. 2017), Markov models (Deshpande and Karypis 2004;Gündüz and Özsu 2003;Bernhard et al. 2016;Pitkow and Pirolli 1999;Cleary and Witten 1984;Chierichetti et al. 2012;Singer et al. 2014;Benson et al. 2017), decision trees (Gueniche et al. 2015;Buijs et al. 2012;Leemans et al. 2018), spectral learning (Balle et al. 2014), as well as neural networks and automata (Tax et al. 2020).Such algorithms do not account for the underlying network constraining these sequences.Instead, network models are used to capture topological patterns (Padmanabhan and Mogul 1996;Laird and Saul 1994;Peixoto and Rosvall 2017;West and Leskovec 2012;Brockmann and Helbing 2013) but neglect patterns in the sequence of traversed nodes (Xu et al. 2016;Scholtes 2017).Recent works on higher-order network models have taken a first step towards combining the advantages of sequence modelling algorithms and standard network models.They have improved our ability to cluster and rank nodes (Rosvall et al. 2014;Xu et al. 2016;Scholtes 2017), detect anomalies (LaRock et al. 2019) and open new perspectives for network embedding (Saebi et al. 2019;Belth et al. 2019).However, they still do not account for the finite and variable lengths of paths, which limits our ability to accurately predict variable-length paths in networks.
Addressing this gap, our work makes the following contributions: First, we introduce MOGen, a method to learn compact models of large collections of paths that account for both the temporal ordering and variable-length of paths, as well as underlying network constraints.Second, we show how the models generated by MOGen can be represented in terms of block adjacency and transition matrices, using a previously unknown relation between higher-order network models and matrix representations of multilayer networks.Third, we propose an efficient model selection algorithm that yields an optimal model to encode sequential patterns in a given set of paths.We evaluate this algorithm in empirical data and show that it yields generalisable models that neither under-nor overfit sequential patterns.Finally, we show that MOGen accurately predicts paths in networks.Unlike existing algorithms, our method supports both predicting the next node based on a prefix of varying length (next-element prediction) and the out-ofsample prediction of full paths based on a training set.We apply our method to six data sets on human clickstreams and passenger itineraries in transportation networks and demonstrate its superior performance compared to state-of-the-art sequence modelling algorithms and higher-order network models.An overview of our method is shown in Fig. 1.

Results
We now present the four key results of the article.

Data sets
To obtain these results, we evaluate MOGen in six empirical data sets containing (i) user clickstreams on the Web, and (ii) travel itineraries of passengers in a train and an airline network.Here, BMS1 records clickstreams of customers of the web retailer Gazelle.com (Brodley and Kohavi 2000).Similarly, FIFA captures clickstreams of users of the '98 FIFA World Cup website (Arlitt and Jin 2000), and MSNBC contains clickstreams of page categories on the MSNBC news website (Cadez et al. 2000).Our final clickstream data set WIKI records navigation paths of users in the Wikipedia article graph, who Fig. 1 a Five colour-coded paths between a set of six nodes (A-F) in a networked system.We split the paths into a training set ( , , ) and a validation set ( , ).The topology of the networked system-depicted in grey-limits which transitions can exist.1.

Next-element prediction performance
First, we assess MOGen's performance for the task of next-element prediction.Next-element prediction seeks to predict the next element(s) in a given ordered sequence based on observed data.This problem, typically addressed with sequence modelling techniques, occurs in a number of applications, e.g., recommender systems (Frias-Martinez et al. 2002), speculative Web caching (Bestavros 1995), or purchase prediction in eCommerce (Kraus and Feuerriegel 2019).Uniquely, MOGen considers both the underlying network and the sequential patterns in path data.As a result, MOGen outperforms state-of-the-art sequence modelling algorithms in next-element prediction tasks.Given a prefix of n nodes (v t−n−1 , ..., v t−1 ) traversed by a path, we predict the next step of the path.This next step can either be the next visited node v t , or the end of the path † .We include the prediction of path ends in our evaluation as the prediction of terminal nodes is crucial in variable-length path data, e.g., to predict where users exit a website, where passengers end their itinerary, or whether a purchase is made in an online shop.
We compare the performance of MOGen to state-of-the-art sequence modelling algorithms.Specifically, we compare against the higher-order Markov chain based method AKOM (Pitkow and Pirolli 1999), which has shown the highest performance in the review paper (Tax et al. 2020).In addition, as the authors of (Gueniche et al. 2015) show that CPT+ outperforms AKOM in terms of prediction accuracy, we also include CPT+ in our set of baselines methods.Further, we consider predictions based on a higher-order network with a single layer (NET), the multi-order graphical model (MOM) introduced in Scholtes (2017), and a naive random baseline (RND) that predicts the next node based on the relative frequency of nodes in the training data.To compare the models, we create a 90%/10% train-validation split for each set of path data.We fit the models based on the training paths.From the validation paths, we generate all possible prefix-target combinations up to prefix length N = 6 .Specifically, given a target v t on path (v 1 , ..., v l , †) in the validation data, we enumerate all possible prefixes {(v t−n , ..., v t−1 )} n∈ [1,6] , where n is the length of the prefix.Repeating this procedure for all targets on all paths in the validation data yields a set of prefix-target combinations against which we evaluate next-element prediction performance in terms of the crossentropy loss function.We provide further details on the experimental setup for the nextelement prediction, including additional information regarding the baseline models, the parameter selection, and the cross-entropy loss function in the Materials and Methods section.
In Table 2a, we report the mean cross-entropy loss (in bits) for the six empirical data sets and the six prediction methods described above averaged over five train-validation splits.For each algorithm, we report the performance for the parameters that yield the smallest cross-entropy loss.The results show that MOGen outperforms all other methods.In most of the data sets, we further observe a considerable difference in the crossentropy loss function.We conjecture that this large difference in performance is due to the fact that, different from other methods, MOGen (i) explicitly models varying-length paths in a network, and (ii) is able to predict the termination of paths.This highlights the main contribution of our work, which is a modelling framework that accounts specifically for the characteristics of data on paths in networks.We observe the smallest cross-entropy loss for MSNBC, AIR, TUBE, which we obtain thanks to the-relative to the size of the underlying network-large number of observed paths.FIFA, WIKI, and BMS1 generally yield the largest cross-entropy loss across all methods, which come from the large network topologies and a relatively small number of observations, which hinders a reliable detection of generalisable sequential patterns.Table 2 reports the prediction performance of methods across multiple prefix lengths.
To better understand where the prediction performance originates, we provide additional results comparing the performance separately for each individual prefix lengths n.In Fig. 2, we show a ranking of methods for the different prefix lengths used for the MSBNC data set.For MOGen, we show the performance of models with maximum order K between 1 and 3.The best performing model ( K = 2 , printed in bold) is the one identified as most suitable by our AIC-based model selection discussed below.We report the model parametrisation results for the remaining methods that yield the best performance (shown in brackets).MOGen outperforms the other methods not only across all prefix lengths, but also for each prefix length individually.

Selection of optimal maximum order from data
Second, we assess MOGen's model selection approach (see the Materials and Methods section).The aim of model selection is to identify the model that provides the optimal balance between model complexity and explanatory power, allowing such a model to generalise to unseen data.Existing works use either heuristic techniques (Xu et al. 2016;Rosvall et al. 2014;Pitkow and Pirolli 1999;Gueniche et al. 2015) or analytical methods that require additional model characteristics like, e.g., a nested structure or convergence properties (Scholtes 2017;Casiraghi 2021).MOGen's model selection approach, instead, avoids computationally expensive parameter search algorithms that are used by competing methods.
In the case of MOGen, the only free parameter is the maximum order K of the multiorder transition matrix.Theorem 1 provides an AIC-based model selection approach that enables us to estimate the optimal value K of this free parameter directly from the data.Using this approach, MOGen balances the explanatory power of the resulting model with its complexity, captured by the degrees of freedom.
In Table 2b, we report the output of this model selection for the six empirical data set (top row), comparing it to the maximum order K that yields the best prediction performance (bottom row).For the clickstream data BMS1, FIFA, and WIKI, we obtain an optimal maximum order of K = 1 .For MSNBC, AIR, and TUBE, we obtain optimal values of K = 2 and K = 6 , respectively, highlighting the need for higher-order models.For all analysed data sets, the optimal order detected by the model selection provides the best prediction performance in cross-validation.With this, we confirm that MOGen yields models neither underfitting nor overfitting sequential patterns in paths, effectively making it parameter-free.

Out-of-sample prediction of paths
Third, we assess MOGen's ability to predict full variable-length paths.Existing works have focused on modelling patterns in the sequence of traversed nodes, rather than generative models allowing the out-of-sample prediction of paths (Lambiotte et al. 2019).
MOGen constrains only those transition sequences for which the underlying path data provides sufficient statistical evidence.The model allows all other transitions, enabling us to generate new unseen paths.In other words, given a set of paths, MOGen allows us to predict which other paths are likely to be observed and at which frequency.
Path prediction opens new opportunities and applications, e.g., in the management of travel disruptions, product recommendation, or online purchase prediction.We show that MOGen allows for the accurate out-of-sample prediction of full variable-length paths, even when trained on small samples of observations.We evaluate MOGen's performance in an out-of-sample path prediction task for the following scenario.We aim to predict passenger itineraries in the London Tube based on a training set consisting of 1000 observed paths as we would obtain, e.g., from a survey study.Based on the training set, we fit a MOGen with K = 6 , which we then use to gen- erate a new set S ′ of 100,000 paths of variable lengths.We rank paths in the generated set S ′ based on their frequencies and use the top-N paths to predict the top 10% most fre- quent paths in the validation set.We interpret this prediction as a binary classification and repeat the prediction for different discrimination thresholds N. We then compute true-positive/false-positive rates for different values of N and compute the area under a ROC (receiver operating characteristic) curve.
In Fig. 3, we show ROC curves (and averages) for the out-of-sample path prediction experiment repeated for 100 train-validation splits.We report prediction results for MOGen and a model-less baseline based directly on the training paths.The baseline assumes that the path frequencies in the validation data are identical to the path frequencies in the training data.These two frequencies converge as the size of the training set grows.To make the baseline predictions, we resample 100,000 paths from the training set according to the observed frequencies.
Predicting the top-N paths using MOGenyields an AUC of 0.87 ± 0.01 .In contrast, predicting directly from the training set based on their observed frequencies results in a significantly lower average AUC score with a higher standard deviation ( 0.76 ± 0.02 ).The ROC curve for the baseline contains linear segments because most paths are observed only once or twice, i.e., the training path counts alone do not provide enough information to compute relative rankings between them.MOGen is able to recover the information about paths' probabilities from transition-and subpath-frequencies observed in the training data resulting in the higher AUC scores seen in Fig. 3. Thus, using MOGen yields a better and more consistent prediction.
In an additional experiment, we compare the path frequencies generated from MOGen and the baseline against the 'true' path frequencies observed in the validation set.We do so by computing Kendall's τ rank correlation coefficients for the two generated path sets against the validation set.Kendall's τ correlation measures the similarity of the rankings obtained by ordering the paths according to the generated frequencies with the ranking obtained according to the 'true' frequencies in the validation data.The average baseline's correlation coefficient is 0.11 ± 0.02 .A low correlation is expected as the training set contains only 1000 out of over 4 million total paths.Notably, the correlation based on MOGen's generated paths is significantly higher, averaging 0.40 ± 0.01 .This result shows that the new paths generated by MOGen and their distribution are compatible with the validation data.Thus, the remarkable improvement in predictive performance originates from MOGen generating new paths not present in the training data but compatible with the observed transitions and the underlying network topology.

Scalability
Fourth, we show that MOGen is highly scalable to large sets of path data both in terms of time and memory requirements.The scalability results from the modular multi-layer formulation of the method (see the Materials and Methods section), which allows for an efficient sparse representation and a parallel implementation.
Our implementation parallelises the training of a multi-order model by splitting paths into multiple sets and simultaneously computing entries of the multi-order adjacency matrix A (K ) and transition matrix T (K ) for each of the sets, and finally aggregating those entries.We speed up the computation of the model likelihood (cf.Eq. 4) by computing the probabilities of paths in parallel.The left panel of Fig. 4 shows the time (in seconds) that is required to (i) train multi-order models for multiple parameters K, and (ii) select the optimal parameter K using the method described in Eq. 2 for different numbers of processing cores.The results show that the time required to fit an optimal model primarily depends on the number of unique paths as well as their length (cf.Table 1).We also observe a near-linear reduction of the training time as we increase the number of processing cores (grey line refers to an optimal linear speedup).The deviation from the linear reduction originates from the non-parallel implementation of the matrix powers of A , required to compute the degrees of freedom.The size of the models generally increases superlinearly with the maximum order K.The scaling of model size depends on the size and algebraic connectivity of the underlying topology, which determines the growth of the degrees of freedom with the parameter K.This is illustrated by the TUBE data set, where the increase in model size is lowest as the order increases due to a sparsely connected network topology.All models selected by MOGen are below 1 MB in size.

Discussion
Path data, i.e., time-ordered ordered sequences of nodes that are traversed, e.g., by passengers in a transportation network, users navigating in a network of hyperlinked articles, or processes propagating in a complex network are increasingly available in a wide range of scientific disciplines.Modelling and understanding sequential patterns in such data, especially those that cannot be explained by the underlying network's topology, promises to fundamentally advance our understanding of complex systems.Such patterns not only capture how the elements of a system indirectly influence each other, they also enable us to predict paths in networks, with applications, e.g., in the design of recommender systems, smart mobility services, or eCommerce.To do so, we must simultaneously account for three characteristics of real-world path data: (i) we typically have large collections of (short) paths with variable lengths, (ii) each of those paths consists of a time-ordered node sequence, and (iii) possible node sequences are constrained by Closing this gap, we introduced MOGen, a generative modelling framework to learn compact higher-order models of sequential patterns in paths that account for both the temporal ordering and variable-length of paths, as well as underlying network constraints.Addressing the challenge of overfitting, MOGen provides an integrated model selection that yields compact and generalisable graphical models for patterns in variable-length paths.Our results confirmed that the selected models are indeed those that generalise best to unseen data, effectively making our method parameterfree.Using empirical data, we showed that MOGen enables both next-element prediction and out-of-sample prediction of full variable-length paths with high accuracy and consistency.Moreover, it outperforms state-of-the-art sequence modelling algorithms and higher-order network models when predicting the next visited element, as we showed for six empirical path data sets.We finally showed that we can use it to generate surrogate data, i.e., new unseen paths that are compatible with patterns observed in the data.MOGen's formalisation allows for a highly efficient and scalable representation as a sparse multi-order transition matrix.By simultaneously modelling sequential and topological patterns in empirical data on paths in networks, the resulting models further advance our understanding of complex systems.We provide a parallel implementation of MOGen as an Open Source project.
In future work, we will further evaluate the novel opportunities that path generation and multi-order network analysis with MOGen open in real-world applications.We will further explore the link between MOGen and related methods from the domain of language modelling.

MOGen formalisation
We first formally define path data and present the generative modelling framework and model selection algorithm that are the main contributions of our work.We assume that we are given a multi-set of m paths S = {t 1 , . . ., t m } on a network G = (V , E) with nodes V and (directed) links . We assume that paths in S have varying lengths l i , and we denote the maximum path length as l max .If we were to use a simple network-based model for the sequences of nodes traversed by paths, we would assume that paths were generated by a memoryless random walk corresponding to a first-order Markov chain.The nodes v ∈ V correspond to the state space of such a Markov chain, while links (v, w) ∈ E are associated with transition probabilities capturing the probability of the transition from v to w.This assumes that the next node v i+1 traversed by a path only depends on (i) the currently visited node v i , and (ii) the links that determine which state tran- sitions are possible.To capture higher-order dependencies in the sequence of traversed nodes that are not due to the network topology, we can generalise this model to a k-th order Markov chain, where the next node v i+i depends on the k previously visited nodes v i−k , ..., v i .We can view this model as a memoryless random walk in a k-th order network G k = (V k , E k ) , where (i) each k-th order node (v 1 , ..., v k ) ∈ V k is a sequence of k nodes where consecutive nodes are connected by links, and (ii) each k-th order link connects two higher-order nodes (v 1 , ..., v k ) and (w 1 , ..., w k ) such that

Multi-order models of paths
The modelling approach above has several limitations with respect to the unique characteristics of variable-length paths in S. (i) Using a k-th order model, we can neither use the first k nodes in each path to fit the transition probabilities of the model nor can we use the model to perform a next-element prediction for the first k nodes on a path.(ii) The fact that the number of used to fit the transition probabilities of a k-th order model differs across different orders k complicates model selection, i.e., the selection of the optimal order k to model a data set.(iii) The model neglects information on the start-and endpoint of paths, hindering its application to an out-of-sample prediction of full variable-length paths, i.e., from start to end.(iv) Finally, a model with a single order k is limited to capture sequential patterns of a single length k, which is why recent works argued for variable-and multi-order modelling techniques (Scholtes 2017;Xu et al. 2016;Pitkow and Pirolli 1999).
To overcome those limitations, we introduce a generative modelling framework that combines (i) transitions between higher-order models of multiple orders up to a maximum order K and (ii) special transitions that capture the start and endpoints of paths.For this, we introduce a special initial state * , with transitions * → v to first-order nodes v ∈ V that mark the start of a path in node v.For all orders 1 ≤ k ≤ K , we further add transitions from k-th order nodes (v 1 , ..., v k ) → † to a special terminal state † , which cap- tures the termination of a path.With this extension, a path v 1 → v 2 → • • • → v l corre- sponds to l + 2 transitions in a multi-order network with maximum order K, which gives rise to the following sequence of l higher-order nodes with two additional transitions from/to the special states * and † to model the start and end of the path.Considering the coding of memory prefixes in the extended state space of higher-order Markov chains, we note that each transition in this multi-order model with maximum order K increases the memory prefix by one node, up to a maximum memory of length K.In addition to transitions within a K-th order model, we obtain a multi-order model that includes transitions between the nodes of a k-th order model and the nodes of a k + 1-th order model for all orders k < K .Transitions from/to the initial state * and terminal state † explicitly model the start-and endpoints of paths, which is the basis for an end-to-end, out-of-sample prediction of full paths.
We can mathematically represent such a multi-order network with maximum order K through a multi-order adjacency matrix A (K ) , whose overall structure is shown in Fig. 5a.This representation is inspired by so-called supra-adjacency matrices, which have recently been proposed to represent the interconnected topologies of multi-layer networks (Kivelä et al. 2014).A multi-order adjacency matrix consists of blocks A k−1,k , whose entries count the observed transitions between k − 1-th and k-th order nodes in a given multi-set S of paths.The entries in the top-left block A 0,1 ∈ R 1×n count transitions ( * , v) for all nodes v ∈ V , i.e., the distribution of start nodes for all paths.The entries of the bottom-right block A † count terminal transitions (v 1 , ..., v k ) → † for all orders 1 ≤ k ≤ K .For paths with lengths l i > K , the entries in the bottom matrix A K ,K count the transitions between the nodes in a K-th order network model.Figure 5a shows an example for a multi-order adjacency matrix that corresponds to a toy example where each path shown in Fig. 1a is observed ten times.We note that structural zeros in A (K )  (greyed out in Fig. 5b) represent transitions that are impossible based on the underlying network topology, which are different from zero entries that represent possible but unobserved transitions (see zeros in highlighted blocks in Fig. 5b).Considering these structural zeros allows for a sparse memory-efficient representation of our method.The only model parameter K determines the maximum order of the multi-order model, which resembles the maximum memory length in a Markov chain.We note that, under the assumption that the paths in a multi set S = {t 1 , . . ., t m } with maximum path length l max are generated independently of each other, the multi-order adjacency matrix A (K ) is a lossless representation of S iff K ≥ l max .To facilitate prediction tasks, we use the multi- order adjacency matrix to define a probabilistic generative model of variable-length paths.This model is fully described by a multi-order transition matrix T (K ) , which we obtain by a row normalisation of A (K ) .This yields a stochastic matrix T (K ) , whose entries give the transition probabilities in a probabilistic model for paths.
The transition matrix of a generative multi-order model with maximum order K = 3 that corresponds to the toy example in Fig. 1a is shown in Fig. 5c.To illustrate our method in the toy example shown in Fig. 1a, we present here results for a simple experiment.Varying the observed frequencies of the five (colour-coded) paths shown in Fig. 1a, we report the different estimated optimal orders K .Due to the topology of the network shown in Fig. 1a, paths that start in node A or B and continue to C and D either terminate in node E or node F. In order to model paths whose probability to terminate in E or F depends on whether they start in A or B, a MOGen with maximum order three is needed.Such a dependency exists if the paths 5d shows that our model selection algorithm yields the cor- rect optimal order.If, however, the terminal node of a path is independent of the start node, a model with order two is sufficient.This is the case if the frequencies of all four paths are similar, in which case our model selection algorithm yields the correct value for K = 2 (cf. Figure 5d).This is illustrated by the identical values in T (3)  3,3 in Fig. 5c.These results merely illustrate our method.We perform a thorough cross-validation experiment in the Results section.With this, we confirm that the best prediction performance is obtained for the parameter K estimated according to Theorem 1, which we introduce in the next section.This implies that, different from existing techniques, the parameter K can be directly estimated from the data, which effectively turns our method into a parameter-free approach.

Optimal maximum order detection
The unique formulation of MOGen allows the comparison of likelihoods across different maximum orders K. Thanks to this, we can utilise Akaike's Information Criterion (AIC) (Akaike 1974) to choose MOGen's optimal maximum order K .AIC balances the explanatory power of a model, given by its likelihood function L(T (K ) |S) , with its com- plexity, measured in terms of the model's degrees of freedoms d(T (K ) ) .Given a multi set S of m independent paths, L(T (K ) |S) is simply the product of all path probabilities P(t|T (K ) ) .The probability P(t|T (K ) ) of a single observed path t in S is instead given by the probability of observing all its transitions in sequence, as specified by T (K ) .This for- malisation is easily parallelisable.Therefore, as we show in our results, MOGen is highly scalable and memory efficient.
The degrees of freedom d(T (K ) ) need to account for (i) structural zeros in T (K ) result- ing from paths being restricted to a network, (ii) the multi-layer structure of MOGen combining multiple orders k, and (iii) additional model parameters used to model transitions from/to the initial and terminal states.Thus, d(T (K ) ) depends on the topology of the underlying network, 4 encoded by A .Possible paths of length k in such a network are encoded by the non-zero entries of A k .These specify which transition probabilities need to be estimated from the data S and which are structural zeros.The degrees of freedom of a MOGen/ with maximum order K are then: Since the best model among a set of candidate models is the one that minimises the AIC, we can compute the optimal maximum order of a MOGen model given a set of paths as follows: Theorem 1 Let S = {t i } be a multi-set consisting of m independent paths.Let T (K ) be the multi-order transition matrix defining the MOGen of maximum order K constructed from S. Let A be the binary adjacency matrix where A vw = 1 iff the transition v → w appears at least once in S. Then the optimal maximum order K according to AIC is where P(t|T (K ) ) denotes the probability of a path t ∈ S.
Formal Proof of Theorem 1 We split the formal proof of Theorem 1 in a series of small lemmas.Doing so, we go through all the steps needed to obtain Eq. 2. The first step consists of specifying the probability of observing a path t according to a multi-order model defined by a multi-order transition matrix T (K ) .
Lemma 1 (Probability of a path) Let T (K ) be a multi-order transition matrix defining a multi-order model of maximum order K.Then, the probability P(t|T (K ) ) according to corresponds to l + 2 transitions along the edges of a multi-order network with maximum order K, which gives rise to the following sequence of l higher-order nodes: The probability of each of these transitions is (a) independent by definition and (b) completely specified by the multi-order transition matrix T (K ) .Thus, the probability P(t|T (K ) ) of observing a path t is given by the product of the l + 2 probabilities given in T (K ) of observing each higher-order transition defining the path.
Employing the results of Lemma 1, we can compute the probability of a multi-set S of independent paths. (1) Lemma 2 (Probability of a paths' collection) Let T (K ) be a multi-order transition matrix defining a multi-order model of maximum order K. Let S = {t i } be a multi-set consisting of m independent paths.Then, the probability of observing S according to T (K ) is where P(t|T (K ) ) is given by Eq. 3.
Proof All paths t in S are assumed to be independent.Thus, the probability T (K ) of observing S according to T (K ) is simply the product of the probabilities P(t|T (K ) ) of observing each path t ∈ S .
These first two lemmas allow computing the likelihood of a multi-order model given some observed data in the form of a (multi-)set of paths.The final lemma allows defining the number of degrees of freedom of a multi-order model constrained by a graph G(V, E).

Lemma 3 (Number of possible transitions)
Let A be the first-order binary adjacency matrix specifying a graph G(V, E) over which a multi-order model T (K ) of maximum order K is defined.Then, the maximum number N τ of possible transitions in T (K ) is given by where V k denotes the set of connected k-th order nodes in T (K ) .
Proof Because the multi-order model is defined over a graph G(V, E), to all multiorder transition probabilities in T (K ) correspond first-order edges in E. Similarly, for all edges e ∈ E , there is at least one non-zero multi-order transition probability in T (K ) that encompasses e.Thus, K k=1 i,j A k ij gives the number of all transitions up to order K that are possible according to G(V, E).Furthermore, there possibly exist |V | transitions from the initial state * to any node in V. Finally, there possibly exist | K k=1 V k | transitions from any higher-order node in K k=1 V k to the terminal state † .Summing over all possi- ble transitions, we get to Eq. 5.
With the results provided by the above lemmas, we can proceed with a proof for Theorem 1.

Proof of of Theorem 1
For a multi-order model, we can calculate the AIC as follows: where d(T (K ) ) denotes the number of degrees of freedom of the model identified by T (K ) , and L(T (K ) |S) its likelihood given the data S.
(4) P(S|T (K ) ) = t∈S P(t|T (K ) ), (5) The number of degrees of freedom d(T (K ) ) of a multi-order model corresponds to the number of multi-order transition probabilities in T (K ) that have to be estimated from the data.The elements of T (K ) can be divided into two classes: structural zeros and possible transitions.Structural zeros are those multi-order transitions that cannot be observed, and thus their probabilities are set to zero and do not contribute to the number of degrees of freedom of the model.The values of the remaining transition probabilities have to be estimated from the data.Thus, they contribute to the degrees of freedom of the model.For this reason, d(T (K ) ) can be computed by counting how many transitions are possible according to the observed data S.We denote with A the binary adjacency matrix whose elements A vw = 1 iff a transition v → w has been observed at least once in S. Given A , we can find the maximum number of possible transitions N τ using Lemma 3. Furthermore, we note that the multi-order transition matrix T (K ) is row-stochastic.The number of degrees of freedom d(T (K ) ) is then given by N τ minus the number of rows r = | K k=1 V k | + 1 of T (K ) .The reason for this is that by accounting for the row- stochasticity of T (K ) one transition probability per row need not be estimated: This provides us with an explicit expression for the first term in Eq. 6.
To get to Eq. 2, we further need to compute the likelihood L(T (K ) |S) of the model defined by T (K ) .As L(T (K ) |S) = P(S|T (K ) ) , we can use Theorem 1,2 to express it in terms of the probabilities P(t|T (K ) ) of the paths t ∈ S. Finally, we highlight that the best model among a set of candidate models is the one that minimises the AIC.Thus, plugging Eq. 7,4 in Eq. 6 and taking the argmin of the resulting expression gives us Eq. 2 (after ignoring constant terms), concluding this proof.

Selection of baseline models
We base our selection of baseline methods for next-element prediction on the recent review (Tax et al. 2020).The authors of Tax et al. (2020) find that AKOM (Pitkow and Pirolli 1999) outperforms all other algorithms in three out of four tested data sets.Conversely, in Gueniche et al. (2015) the authors of CPT+ show that their algorithm outperforms AKOM in terms of the accuracy of the next element prediction.We thus use both CPT+ and AKOM as baselines, utilising the implementations provided by the authors of Fournier-Viger et al. (2014).To enable a fair comparison for empty prefixes, we modified the implementation of AKOM to return a prediction based on the frequency of elements in the training data.This is described in Pitkow and Pirolli (1999).However, the implementation in Fournier-Viger et al. (2014) returns an empty prediction instead.CPT+ predicts the next element in a sequence based on an internal assignment of weights to ( 7) candidate elements, i.e., the prediction is not based on probabilities.We thus normalise the weights assigned by CPT+ to obtain a probabilistic prediction.While this allows us to compare the performance of CPT+ with the other (probabilistic) methods, we highlight that the evaluation in Gueniche et al. (2015) is based on prediction accuracy.It is crucial noting that the authors of Tax et al. (2020) report that AKOM and state-of-theart RNNs achieved the highest and, at the same time, very similar levels of performance on four datasets.As our method is based on a Markov chain model, we thus opt to use the comparable method AKOM as a benchmark for our method and not RNNs.Based on the findings in Tax et al. (2020), we expect these results to hold also for RNNs.
In addition to those sequence modelling techniques, we compare our method to a probabilistic prediction derived from (i) a random walk in a higher-order network with a single order k (NET), and (ii) the multi-order graphical model (MOM) introduced in Scholtes (2017).For k = 1 , NET is identical to a prediction generated by a random walk in the underlying network.Note that, despite the similar name, the method proposed in the present work considerably differs from MOM.First, MOM not model the endpoint of paths, limiting its ability to predict variable-length paths.Second, it uses a model-fitting approach that estimates transition probabilities based on sub-path frequencies rather than the transition probabilities used in our approach.In addition to those network-based models, we use a naive (parameter-free) random baseline (RND) that simply predicts the next node based on the relative frequency of nodes in the training set, i.e., a prediction that neither uses a Markov chain nor the network topology.

Selection of model parameters
AKOM, MOM, NET, and MOGen are characterised by a single parameter, i.e., their (maximum) order.We search for the optimal order in terms of predictive performance.We obtain this optimal order by iteratively increasing the model order from one until a maximum in predictive performance is reached.For CPT+, we use the parameters proposed in Gueniche et al. (2015).Our naive random baseline (RND) is parameter-free.We report the optimal (maximum) orders for all models and data sets in Table 3.

Cross-entropy loss
Following (Bishop and Nasrabadi 2006), we define the cross-entropy loss function for next-element prediction as (8) H (p, q) := − v∈V p(v) log q(v) where for a given prefix (v 1 , ..., v k ) , q(v) denotes the probability of the next element (i.e., target node) v obtained from a model trained on the training set and p(v) is the true distribution underlying the data.Since the true distribution is unknown for the empirical data set, we compute p(v) from the distribution of actual next nodes in the validation set.This yields the log-loss function commonly used in the cross-validation of probabilistic multi-class prediction (Bishop and Nasrabadi 2006).We evaluate predictions for multiple prefix lengths up to a maximum length of six.Those are considered as multiple samples and assign equal sample weights such that sample weights for any given target node with different prefix lengths sum to one.For the resulting loss function, a value of zero corresponds to a perfect prediction, which would allow us to replace the true distribution in the data with the prediction without information loss.Non-zero values quantify the information loss in bits, i.e., larger values correspond to worse prediction performance.

Fig. 2
Fig. 2 Prediction performance on different prefix lengths for MSNBC 3 The resulting model sizes (in MegaBytes) for different orders K are shown in the right panel of Fig.4.

Fig. 3
Fig. 3 Out-of-sample path prediction with MOGen.ROC for 1000 training and 4,294,731 validation paths for TUBE

Fig. 4
Fig. 4 The left panel shows the mean and standard deviations of the times required for training model selection (five iterations).The grey line in the left panel indicates a linear speedup.The right panel shows the resulting model size for all datasets.Highlighted models in the right panel are the ones picked through model selection

Fig. 5
Fig. 5 Multi-order matrix representation of MOGen.a shows the block structure of the multi-order adjacency matrix.b and c show A and T for MOGen with K = 3 on the example from Fig. 1a, where all paths occur 10 times.d shows the K detected by model selection for varying observation counts.The observation from (b) and (c) is highlighted in white (West and Leskovec 2012)B.They are of variable length as they end either after a single transition in C, or after three transitions in E or F. Consider aiming to predict if a path arriving in C continues to D or ends.Only paths starting in A can end in C, whereas all paths starting in B continue to D. Hence, to make an informed prediction, we need to account for the temporal ordering of transitions, i.e., we require memory recording the sequence of visited nodes before arriving at C. b To fit MOGen, we estimate a set of models encoding the observed transitions with different maximum memory lengths.cWeapplyMOGen'smodel selection algorithm to determine the optimal maximum order.dWefind the optimal model for the given data.eWeusethisoptimal model to predict the paths in the validation data were playing the game Wikispeedia(West and Leskovec 2012).For our second category, AIR contains flight itineraries of passengers travelling on routes between US airports in 2001 (RITA 2014), and TUBE captures itineraries of London Tube passengers (Transport for London 2014).The raw data for all data sets are freely available online.2We provide extended summary statistics in Table

Table 1
Summary statistics for the six data sets used to validate MOGen.

Table 2
Next-element prediction performance of all models (a) Mean and standard deviation of the cross-entropy loss over five train-validation splits.For each data set, the result of the best performing model is highlighted in bold.(b) Maximum order of the best performing MOGen model and the order detected by model selection

Table 3
Best performing model parameters for all data sets