Node embeddings in dynamic graphs

In this paper, we present algorithms that learn and update temporal node embeddings on the fly for tracking and measuring node similarity over time in graph streams. Recently, several representation learning methods have been proposed that are capable of embedding nodes in a vector space in a way that captures the network structure. Most of the known techniques extract embeddings from static graph snapshots. By contrast, modeling the dynamics of the nodes in temporal networks requires evolving node representations. In order to update node representations that reflect the temporal changes in the local graph structure, we rely on ideas for data stream algorithms. For example, we assess neighborhood overlap by a MinHash fingerprint-based algorithm. To evaluate our methods, in addition to the standard link prediction task, we provide dynamic ground truth data for the quantitative evaluation of similarity search by using online updated node embeddings. In our experiments, we constructed tennis tournament Twitter mention graphs as edge streams and compiled dynamic ground truth by using tournament schedule as external source. Our new algorithms outperformed snapshot-based batch methods for both link prediction and similarity search.


Introduction
The need for machine learning over data streams is motivated by a rapidly growing number of industrial applications of graph algorithms (Wang et al. 2017;Nie et al. 2017;Zhou et al. 2017;Wei et al. 2017) and online machine learning (Bifet et al. 2010;De Francisci Morales et al. 2016;Zhu and Shasha 2002;Žliobaite et al. 2012). In graph streams, the combination of the two areas, edges arrive continuously over time from a large network and have no duration (McGregor 2014). We intend to apply online machine learning (Bifet et al. 2010) for link prediction and similarity search by learning and updating node feature representations on the fly from graph streams.
The principal task of online machine learning is to learn a concept incrementally by processing data immediately after creation (Widmer and Kubat 1996), for example, after each mention in a Twitter mention graph. Traditional, batch learners build static models from finite, static data sets, which do not change over time. By contrast, stream learners build models that evolve over time. For example, in graphs, more recent edges can form a more relevant picture of the current network structure than older ones. The final model will strongly depend on the order of examples generated from a continuous, non-stationary flow of data. Modeling is therefore affected by potential concept drifts or changes in distribution (Gama et al. 2013). Online learning seems more restricted than batch learning, which can iterate over the data set several times, and thus one could expect inferior results from online methods. By contrast, in some cases (Frigó et al. 2017), online methods perform surprisingly strongly.
To track node properties in a graph stream, we adapt the highly successful technique of node representations. Representation learning methods on graphs encode the nodes of the network to points in a low-dimensional vector space. In general, representations in the embedded space should reflect the structure of the original graph. The research area of node embeddings has been recently catalyzed by the Word2Vec algorithm , developed for natural language processing. Several node embedding methods have been proposed recently (Perozzi et al. 2014;Tang et al. 2015;Grover and Leskovec 2016;Qiu et al. 2018) and applied successfully for multi-label classification and link prediction in a variety of real-world networks from diverse domains.
In order to generate node embeddings, we have to solve the challenge of maintaining node embeddings for tracking and measuring node properties and similarities as the edges arrive. Most graph algorithms are difficult to update online. For example, to compute random walk-based embeddings (Grover and Leskovec 2016), we have to be able to maintain not just the embedding but also the set of walks whenever a new edge appears in the stream.
Time-aware relevance evaluation also becomes troublesome in the presence of fast changes in network structure. For link prediction (Liben-Nowell and Kleinberg 2007), in an edge stream we can update our model immediately after the new edge arrives, and predict a completely new list in the next step. For this so-called predictive sequential (abbreviated as prequential) evaluation (Dawid 1984), we have to define new evaluation metrics. The same difficulties for evaluating streaming recommenders was first observed in (Lathia et al. 2009).
To fully utilize the power of graph embedding, our main focus for evaluation is similarity search, in which we assess the information encoded in the embedding about node pairs rather than just a global property required for predicting links. For similarity search, the prime source of difficulty lies in ground truth compilation. Static relevance measures such as precision, recall, or NDCG already require ground truth labeling, which itself often requires tedious human effort such as the effort that has been made for TREC topics (Clarke et al. 2004). In a dynamic graph stream, depending on time granularity, the same human data curation may be required in each time step.
Algorithms for temporal graphs have already started to emerge in publications; however, very few graph learning algorithms are capable of immediately updating their models from edge streams. Similarly, in the literature we rarely find real graph streaming methods where node labels are highly dynamic: even link prediction tasks are evaluated in batches for sets of edges that appear over a longer period in time. Any embedding method can be applied in dynamic graphs by considering graph snapshots in time. However, such solutions do not only react slowly, but also build new representations for every snapshot, hence they require an entire model retraining for downstream machine learning tasks (Hamilton et al. 2017a). The more natural part of the task is updating the embedding: gradient descent is a commonly used optimization procedure, which naturally lends itself to online learning algorithms (Juang and Lin 1998) as well. However, walk-based embedding methods published so far could only efficiently rebuild the walks for snapshots or larger batches of insertions and deletions, and were not able to update the set of walks for every single new edge in the stream. Present work. StreamWalk, our first algorithm updates the node embedding online to track and measure node properties and similarity from a graph stream (Rozenshtein and Gionis 2016). StreamWalk is based on the recent concept of temporal walks containing edges ordered in time. As illustrated in Fig. 1, the StreamWalk algorithm picks node samples for a single node from its temporal neighborhood, using temporal walks ending in the node. Given the sample, we optimize to make the embedding of the sample similar to the source node. Our algorithm performs online machine learning (Bifet et al. 2010) by continuously updating a model as we read the graph stream. Its key ingredients are online gradient descent optimization (Juang and Lin 1998) and time respecting temporal walks (Rozenshtein and Gionis 2016).
StreamWalk improves over static node representations as applied to graph streams in three key ways: • It accounts for the ordering of edges by sampling only from time respecting random walks, capturing richer information about graph structure. • It includes an efficient data structure to sample random walks online without storing the entire edge set.
• The node representations evolve over time to reflect changes in network structure.
Our second algorithm directly learns the neighborhood similarity of node pairs in the graph stream, which we call second order similarity. By MinHash fingerprinting (Fogaras and Rácz 2005), we efficiently approximate the neighborhood Jaccard similarity of any two nodes at a given time. Then we optimize the embedding to make pairs similar, proportional to the overlap of their neighborhood.
Our main results are twofold: • We design two algorithms that can update their node representations quickly in large graph streams, and outperform the baselines among others in link prediction tasks. • We design a quantitative experiment for accessing the quality of temporal node embeddings based on the Twitter tennis tournament mention graphs of (Béres et al. 2018), which include temporally changing node labels. In a supervised experiment, Fig. 1 Concept of StreamWalk. When uv edge arrives in the stream, vertices from the temporal neighborhood of u are sampled via temporal random walks. The method optimizes for the similarity of v and the sampled node w we show that online updateable embeddings capture node similarities better than static embeddings.
The rest of this paper is organized as follows. First we summarize the related works in "Related works" section. In "Dynamic vector space embedding methods in edge streams" section we introduce StreamWalk and our method for learning neighborhood similarities directly from graph streams. In "Similarity search experiments" section we first examine the quality of dynamic node embeddings on the RG17 and UO17 Twitter data sets. Finally, in "Online link prediction" section we consider the online link prediction problem as another evaluation of our methods.

Related works
Temporal networks. A large variety of temporal network algorithms have appeared for connectivity, spanning trees, matchings, and many more, which are surveyed, for example, in (Holme and Saramäki 2012; Aggarwal and Subbian 2014). The usual approach for analyzing temporal graphs is to use timestamps to create a series of static graph snapshots (Kumar et al. 2010). High temporal granularity networks are considered in the edge or graph stream model (McGregor 2014) where edges must be processed once they arrive in the stream; for example, a random walk algorithm is described in (Sarma et al. 2011).
The concept of time respecting paths, in which adjacent edges must be ordered in time, is key in our results and directly used in one of our embedding models. The concept was perhaps introduced in (Moody 2002) for analyzing diffusion in networks. In another terminology, temporal walks were used to construct time-aware centrality metrics in (Rozenshtein and Gionis 2016;Béres et al. 2018).
Online machine learning. The area of online machine learning covers algorithms that work from data streams with only a limited possibility to store past data (Bifet et al. 2010). We define our models over graph streams where the data stream consists of the edges of the graph. Our models are online updateable, hence they are capable of adapting to concept drift.
Link prediction from graph streams is one of our tasks where the goal is to predict the next edge appearing in the edge stream. The problem is closely related to recommender systems where the strength of online machine learning has been observed recently (Ling et al. 2012;Frigó et al. 2017). Note that for link prediction we use the prequential evaluation, which was published more than thirty years ago (Dawid 1984) but has only recently come into widespread use (Gama et al. 2013) for streaming algorithms.
Representation learning on graphs. Embedding methods on graphs encode the nodes of the network to vectors in a low-dimensional vector space. In general, representations in the embedded space should reflect the structure of the original graph. Perhaps the most well-known method is Laplacian eigenmaps (Belkin and Niyogi 2002). Another class of models is based on the adjacency matrix of the graph; one popular example is graph factorization (Ahmed et al. 2013). Recently, random walk-based approaches have been proposed, like Node2Vec (Grover and Leskovec 2016), LINE (Tang et al. 2015), and Deep-Walk (Perozzi et al. 2014). These methods sample node pairs that co-occur in random walks, and then optimize for their similarity in the embedded space. Walk sampling is motivated by the skip-gram model from natural language processing . Furthermore, the aforementioned techniques can be unified under a matrix factorization framework (Qiu et al. 2018).
We briefly review the methodology of the above approaches by following (Hamilton et al. 2017b). Static embedding methods learn an embedding vector q u for each node u in the graph. Usually the objective is to learn vectors that are similar for neighboring nodes. Let s(u) denote the neighborhood of u; then our goal is to satisfy q v ≈ q u for v ∈ s(u). Shallow embedding approaches for static graphs differ in the objective function they use to ensure the similarity of the embeddings, and in the definition of the network neighborhood s(u).
Graph factorization (Ahmed et al. 2013), GraRep (Cao et al. 2015), and HOPE (Ou et al. 2016) optimize for the squared error (SE) over node pairs in the neighborhood: where sim(u, v) is the similarity of two nodes measured from the graph structure. The definition of the neighborhood is based on the adjacency matrix. Graph factorization calculates with adjacent neighbors, while GraRep uses higher powers of the adjacency matrix, for example, two-hop neighbors.
As a different method, random walk-based approaches (Grover and Leskovec 2016;Perozzi et al. 2014) sample vertices from the neighborhood of a node. Sampling is done by initiating random walks from node u. Instead of SE, these approaches optimize for cross-entropy loss: where s * (u) is a random sample from the neighborhood of u. Many of the above mentioned algorithms use the Word2Vec model as an underlying abstraction by training the model, using sampled walks analogously to sentences, and using the learned embeddings as node embeddings. We follow this approach, and also investigate the use of either the input (W 1) or the output embedding (W 2) of the model (Press and Wolf 2016) as the vector space representation of the graph.
The above models learn static embeddings on graph snapshots; however, they mention extensions towards online learning from graph streams. In DeepWalk (Perozzi et al. 2014), the possibility of an online incremental update is proposed but not analyzed. An incremental update for LINE with a batch of edge insertions and deletions is described in (Yu et al. 2018), but no attempt is made to analyze the online, single edge insertion behavior. Closest to our work is the continuous-time dynamic network embedding result (Nguyen et al. 2018), which does not learn online but computes an embedding for a single point in time. Similarly, the HTNE algorithm (Zuo et al. 2018) produces temporal node embeddings, but training is done in batch instead of executing online updates. A promising direction for computing the embedding dynamically involves recurrent neural networks, for example, Long Short-Term Memory networks ; however, the applicability for graphs is not yet explored.

Dynamic vector space embedding methods in edge streams
We describe two node embedding approaches that are applicable in edge streams. The input of both algorithms consists of an edge stream (u, v, t) ordered by time t in which each edge can occur multiple times. As required by the data stream algorithmic model, we process the edges in the order of arrival without storing the entire input.
Our goal is to dynamically learn node representations by reflecting the current node similarity structure of the evolving graph as we dynamically change the location of the nodes in the vector space. To this end, we give two embedding methods in the next two subsections. Two nodes are required to be mapped close in the vector space whenever they lie on short paths formed by recent edges in the first model, and whenever the set of their recent neighbors is similar in the second model.

Similarity based on reachability through short temporal walks
In our first algorithm, our goal is to enforce that the embedding of node v be similar to the embedding of nodes with the ability to reach v across edges that appeared recently, as shown in Fig. 1. In other words, the embedding of a node should be similar to the embedding of nodes in its temporal neighborhood. We define time respecting temporal walks (Rozenshtein and Gionis 2016) in order to sample for each node u at any time t nodes from its temporal neighborhood. As seen in Fig. 2, a temporal walk consists of adjacent edges ordered in time: For example, there are three temporal walks leading to node v in Fig. 4: e 1 , e 3 , and e 2 , e 3 . Since edges can appear multiple times, we consider the edge set as a multiset and distinguish between the walk (e 2 , e 3 ), which is a temporal walk, from (e 2 , e 1 ), which is not, since e 1 comes earlier than e 2 .
To define the similarity, we want to give more weight to shorter walks and more weight to fresh edges. Towards this end, for a temporal walk where edges appeared at (t 1 , t 2 , . . . , t j ), we define the probability of the walk at time t as where β ≤ 1 is an exponential decay on the length of the walk, t = t j+1 , and γ (τ ) is a time-aware weighting function that is based on the delay τ between adjacent edges. The concept of (4) is that a walk is more likely if edges along the walk appeared close to each other in time. We use exponential time weight γ (τ ) The notation is summarized in Table 1.

Temporal walk sampling from edge stream
Given a node v, a naive idea would be to compute the walk weight {p(z, t) : z is a temporal walk from w to v} for all other nodes w and set the embedding of w close to that of v proportional to the walk weight. The problem with this approach is that it requires a time consuming walk enumeration procedure at each time instance, and has no ability to update the similarity measure by focusing only on the new edges as they arrive. Given the new edge uv that arrives at time t, we would like to only consider walks from any w that reach v by the new edge uv. Towards this end, we propose a sampling update procedure for temporal walks as follows. We select a start node w of a random temporal walk z ending in u with probability proportional to p(z, t) in (5); see Fig. 3. We generate the walks by taking steps backwards from u. To make sure the walks are temporal, we always use edges that appeared before the previous one. Among the possible edges entering the current node, we select proportional to the time-aware weighting function γ . For example, in Fig. 3, we select t 5 backwards from u, and then t 3 backwards from the next node. Finally, we also compute a stopping probability corresponding to the length decay β so that we select no new edge from w in the example; the actual formula (10) is explained later.
The actual implementation is somewhat tricky in that we have to handle multi-sets of edges. A way to illustrate the implementation is to consider an edge uv that appears before another wu and then reappears, see Fig. 4. The second instance can form a temporal walk w, u, v, while the same walk is not temporal with the first instance of uv. However, the second instance of uv has a higher edge weight γ , hence we have to store the weight of the first instance as well to be able to correctly compute the weight of all temporal walks that reach node v.

The implementation of the StreamWalk algorithm
In Algorithm 1, we describe StreamWalk, our implementation of temporal walk sampling. Recall that the notation is summarized in Table 1. For every edge uv in the multi-set of edges arriving in the stream, we maintain the total weight of all walks ending at v at time t(uv): where we sum over all temporal walks z ending in v using edges arriving no later than t(uv). The actual computation in procedure UPDATEWALKS accumulates the weight of the walks seen in Fig. 5. There is a new single edge temporal walk uv with weight β. Furthermore, we can continue each temporal walk z that ended in u before t(uv) with uv.
The total weight of these walks is p (u, t u where t u is the most recent timestamp for which p(u, t u ) is known. In other words, t u denotes the last time an edge entering u arrived in the edge stream. The exponential term accounts for the time decay of temporal walk weights since the arrival of this last edge entering u. Finally, we add all the walks that terminated at v before, with exponential time decay. The final formula becomes where t v is the most recent timestamp for which p(v, t v ) is known. The update rule is illustrated in the last step in Figs. 4 and 5.
For each edge uv in the stream, we finally update the embedding of v by sampling a fixed number of temporal walks ending in u; we do this by calling procedure SAMPLEWALKS k times as described at the end of this section. Given the start node w of a walk in the sample, we optimize for the similarity of the embedding pair (q v , q w ) with stochastic gradient descent. For loss function, we either set MSE or cross-entropy as in Eqs. (1) and (2). In the case of MSE, for each w we apply online negative sampling (Pálovics et al. 2014) by selecting pairs vw proportional to the popularity of w in the edge stream up to the current timestamp. We refer to (Kaji and Kobayashi 2017) for online incremental updates for cross-entropy based loss.

Fig. 4
Computation of p(v, t) for the arrival times t 1 < t 2 < t 3 of the three edges e 1 , e 2 , and e 3 . The bottom right cell illustrates the update formula (7) Fig. 5 Whenever a new edge uv appears, a new walk starts from u (red), and each temporal walk (z 1 , z 2 , z 3 ) that ended in u up to time t u continues via uv (blue). We get p (v, t(uv)) by summing up the contribution of the previous two type of walks (red and blue) with the decayed weight of walks that have already reached node v (purple) before time t (uv) Since we train by sampling k walks per edge, time complexity is affected by the cost of sampling temporal walks. To reduce storage, we can work over a sliding window of the stream and periodically remove the oldest edges; these edges will already have a very small γ value.
Finally, we describe the algorithm to sample temporal walks as implemented in Procedure SAMPLEWALKS of Algorithm 1. Our goal is to sample proportional to p(y, τ ) at a given time τ . We define a random walk backwards from y. We select a backward edge with probability proportional to the weight of walks ending with that edge, which we define as p(xy) = z p(z, t(xy)); where z are temporal walks ending with the given instance of the edge xy that appeared at time t(xy). Recall that the edges are taken from a multi-set. The value of p(xy) can be calculated as follows. From the total temporal walk weight ending in y at time t(xy), we have to subtract the total weight of all walks ending in y before t(xy); the difference contains the weight of only those paths that use the edge instance xy of timestamp t(xy): wheret < t(xy) is the timestamp of the last edge in the stream entering y before t(xy). The exponential term corresponds to the time decay of the walk weight since timet. We also define the termination for the walk, which is based on the contribution of the single node y as a zero-edge walk relative to all other walks that end at y. At any time of observation τ , the weight of the zero-edge walk is 1, and the total weight of the remaining walks is p(y, t) for the last recorded time t ≤ τ , decayed proportional to the elapsed time, τ − t. Hence with the probability below, we take no further steps but stop the walk: The steps of Procedure SAMPLEWALKS are summarized as follows.
1 We start the random walk from y ← u and set τ = now.

Algorithm 1 StreamWalk.
procedure UPDATEWALKS(u, v) Update the weight for all walks ending at v t u , t v ← last timestamp such that p(u, t u ) and p (v, t v Recursively sample a temporal walk ending at y t ← most recent timestamp with t ≤ τ such that p(y, t) is known p(y, τ ) ← p(y, t) · exp(−c(τ − t)) With probability 1/(1 + p(y, τ )) do return y else for all xy multi-edges with t(xy) < τ do Select x with probability p(xy) · exp(−c(τ − t(xy)))/p(y, τ ) end for return SAMPLEWALKS(x, t(xy)) end procedure procedure STREAMWALK(u, v) Update embedding for v call UpdateWalks (u, v) repeat k times w ← SAMPLEWALKS(u, now) Optimize the representations q w and q v by Eqs.
(1) or (2) end procedure 2 With probability such as in Eq. 10, we stop the walk and return the current node y. 3 Optionally, we can also terminate the walk if its length reaches a predefined limit. 4 Else, we select an edge xy with t(xy) < τ with probability proportional to the time-decayed total weight of walks ending with xy, which is p(xy) · exp(−c(τ − t(xy))) by definition. 5 We repeat from step 2 by setting y ← x and τ ← t(xy).
As the final implementation details, we can sample by selecting a random value between zero and p(y, τ ) and binary search in the multi-set of xy edges ordered by t(xy). For a given edge xy, we compute p(xy) by Eq. 9 and continue the binary search based on the time-decayed value p(xy) · exp(−c(τ − t(xy))). Lastly, it can happen that sampling intends to select a very old edge that was already deleted from the sliding window. This happens when binary search does not terminate at the oldest t still kept in the records. In this case, we can repeat the sampling with a new random value.

Online learning of second order node similarity
Our next online algorithm optimizes the embedding to match the neighborhood similarity of the nodes, which we call second order proximity by following (Tang et al. 2015). Our goal is to optimize for (1) online, by considering sim(u, x) as a time-aware Jaccard similarity of the neighborhood of u and x, as illustrated in Fig. 6. We consider the neighbors y of u as a multi-set N(u, t) in which we use the decayed weight of edge uy as the weight of y: where t(uy) is the time the corresponding instance of edge uy appeared in the stream. Whenever we add a new edge to u, we discard elements y ∈ N(u, t) with probability 1 − w(y). This way we emphasize the importance of new edges and also limit the size of N(u, t) by discarding old edges with low weight that have little effect on similarity values.
In order to design a streaming algorithm to compute second order similarity, we face the same problems as in the StreamWalk algorithm: we want to focus on the increase of similarity when we add a new edge uv, and we want to avoid the costly full computation of similarities of u with all neighbors x of v. Note that the similarity of x and u depend on their neighborhood, which means that all nodes of distance two from v should be enumerated for the full computation. In the next subsection, we describe a randomized approximation method for neighborhood similarities based on (Fogaras and Rácz 2005), which will be used in our final algorithm.

Approximation by fingerprinting
Our algorithm relies on MinHash fingerprinting (Broder et al. 2000) to approximate the Jaccard similarity. The notations are summarized in Table 2. Let there be k independent random permutations over the nodes π i for i = 1 . . . k. We define the k fingerprints of A as   (N(U, t)) the i-th fingerprint of u at time t For short, (N(u, t)).
We maintain k fingerprints defined in (12) for the neighborhood of each node where the weights of the elements are defined by (11). We approximate the time-aware Jaccard similarity of any node pair with the fraction of common fingerprint values: We illustrate the fingerprinting idea in Fig. 7 for k = 2. The two fingerprints of u, h 1 (u) and h 2 (u), are defined based on two permutations π 1 and π 2 of the entire vertex set. The permutations are fixed, but the fingerprints change in time as new edges arrive and past edges become too old and get removed from N(u, t).
Next, we show how the similarity of u and a neighbor x of v can be approximated in the example of Fig. 7. Assume that h 1 (x) = v and h 2 (x) = v 1 . By using formula (14), before edge uv arrives, the similarity approximation is sim(u, v, t 3 ) ≈ (0 + 1)/2 as h 2 (x) = h 2 (u) = v 1 at time t 3 . When edge uv arrives, the similarity will on one hand increase, since h 1 (u) gets assigned with v. On the other hand, the similarity can decrease as edges become too old. For example, if we drop edge uv 1 , equation h 2 (x) = h 2 (u) = v 1 will no longer hold. However, since we want to avoid the cost of updating h i (x) for all i and all neighbors x of v, we heuristically only consider the increase of similarity, which can be caused by adding v as new fingerprint of u. Fig. 7 Illustration of how the fingerprints of node u change when adding the new edge uv. Neighbors of u are ordered in time as t 1 < t 2 < t 3 < t. Two fixed random permutations π 1 and π 2 define the fingerprints h 1 (u) and h 2 (u). In π 1 (red), v has minimum value, hence the previous h 1 (u) will be reassigned to v. In π 2 (purple), the minimum is the oldest node v 1 , which becomes too old and gets removed from N(u, t). The correct value for h 2 (u) would be v 2 ( ). Instead we heuristically set h 2 (u) = v after the removal of v 1 Algorithm 2 Online learning second order similarity procedure UPDATEFINGERPRINTS(u, v) GETSIMILARITYDELTA(u, v, x) ← 0 Optimize the representations q u and q x repeated times, by using Eqs. (1) or (2) end for Repeat with v and u swapped and edge directions reversed end procedure As a final heuristic, in our implementation we always replace fingerprints corresponding to pruned neighbors by v, since obtaining the π i values of the entire neighborhood is computationally costly. In the example of Fig. 7, we drop edge uv 1 as t 1 t. The correct new value of h 2 (u) would be the next oldest vertex v 2 , however this can only be calculated by enumerating all neighbors of u. Instead, in our implementation we heuristically assign h 2 (u) ← v.

Algorithm for online learning second order similarity
Our method is described in Algorithm 2 by using the notations in Table 2. Our goal is to approximate the change of similarity between u and the in-neighbors x of v, and modify the embedding vectors whenever certain x gets more similar to u after adding the new edge uv. Note that x becomes more similar if the edge xv also appeared recently; in terms of fingerprints, this means that for some fingerprint index i, both x and u have v as fingerprint node. We perform the steps below to update the fingerprints of u and check for v as fingerprint in the in-neighbors x of v: Procedure UPDATEFINGERPRINTS. Fingerprint h i (u) can take the new value v for the new edge uv if it is too old or if π i (v) becomes the new MinHash value for permutation i. In the former case, we can either heuristically replace h i (u) with the new neighbor v or compute the true MinHash value argmin{π i (y) : y ∈ N(u)}. 2 Finally, for each in-neighbor x ∈ N(v), we compute the number of fingerprints that match those of u and have value v in Procedure GETSIMILARITYDELTA, and times optimize the representations q u and q x by using Eqs.
Symmetrically, we also check for the similarity increase of v with the out-neighbors of u by performing the same steps, replacing u and v on the reverse direction graph.

Similarity search experiments
In this section, we describe our main evaluation, in which we assess how well the closeness of two nodes in the embedding reflect their similarity against an external ground truth. Towards this end, we first describe a network enriched with a time dependent external similairty ground truth information. Then, at a time instance, we compute the list of nodes closest to selected ones in the embedding, and compare these lists against the similarity ground truth. We analyze node embedding methods for similarity search over the Twitter tennis tournament collections of (Béres et al. 2018). For the quantitative analysis, we use the annotation of the nodes for the accounts of the tennis players that participate in a game on a given day. In this sense, we expect that the players of the same day are more similar than other players and non-player accounts, as we will describe in "Evaluation metrics" section. We compare the performance of StreamWalk and online second order similarity with online and static baseline methods, which we will describe in "Baseline models" section.

Tennis tournament twitter collection data
In (Béres et al. 2018), we compiled two separate tweet collections: RG17 for Roland-Garros, the French Open Tennis Tournament, and UO17 for US Open, the United States Open Tennis Championships, which we use in our first experiment. We use the mention graphs extracted from the last 15 and 14 days of RG17 and UO17, respectively. Based on the approximate time of the games, we consider a Twitter account n active on the given day, if it belongs to a tennis player who participated in a completed, canceled, or resumed game.

Evaluation metrics
We evaluate similarity search by a supervised experiment in which for each active account we consider the other similar active accounts on the given day. For each embedding algorithm, we generate 128-dimensional node representations every six hours (6:00, 12:00, 18:00, 24:00). For online methods, we perform continuous updates over the edge stream. For the static methods, we build the corresponding graph snapshots.
We use NDCG (Al-Maskari et al. 2007) to evaluate how other active accounts are similar to a selected one. NDCG is a measure for ranked lists that assigns higher score if active accounts appear with higher rank in the similarity list. In our experiments, we compute the average of the NDCG@100 for the active accounts as query nodes to measure the performance of a single model in any given snapshot.

Baseline models
We compare StreamWalk and online second order similarity to online (or time-aware) and static (or batch) embedding methods. Online models are updated after the arrival of each edge. By contrast, static representations are only updated once every six hours when the graph snapshot ends. At hour t a static model is computed on the graph constructed from edges arriving in time window [ t − T, t] from the edge stream. For each batch baseline, we experimentally select the best value of T.
We consider four static centrality measures as baseline: • DeepWalk ( • Decayed indegree, defined for node u at time t as where E(t) is the multi-set of edges that occurred up to time t with edge activation time t zu .
We use the 128-dimensional representations of StreamWalk, second order similarity, DeepWalk, Node2Vec, and LINE to measure node similarity over time. For the two degree methods, we rank by degree without reference to the query node in the NDCG@100 formula.

Results
In our experiments, we measure how the similarity of node representations evolves over time by a supervised evaluation in which the active nodes should be similar to each other. We show two different ways to describe the performance of a single model: 1 For each day, we present the mean NDCG@100 of the snapshots evaluated at 6:00, 12:00, 18:00, and 24:00. 2 As a single global value (NDCG@100), we take the average of NDCG@100(u) for each daily player u in every snapshot.
For a given parametrization of every embedding-based method, we always show the average performance of ten independent instances. During our experiments, we found that the following parameters had a great impact on the quality of online node embeddings, see Table 3: W1 and W2: In Word2Vec, we have the option to optimize node representations for the input (W 1) or the output (W 2) matrices (Press and Wolf 2016). It is application dependent whether W 1 or W 2 yields the better representation. For SW, we achieved the best results by W 2. Initialization: We experimented with Xavier (Glorot and Bengio 2010) and uniform random initialization of W 1 and W 2. Mirror: In our algorithms, the input to Word2Vec consists of node pairs. Given a training instance (x, y), we mirror if we feed both (x, y) and (y, x), not just (x, y). Decay: We heuristically map the representations of nodes with no recent activity to the null vector. Negative sampling rate and past positive samples used: Key parameters of Word2Vec analyzed separately in Figs. 8-9. Past positive samples are edges that appeared longer time ago; using such edges for negative training helps forgetting the past. We also combine the output of StreamWalk and second order similarity by using the weighted average of the corresponding inner products as similarity. This method denoted as SW+SO outperforms SW and SO, as seen in Fig. 12. The optimal weight of SO in the combination is 0.3 for both RG17 and UO17.
In Table 4 we present the best global mean performance for each model. Fig. 13 shows the daily mean performance of the best models.
For illustration, in Table 5 we present the 20 accounts most similar to that of Rafael Nadal for different node embeddings on 2017-May-31 18:00. Since Rafael Nadal played on this day, the active accounts (yellow) belong to tennis players who participated in a game on this day. The combined model SW+SO has the highest number of active player accounts. Furthermore, accounts present in both SW and SO columns (e.g. BMATTEK, DjokerNole, GrigorDimitrov, etc.) typically achieve higher position by SW+SO than SW. It is interesting to see that SW and SO find different active accounts, which explains why the combination SW+SO achieves superior performance. While static LINE and Node2Vec have less relevant hits than our online methods, most of the irrelevant accounts still belong to tennis players (e.g. andy_murray, stanwawrinka, etc.). The main difference is that the daily active players are better found by the online than the static methods.

Online link prediction
Next, we address the online, time-aware variant of the link prediction problem, in which we give a prediction for the next edge as we process the stream edge by edge. Our goal is to predict a new link at a given time based on all events that appeared before, including the arrival of the most recent edges. Compared to the predictions given at time t, at time t + 1, we can potentially reconfigure our model based on the edges appeared at time t and give a very different prediction for the next set of links. If we compare with a traditional model based on graph snapshots, the traditional model will output the exact same ranked list of links between two snapshots. While our modeling technique provides much stronger time awareness, it poses a challenge for evaluation, since we cannot compare just a single prediction against a larger set of edges, but a large set of potentially very different predictions ordered in time.
To evaluate online link prediction methods, we use the prequential evaluation framework (Dawid 1984). As explained in Fig. 14, before a new edge (u, v, t) arrives in the graph stream, we first give an attempt to predict this edge, then reveal the edge and update the model using the new edge. In this way, we can incorporate information on the most recent edges in our model and evaluate potentially completely different predictions, coming from modified models, at every new time tick. Fig. 11 The effect of sampled walks (top) and hash functions (bottom) on the global mean performance (NDCG@100) for SW and SO respectively. These parameters control the number of sampled node pairs we feed to online Word2Vec at every edge arrival  For evaluation, we use a single-point variant of NDCG (Pálovics et al. 2014) in which there is always exactly one relevant item, the actual new edge, and the higher the rank of the relevant edge, the higher the score. The overall evaluation of the model is the average of the single-point DCG@20 values over all events in the graph stream. We can also assess performance trends by computing daily or weekly averages of the DCG. We note that we ignore reappearing edges and only evaluate the prediction for those edges that appear the first time in the stream.

Data sets
We experiment on three standard network data sets from KONECT (Kunegis 2013). We selected networks from the collection with timestamped edges evenly distributed in time.
For example, we discarded networks that were crawled for several weeks, but a significant part of their edges appeared within one day due to anomalies in the crawl. We discarded self-loops and similar edges with the same timestamps from each data set and processed the links in temporal order. Enron: The Enron email network consists of emails sent between employees of Enron. Nodes in the network are individual employees and edges are individual emails. The data has 308,708 edge events in 365 days between 27,972 nodes with 90,177 unique edges. Fig. 14 The online link prediction problem. For each edge in the stream, first we query the model to predict the next interaction. Then we train the model on the observed edge. Whenever node u interacts with another node, the model may generate different, updated top-k predictions Linux kernel: The communication network of the Linux kernel mailing list. The nodes are people, and each directed edge represents a reply from a user to another. The data has 487,355 edge events in 1380 days between 16,449 nodes with 88,855 unique edges.
Facebook: The nodes of this network are Facebook users, and each edge represents one post, linking the user writing a post to the user whose wall the post is written on. The data has 16,868 edge events in 658 days between 16,868 nodes with 61,582 unique edges.

Baseline methods
We compare our methods to three batch baselines: Node2Vec, DeepWalk (DW), and Graph Factorization (GF). For these three models, we retrained the model over all past data periodically. We set the periodicity for one day on the Enron data set. We trained models with weekly batch updates on the Linux and Facebook data sets. Furthermore, we give two online baselines. Besides simply updating the degree of each node and using it as a predictor, we experimented with the online version of Graph Factorization. This corresponds to the version of StreamWalk where we do not take any samples but optimize only for the similarity of the node pairs along the edges in the stream.

Results
We analyze the results for online link prediction for the best parameter setting of each algorithm. The results are summarized in Table 6 and in Fig. 15. Our key observation is that the online learning methods, online graph factorization (GF), StreamWalk (SW) and SecondOrder (SO) show very strong performance compared to the batch methods. Note that batch methods such as GF read the list of edges several times to perform stochastic gradient descent, while online methods must process the edges in the order of arrival, without the possibility to access past edges again.
The surprisingly good performance of the online methods is due to the fact that they put emphasis on the more recent edges. For gradient descent with negative sampling, the embedding is always optimized towards the freshly arrived edges, while negative sampling has the effect of forgetting the past. The advantage of model freshness is strikingly strong for the link prediction experiments where the low time granularity prequential evaluation method is used. Online methods also outperform batch embeddings for the Twitter tennis data sets. Note that ground truth data is only available daily in our collections. Higher performance difference can be obtained with labels of higher time granularity, as seen for the link prediction.
When comparing the methods that update the embeddings from an edge stream, online graph factorization (GF), StreamWalk (SW), and SecondOrder (SO), we observe that they  perform very similarly and their relative order depends on the data sets. Also note that GF is a special case of SW with walks of a length of one.

Conclusion
We introduced two online machine learning algorithms to extract temporal node representations from graph streams. The StreamWalk algorithm optimizes for the similarity of node pairs extracted along temporal walks from the data stream, whereas online second order similarity efficiently learns neighborhood similarity over graph streams by MinHash fingerprinting.
We measured the quality of these models in two tasks. In the RG17 and UO17 Twitter collections, we analyzed the similarity of node representations over time for both online and static node embedding algorithms. Our methods SW and SO significantly outperformed static Node2Vec, LINE, and simple degree related baselines. The combination of SW and SO achieved superior performance in the supervised evaluation task that we implemented using daily changing node relevance labels.
In a second experiment, we addressed the temporal link prediction task in three commonly used network data sets. We observed that online learning methods are superior to the snapshot-based batch algorithms. For some graphs, our walk-based embedding methods performed better than online matrix factorization.
Abbreviations DCG: Discounted cumulative gain; DW: DeepWalk (Perozzi et al. 2014); GF: Graph factorization (Ahmed et al. 2013); NDCG: Normalized discounted cumulative gain; RG17: Our twitter data set about Roland-Garros 2017, the French Open Tennis Tournament; SO: online second order similarity, our model based on temporal node neighborhoods; SW: StreamWalk, our proposed model based on temporal walks; SW+SO: The combination of StreamWalk and online second order similarity; TREC: Text retrieval conference; UO17: Our twitter data set about US Open 2017, the United States Open Tennis Tournament