Relative Hausdorff Distance for Network Analysis

Similarity measures are used extensively in machine learning and data science algorithms. The newly proposed graph Relative Hausdorff (RH) distance is a lightweight yet nuanced similarity measure for quantifying the closeness of two graphs. In this work we study the effectiveness of RH distance as a tool for detecting anomalies in time-evolving graph sequences. We apply RH to cyber data with given red team events, as well to synthetically generated sequences of graphs with planted attacks. In our experiments, the performance of RH distance is at times comparable, and sometimes superior, to graph edit distance in detecting anomalous phenomena. Our results suggest that in appropriate contexts, RH distance has advantages over more computationally intensive similarity measures.


Introduction
Similarity measures play a crucial role in many machine learning and data science algorithms such as image classification and segmentation, community detection, and recommender systems. A good deal of effort has gone into developing similarity measures for graphs, in particular, since they often provide a natural framework for representing unstructured data that accompanies many real-world applications. Some popular graph similarity measures currently used are graph edit distance [42], iterative vertex-neighborhood identification [9,33], and maximum common subgraph based distance [18]. However, as graph datasets grow larger and more complex, the need for tools that can both capture meaningful differences and scale well is becoming more critical. In this respect, a number of sophisticated yet costly graph similarity measures, such as those listed above, fall short.
The recently proposed graph Relative Hausdorff (RH) distance [45] is a promising measure for quantifying similarity between graphs via their degree distributions. Inspired by the Hausdorff metric from topology [25], RH distance was devised to capture degree distribution closeness at all scales, and hence is well-suited for comparing the heavy-tailed degree distributions frequently exhibited by real-world graphs. Furthermore, as recent work has shown [5], RH distance is extremely lightweight, with time complexity linear in the maximum degrees of the graphs being compared. However, as this metric is relatively new, it has not yet been extensively vetted. In particular, current research has not addressed its potential as an anomaly detection method for time-evolving graphs.
In this work, we conduct a statistical and experimental study of RH distance in the context of dynamic graphs. While RH distance may be applied to arbitrary pairs of networks, we focus our attention on sequences of time-evolving networks arising from cyber-security applications. We begin by first applying RH distance to cyber-security logs recently released by Los Alamos National Laboratory, and investigate the extent to which it detects identified "red team" events. Then we follow up by studying RH distance in the more general and controlled context of random dynamic graph models. Here we generate sequences of correlated Chung-Lu random graphs using a simplified cyber-security model proposed by Hagberg, Lemons, and Mishra [24], and test the extent to which RH distance detects several planted attack profiles. Throughout our analysis, we compare the performance of RH to that of more well-known graph similarity measures, such as edit distance and Kolmogorov-Smirnov distance of degree distributions. With this work, we better clarify the range of differences captured by RH, and also highlight its practical advantages and disadvantages over other methods.

Preliminaries
2.1. Graph similarity measures. Below we define the graph similarity measures we consider for anomaly detection in time-evolving graphs. We begin with the graph Relative Hausdorff distance, the primary focus of our study.  [45], the Relative Hausdorff (RH) distance between graphs is a numerical measure of closeness between their complementary cumulative degree histograms (ccdh). More precisely, the (discrete) ccdh of a graph G is defined as (N (k)) Despite being referenced as a "distance", RH distance does not satisfy the triangle inequality and hence is best viewed as a similarity measure rather than a bona-fide distance metric. However, to match existing literature we will still use the term "RH distance." As shown in [5], RH(F, G) is extremely lightweight, and can be computed with run time O(∆(F ) + ∆(G)). Lastly, we note that while RH distance values exceeding one have been considered indicative of a "large" dissimilarity between ccdhs, see [45], it was shown in [5] that the RH distance between graphs can be as large as O(m) when comparing graphs on m and n vertices with m ≥ n.

Other measures.
While RH distance will be the focus of the present work, we also consider several other graph similarity measures in order to provide relevant context for its performance. First, we consider another comparable lightweight, ccdh-based measure called Kolmogorov-Smirnov (KS) distance. KS distance is a widely-used statistical measure of similarity between distributions, and serves as the test statistic for the two-sample KS hypothesis test [51]. In what follows, we will not only compute KS distance directly between graph degree distributions, but also between distributions of graph similarity values, such as RH values. To avoid confusion, below we define both KS distance as well as the two-sample KS hypothesis test (for which KS distance is a test statistic) for general empirical distributions.
Definition 3 (KS distance and two-sample KS test). Let F ∈ R n and G ∈ R m be empirical cumulative distribution functions. The Kolmogorov-Smirnov distance is The null hypothesis that F and G are samples of two identical probability density functions is rejected at the α confidence level if For clarity, we emphasize that smaller values of KS distance indicate greater similarity between distributions, whereas smaller p-values permit one to reject the null hypothesis of identical underlying distributions at a higher confidence level, thereby presenting stronger evidence the empirical distributions were drawn from different underlying distributions. For the special case that F and G are the ccdhs of two graphs on n and m vertices, respectively, KS distance is given by n · F and G = 1 m · G . As argued in [37,46], KS distance between graph ccdhs can sometimes be large in graphs that are intuitively similar; furthermore, KS distance may also be insensitive to certain important differences between graph ccdhs, particularly in the tails of ccdhs (which correspond to high-degree vertices).
On the other end of the computational spectrum, we also consider graph edit distance (GED). Arguably one of the most well-known graph similarity measures, GED has been widely used throughout machine learning, particularly in computer vision and pattern recognition contexts [20]. An edit operation on a graph consists of either an insertion, deletion, or substitution of a single vertex or edge. 1 An edit path of length k between F and G is a sequence of edit operations P = (e 1 , . . . , e k ) that takes F to a graph that is isomorphic to G. Edit distance is the total weight of the minimum-cost edit path, i.e. Definition 4 (Graph edit distance). Let F, G be graphs. The graph edit distance between F and G, denoted GED(F, G) is given by where Υ(F, G) denotes the set of possible edit paths from F to G, and c(e i ) ≥ 0 denotes a cost-function measuring the weight of edit operation e i .
In what follows, we simply take c(e i ) = 1 for any edit operation, in which case GED(F, G) is the minimum number of edit operations needed to transform F to G.

Related
Literature. While in this work we focus on the graph Relative Hausdorff distance, we note that a wide variety of graph similarity measures have been utilized for anomaly detection in time-evolving graphs. In [28], the authors propose detecting anomalies in communication network traffic data by measuring cosine similarity between the principal eigenvectors of graph adjacency matrices. In [3], Akoglu and Faloutsos also take an eigenvector-based approach for measuring graph anomalousness. Matrix-analytic graph similarity measures have also been based on eigenvalue residuals [22], non-negative matrix factorization [48], and tensor decompositions [43]. Other popular approaches for graph-based anomaly detection are via distance metrics, such as those based on edit distance, maximum common subgraph distance, or mean vertex eccentricity [21], or take a community-detection approach towards identifying anomalies by tracking changes between clusters of well-connected vertices [2,50]. For a broader survey of graph-based anomaly detection techniques see [4,41,44] and the references contained therein. Lastly, we note that applications of graph similarity functions extend far beyond anomaly detection. Graph similarity functions are also ubiquitous in inexact graph matching and graph classification problems. For instance, graph edit distance is a key tool for error-tolerant pattern recognition and computer vision techniques [20]. While in this work we explore Relative Hausdorff distance through the lens of anomaly detection, we note its application as a graph similarity measure in other contexts such as these remains unexplored.
As our focus in this paper will be on cyber anomaly detection 2 in network flow data, we also mention some of the existing graph-based methods specifically for cyber anomaly detection. In this domain, some researchers focus on detecting anomalies edge-by-edge or target specific types of behavior, e.g., [19,39], while others look at the graph more globally or structurally and are agnostic to the type of anomalous behavior being detected, e.g., [12]. In [39] Noble and Adams describe a real-time unsupervised framework for detecting anomalies in network data. They consider "edge activity" as the sequence of flows on a single edge and compute correlations between event inter-arrival times and other edge data (e.g., byte count or protocol). Statistically significant changes in those correlations are flagged as anomalies. Groups of adjacent anomalies can be combined to form larger anomalies perhaps indicating coordinated behavior. The authors of [19] use PageRank to perform linkage analysis followed by clustering techniques to identify groups of IPs with similar behavior. These groups are then compared with known bot behavior to detect botnets within the network. In the category of more structural and behavior-agnostic algorithms [12] introduces multi-centrality graph PCA and multi-centrality graph dictionary learning which use structural properties of a graph, e.g., walk statistics and centrality measures, to learn normal structure and thus detect abnormal structure. This method is not tailored to the cyber use case, but the authors use network flow as one of their examples. Our work is similarly not targeted towards a specific cyber use case and is focused on detecting structural perturbations rather than clustering behavioral patterns.

Los Alamos National Laboratory (LANL) Cybersecurity Data
To begin our study of RH distance as an anomaly detection method for dynamic graphs, we will first consider a dataset recently released by LANL with known red team events from their internal corporate computer network [30,31]. The dataset represents 58 consecutive days of de-identified event data collected from four sources, namely: • Windows-based authentication events from both individual computers and centralized active directory domain controller servers, • Process start and stop events from individual Windows computers, • Domain Name Service (DNS) lookups collected by internal DNS servers, and • Network flow data collected at several key router locations. In total, the data set is approximately 87.4 gigabytes, spread across the four modalities, including 1,648,275,307 events coming from 12,425 users, 17,684 computers, and 62,974 processes. Ground truth for the red team events is given as a set of authentication events that are known red team compromise events. In this section, we will demonstrate that Relative Hausdorff distance is effectively able to identify anomalous behavior around the red team events in the LANL data.
3.1. Data Source. As stated above, LANL captured network evolution in four different modalities, namely authentication, process, network flow, and DNS events. In order to apply the RH distance, we first must convert these network event files into a time series of graphs. To do so, we consider 60 second moving windows that advance 20 seconds at a time. For each window we use the events in that window to construct a graph. For the 58 consecutive days, this yields a time sequence of 250,560 graphs for each modality. Further details for constructing each type of graph are given below.
• Authentication Graphs. The authentication data is a record of authentication events collected from individual Windows-based desktop computers, servers, and Active Directory servers. Each line of the data file reports a separate authentication event in the form time, sourceUser@domain, destUser@domain, source computer, dest computer, auth type, logon type, auth orientation, pass/fail. For a given window, we construct an unweighted graph with edges {sourceUser, destUser} for each user pair present in the logs within the window.
• Authentication Failure Graphs. These are constructed in the same manner as the Authentication Graphs, except we restrict the edge set to those corresponding to failed authentications only. • Process Graphs. The process data is a record of process start and stop events collected from individual Windows-based desktop computers and servers. Each line of the data file reports a separate process start/stop in the form time, user@domain, computer, process name, start/end. For a given window, we construct an unweighted graph with edges {computer, process name} for each computer-process pair present in the logs within the window.
• DNS Graphs. The DNS data is a record of DNS lookup events collected from the central DNS servers within the network. Each line of the data file reports a separate lookup event in the form time, source computer, computer resolved, representing a DNS lookup at the given time by the source computer for the resolved computer. For a given window, we construct an unweighted graph with edges {source computer, computer resolved} for each source-resolved computer pair present in the logs within the window.
• Flow Graphs. The flow data is a record of the network flow events collected from central routers within the network. Each line of the data file reports a separate network flow event in the form time, duration, source computer, source port, dest computer, dest port, protocol, packet count, byte count. For a given window, we construct an unweighted graph with edges {source computer, dest computer} for each source-destination computer pair that communicate during that time.

3.2.
Limitations. While working with real-world data often presents challenges, testing graph-based anomaly detection methods on the LANL dataset is particularly difficult for several reasons. First and foremost, the data only provides red team authentication attempt time stamps and does not specify the nature, extent or duration of the red team events. This makes it difficult to segregate benign from anomalous time periods. Additionally, without knowing the specific red team actions, it is difficult to determine which (if any) of the aforementioned modalities a red team signature may appear in. Finally, it is worth noting the data exhibited large periods of time in which no events occurred that did not correspond to regular lulls such as weekends and nighttime. In particular, the flow data has records from only the first 37 of the 58 days. To address some of these limitations, in Section 4 we extend our analyses to a generalized dynamic network model [24] proposed by LANL scientists Hagberg, Mishra, and Lemons. While no synthetic model is a perfect substitute for real data, this model's conception and design was directly informed by direct access to the LANL cyber data [29,32] and provides a framework under which we may draw more certain and rigorous conclusions regarding the behavior of RH distance. First, we present our analysis of the real LANL data.
3.3. Experiment and results. As a first-pass approach towards studying the sensitivity of RH distances to red team events in the LANL dataset, we test whether the distribution of pairwise RH distance values before a red team event differs significantly from the post red team event distribution. In this way, we assess whether there is statistical evidence to support that red team events demarcate "change-points" in  RH distance distribution. To that end, for each red team event at time r, we associate a time window w of length centered at r, which we denote w (r). Each such window can be naturally partitioned into a "before" period (i.e. the time interval (r − /2, r)) and "after" period, (r, r + /2). To avoid overlapping windows and ensure the "before" periods are in fact devoid of red team events, we restrict attentions to windows in which no red team event occurs in the time interval (r − /2, r). Put equivalently, we consider the set W = {w (r) : red team event occurs at r, no red team events occur in (r − /2, r)}.
We note that it is possible for the after period of a window in W to contain additional red team events. For each window in W , we compute the RH distances between pairs of graphs separated by δ seconds in the before period, as well as such pairs belonging to the after period. We then aggregate the RH distances over all before periods and all after periods. More precisely, if G 0 , G 1 , . . . denotes the time-ordered sequence of graphs for a particular mode in the LANL data, we compute the aggregate before and after distributions as respectively. Recalling that we processed the LANL graph sequence for each modality by generating graphs for windows shifted by 20 seconds, we may choose the parameter δ controlling the granularity of pairwise RH measurements to be as small as 20 seconds and and as large as /2 − 20 seconds. Finally, we assess whether these aggregated before and after RH distance distributions differ significantly by conducting a two-sample Kolmogorov-Smirnov test. Table 1 presents the resulting p-values for δ = 20, 40, 60, 120, 240 seconds, under window lengths = 30, 60, 120 minutes, for each LANL modality. The p-values in Table 1 suggest that whether the aggregated distribution of RH values before red team events differs significantly from the post red team events depends crucially on the cyber modality, window length and granularity parameter δ. In the case of a 30 minute window, almost none of the parameter settings for any modality result in statistical significance, while for a 2-hour window, a majority of parameter settings are significant at a level of 0.05. In this case, the before and after RH distance distributions over longer time windows surrounding red team events more frequently show significant differences, which is perhaps unsurprising. On the other hand, the changes in significance levels as the granularity parameter δ varies are more difficult to interpret. Even for a fixed window length and modality, the significance levels neither consistently increase nor decrease in δ.
One plausible hypothesis for this experiments sensitivity to δ is that RH distance values exhibit periodic behavior both within and across modalities, reflecting the natural circadian rhythms one might expect from temporal cyber data. If this were the case, the choice of δ may skew the RH values sampled when constructing the representative before and after distributions. To check whether such periodicity is indeed present in the RH distance measurements on LANL, we constructed heatmaps of RH distances between all pairs of graphs over given time windows. As this requires a quadratic number of comparisons, it is worth noting this analysis is crucially facilitated by the lightweight computational complexity of RH distance. We examined heatmaps not only for windows surrounding anomalies, but also for time windows away from red team events. Figure 2 (left column) presents sample heatmaps for the Authentication, Flow and Process modalities spanning a 2-hour time period. In an effort to select a representative window for short-term nominal RH behavior within each modality, this time period was selected so as to not include any red team events nor be preceded or followed by any red team events for 20 hours. It is also worth pointing out that the RH distance between pairs of flow graphs regularly exceeds one, indicating that the rough guide for detecting anomalous behavior given in [45] is inappropriate for the cyber-security context. We also transformed each heatmap of pairwise RH distance values into a similarity matrix by applying the Gaussian kernel with σ = 1, and performed normalized Laplacian spectral clustering 3 , as described by Ng, Jordan and Weiss [38]. Under the corresponding heatmap, Figure 2 (right column) plots the pairs of graphs belonging to common clusters, using a different color for each cluster (and white for different clusters).
A cursory examination of the heatmaps and their clustering suggests that the RH distance values for a given modality exhibit persistent and striking periodic patterns. Furthermore, in comparing the plots for Flow, Authentication, and Process, the differences in the periodic behavior of RH values are also apparent. While these periodicities are not entirely unexpected they are likely network-and data-dependent. Detecting and visualizing periodic behavior in network data is an active area of research, e.g., [23,27,40]. As a consequence of this experiment and examination of the heatmaps in Figure 2, it is clear that a single choice of granularity parameter δ is likely insufficient in establishing a representative distribution of RH values within a time window for any given modality. Accordingly, next we refine our experiment to better account for the inherent multi-scale and multi-modal nature of the LANL data.
One of the many difficulties with investigating the effectiveness of RH distance in detecting anomalous behavior associated with red team events in the LANL data sets is that the red team process is inherently multi-modal and multi-scale. That is, the red team events identified in the data are simply the first step of the red team intrusion process which could potentially affect all of the data modalities (process, DNS, flow, and authentication) and occur over multiple time scales. To attempt to deal with this issue, we craft an indicator for each time that considers the RH distance between graphs with multiple time differences and in multiple modalities. More concretely, let S be the set of potential data sources and let G = {G t,s } be the collection of observed graphs indexed by the time t and data source s ∈ S. For any fixed timestamp t and collection of differences D, we will define the profile vector at time t, v (t) , as the vector given by (RH(G t,s , G t−δ,s )) s∈S,δ∈D . Ideally, this profile could be used to aggregate the behavior across multiple modalities and multiple time scales and give a clearer picture of the overall state of system. However, there is a further complication with this approach in that the LANL data represents a system which has a naturally evolving behavior based on various temporal patterns of human activity (i.e. weekday vs. weeknight, circadian rhythms, etc.). To adjust for these temporal patterns, for every time t, source s ∈ S, and difference δ ∈ D, we define a baseline behavior random variable B t,s,δ which is the random variable which represents the "typical" behavior of RH(G t,s , G t−δ,s ). This baseline behavior can then be combined with the profile vector v (t) to generate a temporal profile vectorv (t) where for any (s, δ) ∈ S × D we havev s,δ + . We define the temporal score of the time t as the geometric mean of the entries ofv (t) 4 . In what follows, we will calculate the temporal scores of time periods before and after a red team authentication event and show that there is a statistically significant difference between the behaviors. In fact, we will show that this temporal scoring methodology is more sensitive than using raw RH scores evaluated in Figure 1 indicating that there are significant potential gains to be found by considering multi-modal and multi-scale indicators for anomalous behaviors.
Before applying our results to the LANL data sets, it remains to address how to estimate the distribution of the random variable B t,s,δ and how the term defines the temporal profile vector is chosen. For a fixed t we estimate the empirical distribution for B t,s,δ , we consider the RH distance between all pairs of graphs (G t * ,s , G t * −δ,s ) where t * ranges over all times that differ from t by a multiple of a week, plus or minus 10 minutes. In order to avoid biasing this empirical estimate we exclude times t * where there is a red team event in the interval [t * − δ, t * ] as well as those that are within 10 minutes of t. As we see in Figure 2 the typical variation of a RH distance changes significantly based on the modality of the observation, both in source and elapsed time between graphs. Thus, rather than fixing a particular value of , we choose as 3 The prescribed number of clusters was chosen to coincide with the first observed gap in Laplacian eigenvalues, as in [49]. 4 As a practical matter, for those entries (t, s, δ) where there is insufficient or no data to estimatev (t) s,δ , the entry is dropped from the vector and ignored in the calculation of the temporal score. one twentieth of the range of the empirical distribution for B t,s,δ . Finally, for each of the 712 red team times provided in the LANL data we calculate the temporal scores for each graph in the 30 minutes before and after the red team time and apply the two-sample Kolmogorov-Smirnov test, see Table 2. We further segregate this data by whether or not additional red team events occur during the 30 minutes prior to the red team time. It is clear from Table 2 that the temporal scores is far from a perfect indicator, as a non-negligible fraction of the changes associated with a red team event are not detected. Nonetheless, it is also apparent that for a relatively lightweight measure the RH distance exhibits reasonable effectiveness in distinguishing between nominal and anomalous behaviors. However, our conclusions must be somewhat tempered by the challenging p ≤ 0.10 p ≤ 0.05 p ≤ 0. 01 Total   intervals with no prior red team events  30  30  21  48  intervals with prior red team events  478  460  370  664   Table 2. Aggregate behavior of temporal scoring on a per event basis nature of real world data and the LANL data in particular. Specifically, the lack of clear demarkation between anomalous and non-anomalous behavior as well as the limited time-scope of the investigation are significant caveats to any conclusions we make about the effectiveness of RH distance. In following section, we attempt to address these caveats by analyzing a synthetic temporal graph model inspired by the LANL cyber data.

Simulated Evolving Networks
The study of temporal networks is concerned with the analysis and modeling of time-ordered sequences of graphs. In order to better understand temporal network dynamics, researchers have proposed a plethora of abstract models for their simulation (for a survey, see [26]). In the present work, we consider a temporal graph model that belongs to the broader class of Markovian Evolving Graphs (MEGs) [7]. Given a probability distribution over the set of all graphs on a fixed vertex set, MEGs have the defining property that the distribution at time t is completely determined by that at t − 1, thereby forming a sequence of random variables which satisfy the Markov property. Because of their generality and flexibility, MEGs have been popularly used to study information spreading processes, such as file sharing on peer-to-peer networks, social network memes, and disease spreading [16,17].
In [24], Hagberg, Lemons and Mishra proposed a new MEG model, the design of which was informed by their study of LANL centralized authentication system cyber data [29,32]. In particular, they observed that these sequences of graphs exhibit certain stable global properties, such as skewed degree distributions, while local dynamics such as individual vertex neighborhoods change rapidly. To capture these dynamics, they designed a temporal model that can be used to preserve certain random graph structure while affording tunable control over the rate of dynamics. We refer to their model as the HLM model. Ultimately, Hagberg et. al utilized the HLM model to study temporal reachability; that is, the expected time (number of evolutions or transitions) before a constant fraction of the vertices are reachable from an arbitrary vertex. We note that although the HLM model was developed to capture abstract dynamics exhibited by cyber data, the HLM model need not be limited to simulating cyber phenomena. Although, the experiment that follows is driven by cyber-security structures and data, the is no a priori reason that a similar experiment could not be applied across the variety of domains for which the evolving nature of the HLM model is appropriate, such as communication networks, social networks, and (on a much slower time scale) transportation networks. In the remainder of this section, we study the sensitivity of RH distance in detecting several planted attack profiles, utilizing the HLM model to simulate the natural time evolution of a generic cyber network graph topology. Before describing our experimental methodology, we first begin by defining and briefly discussing the HLM model.
The fact that G i+1 D = G i follows by induction and the observation that We note that there is a natural trivial generalization of the HLM model where the edge probability wuwv /ρ is replaced with arbitrary values in p uv ∈ [0, 1]. In this case, at each time step the network is distributed over graphs like G(P ), the generic independent edge graph model with parameter P . Similarly, the evolution parameter α can be generalized to a symmetric matrix A ∈ [0, 1] n×n . We note that several well studied models fall into this framework, including the stochastic block model, stochastic Kronecker graphs [34,36], random dot product graphs [52,53,54], and the inhomogeneous random graph model [10,47]. In order to maintain consistent notation, we will specify all of the experiments in this work in terms of this generalized HLM model even though most of generative matrices P come from the Chung-Lu model. Further, with the aim of having the minimum number of free-parameters we will only consider HLM evolutions where A uv = A xy for all u = v and x = y. We will further slightly abuse notation and refer to this common value as α.
Finally, we note that this generalized framework can be further expanded by allowing the parameter matrix P to depend on the time step t. In particular, we have It is worth mentioning that in this case G t+1 is not distributed like G(P (t+1) ) because of the possibility of edges being present from earlier timesteps. In fact, it is an easy exercise to show that the edges of G t are distributed according to (

Experimental Setup.
In our experimental setup, in keeping with the lightweight nature of the RH calculation, we focus on the detection of small anomalies in extremely sparse graphs, such as we observed in small time windows for the LANL data set and other proprietary network flow data. For the sparse graphs we consider two different fixed degree distributions. Both of these degree distributions are formed by choosing 5000 samples from some fixed probability distribution. For the first degree distribution, we estimate a degree density function using a smoothed median estimator from a selection of one minute graphs in the LANL network flow data set, see Figure 3a. The resulting degree density function results in a "power-law" like degree distribution with exponent approximately 3.5. Although the resulting degree density function is not truly a power-law distribution, we will abuse notation and refer to it as a "power-law" degree distribution. For a discussion of the difficulties and appropriateness of the power-law degree distribution for real data the interested reader is referred to the recent work [11]. The resulting distribution has 4,742 edges in expectation as well as maximum expected degree 961. The second distribution represents what we call a "bump power-law," that is, a power-law distribution coupled with an approximately binomially distributed "bump" at higher degrees. This can be thought of as a more hub-and-spoke style network where the degree of the spoke vertices are approximately power-law distributed while the degree of the hub vertices are approximately binomially distributed. For the bump power-law distribution, the degree probabilities were explicitly estimated from a collection of several thousand graphs generated from a proprietary enterprise boundary network flow data set (see Figure 3b). This resulting distribution has 6,067 edges in expectation as well as maximum expected degree 327. 5 We will also consider two different styles of anomalies involving between 10 and 50 edges. The first anomaly involves three randomly chosen vertices adding some number of edges to the rest of the network uniformly at random. We view this as behavior consistent with a probe or scan of the network structure. For the second anomaly a random collection of vertices are chosen and a random spanning tree is added 5 It is worth mentioning that both of these degree distributions violate the standard assumption for the Chung   . log-log Degree Distribution among those vertices. We view this as behavior consistent with lateral movement scenario where an attacker is exploring the network by moving from machine to machine. They may backtrack and try different routes (thus a tree rather than just a path) as needed.
For each of these 420 scenarios (two different degree distributions, two different anomaly types, five different anomaly sizes, and 21 different values of α) we produce 1,000 different pairs of graphs (G, G ) where G is a random instance of the Chung-Lu model with the chosen degree distribution and G is formed from G by performing one step of the HLM evolution with the chosen parameter α, and then adding a random instance of the chosen anomaly of the chosen size. In this way, the anomaly occurs concurrently with the natural evolution of the network, as might be typical of real-world data. For each of our 420 scenarios, and 1,000 pairs of graphs within the scenario, we compute the RH distance between G and G to get a distribution of RH distances for the anomalous transition.

Anomalous versus Nominal Relative Hausdorff Distance.
In this section we consider whether the anomalous transitions in the HLM model result in a different distribution of RH distances than a nominal transition. To this end, for each degree distribution and choice of α, we simulate 10,000 different HLM transitions to develop a baseline distribution of RH distances, see Figure 4. For each of the 420 anomalous scenarios we calculate the 2-sample Kolmogorov-Smirnov p-value [51] between the previously calculated anomalous distribution and this baseline distribution. For each of the 420 different anomaly scenarios the KS test significance value is less that 0.01, indicating that we can reject the null hypothesis that the distribution of RH distances for an anomalous HLM transition is the same as the distribution for non-anomalous transitions. In particular, this means that in a statistical sense the RH distance is able to pick up on anomalous evolution of the degree distribution, even when the anomaly only consists of 10 edges. 6 In the next subsection we will consider the effectiveness of the RH distance in detecting anomalous behavior directly, rather than statistically.

Anomaly Detection.
In this section we consider how RH distance could be used to detect anomalous behavior in a streaming environment and compare to the performance with a similarly lightweight measure (KS distance) as well as a "ideal" measure (graph edit distance). To compare between these three methods in a non-parametric way (i.e. without introducing a "anomaly threshold") we introduce the idea of an anomaly score of an observation with respect to a theoretically or empirically observed baseline distribution.   Specifically, let the random variable Z have theoretical or empirical cumulative distribution function f Z : R → [0, 1]. We will then say that a particular observation z (not necessarily distributed as Z) has an anomaly score relative to f Z of 2 f Z (z) − 1 2 . Note that this score takes on values from [0, 1] with values closer to one being more "anomalous." This score can be thought of as measuring the deviation of the observation z from the bulk of the distribution of Z.
Before turning to a direct comparison between anomaly scores for RH distance, KS p-values 7 , and edit distance, we consider the performance of each of these anomaly scores in isolation via ROC-like curves, presented for a subset of our 20 scenarios (2 distributions, 2 anomaly types, 5 levels of each anomaly) in Figures 5, 6, 7, and 8. Note that as the relative frequency of anomalous and non-anomalous behavior is unknown, these are not truly ROC curves but rather implicit plots (x(t), y(t)) where t is some threshold value. Specifically, y(t) is the fraction of the anomalous transitions that have anomaly score at least t, i.e. "true positives", where x(t) is the fraction of non-anomalous transitions that have anomaly score at least t, i.e. "false positives". At this point it is worth pointing out that if z is identically distributed with the random variable Z, then the anomaly score for z is uniformly distributed over [0, 1]. As a consequence, we can explicitly define x(t) = 1 − t. To compute the ROC curve for one scenario we used the previously computed cumulative distribution function for the 10,000 non-anomalous transitions as f Z . Then, for each of the 1,000 anomalous transitions we use the RH distance as z and compute the anomaly score for that value in the context of f Z .
Overall we can see that, for detecting anomalies, edit distance would be preferred to RH distance, which would in turn be preferred to the KS statistic. However, for the bump power-law, under the lateral movement anomaly with 10 edges, we see that the RH distance outperforms the edit distance, see Figure 7.

Kolmogorov-Smirnov (KS).
For this section we will compare the anomaly score of the RH distance with the anomaly score of the KS p-value (significance value) between successive degree distributions both for the 420 anomalous scenarios and the 40 baseline distributions. We note that since the degree distributions are discrete valued, the application of the KS test for hypothesis testing is not necessarily appropriate, however as we are interested in statistical behavior of the significance test and KS is widely used in the network analysis literature (see, for instance, [6,11,45]) we will ignore these technical issues. Figure 9 gives the relative performance of the KS and RH anomaly scores across all 420 anomaly scenarios. The y value counts how many of the 1,000 cases RH outperforms KS. We can see that the RH distance outperforms KS the most for the scenario where there is a 50-edge scan anomaly on the power-law distribution 7 From this point on in this work, although we will be using the KS p-value we will be treating it simply as a distance measure rather than a statistical quantity. In particular, we will make no assumptions about the meaning of large or small values of the p-value other than as a means of measuring the "closeness" between two degree distributions.    with an evolution rate of 0.23. In this case, the RH anomaly score is larger in 924 of the 1,000 different trials. We see that overall, excepting cases with a low-evolution rate and larger, lateral movement anomalies, RH distance is clearly superior, especially for the power-law degree distribution. It is worth noting that relative performance of the KS statistic improves when considering the lateral movement anomaly rather than the  . Relative Performance of RH and KS Anomaly Scores scan anomaly. Since the degree change caused by lateral movement is spread across many vertices (as opposed to scan where the primary change is spread across only three vertices), this result can be explained by the well known sensitivity of the KS test to variation away from the a tails of the distribution [45]. We note that a direct binary comparison between the two measures may not tell the whole story of their relative performance. For instance, in an extreme case, one can imagine one of the two measures taking on a fixed large value (indicating an anomaly) while the other takes on both small values and, more frequently, a value that is slightly larger than the other anomaly score. We separate the anomalous pairs of graphs into two sets according to whether their RH or KS anomaly score is higher. Then, for each of these two classes we report in Figure 10 (across all 420 anomaly scenarios) the mean difference between the scores with error bars representing one standard deviation of range around this mean. We note that the RH anomaly score typically exceeds the KS anomaly score by about 0.4, while the KS anomaly score typically exceeds the RH anomaly score by between 0.2 and 0.3. Further, the standard deviation across all cases the average gap between the RH and KS anomaly score is fairly consistently in the range [0.2, 0.3] essentially independent of all parameters. Together, the data in Figures 9 and 10, indicates that the RH distance is significantly more sensitive than KS distance to the broad range of anomalies we have investigated. In this section we compare the sensitivity of RH distance to a "perfect information" aggregate measure, in particular graph edit distance. Recall that the edit distance between two graphs F and G is the minimum "weight" of a sequence of edge/vertex additions/deletions needed to transform F to G. In general this quantity is N P-complete to compute (see [55]) and likely impractical to even approximate [35]. This complexity is driven by the difficulty in finding the optimal alignment between the vertices of F and G which maximizes the edge overlap between F and G. For the HLM model, this problem is mitigated by the natural alignment between the graphs generated at consecutive time steps. Thus, for purposes of this section we will approximate the graph edit distance as the number of edges that "flip" during each evolution of the HLM model. The following lemma allows us to significantly simplify the calculation of the anomaly score for edit distance by approximating the baseline distribution with the large n limit. Lemma 1. Let G be an random graph distributed according to G(P ) and let G be the graph formed by one iteration of the Hagberg-Lemons-Mishra evolution with evolution parameter α and probability matrix P . Let X be the random variable that counts the number of edges that differ between G and G . If Var(X) → ∞, then X is asymptotically normally distributed.
Proof. Let X ij be the indicator function for the random variable that edge {i, j} is present in precisely one of G and G and observe that X = i<j X ij . We recall that by the Lyapunov Central Limit Theorem [8, p. 362], we have that Var(X) if there is some δ > 0 such that Fixing δ = 1, we note that Thus, if Var(X) → ∞, then lim n→∞ 1 Var(X) and X is normally distributed.
It is worth mentioning that this, in principle, allows for an explicit formula for the distribution of the anomaly score for edit distance in a wide range of baseline and anomalous behaviors, namely where (µ, σ) and (µ A , σ A ) are the mean and distribution of the baseline and anomalous evolutions, respectively, and Φ is the cumulative distribution function of the standard normal distribution. However, given the correlated nature of the anomalies the calculation of σ A is tedious, so we will empirically estimate this distribution. Figure 9b again presents the relative performance of the anomaly scores, this time for edit distance and RH distance, for all 420 anomaly trials. Again the y value counts how many of the 1,000 cases RH outperforms edit distance. We note that in the best case (bump power-law degree distribution, lateral movement anomaly, 10 edges, α = 0.24), the RH distance anomaly score is larger than the edit distance anomaly scores 646 times. However, for the power-law case the RH anomaly scores essentially never outperform the edit distance anomaly scores. This failure is mitigated by the fact that, as mentioned earlier, in many cases the edit distance is computationally infeasible, while the RH distance requires minimal computational overhead. It is also worth mentioning that we can see a clear degradation of performance for edit distance as the size of the anomaly decreases and the evolution rate increases. This phenomenon can be explained by observing that the anomaly score for edit distance is driven by a z-score of the anomaly, which is linearly correlated with the anomaly size and inversely correlated with the standard deviation of the baseline distribution. Additionally, the variance baseline distribution of edit distance is linear related to the evolution rate, resulting in significantly decreased sensitivity at high evolution rates.
We further compare the relative behavior of the edit distance anomaly scores and the RH distance anomaly scores, in the same way as we did for KS above, by considering the average difference between the anomaly scores in the cases where the RH anomaly score is larger (positive values) and in the case the edit distance anomaly score is larger (negative values). As the RH distance anomaly score is essentially never larger than the edit distance anomaly score for the power-law distribution, we restrict our attention here to the bump  Figure 11. Difference between edit distance and RH distance power-law distribution. In Figure 11, we again report the relative magnitude of the differences with the error bars representing an interval one standard deviation away from the mean. Again we can see a clear stratification of the behavior with the RH anomaly scores performing better as the size of the anomaly decreases. We also note the mild improvement in the performance of RH distance as the evolution rate increases, likely reflecting the decreased sensitivity of edit distance (due to larger variance). Interestingly, the standard deviation is essentially constant over all choices of degree distribution, anomaly type, anomaly size, and evolution rate and is also roughly equal to the standard deviations shown in Figure  10. Furthermore, the magnitude of the standard deviation is close to the minimal possible standard deviation given by the generalization of Bhatia-Davis inequality for the variance of a bounded random variable [1]. As the extremal distribution is given by point masses at the end points of the distribution, this indicates that there are three essentially distinct outcomes: the RH distance anomaly score is significantly larger than the edit distance anomaly score, the RH and edit distance anomaly scores are essentially the same, and the edit distance anomaly score is significantly larger than the RH distance anomaly score. Furthermore, this holds regardless of the size and nature of anomaly or evolution rate and also holds when replacing edit distance with KS distance (for both degree distributions).

Conclusions
In this work, we conducted an experimental and statistical study of Relative Hausdorff distance in the context of time-evolving sequences of graphs. Applying RH distance as an anomaly detection tool, we first tested its detection of red team events across multiple modalities in real cyber security data. We found evidence that RH distance values register statistical change-points at red team events, although these results were sensitive to window length, the granularity of pairwise RH measurements, and subject to the limitations of the data. In order to test RH distance in a more controlled and rigorous manner, we then turned our attention to a temporal graph model inspired by cyber-data.
Using this temporal graph model to generate synthetic sequences of evolving graphs, we experimentally tested the sensitivity of RH distance to two attack profiles. To broaden our tests scope, we considered a multitude of parameter settings in which we varied the input degree distribution, temporal evolution rate, and intensity of the attack signal. In its own right, RH distance performed respectably, yielding ROC curves above the line of no-discrimination for every scenario tested. Compared with other similarity measures, RH distance consistently outperformed another lightweight similarity measure based on Kolmogorov-Smirnov distance, while its performance against the computationally-intensive edit distance was more mixed: while edit distance clearly outperformed RH distance under scenarios featuring the power law degree distribution, RH distance was better able to detect the low-intensity lateral movement attack under the bump power law degree distribution.
Anomaly detection generally, and even specifically in cyber security, is not amenable to a "one method to rule them all" mentality. Indeed, there are many types of anomalies and one does not expect them to all be caught by the same detector. It is important to recognize that our analysis does not use all of the available information pertinent to real cyber data. Distilling a time interval of data down to a single graph and removing all metadata is likely to introduce many false positives. It could be that a graph is anomalous given the recent context, but the behavior is fully expected by cyber security operations analysts (e.g. a daily backup may appear to be an exfiltration if the IP addresses involved aren't considered). In the other direction, if the graph does not contain the metadata that would flag an anomaly, this may similarly introduce false negatives. By integrating metadata into our analysis, it is possible that as anomalies are discovered, this metadata could be used to help classify them as benign and nefarious anomalies. Lastly, it is also worth noting that our analyses considered the entire data in a given period, as opposed to an online approach. Whether and how RH distance might be utilized in online anomaly detection frameworks remains another open topic for future research.