Selective network discovery via deep reinforcement learning on embedded spaces

Complex networks are often either too large for full exploration, partially accessible, or partially observed. Downstream learning tasks on these incomplete networks can produce low quality results. In addition, reducing the incompleteness of the network can be costly and nontrivial. As a result, network discovery algorithms optimized for specific downstream learning tasks given resource collection constraints are of great interest. In this paper, we formulate the task-specific network discovery problem as a sequential decision-making problem. Our downstream task is selective harvesting, the optimal collection of vertices with a particular attribute. We propose a framework, called network actor critic (NAC), which learns a policy and notion of future reward in an offline setting via a deep reinforcement learning algorithm. The NAC paradigm utilizes a task-specific network embedding to reduce the state space complexity. A detailed comparative analysis of popular network embeddings is presented with respect to their role in supporting offline planning. Furthermore, a quantitative study is presented on various synthetic and real benchmarks using NAC and several baselines. We show that offline models of reward and network discovery policies lead to significantly improved performance when compared to competitive online discovery algorithms. Finally, we outline learning regimes where planning is critical in addressing sparse and changing reward signals.

a sequential, closed-loop manner. In particular, we will leverage Reinforcement Learning (RL) and its mathematical formalism, Markov Decision Processes (MDP): a general decision-theoretic model that allows us to treat network discovery as an interactive, sequential learning and planning problem. MDP approaches have been successfully used in many other application settings (Mnih et al. 2015;Heess et al. 2017;Silver et al. 2017). However, the use of decision-theoretic approaches in the context of discovery of complex networks is novel and presents very interesting research opportunities. In particular, it requires learning effective models of reward that can capture properties of network structure at various topological scales and learning contexts. The network science community has defined many such topological and task quality metrics; but, to-date, they have not been leveraged in the context of guiding the process of network discovery. We consider the task of selective harvesting on graphs (Murai et al. 2017), where the learning objective is to maximize the collection of nodes of a particular type, under budget constraints. We make the following contributions: • We introduce a deep RL framework for task-driven discovery of incomplete networks. This formulation allows us to train models of environment dynamics and reward offline. • We show that, for a variety of complex learning scenarios, the added feature of learning from closely related scenarios leads to substantial performance improvements relative to existing online discovery methods. • We show that network embedding can play an important role in the convergence properties of the RL algorithm. It does so by imposing structure on the network state space and prioritizing navigation over this space. • Among a class of embedding algorithms, we identify Pagerank (PPR) as a suitable network embedding algorithm for the selective harvesting task. Our combined approach of PPR embedding and offline planning achieves substantial reductions in training time. • Leveraging several evaluation metrics, we delineate learning regimes where embedding alone stops being effective and planning is required. • Our approach is able to generalize well to unseen real network topologies and new downstream tasks. Specifically, we show that policies discovered by training on synthetically generated networks translate well to detection of anomalous nodes in real-world networks.

Related work
Our learning task falls under the category of finding the largest number of a particular type of node under budget constraints. The node type can be specified by node attributes (for example, males on a social network), or they can be determined by node's participation in a particular class of behavior (for example, accounts that belong to dense bipartite subgraphs in a communication network). Unlike the problem setting in Wang et al. (2013), we do not assume access to the full topology of the network and therefore have to perform the learning task with partial information.
Discovering incomplete networks with limited resources has received a lot of attention in recent literature. The primary learning objective in these works is to increase the visibility of the network topology by increasing the number of discovered nodes (LaRock et al. 2018(LaRock et al. , 2020Soundarajan et al. 2015Soundarajan et al. , 2016, increasing the number of discovered nodes of a given type (Murai et al. 2017), or increasing network coverage (Avrachenkov et al. 2014). Our problem setting is the closest to Murai et al. (2017). However, while Murai et al. (2017) leverages supervised learning to infer discovery heuristics, our approach leverages an MDP formulation of RL to estimate offline models of network discovery strategies (a.k.a. policy) and node utility (a.k.a. reward) that are network state-aware. More specifically, our approach explicitly connects the utility of a discovery choice to the network state when that choice was made. We will illustrate in later sections that learning state-dependent discovery strategies, allows our approach to stay robust in learning scenarios where nodes of interest are sparsely observed. In LaRock et al. (2018LaRock et al. ( , 2020, they frame their network discovery task as an MDP, but they only consider online training of their policy. The online-only training, we will show, suffers from tunnel vision and is not able to generalize well. Reinforcement learning for tasks on complex networks is a relatively new perspective. Work in Ho et al. (2015) and Goindani and Neville (2019) leverages RL to engineer diffusion processes in networks assumed to be fully observed, while authors in Mofrad et al. (2019) focus on the problem of graph partitioning. You et al. (2018) leverage RL to generate novel molecular graphs with desired domain-specified properties. There are connections to our problem setting. The graph generation is approached as an MDP, in a similar fashion to our network discovery problem, by iteratively expanding a seed graph via defined actions. There are, however, some important differences with our work. Their definition of reward and environment dynamics is tailored to the biochemical domain and molecular design application, which is characterized by a comparatively small state space and fixed-sized action space.
In our problem setting, the agent deals with a much larger network space and a larger and variable-sized action space. To address this increased complexity, we introduce a network embedding step that enables a more efficient navigation of the decision space. Our notion of reward is also more general, in that we do not utilize domain-specific properties to guide the learning process. Since we cannot rely on added information provided by the domain-specific properties (e.g., biochemistry), we had to carefully model ways of introducing topological diversity in our policy training data. Others (Dai et al. 2017;Mittal et al. 2019) have leveraged deep RL techniques to learn a class of greedy optimization heuristics on fully observed networks. We summarize key differences between our method and related methods in Table 1.

Problem definition
We start with the assumption that a network contains a target subnetwork representing a set of relevant nodes. The decision-making agent can initially observe only part of the original network G 0 = (N 0 , E 0 ) , where some of its nodes have their relevance status C 0 revealed; 0 representing non-target nodes and 1 representing target nodes.
The agent has a pre-specified way by which it can interact with the partially observed network: at each step, it is allowed to query the label of one selected observed node whose label is unknown. We call this type of node a boundary node. The environment responds by revealing the label of the queried node as well as its neighboring nodes (but not their labels). An immediate reward is given if the selected (i.e., queried) node is a target node. The agent's overall objective is then to strategically grow the original network over a sequence of steps, so that it can discover as many relevant nodes as possible under the query budget constraints. This problem may be stated as a Markov Decision Process (MDP). An MDP is defined by the tuple S, A, T , r, γ : • State space: Let {M i } be a set of random graph models, each defining a set of network instances {G i } defined over a set of nodes V. We define the state space as S = M i {s t = G i t } , the set of partially-observed network instances has a label C(v) ∈ { 0, 1,*} , representing non-target, target and unobserved node states, respectively. E i t is the set of edges induced by V i t . The state space includes the initial state G i 0 that contains at least one target node, as well as the terminal state which is the fully observed network instance G i . • Action space: For each network instance G i , we define the action space as A = {A t } , where A t = {a = v} is the set of boundary nodes v, observed at timestep t : {v ∈ V t , C(v) = * }. • Transition function: T (s t , a t , s t+1 ) = P(s t+1 |s t , a t ) encodes the transition probability from state s t to s t+1 given that action a t was taken. Let v be the selected node by action a t . Then s t+1 = s t ∪ C(v) ∪ N v ∪ E N v , where N v are the neighbors of node v and E N v are all the edges incident to N v . For each network instance G i , the transition function is deterministic: T (s t , a t , s t+1 ) = 1. • Reward function: r(s t , a t ) returns the reward gained by executing action a t in state s t and is defined as: r(s t , a t ) = 1 if C(a t ) = 1 . The total cumulative, action-specific reward, also referenced as the action-value function Q, is defined as, Yes with γ ∈ [0, 1] representing a discount factor that captures the utility of exploring future graph states and h is the horizon length. Figure 1 gives a simple illustration of how this cumulative reward is computed over a network topology. In the next section, we describe in detail our deep reinforcement learning algorithm.

Network actor critic (NAC) algorithm
Our network discovery algorithm has two main components as illustrated in Fig. 2. The first component, "Compress State Space", is concerned with effective ways of representing the large network state space so that policy learning can happen efficiently relative to our selective harvesting task. The second component, "Plan", utilizes the reinforcement learning framework and offline training to learn task-driven discovery strategies. We discuss both components in detail in the rest of this section.
(1) Q(s, a) = h t=0 γ t r t+1 |s, a , Fig. 1 Illustration of estimation of cumulative reward of current state s = s 0 over a horizon of length h = 3 , and discount factor γ = 0.5 . The current state s is comprised of 3 types of nodes: unknown (grey), target-nodes (red), non-target nodes (black); red nodes represent the node type we would like to discover. The figure shows an instantiation of policy π , starting at state s, corresponding to a path of length h = 3 . We calculate the cumulative discounted reward of state s based on taking action a 1 at t = 0 and following the highlighted path as follows: Q(s, a 1 ) = γ t * r t+1 + γ t+1 * r t+2 + γ t+2 * r t+3 = 1 * 0 + 1/2 * 0 + 1/4 * 1 = 1/4

Fig. 2
Schematic approach of NAC algorithm. In the first component, NAC uses a network embedding and truncation step to avoid an explosion in the state-action space as the network grows. The truncation block ensures a constant size input into the learned policy. In the second component, NAC uses reinforcement learning to learn a policy offline

Compression of network state space
Training an effective network discovery agent implies exploration over an extremely large network space. However, not all observations contribute to learning better discovery policies. In fact, for the task of selective harvesting, we can identify three representative, higher-level abstractions of the network states illustrated in Fig. 3.
In the first canonical case, discovery starts within the region of interest, where many of the relevant nodes we need to discover are nearby. In networks whose states are similar to this canonical case, the optimal discovery agent would follow localized paths and primarily exploit rather then explore new regions. In the second canonical case, discovery starts outside the region of interest and the agent has to now explore longer, deeper paths in order to reach the target region. Finally, there is a hybrid canonical case, where discovery can start in the boundary of the target region and the agent has to more carefully decide when to exploit and when to explore.
In order to map network states into canonical representations, we consider various network embedding approaches. Specifically, we look at popular walk-based algorithms (Grover and Leskovec 2016;Taher 2003;Murai et al. 2017) and matrixfactorization algorithms (Pearson 1901;Torres et al. 2020;Belkin and Niyogi 2003). The embedding step learns a new similarity function between nodes in a network. Since our downstream task is selective harvesting from a seed node, we reorder the rows of the original adjacency matrix based on the new learned distance from the seed node, with closer nodes being ranked higher. The reordering step makes sure the discovery algorithm observes a prioritized set of boundary nodes. For additional efficiency gains, we truncate the reordered adjacency matrix and only retain the network defined by the top k nodes. k, is a hyperparameter, which the user selects; it defines the supporting network for computing potential discovery trajectories and long-term reward. For training our algorithm, we found k = 256 to perform well after incrementally lowering its value from the number of nodes in the network. In practice, larger values of k can be utilized during training, but they incur a higher computational costs at no substantial increase in algorithmic performance.
In "Role of network embedding" section, we present a detailed evaluation of various embedding algorithms and identify the role that they play in supporting the planning component of NAC. Among the embedding algorithms we study, we identify personalized Pagerank (PPR) (Taher 2003) as performing the best in supporting policy learning for selective harvesting. For the rest of the paper, we assume that the planning agent only sees the compressed state representation, that is, the state that has gone through the sequence e of operations: embed, re-order and truncate: s t = e(s t ).

Offline learning and policy optimization
In our setting, learning of discovery strategies happens offline over a training set of possible discovery paths. "NAC performance results" section describes how we generate these paths.
Each path τ h represents an alternating sequence of discovered subnetworks and actions {�s 0 , a 0 �, �s 1 , a 1 �, . . . , �a h , s h �} , taken over h steps. Since in this setting we have access to the ground-truth node labels, we can map each discovery path to the corresponding cumulative reward value using Eq. 1. An illustration is given in Fig. 1.
Given the sampled trajectories, one of our learning objectives becomes to approximate the action-value function by minimizing the loss L Q (φ), We formulate this objective by taking the input tuples of discovered subnetworks s t , boundary nodes a t and corresponding cumulative reward values Q t , such that The approximated reward function Q φ is subsequently used to estimate the policy function π θ (s) = P(a|s) , which defines the probability of selecting action a at state s. This is achieved by training a convolutional neural network with the network embedding as input and a softmax probability as output over the action space. Actions are selected via an argmax over the output probabilities. The parameters of the convolutional neural network are updated via gradient ascent and its objective function is defined in Eq. 5.
We estimate the advantage of choosing one node versus another at state s t , This advantage is used to scale the policy gradient estimator, typically defined as, is the empirical average over a finite batch of samples.
We utilize a proximal policy optimization (PPO) method (Schulman et al. 2017) in order to compute this gradient. PPO methods are widely utilized for policy network optimization and have been demonstrated to achieve state of the art performance on graph tasks (You et al. 2018). The objective function utilized is defined in Eq. 4, Here, ǫ is used to bound the objective function and help with convergence. The function clip(·) keeps the ratio π θ π θ old within [1 − ǫ, 1 + ǫ] (Pascanu et al. 2013). Note that when the estimated expectation in Eq. 4 is expanded π θ π θ old becomes π θ (s,a) π θ old (s,a) , where π θ (s, a) is the probability of the current policy selecting action a when in state s; and π θ old (s,a) is probability of the previous policy selecting the same action a when in state s. (3) During offline training, we modify the objective in Eq. 4 to encourage exploration and reduce the number of required training epochs to converge to a solution. For Eq. 5, S(π θ (s t )) denotes the entropy of policy π θ over actions a in state s t . c is a fixed constant that captures the weight of the entropy term, Both learning objectives (2) and (5) are jointly optimized via an actor critic training framework. This framework is detailed further below in the description of the Network Actor Critic (NAC) algorithm. To help with training times, multiple instantiations of agents are run in parallel. Collected {s t , a t , Q t } values are gathered from each agent and are stored in a buffer β which is used to compute the losses for the value function and policy networks after a fixed number of iterations T.

Training and network details
The NAC algorithm is updated differently during offline training versus online evaluation. During offline training, the Adam optimizer (Kingma and Ba 2014) is used to update network parameters θ and φ for the policy and value function networks. In offline training, eight agents are run in parallel, carrying out the anomaly discovery task each on a unique network realization generated using one of the network models outlined in Table 2. Each network model is chosen with equal probability and their parameters are drawn uniformly at random from the parameter ranges defined in Table 2. Each independent agent contributes data to a common buffer, later used for learning the parameters of the value and policy functions. The use of multiple agents helps generate more training data faster. More agents will lead to further reductions of training time.
Hyper-parameter searches were performed in a grid search manner for both offline and online values. During offline training, the hyper parameters used are: T = 32 , h = 4 , c = 0.2 , ǫ = 0.1 , γ = 0.1 , and learning rate = 1e−4 . With the exception of h, these (5) parameters generally did not affect the overall final performance of the network, but did alter the convergence rate. We found that going beyond h = 5 reduced overall performance of the learned policy. For online evaluation, we used a single agent and parameters T = 1 , h = 1 , γ = 1 , c = 0.0 , and = 1e−3 . The change in hyper-parameters, specifically removing the exploration components, reflects moving from the training setting which benefits from sufficient exploration of the search space, to a fully greedy online policy that favors exploitation. The policy and value function networks are both comprised of 3 convolutional layers with 64 hidden channels and a final fully connected layer. The value function is regressed using the loss function described in Eq. 2, while the policy network is trained using the objective function described in Eq. 5.

NAC performance results
We evaluate our algorithm against several learning scenarios for both synthetic and realistic datasets. The NAC agent always starts the exploration from a seed subnetwork that contains 1 target node. Next we describe our datasets and baselines used for comparison.

Synthetic datasets
We generate synthetic graphs by modeling background networks (i.e., networks that do not contain any target nodes), and foreground networks (i.e., networks that only contains target nodes). There are edges that connect the foreground and background nodes to each other. 1 We use two models to generate samples of background networks. Stochastic Block Model (SBM) (Holland et al. 1983) is a commonly used random graph model, which allows us to model community structure as dense subgraphs sparsely connected with the rest of the network. Lancichinetti-Fortunato-Radicchi (LFR) model (Lancichinetti et al. 2008) is another frequently used random graph model, which allows us to simulate network samples with skewed degree distributions and skewed community sizes, and therefore is able to capture more realistic and complex properties of real networks. We use the Erdős-Renyi (ER) model (Holland et al. 1983) to simulate the foreground network. ER is a random graph model where nodes are connected with equal probability p f . This Table 2 Detailed list of parameter values used for synthetic networks Number of nodes is represented by N = 4000 . SBM parameters are: k represents the number of communities, p i the edge probability for within-community i, r the across-community edge probability, such that p i > r . LFR parameters are: τ 1 , τ 2 skewness parameters for degree and cluster size distributions respectively, d represents the average network degree, d min , d max represent the min and max values of degree distribution, min c and max c represent the sizes of smallest and largest clusters, and finally n f , k f , p f represent the size of the foreground subnetwork, number of foreground subnetworks and its edge probability, respectively allows us to control the density of the foreground networks. Table 2 lists the parameter choices for all the aforementioned models.
In order to create a background plus foreground network sample, we select a subset of the nodes from the background network that will represent the target nodes. We then simulate an ER subnetwork on these nodes and replace their background induced subnetwork with the ER subnetwork, while maintaing the edges from the target nodes to the rest of the background nodes.

Real datasets
We analyzed two Facebook datasets (Rozemberczki et al. 2018) representing pages of different categories as nodes and mutual likes as edges. For both cases, we study the discovery of a target set of nodes, where we control how we generate and embed them in the background network. In particular, we embed a synthetic foreground subnetwork consisting of a denser (anomalous) ER graph with size n f = 80 and density p f = 0.003 . We also consider the Livejournal dataset (Murai et al. 2017). This dataset represents an online social network with users representing nodes, and their self-declared friendships representing edges. For each user, there is also information on the groups they have joined. Similar to Murai et al. (2017), we use one of the listed groups as the target class. The Livejournal dataset represents a departure from the two Facebook datasets, both in terms of its much larger size, but also because the target class does not represent an anomaly. Table 3 describes a few topological characteristics of the real networks described here, as well as details on their target classes.

Baselines
We evaluate NAC by comparing its performance to two top-performing online network discovery approaches. The Network Online Learning (NOL) (LaRock et al. 2018(LaRock et al. , 2020 algorithm learns an online regression function that maximizes discovery of previously unobserved nodes for a given number of queries. We modify the objective of NOL to match our problem setting by requiring the discovery of previously unobserved nodes of a particular type. A second baseline we consider is the Diversity Dynamic Thompson Sampling ( D 3 TS ) (Murai et al. 2017) approach. D 3 TS is a stochastic multi-armed bandit approach that leverages different node classifiers and Thompson sampling to diversify the selection of a boundary node. We also compare to a simple fixed node selection heuristic referenced in Murai et al. (2017) called Maximum Observed Degree (MOD). At every decision step, MOD selects the node with the highest number of observed neighbors that have the desired target label. Finally, we compare to the heuristic that at each step selects the node with the highest PPR network-embedding score.

Learning scenarios
In the first learning scenario, the goal is to detect a set of distributed anomalous nodes. They are represented by two cliques, each containing 40 nodes, that are 2 to 3 hops away from each other. The training instances are networks generated by the SBM model, while the test cases are network instances generated by the LFR model. In this scenario, the discovery agent has to figure out (1) how to value longer exploration paths over the cost of including nodes not in target set, and (2) how to adjust to topological differences between training and testing instances. In Fig. 4a, we consider a test case where detactability of the two cliques with complete network information is relatively easy (average background density around the cliques is comparatively low). We observe that all the methods are able to find the first clique, yet all the baselines struggle once they enter the region where no clique nodes are present. The baselines eventually find some of the second clique's nodes, but they are unable to fully retrieve the entire second clique. NAC is able to leverage estimation of long-term reward and access to the offline policy to fully recover both cliques. Furthermore, NAC is able to generalize to the more complex LFR topology.
In Fig. 4b, we consider a much harder case. The foreground networks are two disjoint denser subnetworks, 2-3 hops away, each with density 0.2 in a background of density 0.05. Even though these foreground networks are denser than the background network, they are still much sparser than the cliques embedded in the first learning scenario. In fact, their relative density parameters are close to the undetectability bound (Nadakuditi and Newman 2012). In this case, neither of the baselines learns how to recover the second foreground network. NAC goes through a longer exploration phase, but eventually learns how to grow the network to identify the second foreground network.
In Fig. 5, we illustrate how our model trained on synthetic background networks generalizes to realistic background topologies. For this scenario, we trained with instances from both the LFR and SBM models. We observe that NAC generalizes very well to the Facebook network topologies and is able to fully discover the target nodes (Fig. 5a, b). Note the substantial performance improvement when we compare to the other network discovery baselines. The Livejournal network in Fig. 5c presents a much more complex Confidence intervals are generated by testing NAC's performance on 200 background network instances generated using 5 different models from Table 2. We move the location of the two foreground subnetworks around, but maintain their relative distance of 2-3 hops discovery challenge both in terms of the size and realism the network (see Table 3 for details). In particular, the target network is not synthetically generated as was the case with the Facebook networks, and we do not know the underlying process that generated this target network. Note that NAC is able to overcome the most competitive baseline D 3 TS at about iteration 900, when it has seen a bit more than a third of the target nodes (600 nodes). After this point, NAC is able to recover the remaining target nodes at a substantially faster rate than all the baselines.

Role of network embedding
In this section, we systematically explore the role of the embedding algorithm in supporting better network discovery for selective harvesting. We consider two broad classes of embedding methods: walk-based methods (Grover and Leskovec 2016;Murai et al. 2017;Taher 2003), and matrix factorization methods (Pearson 1901;Torres et al. 2020;Belkin and Niyogi 2003).
As introduced earlier, Maximum Observed Degree (MOD) (Murai et al. 2017) is a heuristic embedding which ranks nodes by the number of edges shared with a target node. Personalized Page Rank (PPR) (Taher 2003) is a random-walk method which ranks nodes by their estimated random-walk distance to an observed target nodes. We used a damping parameter α = 0.8 . Node2vec (Grover and Leskovec 2016) is a deep-walk based method which attempts to learn a neighborhood preserving representation for each node in a given network instance. We used the following Node2vec parameters: number of random walks = 5, length of each random walk = 40, and embedding dimension = 64. We ranked embedded nodes by estimating the Euclidean distance between each node and the observed target nodes. For Principal Component Analysis (PCA) (Pearson 1901), we compute the eigen-decomposition of the input adjacency matrix. We estimate the node ranking by looking at the average Euclidean distance between a node and observed target nodes.
Laplace Eigenmap Embedding (Eigenmap) (Belkin and Niyogi 2003) is a low-dimensional graph representation based on spectral properties of the Laplacian matrix of a graph. In this embedding, we represent vertices using the eigenvector corresponding to the second smallest eigenvalue. Rank is estimated by looking at the absolute value of the  networks (a, b), confidence bounds are estimated using 200 instantiations of the foreground network using 5 different parameter choices from Table 2 and varying the location where in the Facebook network we plant the foreground instance. For the Livejournal network (c), we pick 200 hundred different nodes in the foreground network, where the agent can start the exploration.
dot product of embedding vectors as described in Torres et al. (2020). The embedding dimension was set to 64.
Laplace Eigenmap Embedding (Eigenmap) (Torres et al. 2020) is a low-dimensional graph representation based on geometric properties of the Laplacian matrix of a graph. Unlike eigenmap, GLEE represents vertices using the eigenvector corresponding to the largest eigenvalue. Ranking is estimated in the same way as Eigenmap. The embedding dimension was set to 64.

Embedding evaluation metrics
Our evaluation of the embedding algorithm is in the context of its support to NAC's policy learning component. An effective RL agent for selective harvesting would benefit from state approximations that reflect canonical states for this task (illustrated in Fig. 3). This may imply differing embedding objectives than if we analyze network embedding algorithms as standalone solutions. To this effect, we consider the following metrics for evaluating the role of the embedding algorithm.

Consistent embedding
Ideally, we would like the embedding algorithm to place probed and unprobed target nodes near each other. This property implies that NAC will have a higher chance of visiting target nodes earlier than background nodes. To capture the consistency property of the embedding algorithm e(·) , we measure, at every discovery step t, the accuracy of the embedding algorithm in recovering the top k target nodes:

Compressability of state-action space
In an ideal RL setting, the highest reward value Q(s, a) for a given action space will be highly concentrated over the best action option. We can conceptualize this scenario using a Gaussian distribution with mean represented by the reward value of the best action and minimal variance. We favor node embeddings that concentrate favorable actions-e.g., tightly clustering target nodes in the border set.
This entropy minimization concept is illustrated by Fig. 7. In 7a the entropy for Q(s = G t , a = u) is higher than Fig. 7b causing the policy π(a = u|s = G t ) to have higher variance and lower probability of successfully selecting the "best" node.
To measure how the embedding algorithm supports this entropy minimization principle, we look at the variance over node rankings in the embedding space for each target cluster and compute the entropy as follows, where B t is the set of target nodes in the border set at time step t.

Robustness to increasing signal complexity
A good embedding algorithm allows the discovery agent to stay robust as the strength of the signal deteriorates and its complexity increases. We consider two (6) Accuracy t (e(G t )) = #top k target nodes identified by embedding #true target nodes .
parameters: the strength of background class and the strength of target class and vary them to explore both regions of high and low signal-to-noise ratio (SNR). To capture robustness, we examine sensitivity to target and background model parameters of Area Under the Curve (AUC), the aggregated accuracy metric over discovery time-steps t,

Learning convergence time
A useful embedding algorithm reduces the number of episodes required to learn effective discovery policies. The embedding algorithm does this by mapping network states to fewer canonical representations that aid policy learning. Here we estimate improvements in NAC's convergence rates without and with access to the embedding steps. Results are shown in Fig. 6. (8) AUC = Fig. 6 NAC convergence behaviour with and without embedding. In a we see the impact of the large state space on the convergence rate when the agent does not use an embedding of the network state. Given the size of the state space, there is low probability that the agent will observe the same state multiple times, and therefore learning is much slower and less generalizable. In contrast, we observe in b, c how the convergence rate increases as the quality of the state embedding increases with respect to the target task Fig. 7 Illustration of how quality of node embedding affects the quality of policy and reward functions; a highly uncertain reward and policy functions, b highly concentrated reward and policy functions. Ideally, our policy is distributed around the best action with minimal variance, so b is preferred

Data generation for embedding analysis
We consider the following learning setting for our embedding analysis: the target class is a set of disjoint dense subgraphs within a background network. A variety of background and anomaly (target) densities are tested. For each learning step, the incomplete graph is embedded and a ranking for all observed nodes is computed and scored. An optimal policy is defined as navigating each step of the selective-harvesting task in the minimal number of steps. We utilize ground truth to navigate the graph optimally. For illustration, in the setting of two anomalous subnetworks with 40 nodes each, separated by 2-hops, a perfect traversal is 81 steps long.
In our experiments, we consider the following parameters: each anomalous subnetwork has 40 vertices, and the background network consists of 2000 vertices. We use stochastic block model to generate background instances at various densities. Each background instance contains two communities with intra-community edge probability p 1 = p 2 = 0.25 and inter-community edge probabilities r in the range {0.01, 0.025, 0.05, 0.075, 0.1} . We use the ER model to generate anomalous subnetwors with edge probabilities p t in the range {0.25, 0.5, 0.75, 1} . For each unique set of parameters, we generate 10 graph instances leading to a total of 200 graph instances.

Empirical analysis of embedding algorithms
We summarize our empirical evaluation of a few embedding algorithms.
Consistent embedding is analyzed in Fig. 8. Within the two target subnetworks, we observe that walk-based methods, PPR and MOD, do a fairly good job in prioritizing target nodes for subsequent selection. PPR is much more robust as the strength of the anomalous subnetwork weakens relative to the background network. Node2vec, by contrast, struggles in the same regions, though it seems to do slightly better once the agent discovers the second subnetwork. It is possible that increasing the dimensionality of the feature vectors would lead to improved performance, however this method is computationally intensive as we consider embeddings over many learning iterations and many graph instances. Overall, across all the embeddings, we observe a strong drop in performance when transitioning between exploitation and exploration regimes. We observe similar embedding sensitivity to decreasing levels of SNR. These observations highlight the role of offline policy learning in recognizing and adapting to changing discovery regimes and sparse task-related signals.
Compressability of state-action space is illustrated in Fig. 9. Similar to the accuracy metric, PPR and MOD appear to do the best job of quickly collapsing to a set of node positions as enough target nodes are collected. Again, node2vec appears to be a poor choice, but does exhibit some compressability in the higher SNR cases. The graph (a.k.a. matrix) factorization approaches appear to follow the expected trend of degrading in performance with a reduction of SNR.
Robustness to increasing signal complexity is demonstrated by Fig. 10, which represents the integrated accuracy over the entire selective harvesting task. The same trends discussed in the accuracy section are illustrated here. This figure delineates, at an aggregate level, the network topology characteristics where simple embedding heuristics are sufficient to support effective selective harvesting (lighter color regions) and those topology characteristics where offline planning is required.
Learning convergence time is analyzed in Fig. 6, which shows the performance of each embedding paired with policy learning after N episodes of training. We demonstrate in Fig. 6a that without embedding, the convergence time is likely to be very long and requires a high capacity network. We show representative embeddings from the walk-based and factorization appraoches in Fig. 6b, c; and observe that they converge in a consistent way to their entropy and accuracy scores. We illustrate this by analyzing the test case described in Fig. 4a, but the behavior is consistent for all the different test cases considered. Overall we observe that embedding quality directly impacts The anomaly density value for each plot represents the edge density of the anomaly or target subnetwork and P represents the density of the background network. The anomalies get sparser from left to right, causing the discovery task to increase in complexity and in general for the performance of each embedding to diminish. On the other hand, for a fixed anomaly density value, decreasing the background density P makes the discovery task easier. We observe that PPR, MOD, GLEE, and Laplacian embedding all perform well on the task and PPR maintaining the best performance. Note that the sharp dips in the plots correspond to regions when the agent has to travel from the first anomalous subnetwork to the second and there are no target nodes in the boundary set convergence time and ultimately the ability of the discovery algorithm to achieve the downstream task objective with budget and resource constraints.
Across the various evaluation metrics and learning regimes, we consistently observe PPR outperforming other embedding algorithms in best augmenting discovery policy learning for selective harvesting. The success of PPR across the various evaluation metrics could be explained by the shared characteristics between the selective harvesting task and the PPR algorithm. Both algorithms rely on the concept of exploring local, relatively dense neighborhoods from a seed node. The same rationale can explain the relative success of the MOD heuristic, though MOD does not have the randomness feature that allows PPR to handle sparser distributions of target nodes. The rest of the embedding approaches lack the seed-centric embedding property and therefore never match the overall performance of PPR. Our hypothesis, however, is that consideration of alternative downstream tasks, might imply a different ranking of suitable embedding methods. Here we see that MOD, PPR, LAPLACE, and GLEE are able to compress the action space, which is indicated by the sharp drops in entropy before the entire anomaly has been discovered ( < 40 steps). This is especially pronounced in the easier cases, e.g. denser anomalies, illustrated in the leftmost plots for each embedding. Sharp increases in entropy values correspond to the agent moving from one anomalous subnetwork to the other and it is faced with intermediate boundary sets with no target nodes

Conclusions and future work
We introduced NAC, a deep RL framework for task-driven discovery of incomplete networks. NAC learns offline models of reward and network-discovery policies based on synthetically generated training data. NAC is able to learn effective strategies for the task of selective harvesting, especially for learning scenarios where the target class is relatively small and difficult to discriminate. We show that NAC strategies transfer well to unseen and more complex network topologies including real networks. We analyze various network embedding algorithms as mechanisms for supporting fast navigation through the large network state space. Across several metrics of evaluation, we identify personalized Pagerank as a robust network embedding strategy that best supports planning for the task of selective harvesting. We leave analysis of alternative downstream tasks and their respective suitable network embedding for future work.
Our approach opens up many interesting venues for future research. The effectiveness and convergence of our algorithm relies on being trained on sufficiently representative training data. It is valuable to further explore and quantify the limits of transferability of synthetically generated training sets. Our current framework is flexible enough to incorporate additional discovery strategies generated from other methods, as part of the offline training process. This feature can lead to more efficient discovery strategies, but we leave that careful analysis for future work. Additionally, for convenience and processing speed, we chose to encode our policy with a standard convolutional neural network. Understanding the impact of utilizing alternative neural network designs, such as graph convolutional networks, is an interesting future research direction. Finally, the NAC framework is general enough to support discovery for other network learning tasks. It is valuable to explore how a different learning objective changes the training, convergence, and generalizibility requirements. In general, we observe that PPR performs best in all scenarios presented in "Data generation for embedding analysis" section