A multi-armed bandit approach for exploring partially observed networks

real-world networks such as social and communication networks are too large to be observed entirely. Such networks are often partially observed such that network size, network topology, and nodes of the original network are unknown. Analysis on partially observed data may lead to incorrect conclusions. We assume that we are given an incomplete snapshot of a large network and additional nodes can be discovered by querying nodes in the currently observed network. The goal of this problem is to maximize the number of observed nodes within a given query budget. Querying which set of nodes maximizes the size of the observed network? We formulate this problem as an exploration-exploitation problem and propose a novel nonparametric multi-armed bandit (MAB) algorithm for identifying which nodes to be queried. Our proposed nonparametric multi-armed bandit algorithm outperforms existing state-of-the-art algorithms by discovering over 40% more nodes in synthetic and real-world networks. Moreover, we provide theoretical guarantee that the proposed algorithm has sublinear regret. Our results demonstrate that multi-armed bandit based algorithms are well suited for exploring partially observed networks compared to heuristic based algorithms.


Introduction
Interactions among different entities in many real-world complex systems are often represented by networks, where the entities are represented by nodes and the interactions among them are represented as links between entities. For example, the information contained in online social networks proved to be valuable in advertising applications such as finding influential users to targeted marketing. Usually, data acquisition is done using Application Programming Interfaces (APIs) offered by respective social networking services. Using these APIs is often time consuming and the number of nodes (e.g., profiles) that can be queried within a given time is restricted. A poorly constructed incomplete network will lead to inaccurate findings. This highlights the importance of acquiring more information as possible using a limited number of queries.
Here, we provide an overview of Adaptive Graph Exploration problem 1 . We formally define it in "Proposed bandit based probing method" section. Suppose we are given a partially observed network. For instance, a sample of a social network collected by a researcher. Since we do not know how this sample is obtained, only way to enhance this sample is by acquiring data belonging to the unseen portion of the network. We use the term probing to refer to querying a node to retrieve information about it and its neighborhood. As an example, probing a node of a social network corresponds to querying a social network API to obtain information about a user profile and its friends (or followers). Continuing this process for a several rounds introduce new user profiles (nodes) and their friends (neighboring nodes). If we are allowed to continue probing infinitely, we may be able collect information about all the users. Still, when we are done with it, there is a high chance that new users might have joined the network and new friendships have been formed. In reality, we are restricted by constrains enforced by such social network APIs. For example, Twitter limits most of its API requests to a maximum of 15 requests within a 15 min time window 2 . We introduce this constraint as a probing budget, the maximum number of times the network is allowed to be probed. Thus, our objective is to enhance the observed graph as much as possible within this probing budget.
A straight-forward approach to reduce the incompleteness of a partially observed network is to assume that the network has been generated by a certain network model and use this model to predict the properties of the unobserved portion of the network. For example, Kim and Leskovec (2011) fit a Kronecker model to the observed part of the network and use this fitted model to predict unseen parts of the network. However, this is always not practical for real-world networks as such methods require more structural information (e.g, number of nodes) about the original network. Another approach is to acquire more information of the network by progressively querying the observed network as we name probing in this paper. Existing heuristic algorithms such as maximum observed degree (MOD) probing and maxreach (Soundarajan et al. 2016) require the sample to be obtained in a certain way (e.g., uniform edge sampling). In "Experiments" section, we show that existing probing algorithms can not be generalized for incomplete networks obtained by different sampling techniques. Furthermore, many real-world networks consist of communities, densely connected regions of nodes. With empirical results, we show that heuristic probing algorithms get stuck inside communities, making them worse than probing a node in random.

Contributions
A high level overview of the proposed adaptive probing algorithm is illustrated in Fig. 1. The probing pipeline consists of two major steps, obtaining a feature representation of the observed network and a model which predicts the reward a node will reveal (e.g., the true degree of that node) based on its feature vector. The key assumption of using a learning model is that nodes with similar features in the observed network will result in similar rewards. Our choice of graph features is motivated by previous work on inferring structural role (Henderson et al. 2012) and social status (Zhao et al. 2013) of nodes in social networks.
One property which makes the estimation of rewards different from a normal prediction problem is that our training data is accumulated over the process of probing. A greedy strategy of probing nodes with similar features all the time may result in sub-optimal results. This situation is known in reinforcement learning literature as exploration-exploitation trade-off. Multi-armed bandits (Robbins 1952) is a generic way to approach real-world exploitation-exploration problems. In this context, exploitation corresponds to selecting the node which has the largest expected reward and exploration corresponds to selecting some other node for probing. In this paper, we express adaptive graph exploration problem as a multi-armed bandit problem in a non-stationary environment.
In this manuscript, we extend the approach proposed in Madhawa and Murata (2018). Our contributions are listed below. We mark the contributions which are new additions to the work mentioned in the conference version (Madhawa and Murata 2018) with a *.
1 A generic approach for enhancing partially observed networks which does not require any prior knowledge about the network. 2 A novel non-parametric upper confidence bound (UCB) algorithm (i KNN-UCB) to solve the multi-armed bandit problem (MAB) when the arms are represented in a vector space 3 . 3 We provide a proof that the regret of the proposed bandit algorithm is sublinear.* 4 Usingi KNN-UCB algorithm on synthetic networks and real-world networks from different domains, we demonstrate that our proposed method performs significantly better than existing methods 4 . 5 With experiments on different network models, we demonstrate that certain partially observed networks correspond to non-stationary reward distributions.*

Organization
The rest of this manuscript is structured as follows. In "Background" section, we provide an extensive review of related work. "Proposed bandit based probing method" section starts with the problem definition and describes our approach in detail. "Experiments" section explains the experimental setup and the data sets being used. Then, in "Results" section we present empirical evaluations of our bandit algorithm using real-world networks as well as synthetic networks. Finally, we conclude the paper with "Conclusions" providing a brief discussion of the proposed bandit approach and a few promising directions as future work.

Network crawling and sampling
Network sampling methods pose as a potential approach to solve the problem introduced in the above section. However, common sampling techniques such as uniform node sampling and uniform edge sampling are not suitable since uniform sampling depends on access to the space of all available nodes (Ahmed et al. 2014). It is not practical to assume that we can know the number of nodes or edges of a partially observed network. In contrast, the objective of our problem is improving a given incomplete network and we have no knowledge of how the sample is being obtained. Particularly, snowball sampling (Lee et al. 2006) and random walk based sampling algorithms (Cooper et al. 2016) can be used when the information about the complete network is not accessible. However, such algorithms suffer from the same drawbacks as of heuristic algorithms; they do not adapt as the observed information updates. As another related problem, link prediction (Liben-Nowell and Kleinberg 2007) can be used to predict missing links on a network, but not capable of predicting missing regions of nodes. The observed sample can be further enhanced by iteratively querying observed nodes and adding their neighboring nodes to the sample. Avrachenkov et al. (2014) propose Maximum Expected Uncovered Degree (MEUD), a greedy algorithm for selecting which node to be probed next. However, this algorithm requires the degree distribution of the original network to be known. When this requirement is not fulfilled, it reduces to Maximum Observed Degree (MOD) algorithm which greedily chooses the node with the largest observed degree. We use MOD as a baseline algorithm in our experiments and show that our proposed algorithm significantly outperforms MOD in synthetic and real-world networks.

Active search
Active search on graphs Bilgic et al. 2010) is another related problem with the objective of finding as much target nodes as possible possessing a given property. Most of the previous work relating to this problem assume that the complete graph is observable and any node can be queried to find its label (Ma et al. 2015). If only an incomplete view is available, an approach relying only on the observed information may not obtain the best possible reward. In addition to exploitation of the best option according to available information, exploration of other possible options is performed to achieve better rewards. A common approach to finding a balance between exploitation vs exploration trade-off is formulating it as a multi-armed bandit (MAB) problem (Mahajan and Teneketzis 2008). SN-UCB1 (Bnaya et al. 2013) and NETEXP (Singla et al. 2015) are such MAB based active search algorithms proposed for partially observed networks. NETEXP assumes that probing a node reveals its 2-hop neighborhood, which is not true for realworld social networks. If the observability is restricted to 1-hop neighborhood of nodes, this algorithm reduces to random neighbor probing. SN-UCB1 does not provide a significant improvement over the existing heuristic methods. Recently, Soundarajan et al. (2017) proposed -WGX, a multi-armed bandit approach to solve Active Edge Probing (AEP) problem in incomplete networks. Though AEP looks similar to the problem discussed in this paper, it is fundamentally different from ours as a node can be probed multiple times and only one neighboring edge is revealed in each probe. Hence, this problem can be considered as a restricted version of the problem we are dealing with in this paper.

Multi-armed bandits
Multi-armed bandits (MAB) (Robbins 1952) is a generic framework used to systematically define exploitation vs exploration trade-off. The classic k-armed bandit problem is modeled after a gambler trying to maximize the profit by choosing which slot machines (known as bandits) to play. Playing a bandit results in a reward which is assumed to be sampled from a probability distribution specific to that bandit. In a multi-armed problem with a discrete set of available actions, choosing an action corresponds to playing an arm in a multi-armed bandit problem. A variety of bandit algorithms are being used to solve a multitude of real-world optimization problems such as recommender systems (Li et al. 2010) and display advertising (Lu et al. 2010). Out of all existing approaches to MAB problem, upper confidence bound (UCB) (Auer 2002) methods possess the best theoretical guarantee in maximizing the reward. UCB algorithms are based on the principle of "optimism in the face of uncertainty"; actions are chosen based on optimistic guesses of how much reward choosing a particular action may bring in. If choosing that action results in a reward which is less than expected, then the confidence placed on that action is decreased.
Algorithms for classic MAB problems decides which action to play only based on the distribution of rewards observed by choosing that action. Since such algorithms do not use any contextual information into consideration, they are known as context-free algorithms. However, in real-world optimization problems such as movie recommendation, an action can be represented by its features. For example, in a movie recommendation problem, a movie has a multitude of features(e.g., genre, year, etc.). A variant of MAB problems, contextual bandits (Li et al. 2010) leverages the features describing an action, known as the context of an action. In addition to the reward, contextual bandits observe the context as a feature vector of each action (bandit). The context vectors are used to calculate the expected reward of each action and the action with the largest expected reward is chosen. As an example, LinUCB (Li et al. 2010) models the expected reward as a linear regression on context vectors.

Proposed bandit based probing method
We start this section with the formal definition of the problem. Then we describe the main components of this work and the multi-armed bandit algorithm in detail.

Problem definition
Suppose there is a large unweighted undirected graph G which cannot be observed fully, instead only a partially observed network G is available. We denote the initial incomplete network as G 0 (at time=0). Our goal is to grow this network by probing any of the observed nodes at each time step. Using this notation we denote the observed network at time t as G t . Table 1 lists the notation that we will be using in this section.

Definition 1 Probing a node reveals all links incident to it and the identity of its neighboring nodes.
The number of times we are allowed to probe the network is constrained by the probing budget (T ∈ Z + ). 1 unobserved: existence of these nodes is not visible to the algorithm. 2 observed: these nodes exist in both G and G6 t , but has not being probed. 3 probed: the algorithm knows about these nodes and their neighboring nodes. Figure 2 illustrates an example of an incomplete network. We use bold lines to denote observed links and dash lines to denote unobserved links at the given moment. Even though nodes V 1 and V 2 are observed when node U is probed, [V 1 , V 2 ] link is not observed because neither nodes are probed.
We consider a node which has been probed by the algorithm as an observed node as well. Hence, all the nodes in the graph G t are referred to as observed nodes. Any observed node which is not probed is considered as a candidate for probing. Hence, we refer to such nodes as candidate nodes. In the beginning, all the nodes in the given sample are candidate nodes. Probing a candidate node reveals a reward (e.g., true degree of a node). Our goal is iteratively selecting b candidate nodes that maximize the cumulative reward (i.e., number of observed nodes).

Calculation of expected reward of a candidate node
Instead of using a heuristic metric to choose a candidate node for probing in each time step, we treat this problem as a learning problem. Similar to an active exploration algorithm, our proposed solution consists of three high-level steps (Pfeiffer III et al. 2014): probing, learning, and prediction. Probing a node results in additional information about the observed network. Information about the currently observed network is Fig. 2 Example of an incomplete network. The black node U is probed and gray nodes V 1 , · · · , V 4 are observed. The white nodes X 1 , · · · , X 4 exist in the original network G, are yet to be observed leveraged to learn a predictive model which predicts the expected reward of a given candidate node in future. Our approach assumes that candidate nodes with similar structural neighborhoods will result in similar rewards.
Suppose that the feature vector of a candidate node j at time t is x j,t ∈ R d . The learner probes node j at time t and observes the following reward: where f : X → R gives the expected reward of a given node and ζ t is sub-gaussian white noise. Here, f can be any function which can compute the expected reward of a node given its features (e.g., linear regression).

Assumption 1 (λ-Hölder continuity of f): There exists constants C H
D is a metric which defines the "distance" between two vectors x and x .
Assumption 1 expresses that the difference between the value of regression function f on two points x and x depends on the distance between the two points D(x, x ). In our problem setting, if two nodes are close with respect to the distance measure D, their rewards are assumed to be similar.This is a standard smoothness condition used in regression (Chen et al. 2018;Jiang 2019). Lipschitz continuity is a special case of Hölder continuity when λ = 1. In the following section, we provide a detailed description on how we formulate this problem as a multi-armed bandit problem.

Problem setting
In the classical multi-armed bandit problem, an agent selects one of the K arms (or actions) at each time step and observes a reward depending on the chosen action. The goal of the agent is to play a sequence of actions which maximizes the cumulative reward it receives within a given number of time steps. Classical k-armed bandit problem comes with the following assumptions: 1 ThesetofarmsK does not change over time. 2 Each arm is independent. 3 The rewards are drawn randomly from a probability distribution that is specific to each arm. 4 The environment is stationary. The reward distribution of arms does not change over time.
Selecting a node from the set of candidate nodes at time step t for probing is similar to pulling an arm in a multi-armed bandit problem. However, our problem does not satisfy the assumptions mentioned above. For example, the set of candidate nodes change as we keep adding new nodes to the partially observed network during the exploration process. More importantly, probing a node for the second time does not reveal any additional information. Hence, playing an arm once again is a waste of time. In addition, it is not safe to assume that the reward distribution would stay stationary over time. We address all of these concerns in our proposed algorithm.
As independent assumption does not hold in our problem setting, it is more suitable to express it as a structured bandits problem, in which reward distributions of arms are not independent, but interrelated. In a structured bandit problem, the agent deduces relationships between arms based on some d-dimensional feature vector x a ∈ R d belonging to an arm a.

KNN-UCB algorithm for structured bandits
Linear bandits (Rusmevichientong and Tsitsiklis 2010;Dani et al. 2008) model, the simplest among such models, assumes that the reward of choosing an arm is linearly dependent on its features. In linear bandits, the expected reward of an arm is calculated as the inner product of its feature vector and a parameter vector θ. However, real-world data often exhibit more complicated relationships than a linear one. Therefore, we choose k-nearest neighbor (k-NN) regression to estimate the expected reward of arms. To introduce exploration into the solution, we extend Guan and Jiang (Guan and Jiang 2018)'s k-armed KNN-UCB algorithm to the structured setting. As explained in Multi-armed bandits, upper confidence bound (Auer 2002) (UCB) algorithms incorporate an exploration term by calculating confidence bound for each arm and choose the action corresponding to the largest confidence bound.
We define k-nearest neighbor upper confidence bound (iKNN-UCB) rule as where U t,k (x) is the uncertainty score of point x and α > 0 is a constant determining the amount of exploration. If α = 0 the uncertainty score is ignored, and the algorithm becomes a greedy algorithm which performs exploitation all the time.
To address the issue of non-stationarity of the environment, we consider only the most recent observations within a time window of size τ . In this setting, k-NN regression considers only the τ -most recently observed nodes and their rewards in computing the expected reward for a new node. This approach is motivated by sliding window UCB (SW-UCB) algorithm proposed by Garivier and Moulines (Garivier and Moulines 2011). If τ ≥ T, then the resultant UCB algorithm is the same as the usual stationary UCB algorithm.

Definition 3 (k-nearest neighbor regression (Jiang 2019)). Let the k-NN radius of x
where y j is the observed reward for x j and D(x i , x j ) is euclidean distance between feature vectors x i and x j .
We define σ (x) as the average distance to points in the k-neighborhood, The uncertainty term U t,k (x i ) is analogous to the term T i (t) in a finite arm MAB problem, the number of times action i has been chosen by the time t. If U t,k (x i ) is large, the k-neighborhood of node i is dispersed over a larger space. On the other hand, if σ (x i ) is small, we have already observed nodes close to node i. Hence, the exploration term weights less observed neighborhoods. Algorithm 1 shows how a given network is being probed using the proposed iKNN-UCB algorithm.

Algorithm 1: iKNN-UCB.
Input : incomplete network G 0 = V 0 , E 0 , probing budget T ∈ Z + , exploration parameter α, temporal window size τ , k, T 0 Output: A sequence of T nodes to probe Initialize: candidate nodes = V 0 Add neighboring nodes N a t of node a t to the incomplete network G t−1 . G t = G t−1 ∪ N a t 12 remove node a t from candidate nodes

Regret
The objective of a bandit algorithm is to select arms so as to maximize the cumulative reward over time. Minimization of total regret is an equivalent way of expressing maximization of cumulative reward. The regret at iteration t equals to the difference between reward of the "optimal" arm and the reward of a suboptimal arm. In simple terms, regret is the loss incurred by the policy for not playing the optimal arm all the times. In T iterations, we pull arms a 1 , a 2 , · · · , a n and we observe rewards r a 1 ,1 , r a 2 ,2 , · · · , r a n ,n . We use the following notion of regret Theorem 1 (Sublinear regret bound). Let M > 0 be an arbitrary constant. Suppose that Assumption 1 holds. Then the regret is sublinear with, Proof The regret for bandits in a continuous feature space is (4) Using k-nearest neighbor regression rates when f is λ-Hölder continuous (Assumption 1) and k = O t 2λ/(2λ+d) (Jiang 2019), Using this result in Eq. 4, results in where L t is used as an asymptotic constant.
Hence, the regret is sub-linear.

Experiments
We construct the feature vector x j of candidate node j as a vector of following features. For each feature, the local neighborhood of node j in the observed graph G t is considered. All these features vary over time as new nodes and edges are added to the observed graph during the probing process. These features are chosen because their effectiveness is shown in previous work on finding structurally similar nodes (Henderson et al. 2012).

Data
We perform experiments on various synthetic networks as well as eight publicly available real-world data sets of social and information networks (Leskovec and Krevl 2014). The datasets are briefly explained below.

Synthetic networks
Real-world networks exhibit a variety of characteristics: different degree distributions, existence of community structure etc. Before delving into real-world networks, we generate synthetic networks with a varying degree of such characteristics. This makes it easier to understand the performance of our proposed approach in terms of network properties. We use the following network generation models: Barabasi-Albert (BA) (Barabási and Albert 1999). The BA model generates scale-free networks with power-law degree distributions. A scale-free network contains a few nodes (called hubs) with unusually high degree compared to other nodes. The BA model uses a network generation process consisting of growth and preferential attachment. This process selects neighbors to a given node with a probability proportional to their degree. This makes sure that the higher the degree a node has it has a higher probability of having edges with other nodes. This phenomenon is responsible for creating hub nodes.
Lancichinetti -Fortunato-Radicchi (LFR) (Lancichinetti et al. 2008). In addition to powerlaw degree distributions, real-world social and communication networks possess additional phenomena such as the existence of communities (Fortunato and Castellano 2012). Groups of nodes which are densely connected within the same group compared to nodes belonging to other groups are known as communities. Modularity is a popular metric used to measure the quality of a particular division of a network into constituent communities. Most of the community detection algorithms are based on the principle of modularity maximization (Newman 2004;Blondel et al. 2008). Thus, higher modularity is an indication of the existence of community structure.
BA network model is not capable of generating networks having community structure. We use Lancichinetti-Fortunato-Radicchi (LFR) benchmark to generate networks with community structure. Node degree and community sizes of networks generated by LFR benchmark have power law distributions with different exponents. The mixing parameter μ of LFR model decides the probability of a node linking to a node belonging to another community. Low values of μ will result in dense communities as the chance of having intra-community links (1 − μ) is higher compared to the chance of inter-community links (μ). We create LFR benchmark networks with varying the value of μ in the range [0.1, 0.5] to investigate how our proposed model performs on networks with varying degree of community structure. Mixing parameter and modularity of a network are inversely related.
To make it easier to compare performance across networks generated by different algorithms, we generate all the networks with the same number of nodes (N = 34, 546), the number of nodes in the HepPh citation network (described in the following section). Table 2 gives a summary of the seven real-world network data sets we use. We use realworld networks obtained from various domains as detailed below.

Real-world networks
Citation networks. A citation network contains an undirected edge connecting paper i and paper j, if the paper i cites another paper j.
Co-authorship networks. Similarly, in a co-authorship network authors are represented as nodes. Two authors are connected if they have published at least one paper together.
Social networks. A social network consists of users and relationships between them. We use a network data set obtained from the social network Twitter. This network is made of 1000 ego-networks consisting of 4869 Twitter lists (Leskovec and Mcauley 2012). An ego-network is a social circle formed among a user and her friends. Web networks. These networks represent users and links between them, similar to a social network. However, the networks we consider here, Epinions and Slashdot represent who-trust-whom data of users instead of the relationships or interaction among users. Hence, we categorize them as web networks. In these networks, a user tags another user as trustworthy or not. They are sparse compared to online social networks.

Impact of initial sampling method
To investigate how the sampling method used to acquire the initial sample influence the probing methods, we generate graph samples using two sampling methods. These are the methods we use: 1 Random node sampling (RN): At each step we randomly choose one neighbor of a node already in the sample. 2 Breadth-first search (BFS): Nodes are added to the sample in the order they are observed.
We induce a subgraph on a sample of nodes obtained by any of the above methods. These subgraphs are used as the initial sample for the adaptive graph exploring problem and probing algorithms are applied to acquire more information on the original network.

Methods
We compare the performance of our algorithm against the following algorithms.

Algorithms that do not use node features
• Random node (RW). In this trivial baseline, we select one of the candidate nodes randomly for probing. • Maximum observed degree (MOD). This greedy algorithm chooses the node having the maximum observed degree. MOD is the MEUD algorithm proposed in (Avrachenkov et al. 2014) adapted to one-hop neighborhood visibility.

Algorithms that use node features
• Lin-UCB. This applies the UCB algorithm by Dani and Kakade (Dani et al. 2008) assuming that the reward of an arm is linearly dependent on its feature vector. • iKNN-UCB This is our proposed algorithm, Algorithm 1. For all algorithms based on k nearest neighbor regression, we limit number of neighbors to 20 (k =20).

Analysis on synthetic networks
We probe incomplete BA and LFR networks obtained by RN and BFS sampling for 1000 iterations (T = 1000). We perform each experiment 10 times with different initial samples and report the average in this section. The number of nodes observed in the BA network is shown in Fig. 3. For all networks generated by Barabasi-Albert (BA) model, MOD could observe more nodes than the bandit algorithm. This observation confirms Avrachenkov et al. (2014)'s claim that MOD probing can achieve the best connected network cover for networks generated by preferential attachment processes. The recent work by LaRock et al. (LaRock et al. 2018) further generalize this claim as learning and predicting the rewards is not necessary for networks generated by BA model.
To understand how the existence of community structure impacts the probing, we evaluate the performance of all algorithms on synthetic networks generated by different configurations of LFR benchmark model (Lancichinetti et al. 2008). We vary the mixing parameter μ from 0.1 to 0.5 keeping all other parameters of the model constant (γ = 3, β = 1.3, average degree = 25). iKNN-UCB significantly outperforms the baseline for networks with a smaller μ. The results are shown in Fig. 4. When the initial sample is obtained by breadth-first walk (BFS), iKNN-UCB outperforms all baselines by a significant margin. The gap between iKNN-UCB and the baselines is larger when the mixing parameter is small, the network has significant community structure. BFS sampling results in dense network samples. This is evident from the significantly high clustering coefficients of BFS sample networks compared to RN samples of the same size. Probing a few nodes in such a densely connected region is enough to acquire the most information about that region. However, greedy algorithms such as MOD are not capable of learning this reality and keep probing nodes which won't result in high rewards.
The experimental results on synthetic networks suggest that iKNN-UCB algorithm can adapt for incomplete networks obtained by different sampling techniques and networks with structural properties such as community structure.

The importance of exploration
The parameter α in the proposed UCB algorithm determines the amount of exploration performed by the algorithm. Therefore, when α = 0, the resultant algorithm is equivalent to the greedy algorithm which chooses the node with the largest expected reward. In this section, we perform experiments by varying the value of α and results are shown in a b . We observe that more exploration corresponds to better results for networks with stronger community structure (smaller μ). However, the exploration is less important for networks with weaker community structure (larger μ). This is evident in Fig. 4 as well. We perform all the subsequent experiments keeping the exploration coefficient as a constant (μ = 1).

Fig. 5
Investigating the importance of exploration. Compares the number of nodes observed by our proposed algorithm with different values of exploration coefficient α. Each graph corresponds to a LFR benchmark network generated with the corresponding mixing parameter μ. Results indicate the average of 10 independently sampled partially observed networks

Non-stationarity of the environment
In this section, we investigate the non-stationarity of reward distribution by varying the sliding window size τ . If the reward distribution is stationary, then the variation of performance between different values of τ should be minimum. However in Fig. 6, we observe larger variance among the cumulative reward of LFR benchmark graphs with a larger mixing parameter, especially when sampled by a BFS walk. Larger mixing parameter corresponds to networks with weaker community structures, hence smaller modularity. Based on these observations, for network samples with low modularity and high clustering, it is desirable to run iKNN-UCB algorithm with a smaller sliding window; relying only on the most recent observations.

Results on real-world networks
We use eight real-world networks mentioned in Table 2 and generate RN and BFS samples containing 5% nodes of the original network G. Then we perform 1000 probing steps on each graph. We perform each experiment five times initialized with different random seeds and report the average number of additional nodes which were observed in Figs. 7 and 8. iKNN-UCB and Lin-UCB bandit algorithms outperform all baseline methods in networks generated by both RN and BFS sampling. Even though Lin-UCB bandit algorithm observes as much nodes as iKNN-UCB for RN samples, its performance is worse for BFS samples. This shows that linear model in Lin-UCB is not capable of learning the relationship between observed node features and the true degree of a node if the sample is constructed by a BFS.

Conclusions
In this paper, we introduced a multi-armed bandit based exploration algorithm for partially observed incomplete networks. We proposed a novel nonparametric multi-armed bandit algorithm iKNN-UCB with sublinear regret. Compared to existing solutions for the Adaptive Graph Exploring problem, the proposed method does not depend on a specific heuristic. Additionally, iKNN-UCB bandit algorithm outperforms the baseline methods irrespective of how the initial incomplete network is obtained. We provided experimental evidence for our approach using synthetic networks and a variety of realworld networks. Using different configurations of LFR benchmark networks, we observed that our algorithm outperforms all other baselines significantly, especially when the network exhibits community structure prominently. Since the reward function is independent of the probing procedure, it is easy to define a new reward function to solve a different graph exploration problem (e.g., finding a particular type of nodes).
In this paper, we assumed that probing a node would reveal all its neighboring nodes. However, in some real-world scenarios, only a certain number of neighbors is revealed (e.g., follower limit in Twitter API 5 ). As future work, we intend to explore how this current approach can be extended for such different settings of the same problem.