 Research
 Open Access
 Published:
A graph exploration method for identifying influential spreaders in complex networks
Applied Network Science volume 2, Article number: 26 (2017)
Abstract
The problem of identifying the influential spreaders  the important nodes  in a real world network is of high importance due to its theoretical interest as well as its practical applications, such as the acceleration of information diffusion, the control of the spread of a disease and the improvement of the resilience of networks to external attacks. In this paper, we propose a graph exploration sampling method that accurately identifies the influential spreaders in a complex network, without any prior knowledge of the original graph, apart from the collected samples/subgraphs. The method explores the graph, following a deterministic selection rule and outputs a graph sample  the set of edges that have been crossed. The proposed method is based on a version of Rank Degree graph sampling algorithm. We conduct extensive experiments in eight real world networks by simulating the susceptibleinfectedrecovered (SIR) and susceptibleinfectedsusceptible (SIS) epidemic models which serve as ground truth identifiers of nodes spreading efficiency. Experimentally, we show that by exploring only the 20% of the network and using the degree centrality as well as the kcore measure, we are able to identify the influential spreaders with at least the same accuracy as in the full information case, namely, the case where we have access to the original graph and in that graph, we compute the centrality measures. Finally and more importantly, we present strong evidence that the degree centrality  the degree of nodes in the collected samples  is almost as accurate as the kcore values obtained from the original graph.
Introduction
Understanding spreading process in real world complex networks is a central subject in network analysis, due to the variety of applications which occur  such as the control of the spread of a disease, the viral marketing, as well as the network vulnerability to external attacks. Key role in these processes play the high spreading efficient nodes which are often called influential spreaders, representing the nodes that are more likely to spread information or a virus in a large part of the network.
Thorough research has been realized in order to connect the topological properties of network nodes with their spreading efficiency. Kitsak et al. (2010) proposed the kcore decomposition method (Seidman 1983) as an influential spreaders identifier, showing that the kcore values constitute a more reliable measure than degree centrality and betweenness centrality. One of the core results is that the placement of a node (node global property) is more important than its degree (node local property). That is, two nodes with the same degree but different placement in the network, where the one is connected with the periphery of the network and the other one with the innermost core may not have equal spreading efficiency. Thus, highly connected nodes are not always the best spreaders, while less connected nodes but, at the same time, well connected with the core of the network may strongly affect the spreading process.
In addition, Zeng and Zhang (2013) investigated the limitations of the kcore method and proposed a mixed degree decomposition procedure which performs more efficiently than the kcore approach. Chen et al. (2012) proposed the local centrality, a semilocal centrality measure, as a tradeoff between the degree centrality (local measure) and the computationally complex betweenness centrality and closeness centrality (the global measures). They showed that local centrality is more effective to identifying influential nodes than the degree centrality.
HébertDufresne et al. (2013) performed a very large study of local and global centrality measures, along with a semilocal one which is based on the notion of communitystructure, that is the number of communities that the nodes belong to. In order to define this overlapping community structure, the authors used the Jaccard coefficient measure  two links from a given node belong to the same community if their Jaccard coefficient is above a given threshold. Then, based on this communitystructure approach, they introduced the notion of structural hubs, that is the nodes that connect a large number of communities.
Furthermore, several algorithms have been proposed such as the LeaderRank (Lü et al. 2011), a ranking algorithm for identifying influential nodes in directed social networks. LeaderRank is a parameterfree random walk algorithm analogous to PageRank (Brin and Page 1998). Moreover, Li et al. (2014) proposed a weighted variation of Leader Rank which outperforms LeaderRank. Furthermore, in Chen et al. (2013), the authors introduced ClusterRank  a local ranking algorithm for directed graphs that takes into account the nodes clustering coefficient and proved that ClusterRank outperforms other approaches such as LeaderRank.
In this paper we deal with the problem of identifying the influential spreaders of a complex network when we are not able to analyze the whole network directly, either because of its large size or of our limited computational resources which are necessary for estimating global centrality measures or other advanced nodes properties. Our approach is based on graph sampling  the problem of selecting a small subgraph which will preserve the topological properties of the original graph.
A first preliminary study which investigates the applications of graph sampling to the influential spreaders identification problem has been conducted in Salamanos et al. (2016), where we studied the effectiveness of Rank Degree as influential spreaders identifier. The Rank Degree is a graph exploration sampling method which can produce representative samples/subgraphs from an unknown graph, using only local information, that is the degree of the visited nodes (Voudigari et al. 2016; Salamanos et al. 2017).
In this paper, we extend the work in Salamanos et al. (2016), on several levels. First, we devise the susceptibleinfectedrecovered (SIR) and susceptibleinfectedsusceptible (SIS) epidemic models  a common approach in the literature  in order to define a kind of “ground truth” ranking of the graph nodes with regard to their spreading efficiency. Secondly, we perform a larger scale simulation of the sampling algorithms, in a larger collection of datasets. Finally, we study three well known sampling methods and we experimentally show that our method significantly outperforms all of them.
The experimental results for eight real world networks demonstrate that by exploring only 20% of the network and using the degree centrality, the kcore, as well as the betweenness centrality, we are able to identify the influential spreaders with at least the same accuracy as the accuracy achieved in the full information case. This case is the one by which we have access to the original graph and in that graph we compute the centrality measures. Furthermore, we present strong evidence that the degree centrality  the degree of nodes in the collected samples  is almost as accurate as the kcore measure computed in the original graph.
The rank degree method
The approach we have followed in this paper is based on the Rank Degree sampling algorithm, which outperforms several other well known approaches. We concentrate our analysis in one of the Rank Degree versions, which for the rest of the paper we call as maxRD. The maxRD can be summarized as follows:
Given a graph G(N,E) do the following:

1.
Start s parallel graph traverses from s randomly chosen nodes.

2.
For each visited node i do:

3.
Select his maxdegree friend j.

4.
Visit j.

5.
Repeat, without crossing the same edges for a second time.

6.
Halt, when the number of discovered nodes has reached a given target size x.

7.
Output the discovered subgraph  the set of edges that have been crossed.
Algorithm 1 presents the maxRD in details. The main characteristic of this method is that the graph traverse is based on a deterministic selection rule (Step 10)  the ranking of nodes according to their degree values (local information). The only parameter of the algorithm is the number s of the initial starting nodes (seeds). The algorithm, starting from s initial nodes, performs s parallel graph traverses without crossing the same edges for a second time. The visited nodes remain present in the graph and only the selected edges are removed (Step 14). Thus, any selected node can be visited many times, but not from the same paths. This edgeelimination process alters the original graph and eventually the nodes degree and ranking. Hence, neither the nodes degree nor the graph are stable.
A detailed analysis of the algorithm can be found in Voudigari et al. (2016) and Salamanos et al. (2017), where we have thoroughly studied the properties and the efficiency of the algorithm as well as other variations of the selection rule.
As we previously mentioned, the main phase of the algorithm is deterministic, but in the extreme case, where all the current seeds have degree equal to one, the algorithm executes a random jump (Step 16) in order not to get trapped.
As we shall see in the next sections, this simple exploration algorithm can effectively identify the influential spreaders in a complex network, by using as a measure the degree of nodes in the collected subgraphs.
Methods
Sampling methods
Apart from our method, we study the Forest Fire, Metropolis Hastings Random Walk and MetropolisHastings, three well known sampling methods in the literature. The first two methods are graph exploration algorithms, while the Metropolis Hastings is a “centralized” algorithm, which takes as input the original graph.
Forest Fire (FF) (Leskovec et al. 2005; Leskovec and Faloutsos 2006; Leskovec et al. 2007) algorithm starts from a randomly selected node (seed); at each step, the algorithm moves from the current set of seeds to the next one, as follows: from each node w in the set of current nodes (seeds), a random number x is generated which is geometrically distributed with mean p _{ f }(1−p _{ f }). Then, x outgoing edges are selected randomly from the set of node w outgoing edges. The end nodes of the selected edges constitute the next set of current nodes (seeds). At each step, the visited nodes are considered as burned and are removed from the graph. Hence, they cannot be traversed for a second time. Finally, the process is repeated until a sample of the requested size is reached. The parameter p _{ f } is called forward burning probability. According to Leskovec and Faloutsos (2006), best performance is obtained for p _{ f }≥0.6. In our experiments, we set p _{ f }=0.7.
MetropolisHastings Random Walk (MHRW) (Stutzbach et al. 2009; Gjoka et al. 2011; Li et al. 2015) is an application of the Metropolis algorithm (Metropolis et al. 1953), for uniform sampling. It modifies the Random Walk algorithm, as follows: (i) First, select a node x, uniformly at random. (ii) Select a neighbor y of x, uniformly at random. (iii) Find the degree of y. (iv) Generate a random number p, uniformly between 0 and 1. If \(p \leq \frac {degree(x)}{degree(y)}\), y is the next step. Otherwise, remain at x, as the next step.
MetropolisHastings (MH) (Hübler et al. 2008) is a sampling algorithm that is able to collect representative graph samples of small size. Given a target sample size n, the algorithm initially selects n nodes, at random, the edges of which form an initial graph sample. Then, by incorporating the Metropolis algorithm, together with some predefined graph properties, the algorithm adds and removes nodes from the current sample, until the graph properties in question are optimized. Hübler et al. (2008) proposed several sampling strategies; one of the best uses the degree distribution as the optimized property. This is the version of MetropolisHastings, implemented in this paper.
Epidemic models
In the absence of ground truth information, as regards the nodes spreading efficiency, several approaches have been proposed in the literature, such as the basic epidemic models, as well as the linear threshold and independent cascade models (Kempe et al. 2003). In this paper, we devised the standard epidemic models susceptibleinfectedrecovered (SIR) and susceptibleinfectedsusceptible (SIS), which tend to simulate the spreading process in a graph (Kitsak et al. 2010; Chen et al. 2012; HébertDufresne et al. 2013).
The SIR model
In the SIR model, the individuals/nodes can appear in three states  susceptible (S), infected (I) and recovered (R). Each infected node can transmit the disease to any susceptible node which is connected with, with probability β per unit time (infection rate) and at the same time, it can recover from the disease and become immune, with probability γ per unit time (recovery rate).
The dynamics of SIR model exhibits a phase transition where the control parameter of the process is the ratio T=β/γ. A critical point exists, the epidemic threshold \(T^{c} = \frac {\langle k \rangle }{\langle k^{2} \rangle  \langle k \rangle }\), beyond which an epidemic outbreak ensues. The 〈k〉 and 〈k ^{2}〉 are the first and second moments of the degree distribution P(k) and in our case, they correspond to the average degree and the average squared degree, respectively. For T<T ^{c} only a limited number of individuals are infected, whereas for T≥T ^{c} (epidemic phase) an epidemic outbreak occurs and the disease infects a finite fraction of the population  the total size of the outbreak  which corresponds to a large connected component (Castellano and PastorSatorras 2010; Newman 2010; PastorSatorras et al. 2015). Finally, there is a mapping between the SIR model and the bond percolation, as proved in Newman (2002).
In our experiments, without lack of generality, we set γ=1. Moreover, we assume that one node is initially infected and all the other nodes are susceptible to the disease.
The SIS model
In the SIS model, the nodes appear in two states, susceptible (S) and infected (I). Each infected node can infect any susceptible node which is connected with, with probability ν per unit time (infection rate) and with probability δ per unit time (recovery rate) it can become healthy again  but at the same time, susceptible to future infection defining the effective spreading rate λ=ν/δ.
In the case of the SIS dynamics, a phase transition occurs at the critical point (epidemic threshold) \( \lambda ^{c} = \frac {\langle k \rangle }{\langle k^{2} \rangle }\) of the control parameter λ; the 〈k〉 and 〈k ^{2}〉 have been described previously. When λ<λ ^{c}, the disease eventually vanishes from the graph, whereas for λ≥λ ^{c} (epidemic phase) there is a dynamically stable equilibrium state, where the average density of infected nodes is stable  in other words, equal amount of infections and recoveries (PastorSatorras and Vespignani 2001; Castellano and PastorSatorras 2010; Newman 2010; PastorSatorras et al. 2015).
In the simulations, without lack of generality, we set δ=0.8. We also set that the 20% of the individuals/nodes are initially infected.
Evaluation measures
Topk spreading efficiency
In the SIR epidemic, the spreading efficiency of a given node i is defined by the size of the population M _{ i } that will eventually get infected when the epidemic is originated at the node i. Hence, the ground truth ranking, that is referring to the SIR model, is produced by the M _{ i } values.
The imprecision function Kitsak et al. (2010) is given by:
where \(\overline {M}_{C}(\text {topk})\) is the \(\frac {\sum _{i \in N_{\text {topk}}} M_{i}}{N_{\text {topk}}}\), i.e. the average M _{ i } values of the topk nodes, when the ranking is based on the nodes centrality values C. The N _{topk} is the set of nodes in the topk. The \(\overline {M}_{eff}\text {(topk)}\) is the average M _{ i } values of the topk nodes, in the ground truth ranking.
In the SIS epidemic, we define the persistencedistance, based on the notion of persistence, (Kitsak et al. 2010), namely the probability ρ _{ i }(t) that a given node i is infected at time t. At the equilibrium state, the probability ρ _{ i }(t→∞) is independent of t (PastorSatorras and Vespignani 2001). The persistence is a measure of the importance of a given node during the epidemic process; it represents the frequency that the node in question has been infected by the disease. If we consider the SIS model as a rumor spreading process in a social network, then the persistence corresponds to the probability that a given individual gets informed about a certain rumor and consequently, he is able to pass this rumor further into the network.
We define the ground truth ranking in SIS with reference to the persistence values. Hence, the persistencedistance is defined as:
where \(\overline {\rho }_{C}(\text {topk})\) is the average persistence of the nodes in the topk, when the ranking is based on the centrality measure C. The \(\overline {\rho }_{eff}(\text {topk})\) is the average persistence of the nodes in the topk, according to the ground truth ranking.
In this paper, we use three centrality measures, the degree centrality, the kcore decomposition and betweenness centrality  widely used for the influential spreaders identification problem (Kitsak et al. 2010; HébertDufresne et al. 2013).
Topk nodes similarity
We apply the OSim (Haveliwala 2003), an object similarity measure (in our case the objects are the nodes), which measures the common elements between two ranking lists A and B (each of size k), without taking into account their ordering. It is defined as \(OSim(A,B)=\frac {A \cap B}{k}\). In our case, the lists A and B correspond to the topk nodes in the ranking lists \(r_{eff}^{G}\) and r _{ C }, thus, the OSim is given by:
where \(r_{eff}^{G}(\text {topk})\) is the topk nodes of the ground truth ranking \(r_{eff}^{G}\) in the original graph. It is based on the nodes spreading efficiency  a notion that we have previously described. The r _{ C }(topk) is the topk nodes of the subjective ranking r _{ C }, which is based on the values of a given centrality measure C. The r _{ C } is referring to either the original graph or to the samples that are generated by a sampling algorithm. When we study the general effectiveness of a given centrality measure C, then we compute the \(r_{C}^{G}({topk})\), which is referring to the ranking of nodes based on their centrality values (in descending order), obtained from the original graph G. When we study the effectiveness of a given sampling algorithm, then we have the \(r_{C}^{S}(\text {topk})\), which is the ranking of nodes in terms of their centrality values obtained from the samples  the generated subgraphs by the sampling algorithm in question. We note that in the case of samples, the centrality measure is computed in the sample subgraph structure.
Ranking similarity
Given a sample subgraph S and a centrality measure C, we apply the Kendall tau (Kendall 1938), the well known rank correlation coefficient measure, which measures the relative ordering between all pairs in two ranking lists A and B, when A and B are consisting of the same elements. The A list is the \(r_{C}^{S}(\text {topk})\) that we mentioned previously in the OSim definition. The B list is the ranking list of the values that the topk sample nodes have in the ground truth ranking \(r_{eff}^{G}\), regardless of their position in the \(r_{eff}^{G}\). We follow this approach because some of the nodes in the \(r_{C}^{S}(\text {topk})\) may not appear in the topk of the ground truth ranking \(r_{eff}^{G}\). Finally, we follow the same procedure when we study the effectiveness of the centrality measure based on the original graph (full information case). In this case, the A list corresponds to the \(r_{C}^{G}({topk})\), which has also been described previously.
Datasets
We have used eight datasets of different types (social networks, collaboration networks, location based social network etc.), previously used for graph mining (Leskovec and Krevl 2014). We restrict our analysis to undirected graphs, therefore, we transform the directed graphs to undirected ones, by applying the symmetric one to each edge, after removing the self loops, if any.

D1 egoFacebook: undirected graph of 4039 users’ “friendslist”, i.e. the egonet, from Facebook (McAuley and Leskovec 2012).

D2 wikiVote: voting network  directed graph  from Wikipedia consisting of 7115 users (Leskovec et al. 2010).

D3 CACondMat: scientific collaborations network  undirected  between 23,133 authors with paper submitted to Condense Matter category (Leskovec et al. 2007).

D4 p2pGnutella30: Gnutella p2p network topology  directed  of 36,682 nodes (hosts in the Gnutella network) (Ripeanu et al. 2002; Leskovec et al. 2007).

D5 EmailEnron: email communication network  undirected  of 36,692 nodes (email addresses) (Leskovec et al. 2009).

D6 locBrightkite: online locationbased social network  undirected  of 58,228 nodes (Cho et al. 2011).

D7 socEpinions1: web of trust, obtained from Epinions  directed  of 75,879 nodes, members of www.epinions.com (Richardson et al. 2003).

D8 socSlashdot0922: social network  directed  of 82,168 nodes/users (Leskovec et al. 2009).
Simulation setup
For each dataset, as well as, for each sampling method separately, we collect 100 samples, for 20% sample size. For maxRD, the number of initial seeds is defined as the 1% over the number of nodes in the original graph. The number of initial seeds is equal to one for the FF. A condition of the MHRW is that the graph is connected. In this paper, we study the real world graphs as they are  hence, they may not be well connected. In this case, it is possible that the MHRW will get trapped in a small region of the graph. In order to avoid that, we set the number of initial seeds equal to 1%. Furthermore, in each sampling trial, the iterations of MH are set to 20,000 and 30,000 for the datasets D1  D5 and D6  D8, respectively.
We study ten topk intervals of 1%,2%,…,10%; namely, we study the first one percent, up to ten percent of the graph nodes. Hence, we compute the imprecision, persistencedistance, OSim and Kendall tau for each topk, separately.
The simulation experiments have been implemented in MATLAB. The centrality values of the nodes in the samples, as well as in the original graphs have been computed using the igraph R package (Csardi and Nepusz 2006).
Results
The overall steps of analysis that we have followed during this study can be summarized as follows:
Given a graph G(N,E), a centrality measure and a sampling algorithm:

1.
Groundtruth ranking: Simulate SIR and SIS epidemics in G (1000 simulations per epidemic model). Rank the nodes based on their average epidemic efficiency (see “Methods” section).

2.
Samplesranking:

(a)
Run the sampling algorithm (100 samples/subgraphs, 20% sample size).

(b)
Compute the nodes centrality values in the samples subgraph structure.

(c)
Rank the nodes based on the nodes centrality values.

(a)

3.
Graphranking: Full information case. Rank the nodes of G based on the nodes centrality values in G.

4.
Evaluation: Find the imprecision and persistencedistance values, OSim and Kendall tau, for a given topk, based on Samples and Graph rankings. Compare Samplesranking accuracy vs Graphranking accuracy, in terms of the identified topk influential spreaders.
Estimating the epidemic parameters
A common approach in the literature is that the simulation of the epidemic models is performed at the largest connected component of a given network. In this paper, we study the real world networks as they are, hence other smaller components may exist. Thus, we cannot directly apply the theoretical epidemic threshold values. Furthermore, our final goal is to produce a ground truth ranking, by which we can study the top1% to top10% of the most influential nodes in the graph. For this ranking to be valuable, the values of T=β/γ and λ=ν/δ have to be above the epidemic threshold  but not very far from it. In all experiments, we set γ=1 and δ=0.8, as we mentioned previously in “Methods” section.
For both epidemic models, we set the parameter p=1,1.1,1.2,… and we express the T=p×T ^{c} and λ=p×λ ^{c}. We simulate both SIR and SIS, for several p values, until a predefined criterion is roughly satisfied.
In each simulation instance of SIR model, we measure the epidemic efficiency of each node i, namely, the size of infected population in the long run, when the epidemic is initiated in i. Thus, each simulation instance is consisting of N trials, where N is the number of nodes. We repeat this process for 1000 independent simulations. Then, we select the smallest p value, per dataset, for which, the top20% of the most effective nodes have, as population, average epidemic efficiency larger than the 1% of the graph, in every simulation instance. In Table 1, we present the T ^{∗}=p×T ^{c}, per dataset. These are the values that we have used for the rest of the analysis.
In SIS model, in each simulation instance, the population of initially infected nodes is set to 20% of the nodes in the graph. We run the model, above the epidemic threshold until the dynamically stable equilibrium state has been reached, that is, the average density of the infected nodes is stable. We measure that by tracking the difference Δ(I)=I(t+1)−I(t), where I(t) is the density of infected nodes at time period t. In the first periods, the Δ(I) is always negative. When, for the first time we have Δ(I)≥0 then, from that point on, we leave the system to rest for 100 time periods. Then, from that point, we store the Δ(I)=I(t+1)−I(t) values. If for 100, or more, time periods, the average of the Δ(I) values is close to zero, then we assume that the equilibrium has been reached. In other words, during the last periods (more than 100), the overall gain or loss must be close to zero.
In Table 1, we present the smallest \(\widehat {\lambda } = p \times \lambda ^{c}\), per dataset, for which I _{ t→∞ }≠0, namely the density of infected nodes is nonzero, in the long run. In other words, for \(\lambda \geq \widehat {\lambda }\) an epidemic phase occurs, that is, the disease does not vanish from the graph, instead a dynamically stable equilibrium emerges eventually. For instance, we observe in Table 1, that for the dataset D2 (wikiVote) we have \(\widehat {\lambda } = 1.3 \times \lambda ^{c}\). This means that the theoretical λ ^{c} value does not correspond to the real epidemic threshold of the network, since for this value we do not observe a transition to the epidemic phase.
Finally, we estimate the smallest \(\lambda ^{*} \geq \widehat {\lambda }\), for which the average density of the infected nodes at the equilibrium is at least 1% of the graph. We have used these λ ^{∗} values for the rest of the analysis.
Topk spreading efficiency
Figures 1 and 2 as well as Table 2 present the results of SIR and SIS epidemics, with regard to the spreading efficiency of the topk nodes  imprecision and persistence distances  as they have been identified by the maxRD samples. That is, we compute the nodes centrality values in the samples subgraph structure and then, using the Eqs. 1 and 2, we compute the imprecision and persistence distances per topk. In addition to this, for the full information case, we present the imprecision and persistence distances for the same centrality measures, but this time, the centrality values are computed in the original graph G.
First, in regard to the general effectiveness of the three centrality measures, when they are obtained from the original graph (full information case), we observe the following: In SIR, the accuracy of degree centrality and kcore are equivalent in two out of eight datasets (see Fig. 1 c, d), whereas, the kcore is clearly superior in five out of eight datasets (see Fig. 1 a, e, f, g, h). Similar is the outcome of SIS, where, in five datasets, the kcore outperforms degree centrality (see Fig. 2 a, e, f, g, h). Finally, comparing the values in Table 2 with reference to graph G and betweenness centrality with the corresponding values of degree centrality and kcore in Figs. 1 and 2, we conclude that the betweenness centrality has the lowest accuracy between the three centrality measures. Only in p2pGnutella30 (D4), is the accuracy of betweenness centrality close to the one of degree centrality, in both epidemic models. These results partially verify the work of Kitsak et al. (2010), that kcore is superior both to degree centrality and betweeneess centrality.
With respect to maxRD efficiency, we observe that in both, SIR and SIS and for all eight datasets, the maxRD identifies the influential spreaders for all centrality measures, with equal or even higher accuracy than the one exhibited in the full information case.
Specifically, studying the plots in Fig. 1 (SIR epidemic), we observe that in any topk, the imprecision values of maxRD samples are equal to or lower than the imprecision values for the same centrality measures computed in the original graph (i.e. the full information case). Moreover, in four out of eight datasets  EmailEnron, Brightkite, socEpinions1 and Slashdot0922  the imprecision values, in terms of the degree centrality of the maxRD samples, are clearly lower than the imprecision values in terms of the degree centrality of the original graph (see Fig. 1 e, f, g, h).
The results for SIS epidemic are very similar with those of SIR. From Fig. 2, we conclude that using the maxRD samples, we are able to identify the influential spreaders with at least the same accuracy as if we had used the original graph. Moreover, in five out of eight datasets, the degree centrality of maxRD samples outperforms the degree centrality obtained from the original graph (see Fig. 2 c, e, f, g, h).
Finally, in Table 2, it is clear that maxRD outperforms the full information case. The imprecision and persistencedistance values computed in maxRD samples are substantially lower than those obtained from the betweenness values of original graph G.
Topk nodes similarity
In the next paragraphs, we present the results of the OSim, with respect to degree centrality and kcore, obtained from maxRD samples.
The OSim is an object similarity measure, that is, the percentage of common nodes, between two ranking lists in a given topk. As we mentioned in “Methods” section, the first ranking list is the ground truth ranking, while the second one is the ranking of nodes in a given sample, according to the centrality values that are computed in that sample/subgraph.
Figures 3 and 4 present the OSim values, in separate plots, for each topk as well as for all datasets D1 to D8. For a given topk, we plot the OSim values with reference to degree centrality and kcore, when the centrality measures have been computed in the maxRD samples/subgraphs, along with the OSim values that correspond to the original graph. In other words, we compare the accuracy of maxRD to the full information case.
In all datasets and in all topk intervals, the accuracy of maxRD is equal or larger than the accuracy of the full information case. Generaly, the maxRD OSim values are larger than 0.6 for all datasets, except the D1 (egoFacebook) and top1% (Fig. 3 a). Moreover, the OSim values are always larger than 0.8, in five out of eight datasets in SIR and six out of eight in SIS, for at least one of the two centrality measures.
For instance, from Figs. 3 a and 4 a, we observe that in the top1%, a very narrow interval, the maxRD identifies more than the 80% of the influential spreaders in five datasets, D2, D5 to D8.
An important result is that the maxRD OSim values, in terms of degree centrality, are generally very close to the OSim values with respect to the kcore, obtained from the original graph. This means that by sampling the 20% of the graph and by using only local information/measure, that is the degree centrality in the generated samples, we can approximate the accuracy of kcore (global information) computed in the original graph. We note that the kcore is generally more accurate than the degree centrality (see Kitsak et al. 2010).
Ranking similarity
We continue the analysis examining the Kendall tau rank correlation coefficient measure. For a given graph G, topk and centrality measure, we apply the Kendall tau measure between the ranked centrality values of the nodes in the maxRD samples with the centrality values that those nodes have in any position of the ground truth ranking. We follow the same procedures for the topk nodes of the full information case, that is, when we compute the centrality values in the original Graph.
The Kendall tau values are computed only for the degree centrality because each kcore value is assigned to a group of nodes, hence many ties may occur.
Figures 5 and 6 present the average Kendall tau values over the 100 maxRD samples, along with the Kendall tau values for the original graph, for each topk separately.
We observe that in four out of eight datasets, (D2, D5, D7 and D8), the average Kendall tau values lie on the intervals (0.73,0.85) and (0.65,0.80) for SIR and SIS, respectively. Thus, there is a large positive correlation between the ordering of the topk nodes in the samples and the ordering that these nodes have in the ground truth raniking. For instance, for every topk, the Kendall value for the dataset D8 (Slashdot0922)  the largest of the eight datasets  is always very close to 0.8 in both epidemics. Moreover, maxRD outperforms the full information case in any dataset and topk interval.
Comparison of sampling methods
We continue the analysis, by comparing our method with three well known graph sampling algorithms: the Forest Fire (FF), the Metropolis Hastings Random Walk (MHRW) and the Metropolis Hastings (MH). We note that the MH is a “centralized” algorithm which takes as input the original graph; while running for thousands iterations, it performs a degree distribution approximation. For the sake of simplicity, only the results of SIR are shown. In SIS epidemic, the algorithms exhibit similar performance with the one in SIR.
Figures 7 and 8 present the basic evaluation measure for the SIR epidemics, that is, the imprecision and OSim. We present the results for the degree centrality in separate plots, for each topk interval. Let us note, that the imprecision and OSim values for the maxRD algorithm have already been presented in Figs. 1 and 3.
The maxRD significantly outperforms all the other three methods. Specifically, in four out of eight datasets (D2, D5, D7, D8), the maxRD imprecision values lie on the interval (0.0043,0.0140), for every topk. On the other hand, the values of FF and MH are larger than 0.2, in at least half of the datasets (see Fig. 7). Only in the D1 dataset (egoFacebook), the MH exhibits similar performance with maxRD. The Metropolis Hastings Random Walk, in particular, is the one which presents the lowest performance.
As regards the nodes similarity in Fig. 8, the OSim values of FF and MH are generally significantly lower than those of maxRD. Finally, the performance of MHRW is very low, as the OSim values are always smaller than 0.3, in at least five out of eight datasets.
Effect of the size of the initial seeds
We conclude the analysis investigating the relation between the effectiveness of maxRD and the size of initial seeds, that is the initial starting nodes of the algorithm (see Algorithm 1, Step 4). Thus, we compare the maxRD which starts from 1% of the graph to the extreme case where the algorithm starts only from one node. We note that the former case is the one that we analyzed in the previous sections. The latter case is in fact a parameter free exploration algorithm which performs only one graph traverse by visiting one node at each time step.
In Table 3, we present the OSim values for the SIR epidemics with reference to degree centrality and kcore, for both maxRD versions. Similar are the results for the SIS epidemic. For simplicity of the presentation we present only the SIR case.
The results clearly indicate that both versions are almost equivalent. In every dataset and topk, the OSim values for both versions are almost equal even for the kcore measure. This depicts an important result, that the size of the initial seeds does not affect the accuracy of the method. This fact can simplify further the design of the method and increase its usability in real world applications. The maxRD is an exploration algorithm easy to implement and can be used for crawling online social networks. In the past few years, the complex networks (especially the social ones) have considerably increased in size. This fact increases the importance of sampling in network analysis. Generally, the studies of the influential spreaders identification problem implicitly assume that the whole original network is available for investigation. In real world networks, this approach is not realistic, as the networks due to their large size are usually unknown.
Conclusion
In this paper, we have presented a graph exploration sampling method that accurately identifies the influential spreaders in a complex network. The proposed method is based on a simplified version of Rank Degree graph sampling algorithm. We have performed an extensive experimental analysis on eight real world datasets of different types, using three centrality measures, as well as two well known epidemic models, which served as ground truth identifiers of nodes spreading efficiency. Moreover, we have compared our method with three well known sampling algorithms. The experimental analysis provides strong evidence for the effectiveness of the method which identifies the influential spreaders, with high accuracy, by sampling only 20% of the network.
References
Brin, S, Page L (1998) The anatomy of a largescale hypertextual web search engine. Comput Netw ISDN Syst 30(1): 107–117.
Castellano, C, PastorSatorras R (2010) Thresholds for epidemic spreading in networks. Phys Rev Lett 105: 218–701.
Chen, D, Lu L, Shang MS, Zhang YC, Zhou T (2012) Identifying influential nodes in complex networks. Physica A Stat Mech Appl 391(4): 1777–1787.
Chen, DB, Gao H, Lü L, Zhou T (2013) Identifying influential nodes in largescale directed networks: The role of clustering. PLoS ONE 8(10): 1–10.
Cho, E, Myers SA, Leskovec J (2011) Friendship and mobility: User movement in locationbased social networks In: 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’11, 1082–1090.. ACM, New York.
Csardi, G, Nepusz T (2006) The igraph software package for complex network research. InterJournal Complex Syst 1695. http://igraph.org.
Gjoka, M, Kurant M, Butts CT, Markopoulou A (2011) Practical recommendations on crawling online social networks. IEEE J Selected Areas Commun 29(9): 1872–1892.
Haveliwala, TH (2003) Topicsensitive pagerank: A contextsensitive ranking algorithm for web search. IEEE Trans Knowl Data Eng 15(4): 784–796.
Hübler, C, Kriegel HP, Borgwardt K, Ghahramani Z (2008) Metropolis algorithms for representative subgraph sampling In: 2008 Eighth IEEE International Conference on Data Mining, 283–292.. IEEE, Piscataway.
HébertDufresne, L, Allard A, Young JG, Dubé LJ (2013) Global efficiency of local immunization on complex networks. Sci Rep 3(2171). doi:10.1038/srep02171.
Kempe, D, Kleinberg J, Tardos E (2003) Maximizing the spread of influence through a social network In: Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’03, 137–146.. ACM, New York.
Kendall, MG (1938) A new measure of rank correlation. Biometrika 30(1–2): 81–93.
Kitsak, M, Gallos LK, Havlin S, Liljerosand F, Muchnik L, Stanley HE, Makse HA (2010) Identification of influential spreaders in complex networks. Nat Phys 6: 888–893. doi:10.1038/nphys1746.
Lü, L, Zhang YC, Yeung CH, Zhou T (2011) Leaders in social networks, the delicious case. PLoS ONE 6(6): 1–9.
Leskovec, J, Faloutsos C (2006) Sampling from large graphs In: Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 631–636.. ACM, Philadelphia.
Leskovec, J, Krevl A (2014) SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data. Accessed Oct 2015.
Leskovec, J, Kleinberg JM, Faloutsos C (2005) Graphs over time: densification laws, shrinking diameters and possible explanations In: Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 177–187.. ACM, Chicago.
Leskovec, J, Kleinberg J, Faloutsos C (2007) Graph evolution: Densification and shrinking diameters. ACM Trans Knowl Discov Data 1(1). doi:10.1145/1217299.1217301.
Leskovec, J, Lang KJ, Dasgupta A, Mahoney MW (2009) Community structure in large networks: Natural cluster sizes and the absence of large welldefined clusters. Internet Math 6(1): 29–123.
Leskovec, J, Huttenlocher D, Kleinberg J (2010) Signed networks in social media In: SIGCHI Conference on Human Factors in Computing Systems. CHI ’10, 1361–1370.. ACM, New York.
Li, Q, Zhou T, Lü L, Chen D (2014) Identifying influential spreaders by weighted LeaderRank. Physica A: Stat Mech Appl 404: 47–55.
Li, RH, Yu JX, Qin L, Mao R, Jin T (2015) On random walk based graph sampling In: 2015 IEEE 31st International Conference on Data Engineering, 927–938. doi:10.1109/ICDE.2015.7113345.
McAuley, JJ, Leskovec J (2012) Learning to discover social circles in ego networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds)Advances in Neural Information Processing Systems 25 (NIPS 2012), 539–547.. Curran Associates, Inc., New York.
Metropolis, N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953) Equation of state calculations by fast computing machines. J Chem Phys 21(6): 1087–1092. doi:10.1063/1.1699114.
Newman, M (2010) Networks: An Introduction. Oxford University Press, Inc., New York.
Newman, MEJ (2002) Spread of epidemic disease on networks. Phys Rev E 66: 016–128.
PastorSatorras, R, Vespignani A (2001) Epidemic spreading in scalefree networks. Phys Rev Lett 86: 3200–3203.
PastorSatorras, R, Castellano C, Van Mieghem P, Vespignani A (2015) Epidemic processes in complex networks. Rev Mod Phys 87: 925–979.
Richardson, M, Agrawal R, Domingos P (2003) Trust management for the semantic web In: The Semantic Web  ISWC 2003: Second International Semantic Web Conference, 351–368.. Springer Berlin Heidelberg, Sanibel Island.
Ripeanu, M, Iamnitchi A, Foster I (2002) Mapping the Gnutella Network. IEEE Internet Comput 6(1): 50–57. doi:10.1109/4236.978369.
Salamanos, N, Voudigari E, Yannakoudakis EJ (2016) Identifying influential spreaders by graph sampling In: 5th International Workshop on Complex Networks and their Applications, 111–122.. Springer International Publishing, Milan.
Salamanos, N, Voudigari E, Yannakoudakis EJ (2017) Deterministic graph exploration for efficient graph sampling. Soc Netw Anal Mining 7(1): 24. doi:10.1007/s1327801704416.
Seidman, SB (1983) Network structure and minimum degree. Soc Netw 5(3): 269–287.
Stutzbach, D, Rejaie R, Duffield N, Sen S, Willinger W (2009) On unbiased sampling for unstructured peertopeer networks. IEEE/ACM Trans Netw 17(2): 377–390.
Voudigari, E, Salamanos N, Papageorgiou T, Yannakoudakis EJ (2016) Rank degree: An efficient algorithm for graph sampling In: 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2016, 120–129.. IEEE, San Francisco.
Zeng, A, Zhang CJ (2013) Ranking spreaders by decomposing complex networks. Phys Lett A 377(14): 1031–1035.
Acknowledgements
We wish to thank Kyriaki Chryssaki for her valuable support throughout this research.
Author information
Affiliations
Contributions
All authors contributed to the design, experimentation, and documentation of the paper. Also, all authors read and approved the final manuscript.
Corresponding author
Correspondence to Nikos Salamanos.
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 Influential spreaders
 Complex networks
 Graph mining