Compressive Closeness in Networks

Distributed algorithms for network science applications are of great importance due to today's large real-world networks. In such algorithms, a node is allowed only to have local interactions with its immediate neighbors. This is because the whole network topological structure is often unknown to each node. Recently, distributed detection of central nodes, concerning different notions of importance, within a network has received much attention. Closeness centrality is a prominent measure to evaluate the importance (influence) of nodes, based on their accessibility, in a given network. In this paper, first, we introduce a local (ego-centric) metric that correlates well with the global closeness centrality; however, it has very low computational complexity. Second, we propose a compressive sensing (CS)-based framework to accurately recover high closeness centrality nodes in the network utilizing the proposed local metric. Both ego-centric metric computation and its aggregation via CS are efficient and distributed, using only local interactions between neighboring nodes. Finally, we evaluate the performance of the proposed method through extensive experiments on various synthetic and real-world networks. The results show that the proposed local metric correlates with the global closeness centrality, better than the current local metrics. Moreover, the results demonstrate that the proposed CS-based method outperforms the state-of-the-art methods with notable improvement.


Introduction
Centrality measures are means of quantifying the importance of a node within the given network. Some notions of centrality only consider local properties of the network; however some of them reflect global properties. Appropriate quantification of importance should be done given the application context. To address applications in which reachability of a node to the entire network is of importance, researchers have introduced the closeness centrality measure. For an arbitrary node u, its closeness centrality C(u) is defined as the inverse of its average distance to the other nodes in the network. More formally: where d(u, v) is the shortest distance between u and v. Locating public facilities over a transportation network such that they are easily accessible to everyone or identifying people with ideal social network location for information dissemination or network influence can be mentioned as scenarios in which identifying high closeness centralities is of great interest [1][2][3]. In these scenarios, we are mainly interested in efficiently and accurately detecting top-k high closeness centrality nodes in the network, while their exact relative order compared to each other, as well as the actual closeness centrality values, are not so important.
A trivial approach to identify top-k closeness centrality nodes consists of the may prevent such a method from being applied on large real-world networks [4]. To address this issue, developing scalable distributed algorithms is of great importance, where each node is only interacting with its immediate neighbors [5].
To the best of our knowledge, there is no distributed and decentralized algorithm for the task of detecting top-k high closeness centrality nodes that operates while requiring each node only to have local interactions with its immediate neighbors.
However, several algorithms are satisfying these properties and compute exact or approximated closeness centrality of each node in the network. Approximation approaches compute an alternative centrality score that highly correlates with the global closeness centrality. An efficient sorting algorithm can then be utilized on top of these methods to identify top-k high closeness centrality nodes. There are two major shortcomings with such approaches: (1) Not exploiting the fact that the vector consisting of closeness centrality values has a few large coefficients (k) and many small coefficients so that it can be well approximated by a k-sparse vector (signal). In general, a centrality measure (e.g. closeness centrality) must have a right-skewed probability distribution to be useful in selecting important nodes. (2) Requiring direct measurement (query) from each node, which is not always possible due to log-in requirements, API query limits, and treating user data as proprietary.
To address these issues, we transform the problem of detecting top-k closeness central nodes to the problem of sparse recovery in networks. The breakthrough of the sparse recovery problem is compressive sensing (aka compressive sampling) which performs a few indirect end-to-end measurements on a signal x and recovers a good sparse approximation of that signal. However, two additional requirements must be taken into account when these measurements are performed over a graph, rather than an arbitrary signal. Creating feasible measurements that satisfy these constraints (will be discussed in section 2.2) has initiated the field of compressive sampling over graphs.
Our contributions in this paper are two-fold: (1) We propose a local (ego-centric) metric which can be computed in a distributed manner at each node. The computation can be carried out requiring each node to have only local knowledge of its immediate neighborhood. In section 5, we experimentally show that the suggested local metric is highly correlated with the global closeness centrality on many real-world and synthetic networks. (2) We propose a general compressive sensing framework for distributed identification of central nodes in networks based on the introduced local metric using indirect end-to-end (aggregated) measurements. We experimentally show the superiority of our approach in terms of accuracy for the prediction of high closeness central nodes compared to the best existing competing methods.
The rest of this paper is organized as follows. In section 2, we briefly explain the preliminary notations and definitions. We review the related works on distributed detection of central nodes requiring only local interactions with the neighbors from each node, in section 3. In section 4, we introduce our novel approach in detail and analyze its time and space complexity. Later in section 5, the settings and results of our experimental evaluations are presented. We conclude the paper in section 6.
A preliminary version of this paper has appeared in [6]. Here, we explain the backgrounds and the intuitions behind the idea in more details. Also, we comprehensively review the related work and describe their limitations with our corresponding solutions. Moreover, we add three different types of real datasets and a several test scenarios to our extensive experimental evaluations in order to show the generalization of the proposed method.

Compressive Sampling
As an alternative to direct measurements, one can utilize sampling-based approaches. Based on the Nyquist-Shannon theorem, a general signal x can be completely recovered by sampling it with the Nyquist rate. However, sampling with the Nyquist rate can be costly or impossible due to a massive scale in many real-world networks we are facing today. If the underlying signal is sparse in a suitable basis, sampling with the Nyquist rate only to recover a relatively small fraction of nonzero elements results in loss of system resources and induces two sources of error, sampling (collection) error and identification (compression) error.
The state-of-the-art approach for recovery of sparse signals is Compressive Sensing/Sampling (CS) which addresses these drawbacks. In compressive sampling, one can simultaneously sample and compress a signal x n×1 through a measurement matrix A m×n where m n to acquire the following linear system: The resulting system is under-determined and does not have a unique solution in general. A is said to satisfy the 2k-restricted isometry property (RIP) if there exists 0 < δ 2k < 1, such that for all 2k-sparse signals x , it holds: In case the measurement matrix satisfies the 2k-RIP one can prove uniqueness of a k-sparse solution to the above linear system (y = Ax). To see this, assume x 1 and x 2 are both k-sparse signals and Ax 1 = Ax 2 , so vector x = x 1 − x 2 is a 2k-sparse signal (has at most 2k non-zero entries). Since A satisfies the 2k-RIP, Equation (3) can be rewritten for some 0 < δ 2k < 1 which ensures x 1 = x 2 , as: Let x * be any arbitrary k-sparse vector, and A be an arbitrary measurement matrix that satisfies the 2k-RIP property. Then given what we have discussed so far, it is easy to see that x * can be recovered by solving: where x 0 indicates the number of non-zero entries in x. Unfortunately, solving this optimization problem is NP-hard. Thus the following relaxation is considered which utilizes the sparsity inducing 1 -norm and is referred to as Basis Pursuit (BP): It has been shown when the 2k-restricted isometry is satisfied for A, the solution of BP is x * . In this case, by utilizing the convexity of BP, the recovery is very efficient and computationally fast. Note that the strict condition y = Ax within the Basis Pursuit formulation is very sensitive to imperfect sparsity or noise. The following formulation, known as LASSO, addresses this by removing the exact constraint and penalizing its violation: This objective has extremely fast distributed numerical solvers and will be utilized for the optimization step in this paper.

Compressive Sensing over Networks
In case the signal to be recovered is defined over a graph (network), three additional constraints must be taken into account [7,8] in CS problems: (1) Each element A i,j would be 1 if the node j is visited by measurement i and 0 otherwise; (2) The nodes visited by a measurement must correspond to a connected induced sub-graph [9][10][11][12]; (3) The signal x which contains a graph property, defined for each node, is almost always non-negative (x ≥ 0).
Based on the compressive sensing framework, we would like to efficiently recover k highest closeness centrality nodes from m indirect end-to-end measurements, in a way that m n. In the linear system y m×1 = A m×n x n×1 , let A be an m × n measurement matrix, where its i-th row corresponds to the i-th feasible measurement. For i = 1, ..., m and j = 1, ..., n, A ij = 1 if and only if node j is visited by the i-th measurement, otherwise A ij = 0. Let x be an n × 1 non-negative vector whose j-th entry is the value of a certain type of network characteristic (e.g. a global/local centrality metric) over node j ∈ V , and y ∈ R m denotes the measurements vector whose i-th entry represents the additive aggregation values of network nodes in the i-th row of the measurement matrix A that induces a connected sub-graph over G. Note that this way of measurements construction already satisfies the network topological constraints of the feasibility conditions mentioned in the beginning of this section.
For the example network shown in Figure 1 with n = 10 nodes and |E| = 11 links, each of two measurements m 1 and m 2 includes a different subset of connected nodes.
The corresponding feasible measurement matrix A with these measurements is: To understand how the additive aggregation over connected induced sub-graphs is motivated for each measurement in practice, we mention an example from [13].
Consider a network where the nodes represent sensors, and the links represent communications between sensors. For the set T of active nodes within an arbitrary feasible measurement that induce a connected sub-graph, a node u ∈ T monitors the total values corresponding to nodes in T . Every node in T obtains values from its children, if any, and aggregates them with its value on the spanning tree rooted at u, then sends the sum to its parent. After that, the fusion center can obtain the sum of values corresponding to all the nodes in T by only communicating with u.
The explained paradigm in data acquisition and aggregation is highly utilized within the wireless sensor network literature for applications such as air quality monitoring, volcanic activity detection, and object localization [14]. Some recent work has applied a similar acquisition and aggregation paradigm in network tomography [8], community detection [10] and finding key actors in social networks [15][16][17].
Based on the above idea, a straight forward approach utilized in practice to construct measurement matrices satisfying these properties, is to create a correspondence between every single measurement and a random walk on the graph. Each random walk additively aggregates values computed by the nodes during the walk.
The random walk strategy and the values computed by the nodes are what separate a method from the others. Performance of these methods and RIP satisfaction can then be verified theoretically or experimentally [6,7,11,15]. An alternative approach [18] employs a well-known randomized method in compressive sensing literature which satisfies the restricted isometry property with very high probability and makes deriving theoretical recovery guarantees straightforward. Also, it is possible to show that each constructed measurement will almost surely correspond to an induced connected sub-graph.

Related Work
In this section, we first review local metrics that highly correlate with the global closeness centrality and can be computed in a distributed manner relying only on interactions of neighboring nodes. After that, we review compressive sensing (CS)-based methods that can be utilized to recover top-k central nodes, using the mentioned local metrics by constructing a feasible measurement matrix.

Local Closeness Metrics
Dist-Exact Weight-Vol [20]: This work was an extension to the metric in DACCER, based on two simple observations. First, closer nodes to a node have more contributions than farther nodes in the dissemination of the node's information. Second, the nodes with low clustering coefficients are hubs linking neighboring network parts.

RW [7]
: This work is one of the state-of-the-art method in compressive sensing over graphs that constructs random-walk based measurements. Each measurement in the measurement matrix can be used to aggregate a metric of choice additively.
TopCent [15]: This method constructs a measurement matrix to recover top-k degree central nodes in networks. Since degree centrality is highly correlated with the closeness centrality in some real-world networks, this method is expected to perform well for the task of detecting closeness centralities, as well.

Proposed Method
In this section, we introduce the proposed framework in the following steps: (1) defining a new ego-centric centrality measure; (2) (2) and (3).

Proposed Local Metric
We introduce the h-hop ego-centric (local) closeness centrality of node v as: where B τ (v) indicates the set of nodes that have an exact shortest distance of length τ from node v. The intuition behind this metric is that, the farther nodes from v have lower effect in dissemination of goods (e.g. information) emerged from it.

Score Computation Subroutine
The computation of the sets B τ (v) for τ ≤ h, ∀v ∈ V can be done by executing a breadth-first search (BFS) process at each node in parallel, with exploration radius of h. This will require computational cost of at most O(∆ h ) where ∆ is the maximum degree of the network. The required memory storage at each node is also O(∆ h ).
The computed sets can be utilized to evaluate ego closeness centrality at each node in a distributed and decentralized manner, with O(1) computational and storage cost per node. Thus we will have the following steps for ego-closeness computation: (ii) Once B i (v) is available for each node v ∈ V , i ranging from 1 to h, one can easily compute the ego-closeness centrality metric based on Equation (9). This step can be also executed in a decentralized fashion for each node independently. The pseudo-code for this subroutine is in Algorithm 2.

Score Aggregation Subroutine
The proposed compressive sensing-based method for aggregating the computed egocentric metric is depicted in Algorithm 3, which contains fours steps: (i) The first node v f irst is added to the visited set S and all of its neighbors are added to the neighbor set N (S).
(ii) The next node is selected relative to egoC h (v next ) from the nodes in N (S), which are already computed in the previous subroutine.
(iii) The selected next node is added to the visited set S and it is removed from the neighbor set N (S), then its neighbors are added to the neighbor set N (S).
In a distributed manner See Equation (7) Output: sparse approximationx (iv) The steps (i)−(iii) are fulfilled 'l' times which is the length of a measurement, to generate a new row for the matrix A and the vector y. (v) Step (iv) is repeated 'm' times (in parallel) to construct a feasible measurement matrix A with 'm' measurements and the corresponding measurement vector y.
(vi) To find the sparse approximationx of x, we optimize the LASSO objective function subject to the linear sketch of y = Ax, based on Equation (7).
In this algorithm, we have m parallel aggregation processes, where each is to be started from a node selected uniformly at random from V . The random seeds to Moreover and Facebook is about 5000, that is much smaller than their network size [11]. This shows that our approach is practically efficient and scalable on real-world networks.

Experimental Evaluation
In this section, we experimentally evaluate the performance of the proposed method in various scenarios over both synthetic and real-world networks. We first introduce the networks used for the evaluation. Then, we explain the settings of the experiments. Finally, the achieved results for each test scenario and their analyses are presented.

Datasets
For the evaluations of the proposed method, we considered both synthetic and real networks. We summarize the properties of the real-world networks used in experiments in Table 1. The four notations deg , C , D, and δ 0.9 represent the "average degree", "average clustering coefficient", "network diameter", and "90percentile effective diameter", respectively. In the case of a disconnected network, we extracted the largest (strongly) connected component.
We also considered three well-known models (i.e. Barabási-Albert (BA), Erdős-Rényi (ER), and Watts-Strogatz (SW)) for generating synthetic networks. We have summarized these networks in Table 2. In ER network, the link existence probability p = 0.01 ensures that the generated network is connected as p > ln |V | |V | is a sharp threshold for connectedness of ER networks with |V | vertices.

Settings
To evaluate the accuracy of the proposed method (CS-HiClose) compared to the competing methods in identifying top-k closeness centrality nodes, we measured the  (7)) as an objective function, and is extremely quick by leveraging the power of GPUs. For example [31], it can solve the LASSO objective on a graph of 100,000 nodes with 10,000 measurements in only 21s on a single Nvidia K40 GPU. For computations of the global closeness centrality in Equation (1), we used available tools in Python-iGraph package.

Correlation between Our ego-Closeness and the Global Closeness
We experimentally analyzed the correlation between the proposed ego-centric (local) centrality metric and the global closeness centrality over several synthetic and realworld networks. To compare these two centrality metrics, we used Pearson product moment correlation coefficient (ρ), which in fact measures the strength of a linear association between two variables and is defined as [32]: where |V | is the number of network nodes and shows that there is not any association, a value greater than 0 indicates a positive association, and a value less than 0 indicates a negative association.    proposed ego-centric centrality measure, all with h = 2, and the global closeness centrality on synthetic and real-world networks. In this experiment, we mainly focus on high sparsity levels k = {0.1|V |, 0.2|V |, 0.3|V |, 0.4|V |}. After implementing DistEst [19], we found that the computed values for this metric critically depend on parameters' initialization (e.g. each node should have an estimation about its closeness value which is an unrealistic assumption). Moreover, this metric needs a very large number of iterations for message passing to converge. To have a fair comparison, we set the same number of iterations as our metric, but its correlation coefficients were around 0, so the results for this metric were excluded.
The results show that Dist-Exact for h = 2 has linear correlation, but negative association with the closeness centrality in networks with various levels of sparsity. One can observe that our proposed metric has almost always the best correlation coefficient compared to the other metrics. Another interesting observation in Tables 3 and 4 is that our ego-centric metric has lower correlation coefficient with the global closeness centrality on the networks (i.e. ca-CondMat, ca-HepTh, and DBLP) with relatively small average degree, small average clustering coefficient, and large network diameter (both full and 90-percentile).
To have more analysis of the correlation between the proposed ego-centric (local) metric and the global closeness centrality, Figure 2 shows the scatter plots of all nodes' ranks provided by one versus the other, on various networks. Each point in the figure corresponds to a node's rank using these two metrics. Based on the results of the previous test cases, we calculated our local measure for h = 2 to have low computational complexity, yet high accuracy. One can easily observe the linear correlation and positive association (as the rank with respect to the local metric increases, so does the rank with respect to the global metric), especially for the top-k nodes' ranks which is the target of this paper. One can easily see the similar observation, as in Tables 3 and 4, that our metric has relatively lower correlation with the global closeness centrality on ca-CondMat, ca-HepTh, and DBLP networks, that share properties like small average degree, small clustering coefficient, and large network diameter.
Although the Pearson product-moment correlation coefficient is the most common and almost exclusively used measure for correlation studies of centrality indices, non-linear dependencies are not adequately captured by it. Moreover, assuming only a linear correlation between two scores is very strong and maybe not realistic.
A common workaround to depict some of the existing non-linear dependencies is to employ the Pearson correlation on the logarithm of the original scores, and it is mainly used for illustrative purposes [33]. Table 5 is similar to Table 3, instead it shows the Pearson correlation on the logarithms of the proposed ego-closeness (with h = 2) and the global closeness scores. The result suggests that our proposed ego-centric metric not only has a high positive linear association (as inferred by Table 3) but also demonstrates a very high positive non-linear association with the global closeness centrality.

Running Time Comparison
In Table 6   Note that in the distributed and decentralized setting that we considered here, each node in the network begins executing a process to compute its corresponding local metric based on its visible neighborhood radius. Each node's process runs independent of the other nodes' processes. The distributed running time that we report for a metric on a network is equal to the longest execution time among all network nodes' processes for computation of the desired local metric. Table 6 shows that our proposed metric is the fastest local measure to be calculated locally in a decentralized manner over all synthetic networks.

Effect of Sparsity
Level k on Accuracy: are the same. The higher the value of F-measure is, the more correlation between the top-k nodes identified by a method and the global closeness centrality will be.

Effect of Number of Measurements m on Accuracy:
The accuracy of CS-HiClose is compared to the existing CS-based methods in terms of F-measure for varying number of measurements, while the measurements length (l) set to 0.25|V | and the sparsity (k) set to 0.15|V | in a network with |V | nodes. For DICeNod, l is determined based on m and k. In Figure 4, it is clearly depicted that CS-HiClose outperforms the competing methods in terms of having higher F-measure for almost all number of measurements. Moreover, our method has better accuracy even in small number of measurements. This improvement can be very important in the situations where performing measurements has a high computational cost [16,34].    Figure 5 shows the measurement length l divided by the total number of network nodes |V | (i.e. l |V | ). This experiment is performed over the network with |V | nodes where the number of measurements sets to m = 0.4|V | and the sparsity level sets to k = 0.2|V | for all methods. We repeated each test 10 times to reduce the methods' randomness and the points in the figures show the mean value of these repetitions. In Figure 5, we can observe an increasing trend for F-measure in CS-HiClose when we increase the measurements length.

Conclusion
Closeness centrality has been utilized as a primary metric to measure the relative importance/influence of nodes in a given network. In this paper, we introduced a new ego-centric metric which has very low computational cost and correlates well with the global closeness centrality. Then, we proposed a compressive sensing framework for distributed detection of top-k central nodes based on the ego-closeness metric using only indirect measurements. Extensive simulations experimental evaluations on both synthetic and real networks demonstrated that the proposed method outperforms the best existing methods to efficiently detect high closeness centrality nodes, in terms of having high F-measure and low complexity.