- Research
- Open access
- Published:
Dynamic network sampling for community detection
Applied Network Science volume 8, Article number: 5 (2023)
Abstract
We propose a dynamic network sampling scheme to optimize block recovery for stochastic blockmodel in the case where it is prohibitively expensive to observe the entire graph. Theoretically, we provide justification of our proposed Chernoff-optimal dynamic sampling scheme via the Chernoff information. Practically, we evaluate the performance, in terms of block recovery, of our method on several real datasets from different domains. Both theoretically and practically results suggest that our method can identify vertices that have the most impact on block structure so that one can only check whether there are edges between them to save significant resources but still recover the block structure.
Introduction
In network inference applications, it is important to detect community structure, i.e., cluster vertices into potential blocks. However, it can be prohibitively expensive to observe the entire graph in many cases, especially for large graphs. For example, in a network where vertices represent landline phones and edges represent whether there is a call between two landline phones. Based on the size of the network, in terms of the number of vertices, it can be extremely expensive to check whether there is a call for every landline phone pairs. Therefore, if one can utilize the information carried by a partially oberverd graph, that is only a small number of landline phone pairs are verified, to identify the landline phones that may play a more important role in formulating communities. Then given limited resources, one can choose to only check whether there are calls between those landline phone pairs to achieve the goal of detecting potential block structure. Thus it becomes essential to identify vertices that have the most impact on block structure and only check whether there are edges between them to save significant resources but still recover the block structure.
Many classical methods only consider the adjacency or Laplacian matrices for community detection (Fortunato and Hric 2016). By contrast, vertex covariates can also be taken into consideration for the inference. These covariate-aware methods rely on either variational methods (Choi et al. 2012; Roy et al. 2019; Sweet 2015) or spectral approaches (Binkiewicz et al. 2017; Huang and Feng 2018; Mele et al. 2022; Mu et al. 2022). However, none of them focus on the problem of clustering vertices for partially observed graphs. To address this issue, existing methods propose different types of random and adaptive sampling strategies to minimize the information loss from the data reduction (Yun and Proutiere 2014; Purohit et al. 2017).
We propose a dynamic network sampling scheme to optimize block recovery for stochastic blockmodel (SBM) when we only have limited resources to check whether there are edges between certain selected vertices. The innovation of our approach is the application of Chernoff information. To our knowledge, this is the first time that it has been applied to network sampling problems. Motivated by the Chernoff analysis, we not only propose a dynamic network sampling scheme to optimize block recovery, but also provide the framework and justification for using Chernoff information in subsequent inference for graphs.
The structure of this article is summarized as follows. Section 2 reviews relevant models for random graphs and the basic idea of spectral methods. Section 3 introduces the notion of Chernoff analysis for analytically measuring the performance of block recovery. Section 4 includes our dynamic network sampling scheme and theoretical results. Section 5 provides simulations and real data experiments to measure the algorithms’ performance in terms of actual block recovery results. Section 6 discusses the findings and presents some open questions for further investigation. Appendix provides technical details for our theoretical results.
Models and spectral methods
In this work, we are interested in the inference task of block recovery (community detection). To model the block structure in edge-independent random graphs, we focus on the SBM and the generalized random dot product graph (GRDPG).
Definition 1
(Generalized Random Dot Product Graph Rubin-Delanchy et al. 2022) Let \({\textbf{I}}_{d_+ d_-} = {\textbf{I}}_{d_+} \bigoplus \left( -{\textbf{I}}_{d_-} \right)\) with \(d_+ \ge 1\) and \(d_- \ge 0\). Let F be a d-dimensional inner product distirbution with \(d = d_+ + d_-\) on \({\mathcal {X}} \subset {\mathbb {R}}^d\) satisfying \({\textbf{x}}^\top {\textbf{I}}_{d_+ d_-} {\textbf{y}} \in [0, 1]\) for all \({\textbf{x}}, {\textbf{y}} \in {\mathcal {X}}\). Let \({\textbf{A}}\) be an adjacency matrix and \({\textbf{X}} = [{\textbf{X}}_1, \cdots , {\textbf{X}}_n]^\top \in {\mathbb {R}}^{n \times d}\) where \({\textbf{X}}_i \sim F\), i.i.d. for all \(i \in \{ 1, \cdots , n \}\). Then we say \(({\textbf{A}}, {\textbf{X}}) \sim \text {GRDPG}(n, F, d_+, d_-)\) if for any \(i, j \in \{ 1, \cdots , n \}\)
Definition 2
(K-block Stochastic Blockmodel Graph Holland et al. 1983) The K-block stochastic blockmodel (SBM) graph is an edge-independent random graph with each vertex belonging to one of K blocks. It can be parametrized by a block connectivity probability matrix \({\textbf{B}} \in (0, 1)^{K \times K}\) and a vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\) summing to unity. Let \({\textbf{A}}\) be an adjacency matrix and \(\varvec{\tau }\) be a vector of block assignments with \(\tau _i = k\) if vertex i is in block k (occuring with probability \(\pi _k\)). We say \(({\textbf{A}}, \varvec{\tau }) \sim \text {SBM}(n, {\textbf{B}}, \varvec{\pi })\) if for any \(i, j \in \{ 1, \cdots , n \}\)
Remark 1
The SBM is a special case of the GRDPG model. Let \(({\textbf{A}}, \varvec{\tau }) \sim \text {SBM}(n, {\textbf{B}}, \varvec{\pi })\) as in Definition 2 where \({\textbf{B}} \in (0, 1)^{K \times K}\) with \(d_+\) strictly positive eigenvalues and \(d_-\) strictly negative eigenvalues. To represent this SBM in the GRDPG model, we can choose \(\varvec{\nu }_1, \cdots , \varvec{\nu }_K \in {\mathbb {R}}^d\) where \(d = d_+ + d_-\) such that \(\varvec{\nu }_k^\top {\textbf{I}}_{d_+ d_-} \varvec{\nu }_\ell = {\textbf{B}}_{k \ell }\) for all \(k, \ell \in \{ 1, \cdots , K \}\). For example, we can take \(\varvec{\nu } = {\textbf{U}}_B |{\textbf{S}}_B|^{1/2}\) where \({\textbf{B}} = {\textbf{U}}_B {\textbf{S}}_B {\textbf{U}}_B^\top\) is the spectral decomposition of \({\textbf{B}}\) after re-ordering. Then we have the latent position of vertex i as \({\textbf{X}}_i = \varvec{\nu }_k\) if \(\tau _i = k\).
The parameters of the models can be estimated via spectral methods (Von Luxburg 2007), which have been widely used in random graph models for community detection (Lyzinski et al. 2014, 2016; McSherry 2001; Rohe et al. 2011). Two particular spectral embedding methods, adjacency spectral embedding (ASE) and Laplacian spectral embedding (LSE), are popular since they enjoy nice propertices including consistency (Sussman et al. 2012) and asymptotic normality (Athreya et al. 2016; Tang and Priebe 2018).
Definition 3
(Adjacency Spectral Embedding) Let \({\textbf{A}} \in \{0, 1 \}^{n \times n}\) be an adjacency matrix with eigendecomposition \({\textbf{A}} = \sum _{i=1}^{n} \lambda _i {\textbf{u}}_i {\textbf{u}}_i^\top\) where \(|\lambda _1| \ge \cdots \ge |\lambda _n|\) are the magnitude-ordered eigenvalues and \({\textbf{u}}_1, \cdots , {\textbf{u}}_n\) are the corresponding orthonormal eigenvectors. Given the embedding dimension \(d < n\), the adjacency spectral embedding (ASE) of \({\textbf{A}}\) into \({\mathbb {R}}^d\) is the \(n \times d\) matrix \(\mathbf {{\widehat{X}}} = {\textbf{U}}_A |{\textbf{S}}_A|^{1/2}\) where \({\textbf{S}}_A = \text {diag}(\lambda _1, \ldots , \lambda _d)\) and \({\textbf{U}}_A = [{\textbf{u}}_1 | \cdots | {\textbf{u}}_d]\).
Remark 2
There are different methods for choosing the embedding dimension (Hastie et al. 2009; Jolliffe and Cadima 2016); we adopt the simple and efficient profile likelihood method (Zhu and Ghodsi 2006) to automatically identify “elbow”, which is the cut-off between the signal dimensions and the noise dimensions in scree plot.
Chernoff analysis
To analytically measure the performance of algorithms for block recovery, we consider the notion of Chernoff information among other possible metrics. Chernoff information enjoys the advantages of being independent of the clustering procedure, i.e., it can be derived no matter which clustering methods are used, and it is intrinsically relating to the Bayes risk (Tang and Priebe 2018; Athreya et al. 2017; Karrer and Newman 2011).
Definition 4
(Chernoff Information Chernoff 1952, 1956) Let \(F_1\) and \(F_2\) be two continuous multivariate distributions on \({\mathbb {R}}^d\) with density functions \(f_1\) and \(f_2\). The Chernoff information is defined as
Remark 3
Consider the special case where we take \(F_1 = {\mathcal {N}}(\varvec{\mu }_1, \varvec{\Sigma }_1)\) and \(F_2 = {\mathcal {N}}(\varvec{\mu }_2, \varvec{\Sigma }_2)\); then the corresponding Chernoff information is
where \(\varvec{\Sigma }_t = t \varvec{\Sigma }_1 + (1-t) \varvec{\Sigma }_2\).
The comparsion of block recovery via Chernoff information is based on the statistical information between the limiting distributions of the blocks and smaller statistical information implies less information to discriminate between different blocks of the SBM. To that end, we also review the limiting results of ASE for SBM, essential for investigating Chernoff information.
Theorem 1
(CLT of ASE for SBM Rubin-Delanchy et al. 2022) Let \(({\textbf{A}}^{(n)}, {\textbf{X}}^{(n)}) \sim \text {GRDPG}(n, F, d_+, d_-)\) be a sequence of adjacency matrices and associated latent positions of a d-dimensional GRDPG as in Definition 1 from an inner product distribution F where F is a mixture of K point masses in \({\mathbb {R}}^d\), i.e.,
where \(\delta _{\varvec{\nu }_k}\) is the Dirac delta measure at \(\nu _k\). Let \(\Phi ({\textbf{z}}, \varvec{\Sigma })\) denote the cumulative distribution function (CDF) of a multivariate Gaussian distribution with mean \({\varvec{0}}\) and covariance matrix \(\varvec{\Sigma }\), evaluated at \({\textbf{z}} \in {\mathbb {R}}^d\). Let \(\mathbf {{\widehat{X}}}^{(n)}\) be the ASE of \({\textbf{A}}^{(n)}\) with \(\mathbf {{\widehat{X}}}^{(n)}_i\) as the i-th row (same for \({\textbf{X}}^{(n)}_i\)). Then there exists a sequence of matrices \({\textbf{M}}_n \in {\mathbb {R}}^{d \times d}\) satisfying \({\textbf{M}}_n {\textbf{I}}_{d_+ d_-} {\textbf{M}}_n^\top = {\textbf{I}}_{d_+ d_-}\) such that for all \({\textbf{z}} \in {\mathbb {R}}^d\) and fixed index i,
where for \(\varvec{\nu } \sim F\)
with
For a K-block SBM, let \({\textbf{B}} \in (0, 1)^{K \times K}\) be the block connectivity probability matrix and \(\varvec{\pi } \in (0, 1)^K\) be the vector of block assignment probabilities. Given an n vertex instantiation of the SBM parameterized by \({\textbf{B}}\) and \(\varvec{\pi }\), for sufficiently large n, the large sample optimal error rate for estimating the block assignments using ASE can be measured via Chernoff information as (Tang and Priebe 2018; Athreya et al. 2017)
where \(\varvec{\Sigma }_{k\ell }(t) = t \varvec{\Sigma }_k + (1-t) \varvec{\Sigma }_\ell\), \(\varvec{\Sigma }_k = \varvec{\Sigma }(\varvec{\nu }_k)\) and \(\varvec{\Sigma }_\ell = \varvec{\Sigma }(\varvec{\nu }_\ell )\) are defined as in Eq. (7). Also note that as \(n \rightarrow \infty\), the logarithm term in Eq. (9) will be dominated by the other term. Then we have the approximate Chernoff information as
where
We also introduce the following two notions, which will be used when we describe our dynamic network sampling scheme.
Definition 5
(Chernoff-active Blocks) For K-block SBM parametrized by the block connectivity probability matrix \({\textbf{B}} \in (0, 1)^{K \times K}\) and the vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\). The Chernoff-active blocks \((k^*, \ell ^*)\) are defined as
where \(C_{k ,\ell }({\textbf{B}}, \varvec{\pi })\) is defined as in Eq. (10).
Definition 6
(Chernoff Superiority) For K-block SBMs, given two block connectivity probability matrices \({\textbf{B}}, {\textbf{B}}^\prime \in (0, 1)^{K \times K}\) and a vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\). Let \(\rho _B\) and \(\rho _{B^\prime }\) denote the Chernoff information obtained as in Eq. (10) corresponding to \({\textbf{B}}\) and \({\textbf{B}}^\prime\) respectively. We say that \({\textbf{B}}\) is Chernoff superior to \({\textbf{B}}^\prime\), denoted as \({\textbf{B}} \succ {\textbf{B}}^\prime\), if \(\rho _B > \rho _{B^\prime }\).
Remark 4
If \({\textbf{B}}\) is Chernoff superior to \({\textbf{B}}^\prime\), then we can have a better block recovery from \({\textbf{B}}\) than \({\textbf{B}}^\prime\). In addition, Chernoff superiority is transitive, which is straightforward from the definition.
Dynamic network sampling
We start our analysis with the unobserved block connectivity probability matrix \({\textbf{B}}\) for SBM and then illustrate how to migrate the proposed methods for real applications when we have the observed adjacency matrix \({\textbf{A}}\).
Consider the K-block SBM parametrized by the block connectivity probability matrix \({\textbf{B}} \in (0, 1)^{K \times K}\) and the vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\) with \(K > 2\). Given initial sampling parameter \(p_0 \in (0, 1)\), initial sampling is uniformly at random, i.e.,
This initial sampling simulates the case when one only obersves a partial graph with a small portion of the edges instead of the entire graph with all existing edges.
Theorem 2
For K-block SBMs, given two block connectivity probability matrices \({\textbf{B}}, p{\textbf{B}} \in (0, 1)^{K \times K}\) with \(p \in (0, 1)\) and a vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\), we have \({\textbf{B}} \succ p {\textbf{B}}\).
The proof of Theorem 2 can be found in Appendix. As an illustration, consider a 4-block SBM parametrized by block connectivity probability matrix \({\textbf{B}}\) as
Figure 1 shows Chernoff information \(\rho\) as in Eq. (10) corresponding to \({\textbf{B}}\) as in Eq. (14) and \(p {\textbf{B}}\) for \(p \in (0, 1)\). In addition, Fig. 1a assumes \(\varvec{\pi } = (\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4})\) and Fig. 1b assumes \(\varvec{\pi } = (\frac{1}{8}, \frac{1}{8}, \frac{3}{8}, \frac{3}{8})\). As suggested by Theorem 2, for any \(p \in (0, 1)\) we have \(\rho _{B} > \rho _{pB}\) and thus \({\textbf{B}} \succ p {\textbf{B}}\).
Now given dynamic network sampling parameter \(p_1 \in (0, 1-p_0)\), the baseline sampling scheme can proceed uniformly at random again, i.e.,
This dynamic network sampling simulates the situation when one is given limited resources to sample some extra edges after observing the partial graph with only a small portion of the edges. Since we only have limited budget to sample another small portion of edges, one would benefit from identifying vertex pairs that have much influence on the community structure. In other words, the baseline sampling scheme just randomly choosing vertex pairs without using the information from the initial observed graphs and our goal is to design an alternative scheme to optimize this dynamic network sampling procedure so that one could have a better block recovery even with limited resources to only observe a partial graph with a small portion of the edges.
Corollary 1
For K-block SBMs, given block connectivity probability matrix \({\textbf{B}} \in (0, 1)^{K \times K}\) and a vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\). We have \({\textbf{B}} \succ {\textbf{B}}_1 \succ {\textbf{B}}_0\) where \({\textbf{B}}_0\) is defined as in Eq. (13) with \(p_0 \in (0, 1)\) and \({\textbf{B}}_1\) is defined as in Eq. (15) with \(p_1 \in (0, 1-p_0)\).
The proof of Corollary 1 can be found in Appendix. This corollay implies that we can have a better block recovery from \({\textbf{B}}_1\) than \({\textbf{B}}_0\).
Assumption 1
The Chernoff-active blocks after initial sampling is unique, i.e., there exists an unique pair \(\left( k_0^*, \ell _0^* \right) \in \{(k, \ell ) \; | \; 1 \le k < \ell \le K \}\) such that
where \({\textbf{B}}_0\) is defined as in Eq. (13) and \(\varvec{\pi }\) is the vector of block assignment probabilities.
To improve this baseline sampling scheme, we concentrate on the Chernoff-active blocks \(\left( k_0^*, \ell _0^* \right)\) after initial sampling assuming Assumption 1 holds. Instead of sampling from the entire block connectivity probability matrix \({\textbf{B}}\) like the baseline sampling scheme as in Eq. (15), we only sample the entries associated with the Chernoff-active blocks. As a competitor to \({\textbf{B}}_1\), our Chernoff-optimal dynamic network sampling scheme is then given by
where \(\circ\) denotes Hadamard product, \(\pi _{k_0^*}\) and \(\pi _{\ell _0^*}\) denote the block assignment probabilities for block \(k_0^*\) and \(\ell _0^*\) respectively, and \({\textbf{1}}_*\) is the \(K \times K\) binary matrix with 0’s everywhere except for 1’s associated with the Chernoff-active blocks \(\left( k_0^*, \ell _0^* \right)\), i.e., for any \(i, j \in \{1, \cdots , K \}\)
Note that the multiplier \(\frac{1}{\left( \pi _{k_0^*} + \pi _{\ell _0^*}\right) ^2}\) on \(p_1 {\textbf{B}} \circ {\textbf{1}}_*\) assures that we sample the same number of potential edges with \(\widetilde{{\textbf{B}}}_1\) as we do with \({\textbf{B}}_1\) in the baseline sampling scheme. In addition, to avoid over-sampling with respect to \({\textbf{B}}\), i.e., to ensure \(\widetilde{{\textbf{B}}}_1[i, j] \le {\textbf{B}}[i, j]\) for any \(i, j \in \{1, \cdots , K \}\), we require
Assumption 2
For K-block SBMs, given a block connectivity probability matrix \({\textbf{B}} \in (0, 1)^{K \times K}\) and a vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\). Let \(p_1^* \in (0, p_1^{\text {max}}]\) be the smallest positive \(p_1 \le p_1^{\text {max}}\) such that
is not unique where \(p_1^{\text {max}}\) is defined as in Eq. (19) and \(\widetilde{{\textbf{B}}}_1\) is defined as in Eq. (17). If the arg min is always unique, let \(p_1^* = p_1^{\text {max}}\).
For any \(p_1 \in (0, p_1^*)\), we can have a better block recovery from \(\widetilde{{\textbf{B}}}_1\) than \({\textbf{B}}_1\), i.e., our Chernoff-optimal dynamic network sampling sheme is better than the baseline sampling scheme in terms of block recovery.
As an illustaration, consider the 4-block SBM with initial sampling parameter \(p_0 = 0.01\) and block connectivity probability matrix \({\textbf{B}}\) as in Eq. (14). Figure 2 shows the Chernoff information \(\rho\) as in Eq. (10) corresponding to \({\textbf{B}}\) as in Eq. (14), \({\textbf{B}}_0\) as in Eq. (13), \({\textbf{B}}_1\) as in Eq. (15), and \(\widetilde{{\textbf{B}}}_1\) as in Eq. (17) with dynamic network sampling parameter \(p_1 \in (0, p_1^*)\) where \(p_1^*\) is defined as in Assumption 2. In addition, Figure 2a assumes \(\varvec{\pi } = (\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4})\) and Fig. 2b assumes \(\varvec{\pi } = (\frac{1}{8}, \frac{1}{8}, \frac{3}{8}, \frac{3}{8})\). Note that for any \(p_1 \in (0, p_1^*)\) we have \(\rho _{B}> \rho _{{\widetilde{B}}_1}> \rho _{B_1} > \rho _{B_0}\) and thus \({\textbf{B}} \succ \widetilde{{\textbf{B}}}_1 \succ {\textbf{B}}_1 \succ {\textbf{B}}_0\). That is, in terms of Chernoff information, when given same amount of resources, the proposed Chernoff-optimal dynamic network sampling scheme can yield better block recovery results. In other words, to reach the same level of performance, in terms of Chernoff information, the proposed Chernoff-optimal dynamic network sampling scheme needs less resources.
As described earlier, it may be the case that \(p_1^* < p_1^{\text {max}}\) at which point Chernoff-active blocks change to \((k_1^*, \ell _1^*)\). This potential non-uniquess of the Chernoff argmin is a consequence of our dynamic network sampling scheme. In the case of \(p_1 > p_1^*\), our Chernoff-optimal dynamic network sampling scheme is adopted as
Similarly, the multiplier \(\frac{1}{\left( \pi _{k_0^*} + \pi _{\ell _0^*}\right) ^2}\) on \(p_1^* {\textbf{B}} \circ {\textbf{1}}_{k_0^*, \ell _0^*}\) assures that we sample the same number of potential edges with \(\widetilde{{\textbf{B}}}_1^*\) as we do with \({\textbf{B}}_1\) in the baseline sampling scheme. In addition, to avoid over-sampling with respect to \({\textbf{B}}\), i.e., \(\widetilde{{\textbf{B}}}_1^*[i, j] \le {\textbf{B}}[i, j]\) for any \(i, j \in \{1, \cdots , K \}\), we require
For any \(p_1 \in [p_1^*, p_{11}^{\text {max}}]\), we can have a better block recovery from \(\widetilde{{\textbf{B}}}_1^*\) than \({\textbf{B}}_1\), i.e., our Chernoff-optimal dynamic network sampling sheme is again better than the baseline sampling scheme in terms of block recovery.
As an illustration, consider a 4-block SBM with initial sampling parameter \(p_0 = 0.01\) and block connectivity probability matrix \({\textbf{B}}\) as in Eq. (14). Figure 3 shows the Chernoff information \(\rho\) as in Eq. (10) corresponding to \({\textbf{B}}\) as in Eq. (14), \({\textbf{B}}_0\) as in Eq. (13), \({\textbf{B}}_1\) as in Eq. (15), and \(\widetilde{{\textbf{B}}}_1^*\) as in Eq. (21) with dynamic network sampling parameter \(p_1 \in [p_1^*, p_{11}^{\text {max}}]\) where \(p_1^*\) is defined as in Assumption 2 and \(p_{11}^{\text {max}}\) is defined as in Eq. (22). In addition, Fig. 3a assumes \(\varvec{\pi } = (\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4})\) and Fig. 3b assumes \(\varvec{\pi } = (\frac{1}{8}, \frac{1}{8}, \frac{3}{8}, \frac{3}{8})\). Note that for any \(p_1 \in [p_1^*, p_{11}^{\text {max}}]\) we have \(\rho _{B}> \rho _{{\widetilde{B}}_1^*}> \rho _{B_1} > \rho _{B_0}\) and thus \({\textbf{B}} \succ \widetilde{{\textbf{B}}}_1^* \succ {\textbf{B}}_1 \succ {\textbf{B}}_0\). That is, the adopted Chernoff-optimal dynamic network sampling scheme can still yield better block recovery results, in terms of Chernoff information, given the same amout of resources.
Now we illustrate how the proposed Chernoff-optimal dynamic network sampling sheme can be migrated for real applications. We summarize the uniform dynamic sampling scheme (baseline) as Algorithm 1 and our Chernoff-optimal dynamic network sampling scheme as Algorithm 2. Recall given potential edge set E and initial sampling parameter \(p_0 \in (0, 1)\), we have the initial edge set \(E_0 \subset E\) with \(|E_0 |= p_0 |E |\). The goal is to dynamically sample new edges from the potential edge set so that we can have a better block recovery given limited resources.
Experiments
Simulations
In addition to Chernoff analysis, we also evalute our Chernoff-optimal dynamic network sampling sheme via simulations. In particular, consider the 4-block SBM parameterized by block connectivity probability matrix \({\textbf{B}}\) as in Eq. (14) and dynamic network sampling parameter \(p_1 \in (0, p_{11}^{\text {max}}]\) where \(p_{11}^{\text {max}}\) is defined as in Eq. (22). We fix initial sampling parameter \(p_0 = 0.01\). For each \(p_1 \in (0, p_1^*)\) where \(p_1^*\) is defined as in Assumption 2, we simulate 50 adjacency matrices with \(n = 12000\) vertices from \({\textbf{B}}_1\) as in Eq. (15) and \(\widetilde{{\textbf{B}}}_1\) as in Eq. (17) respectively. For each \(p_1 \in [p_1^*, p_{11}^{\text {max}}]\), we simulate 50 adjacency matrices with \(n = 12000\) vertices from \({\textbf{B}}_1\) as in Eq. (15) and \(\widetilde{{\textbf{B}}}_1^*\) as in Eq. (21) respectively. In addition, Fig. 4a assumes \(\varvec{\pi } = (\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4})\), i.e., 3000 vertices in each block, and Fig. 4b assumes \(\varvec{\pi } = (\frac{1}{8}, \frac{1}{8}, \frac{3}{8}, \frac{3}{8})\), i.e., 1500 vertices in two of the blocks and 4500 vertices in the other two blocks. We then apply ASE \(\circ\) GMM (Step 3 and 4 in Algorithm 1) to recover block assignments and adopt adjusted Rand index (ARI) to measure the performance. Figure 4 shows ARI (mean\(\pm\)stderr) associated with \({\textbf{B}}_1\) for \(p_1 \in (0, p_{11}^{\text {max}}]\), \(\widetilde{{\textbf{B}}}_1\) for \(p_1 \in (0, p_1^*)\), and \(\widetilde{{\textbf{B}}}_1^*\) for \(p_1 \in [p_1^*, p_{11}^{\text {max}}]\) where the dashed lines denote \(p_1^*\). Note that we can have a better block recovery from \(\widetilde{{\textbf{B}}}_1\) and \(\widetilde{{\textbf{B}}}_1^*\) than \({\textbf{B}}_1\), which argee with our results from Chernoff analysis.
Now we compare the performance of Algorithms 1 and 2 by actual block recovery results. In particular, we start with the 4-block SBM parameterized by block connectivity probability matrix \({\textbf{B}}\) as in Eq. (14). We consider dynamic network sampling parameter \(p_1 \in (0, 1-p_0)\) where \(p_0\) is the initial sampling parameter. For each \(p_1\), we simulate 50 adjacency matrices with \(n = 4000\) vertices and retrieve associated potential edge sets. We fix initial sampling parameter \(p_0 = 0.15\) and randomly sample initial edge sets. We then apply both algorithms to estimate the block assignments and adopt ARI to measure the performance. Figure 5 shows ARI (mean\(\pm\)stderr) of two algorithms for \(p_1 \in (0, 0.85)\) where Fig. 5a assumes \(\varvec{\pi } = (\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4})\), i.e., 1000 vertices in each block, and Fig. 5b assumes \(\varvec{\pi } = (\frac{1}{8}, \frac{1}{8}, \frac{3}{8}, \frac{3}{8})\), i.e., 500 vertices in two of the blocks and 1500 vertices in the other two blocks. Note that both algorithms tend to have a better performance as \(p_1\) increases, i.e., as we sample more edges, and Algorithm 2 can always recover more accurate block structure than Algorithm 1. That is, given the same amout of resources, the proposed Chernoff-optimal dynamic network sampling scheme can yield better block recovery results. In other words, to reach the same level of performance, in terms of the empirical clustering results, the proposed Chernoff-optimal dynamic network sampling scheme needs less resources.
Real data
We also evaluate the performance of Algorithms 1 and 2 for real application. We conduct real data experiments on a diffusion MRI connectome dataset (Priebe et al. 2019). There are 114 graphs (connectomes) estimated by the NDMG pipeline (Kiar et al. 2018) in this dataset. Each vertex in these graphs (the number of vertices n varies from 23728 to 42022) has a {Left, Right} hemisphere label and a {Gray, White} tissue label. We consider the potential 4 blocks as {LG, LW, RG, RW} where L and R denote the Left and Right hemisphere label, G and W denote the Gray and White tissue label. Here we consider initial sampling parameter \(p_0 = 0.25\) and dynamic network sampling parameter \(p_1 = 0.25\). Let \(\Delta = \text {ARI(Algo2)} - \text {ARI(Algo1)}\) where ARI(Algo1) and ARI(Algo2) denotes the ARI when we apply Algorithms 1 and 2 respectively. The following hypothesis testing yields p-value=0.0184. Figure 6 shows algorithms’ comparative performance via boxplot and histogram.
Furthermore, we test our algorithms on a Microsoft bing entity dataset (Agterberg et al. 2020). There are 2 graphs in this dataset where each has 13535 vertices. We treat block assignments estimated from the complete graph as ground truth. We consider initial sampling parameter \(p_0 \in \left\{ 0.2, \; 0.3 \right\}\) and dynamic network sampling parameter \(p_1 \in \left\{ 0, \; 0.05, \; 0.1, \; 0.15, \; 0.2 \right\}\). For each \(p_1\), we sample 100 times and compare the overall performance of Algorithm 1 and 2. Figure 7 shows the results where ARI is reported as mean(±stderr).
We also conduct real data experiments with 2 social network datasets.
-
LastFM asia social network data set (Leskovec and Krevl 2014; Rozemberczki and Sarkar 2020): Vertices (the number of vertices \(n = 7624\)) represent LastFM users from asian countries and edges (the number of edges \(e = 27806\)) represent mutual follower relationships. We treat 18 different location of users, which are derived from the country field for each user, as the potential block.
-
Facebook large page-page network data set (Leskovec and Krevl 2014; Rozemberczki et al. 2019): Vertices (the number of vertices \(n = 22470\)) represent official Facebook pages and edges (the number of edges \(e = 171002\)) represent mutual likes. We treat 4 page types {Politician, Governmental Organization, Television Show, Company}, which are defined by Facebook, as the potential block.
We consider initial sampling parameter \(p_0 \in \left\{ 0.15, \; 0.35 \right\}\) and dynamic network sampling parameter \(p_1 \in \left\{ 0.05, \; 0.1, \; 0.15, \; 0.2, \; 0.25 \right\}\). For each \(p_1\), we sample 100 times and compare the overall performance of Algorithm 1 and 2. Figure 8 shows the results where ARI is reported as mean(±stderr). Again it suggests that given the same amout of resources, the proposed Chernoff-optimal dynamic network sampling scheme can yield better block recovery results. In other words, to reach the same level of performance, in terms of the empirical clustering results, the proposed Chernoff-optimal dynamic network sampling scheme needs less resources.
Discussion
We propose a dynamic network sampling scheme to optimize block recovery for SBM when we only have a limited budget to observe a partial graph. Theoretically, we provide justification of our proposed Chernoff-optimal dynamic sampling scheme via the Chernoff information. Practically, we evaluate the performance, in terms of block recovery (community detection), of our method on several real datasets including diffusion MRI connectome dataset, Microsoft bing entity graph transitions dataset and social network datasets. Both theoretically and practically results suggest that our method can identify vertices that have the most impact on block structure and only check whether there are edges between them to save significant resources but still recover the block structure.
As the Chernoff-optimal dynamic sampling scheme depends on the initial clustering results to identify Chernoff-active blocks and construct dynamic edge set. Thus the performance could be impacted if the initial clustering is not very ideal. One of the future direction is to design certain strategy to reduce this dependency such that the proposed scheme is more robust.
Availibility of data and materials
Social network datasets are available at https://www.snap.stanford.edu/data/.
Abbreviations
- SBM:
-
Stochastic Blockmodel
- GRDPG:
-
Generalized random dot product graph
- ASE:
-
Adjacency spectral embedding
- LSE:
-
Laplacian spectral embedding
- GMM:
-
Gaussian mixture modeling
- BIC:
-
Bayesian information criterion
- ARI:
-
Adjusted Rand index
- stderr:
-
Standard error
- NDMG:
-
NeuroData’s magnetic resonance imaging to graphs
References
Agterberg J, Park Y, Larson J, White C, Priebe CE, Lyzinski V (2020) Vertex nomination, consistent estimation, and adversarial modification. Electron J Stat 14(2):3230–3267
Athreya A, Priebe CE, Tang M, Lyzinski V, Marchette DJ, Sussman DL (2016) A limit theorem for scaled eigenvectors of random dot product graphs. Sankhya A 78(1):1–18
Athreya A, Fishkind DE, Tang M, Priebe CE, Park Y, Vogelstein JT, Levin K, Lyzinski V, Qin Y (2017) Statistical inference on random dot product graphs: a survey. J Mach Learn Res 18(1):8393–8484
Binkiewicz N, Vogelstein JT, Rohe K (2017) Covariate-assisted spectral clustering. Biometrika 104(2):361–377
Chernoff H (1952) A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann Math Stat 23(4):493–507
Chernoff H (1956) Large-sample theory: parametric case. Ann Math Stat 27(1):1–22
Choi DS, Wolfe PJ, Airoldi EM (2012) Stochastic blockmodels with a growing number of classes. Biometrika 99(2):273–284
Fortunato S, Hric D (2016) Community detection in networks: a user guide. Phys Rep 659:1–44
Gallagher I, Bertiger A, Priebe C, Rubin-Delanchy P (2019) Spectral clustering in the weighted stochastic block model. arXiv:1910.05534
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York
Holland PW, Laskey KB, Leinhardt S (1983) Stochastic blockmodels: first steps. Soc Netw 5(2):109–137
Horn RA, Johnson CR (2012) Matrix Analysis. Cambridge University Press, New York
Huang S, Feng Y (2018) Pairwise covariates-adjusted block model for community detection. arXiv:1807.03469
Jolliffe IT, Cadima J (2016) Principal component analysis: a review and recent developments. Philos Trans R Soc A Math Phys Eng Sci 374(2065):20150202
Karrer B, Newman ME (2011) Stochastic blockmodels and community structure in networks. Phys Rev E 83(1):016107
Kiar G, Bridgeford EW, Gray Roncal WR, Chandrashekhar V, Mhembere D, Ryman S, Zuo X-N, Margulies DS, Craddock RC, Priebe CE, Jung R, Calhoun VD, Caffo B, Burns R, Milham MP, Vogelstein JT (2018) A high-throughput pipeline identifies robust connectomes but troublesome variability. bioRxiv, 188706
Leskovec J, Krevl A (2014) SNAP datasets: stanford large network dataset collection. http://snap.stanford.edu/data
Lyzinski V, Sussman DL, Tang M, Athreya A, Priebe CE (2014) Perfect clustering for stochastic blockmodel graphs via adjacency spectral embedding. Electron J Stat 8(2):2905–2922
Lyzinski V, Tang M, Athreya A, Park Y, Priebe CE (2016) Community detection and classification in hierarchical stochastic blockmodels. IEEE Trans Netw Sci Eng 4(1):13–26
McSherry F (2001) Spectral partitioning of random graphs. In: Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pp 529–537. IEEE
Mele A, Hao L, Cape J, Priebe CE (2022) Spectral inference for large stochastic blockmodels with nodal covariates. J Bus Econ Stat
Mu C, Mele A, Hao L, Cape J, Athreya A, Priebe CE (2022) On spectral algorithms for community detection in stochastic blockmodel graphs with vertex covariates. IEEE Trans Netw Sci Eng
Priebe CE, Park Y, Vogelstein JT, Conroy JM, Lyzinski V, Tang M, Athreya A, Cape J, Bridgeford E (2019) On a two-truths phenomenon in spectral graph clustering. Proc Natl Acad Sci 116(13):5995–6000
Purohit S, Choudhury S, Holder LB (2017) Application-specific graph sampling for frequent subgraph mining and community detection. In: 2017 IEEE International Conference on Big Data (Big Data), pp 1000–1005. IEEE
Rohe K, Chatterjee S, Yu B (2011) Spectral clustering and the high-dimensional stochastic blockmodel. Ann Stat 39(4):1878–1915
Roy S, Atchadé Y, Michailidis G (2019) Likelihood inference for large scale stochastic blockmodels with covariates based on a divide-and-conquer parallelizable algorithm with communication. J Comput Graph Stat 28(3):609–619
Rozemberczki B, Allen C, Sarkar R (2019) Multi-scale attributed node embedding. arXiv:1909.13021
Rozemberczki B, Sarkar R (2020) Characteristic functions on graphs: birds of a feather, from statistical descriptors to parametric models. In: Proceedings of the 29th ACM International conference on information and knowledge management (CIKM ’20), pp 1325–1334. ACM
Rubin-Delanchy P, Priebe CE, Tang M, Cape J (2022) A statistical interpretation of spectral embedding: the generalised random dot product graph. J R Stat Soc
Sussman DL, Tang M, Fishkind DE, Priebe CE (2012) A consistent adjacency spectral embedding for stochastic blockmodel graphs. J Am Stat Assoc 107(499):1119–1128
Sweet TM (2015) Incorporating covariates into stochastic blockmodels. J Educ Behav Stat 40(6):635–664
Tang M, Priebe CE (2018) Limit theorems for eigenvectors of the normalized laplacian for random graphs. Ann Stat 46(5):2360–2415
Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416
Yun S-Y, Proutiere A (2014) Community detection via random and adaptive sampling. In: Conference on Learning Theory, pp 138–175. PMLR
Zhu M, Ghodsi A (2006) Automatic dimensionality selection from the scree plot via the use of profile likelihood. Comput Stati Data Anal 51(2):918–930
Acknowledgements
This problem was posed to us by Adam Cardinal-Stakenas and Kevin Hoover.
Funding
Cong Mu’s work is partially supported by the Johns Hopkins Mathematical Institute for Data Science (MINDS) Data Science Fellowship.
Author information
Authors and Affiliations
Contributions
CM developed the theory, designed and implemented the methods, conducted the experiments, and wrote the manuscript . YP implemented the methods, conducted the experiments, and edited the manuscript. CEP formulated the problem, designed the methods, developed the theory and edited the manuscript. All authors read and approved the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Proof of Theorem 2
Let \({\textbf{B}} = {\textbf{U}} {\textbf{S}} {\textbf{U}}^\top\) be the spectral decomposition of \({\textbf{B}}\) and \({\textbf{B}}^\prime = p {\textbf{B}}\) with \(p \in (0, 1)\). Then we have
By Remark 1, to represent these two SBMs parametrized by two block connectivity matrices \({\textbf{B}}\) and \({\textbf{B}}^\prime\) respectively (with the same block assignment probability vector \(\varvec{\pi }\)) in the GRDPG models, we can take
Then for any \(k \in \{1, \cdots , K \}\), we have \(\varvec{\nu }_k^\prime = \sqrt{p} \varvec{\nu }_k \in {\mathbb {R}}^{d}\). By Theorem 1, we have
Note that \({\textbf{B}}\) and \({\textbf{B}}^\prime\) have the same eigenvalues, thus we have \({\textbf{I}}_{d_+ d_-} = {\textbf{I}}_{d_+ d_-}^\prime\). See also Lemma 2 (Gallagher et al. 2019). Then for \(k \in \{1, \cdots , K \}\), we have
where
Recall that by Remark 1, we have \(\varvec{\nu }_k^\top {\textbf{I}}_{d_+ d_-} \varvec{\nu }_\ell = {\textbf{B}}_{k \ell } \in (0, 1)\) for all \(k, \ell \in \{ 1, \cdots , K \}\). Then we have \({\textbf{D}}_k(p)\) is positive-definite for any \(k \in \{1, \ldots , K \}\) and \(p \in (0, 1)\). For \(k, \ell \in \{1, \ldots , K \}\) and \(t \in (0, 1)\), let \(\varvec{\Sigma }_{k\ell }(t)\) and \(\varvec{\Sigma }_{k\ell }^{\prime }(t)\) denote the matrics as in Eq. (10) corresponding to \({\textbf{B}}\) and \({\textbf{B}}^\prime\) respectively, i.e.,
where
Recall that \({\textbf{D}}_k(p)\) and \({\textbf{D}}_\ell (p)\) are both positive-definite for any \(k, \ell \in \{1, \ldots , K \}\) and \(p \in (0, 1)\), thus \({\textbf{D}}_{k \ell }(p, t)\) is also positive-definite for any \(k, \ell \in \{1, \ldots , K \}\) and \(p, t \in (0, 1)\). Now by the Sherman-Morrison-Woodbury formula (Horn and Johnson 2012), we have
where
Recall that for any \(k, \ell \in \{1, \ldots , K \}\) and \(p, t \in (0, 1)\), \({\textbf{D}}_{k \ell }(p, t)\) and \(\varvec{\Sigma }_{k\ell }(t)\) are both positive-definite, thus \({\textbf{M}}_{k \ell }(p, t)\) is also positive-definite. Then for any \(k, \ell \in \{1, \ldots , K \}\) and \(p,t \in (0, 1)\), we have
where
Recall that for any \(k, \ell \in \{1, \ldots , K \}\) and \(p, t \in (0, 1)\), \({\textbf{M}}_{k \ell }(p, t)\) is positive-definite, thus we have \(h_{k \ell }(p, t) > 0\). Together with Eq. (33), we have
Thus for any \(k, \ell \in \{1, \ldots , K \}\), we have
Let \(\rho _B\) and \(\rho _{B^\prime }\) denote the Chernoff information obtained as in Eq. (10) corresponding to \({\textbf{B}}\) and \({\textbf{B}}^\prime\) respectively (with the same block assignment probability vector \(\varvec{\pi }\)). Then we have
Thus we have \({\textbf{B}} \succ {\textbf{B}}^\prime = p {\textbf{B}}\) for \(p \in (0, 1)\). \(\square\)
Proof of Corollary 1
By Eq. (13) and Eq. (15), we have
Recall that \(p_0 \in (0, 1)\) and \(p_1 \in (0, 1-p_0)\). Then by Theorem 2, we have \({\textbf{B}} \succ {\textbf{B}}_1 \succ {\textbf{B}}_0\). \(\square\)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Mu, C., Park, Y. & Priebe, C.E. Dynamic network sampling for community detection. Appl Netw Sci 8, 5 (2023). https://doi.org/10.1007/s41109-022-00528-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41109-022-00528-1