Skip to main content

Dynamic network sampling for community detection

Abstract

We propose a dynamic network sampling scheme to optimize block recovery for stochastic blockmodel in the case where it is prohibitively expensive to observe the entire graph. Theoretically, we provide justification of our proposed Chernoff-optimal dynamic sampling scheme via the Chernoff information. Practically, we evaluate the performance, in terms of block recovery, of our method on several real datasets from different domains. Both theoretically and practically results suggest that our method can identify vertices that have the most impact on block structure so that one can only check whether there are edges between them to save significant resources but still recover the block structure.

Introduction

In network inference applications, it is important to detect community structure, i.e., cluster vertices into potential blocks. However, it can be prohibitively expensive to observe the entire graph in many cases, especially for large graphs. For example, in a network where vertices represent landline phones and edges represent whether there is a call between two landline phones. Based on the size of the network, in terms of the number of vertices, it can be extremely expensive to check whether there is a call for every landline phone pairs. Therefore, if one can utilize the information carried by a partially oberverd graph, that is only a small number of landline phone pairs are verified, to identify the landline phones that may play a more important role in formulating communities. Then given limited resources, one can choose to only check whether there are calls between those landline phone pairs to achieve the goal of detecting potential block structure. Thus it becomes essential to identify vertices that have the most impact on block structure and only check whether there are edges between them to save significant resources but still recover the block structure.

Many classical methods only consider the adjacency or Laplacian matrices for community detection (Fortunato and Hric 2016). By contrast, vertex covariates can also be taken into consideration for the inference. These covariate-aware methods rely on either variational methods (Choi et al. 2012; Roy et al. 2019; Sweet 2015) or spectral approaches (Binkiewicz et al. 2017; Huang and Feng 2018; Mele et al. 2022; Mu et al. 2022). However, none of them focus on the problem of clustering vertices for partially observed graphs. To address this issue, existing methods propose different types of random and adaptive sampling strategies to minimize the information loss from the data reduction (Yun and Proutiere 2014; Purohit et al. 2017).

We propose a dynamic network sampling scheme to optimize block recovery for stochastic blockmodel (SBM) when we only have limited resources to check whether there are edges between certain selected vertices. The innovation of our approach is the application of Chernoff information. To our knowledge, this is the first time that it has been applied to network sampling problems. Motivated by the Chernoff analysis, we not only propose a dynamic network sampling scheme to optimize block recovery, but also provide the framework and justification for using Chernoff information in subsequent inference for graphs.

The structure of this article is summarized as follows. Section 2 reviews relevant models for random graphs and the basic idea of spectral methods. Section 3 introduces the notion of Chernoff analysis for analytically measuring the performance of block recovery. Section 4 includes our dynamic network sampling scheme and theoretical results. Section 5 provides simulations and real data experiments to measure the algorithms’ performance in terms of actual block recovery results. Section 6 discusses the findings and presents some open questions for further investigation. Appendix provides technical details for our theoretical results.

Models and spectral methods

In this work, we are interested in the inference task of block recovery (community detection). To model the block structure in edge-independent random graphs, we focus on the SBM and the generalized random dot product graph (GRDPG).

Definition 1

(Generalized Random Dot Product Graph Rubin-Delanchy et al. 2022) Let \({\textbf{I}}_{d_+ d_-} = {\textbf{I}}_{d_+} \bigoplus \left( -{\textbf{I}}_{d_-} \right)\) with \(d_+ \ge 1\) and \(d_- \ge 0\). Let F be a d-dimensional inner product distirbution with \(d = d_+ + d_-\) on \({\mathcal {X}} \subset {\mathbb {R}}^d\) satisfying \({\textbf{x}}^\top {\textbf{I}}_{d_+ d_-} {\textbf{y}} \in [0, 1]\) for all \({\textbf{x}}, {\textbf{y}} \in {\mathcal {X}}\). Let \({\textbf{A}}\) be an adjacency matrix and \({\textbf{X}} = [{\textbf{X}}_1, \cdots , {\textbf{X}}_n]^\top \in {\mathbb {R}}^{n \times d}\) where \({\textbf{X}}_i \sim F\), i.i.d. for all \(i \in \{ 1, \cdots , n \}\). Then we say \(({\textbf{A}}, {\textbf{X}}) \sim \text {GRDPG}(n, F, d_+, d_-)\) if for any \(i, j \in \{ 1, \cdots , n \}\)

$$\begin{aligned} {\textbf{A}}_{ij} \sim \text {Bernoulli}({\textbf{P}}_{ij}) \qquad \text {where} \qquad {\textbf{P}}_{ij} = {\textbf{X}}_{i}^\top {\textbf{I}}_{d_+ d_-} {\textbf{X}}_j. \end{aligned}$$
(1)

Definition 2

(K-block Stochastic Blockmodel Graph Holland et al. 1983) The K-block stochastic blockmodel (SBM) graph is an edge-independent random graph with each vertex belonging to one of K blocks. It can be parametrized by a block connectivity probability matrix \({\textbf{B}} \in (0, 1)^{K \times K}\) and a vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\) summing to unity. Let \({\textbf{A}}\) be an adjacency matrix and \(\varvec{\tau }\) be a vector of block assignments with \(\tau _i = k\) if vertex i is in block k (occuring with probability \(\pi _k\)). We say \(({\textbf{A}}, \varvec{\tau }) \sim \text {SBM}(n, {\textbf{B}}, \varvec{\pi })\) if for any \(i, j \in \{ 1, \cdots , n \}\)

$$\begin{aligned} {\textbf{A}}_{ij} \sim \text {Bernoulli}({\textbf{P}}_{ij}) \qquad \text {where} \qquad {\textbf{P}}_{ij} = {\textbf{B}}_{\tau _i \tau _j}. \end{aligned}$$
(2)

Remark 1

The SBM is a special case of the GRDPG model. Let \(({\textbf{A}}, \varvec{\tau }) \sim \text {SBM}(n, {\textbf{B}}, \varvec{\pi })\) as in Definition 2 where \({\textbf{B}} \in (0, 1)^{K \times K}\) with \(d_+\) strictly positive eigenvalues and \(d_-\) strictly negative eigenvalues. To represent this SBM in the GRDPG model, we can choose \(\varvec{\nu }_1, \cdots , \varvec{\nu }_K \in {\mathbb {R}}^d\) where \(d = d_+ + d_-\) such that \(\varvec{\nu }_k^\top {\textbf{I}}_{d_+ d_-} \varvec{\nu }_\ell = {\textbf{B}}_{k \ell }\) for all \(k, \ell \in \{ 1, \cdots , K \}\). For example, we can take \(\varvec{\nu } = {\textbf{U}}_B |{\textbf{S}}_B|^{1/2}\) where \({\textbf{B}} = {\textbf{U}}_B {\textbf{S}}_B {\textbf{U}}_B^\top\) is the spectral decomposition of \({\textbf{B}}\) after re-ordering. Then we have the latent position of vertex i as \({\textbf{X}}_i = \varvec{\nu }_k\) if \(\tau _i = k\).

The parameters of the models can be estimated via spectral methods (Von Luxburg 2007), which have been widely used in random graph models for community detection (Lyzinski et al. 2014, 2016; McSherry 2001; Rohe et al. 2011). Two particular spectral embedding methods, adjacency spectral embedding (ASE) and Laplacian spectral embedding (LSE), are popular since they enjoy nice propertices including consistency (Sussman et al. 2012) and asymptotic normality (Athreya et al. 2016; Tang and Priebe 2018).

Definition 3

(Adjacency Spectral Embedding) Let \({\textbf{A}} \in \{0, 1 \}^{n \times n}\) be an adjacency matrix with eigendecomposition \({\textbf{A}} = \sum _{i=1}^{n} \lambda _i {\textbf{u}}_i {\textbf{u}}_i^\top\) where \(|\lambda _1| \ge \cdots \ge |\lambda _n|\) are the magnitude-ordered eigenvalues and \({\textbf{u}}_1, \cdots , {\textbf{u}}_n\) are the corresponding orthonormal eigenvectors. Given the embedding dimension \(d < n\), the adjacency spectral embedding (ASE) of \({\textbf{A}}\) into \({\mathbb {R}}^d\) is the \(n \times d\) matrix \(\mathbf {{\widehat{X}}} = {\textbf{U}}_A |{\textbf{S}}_A|^{1/2}\) where \({\textbf{S}}_A = \text {diag}(\lambda _1, \ldots , \lambda _d)\) and \({\textbf{U}}_A = [{\textbf{u}}_1 | \cdots | {\textbf{u}}_d]\).

Remark 2

There are different methods for choosing the embedding dimension (Hastie et al. 2009; Jolliffe and Cadima 2016); we adopt the simple and efficient profile likelihood method (Zhu and Ghodsi 2006) to automatically identify “elbow”, which is the cut-off between the signal dimensions and the noise dimensions in scree plot.

Chernoff analysis

To analytically measure the performance of algorithms for block recovery, we consider the notion of Chernoff information among other possible metrics. Chernoff information enjoys the advantages of being independent of the clustering procedure, i.e., it can be derived no matter which clustering methods are used, and it is intrinsically relating to the Bayes risk (Tang and Priebe 2018; Athreya et al. 2017; Karrer and Newman 2011).

Definition 4

(Chernoff Information Chernoff 1952, 1956) Let \(F_1\) and \(F_2\) be two continuous multivariate distributions on \({\mathbb {R}}^d\) with density functions \(f_1\) and \(f_2\). The Chernoff information is defined as

$$\begin{aligned} \begin{aligned} C(F_1, F_2)&= - \log \left[ \inf _{t \in (0,1)} \int _{{\mathbb {R}}^d} f_1^t({\textbf{x}}) f_2^{1-t}({\textbf{x}}) d{\textbf{x}} \right] \\&= \sup _{t \in (0, 1)} \left[ - \log \int _{{\mathbb {R}}^d} f_1^t({\textbf{x}}) f_2^{1-t}({\textbf{x}}) d{\textbf{x}} \right] . \end{aligned} \end{aligned}$$
(3)

Remark 3

Consider the special case where we take \(F_1 = {\mathcal {N}}(\varvec{\mu }_1, \varvec{\Sigma }_1)\) and \(F_2 = {\mathcal {N}}(\varvec{\mu }_2, \varvec{\Sigma }_2)\); then the corresponding Chernoff information is

$$\begin{aligned} C(F_1, F_2) = \sup _{t \in (0, 1)} \left[ \frac{1}{2} t (1-t) (\varvec{\mu }_1 - \varvec{\mu }_2)^\top \varvec{\Sigma }_t^{-1} (\varvec{\mu }_1 - \varvec{\mu }_2) + \frac{1}{2} \log \frac{|\varvec{\Sigma }_t |}{|\varvec{\Sigma }_1 |^t |\varvec{\Sigma }_2 |^{1-t}} \right] , \end{aligned}$$
(4)

where \(\varvec{\Sigma }_t = t \varvec{\Sigma }_1 + (1-t) \varvec{\Sigma }_2\).

The comparsion of block recovery via Chernoff information is based on the statistical information between the limiting distributions of the blocks and smaller statistical information implies less information to discriminate between different blocks of the SBM. To that end, we also review the limiting results of ASE for SBM, essential for investigating Chernoff information.

Theorem 1

(CLT of ASE for SBM Rubin-Delanchy et al. 2022) Let \(({\textbf{A}}^{(n)}, {\textbf{X}}^{(n)}) \sim \text {GRDPG}(n, F, d_+, d_-)\) be a sequence of adjacency matrices and associated latent positions of a d-dimensional GRDPG as in Definition 1 from an inner product distribution F where F is a mixture of K point masses in \({\mathbb {R}}^d\), i.e.,

$$\begin{aligned} F = \sum _{k=1}^{K} \pi _k \delta _{\varvec{\nu }_k} \qquad \text {with} \qquad \forall k, \; \pi _k > 0 \quad \text {and} \quad \sum _{k=1}^{K} \pi _k = 1, \end{aligned}$$
(5)

where \(\delta _{\varvec{\nu }_k}\) is the Dirac delta measure at \(\nu _k\). Let \(\Phi ({\textbf{z}}, \varvec{\Sigma })\) denote the cumulative distribution function (CDF) of a multivariate Gaussian distribution with mean \({\varvec{0}}\) and covariance matrix \(\varvec{\Sigma }\), evaluated at \({\textbf{z}} \in {\mathbb {R}}^d\). Let \(\mathbf {{\widehat{X}}}^{(n)}\) be the ASE of \({\textbf{A}}^{(n)}\) with \(\mathbf {{\widehat{X}}}^{(n)}_i\) as the i-th row (same for \({\textbf{X}}^{(n)}_i\)). Then there exists a sequence of matrices \({\textbf{M}}_n \in {\mathbb {R}}^{d \times d}\) satisfying \({\textbf{M}}_n {\textbf{I}}_{d_+ d_-} {\textbf{M}}_n^\top = {\textbf{I}}_{d_+ d_-}\) such that for all \({\textbf{z}} \in {\mathbb {R}}^d\) and fixed index i,

$$\begin{aligned} {\mathbb {P}} \left\{ \sqrt{n} \left( {\textbf{M}}_n \mathbf {{\widehat{X}}}^{(n)}_i - {\textbf{X}}^{(n)}_i \right) \le {\textbf{z}} \; \big | \; {\textbf{X}}^{(n)}_i = \varvec{\nu }_k \right\} \rightarrow \Phi ({\textbf{z}}, \varvec{\Sigma }_k), \end{aligned}$$
(6)

where for \(\varvec{\nu } \sim F\)

$$\begin{aligned} \varvec{\Sigma }_k = \varvec{\Sigma }(\varvec{\nu }_k) = {\textbf{I}}_{d_+ d_-} \varvec{\Delta }^{-1} {\mathbb {E}} \left[ \left( \varvec{\nu }_k^\top {\textbf{I}}_{d_+ d_-} \varvec{\nu } \right) \left( 1-\varvec{\nu }_k^\top {\textbf{I}}_{d_+ d_-} \varvec{\nu } \right) \varvec{\nu } \varvec{\nu }^\top \right] \varvec{\Delta }^{-1} {\textbf{I}}_{d_+ d_-}, \end{aligned}$$
(7)

with

$$\begin{aligned} \varvec{\Delta } = {\mathbb {E}} \left[ \varvec{\nu } \varvec{\nu }^\top \right] . \end{aligned}$$
(8)

For a K-block SBM, let \({\textbf{B}} \in (0, 1)^{K \times K}\) be the block connectivity probability matrix and \(\varvec{\pi } \in (0, 1)^K\) be the vector of block assignment probabilities. Given an n vertex instantiation of the SBM parameterized by \({\textbf{B}}\) and \(\varvec{\pi }\), for sufficiently large n, the large sample optimal error rate for estimating the block assignments using ASE can be measured via Chernoff information as (Tang and Priebe 2018; Athreya et al. 2017)

$$\begin{aligned} \rho = \min _{k \ne l} \sup _{t \in (0, 1)} \left[ \frac{1}{2} n t (1-t) (\varvec{\nu }_k - \varvec{\nu }_\ell )^\top \varvec{\Sigma }_{k\ell }^{-1}(t) (\varvec{\nu }_k - \varvec{\nu }_\ell ) + \frac{1}{2} \log \frac{|\varvec{\Sigma }_{k \ell }(t) |}{|\varvec{\Sigma }_k |^t |\varvec{\Sigma }_\ell |^{1-t}} \right] , \end{aligned}$$
(9)

where \(\varvec{\Sigma }_{k\ell }(t) = t \varvec{\Sigma }_k + (1-t) \varvec{\Sigma }_\ell\), \(\varvec{\Sigma }_k = \varvec{\Sigma }(\varvec{\nu }_k)\) and \(\varvec{\Sigma }_\ell = \varvec{\Sigma }(\varvec{\nu }_\ell )\) are defined as in Eq. (7). Also note that as \(n \rightarrow \infty\), the logarithm term in Eq. (9) will be dominated by the other term. Then we have the approximate Chernoff information as

$$\begin{aligned} \rho \approx \min _{k \ne l} C_{k ,\ell }({\textbf{B}}, \varvec{\pi }), \end{aligned}$$
(10)

where

$$\begin{aligned} C_{k ,\ell }({\textbf{B}}, \varvec{\pi }) =\sup _{t \in (0, 1)} \left[ t (1-t) (\varvec{\nu }_k - \varvec{\nu }_\ell )^\top \varvec{\Sigma }_{k\ell }^{-1}(t) (\varvec{\nu }_k - \varvec{\nu }_\ell ) \right] . \end{aligned}$$
(11)

We also introduce the following two notions, which will be used when we describe our dynamic network sampling scheme.

Definition 5

(Chernoff-active Blocks) For K-block SBM parametrized by the block connectivity probability matrix \({\textbf{B}} \in (0, 1)^{K \times K}\) and the vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\). The Chernoff-active blocks \((k^*, \ell ^*)\) are defined as

$$\begin{aligned} (k^*, \ell ^*) = \arg \min _{k \ne l} C_{k ,\ell }({\textbf{B}}, \varvec{\pi }), \end{aligned}$$
(12)

where \(C_{k ,\ell }({\textbf{B}}, \varvec{\pi })\) is defined as in Eq. (10).

Definition 6

(Chernoff Superiority) For K-block SBMs, given two block connectivity probability matrices \({\textbf{B}}, {\textbf{B}}^\prime \in (0, 1)^{K \times K}\) and a vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\). Let \(\rho _B\) and \(\rho _{B^\prime }\) denote the Chernoff information obtained as in Eq. (10) corresponding to \({\textbf{B}}\) and \({\textbf{B}}^\prime\) respectively. We say that \({\textbf{B}}\) is Chernoff superior to \({\textbf{B}}^\prime\), denoted as \({\textbf{B}} \succ {\textbf{B}}^\prime\), if \(\rho _B > \rho _{B^\prime }\).

Remark 4

If \({\textbf{B}}\) is Chernoff superior to \({\textbf{B}}^\prime\), then we can have a better block recovery from \({\textbf{B}}\) than \({\textbf{B}}^\prime\). In addition, Chernoff superiority is transitive, which is straightforward from the definition.

Dynamic network sampling

We start our analysis with the unobserved block connectivity probability matrix \({\textbf{B}}\) for SBM and then illustrate how to migrate the proposed methods for real applications when we have the observed adjacency matrix \({\textbf{A}}\).

Consider the K-block SBM parametrized by the block connectivity probability matrix \({\textbf{B}} \in (0, 1)^{K \times K}\) and the vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\) with \(K > 2\). Given initial sampling parameter \(p_0 \in (0, 1)\), initial sampling is uniformly at random, i.e.,

$$\begin{aligned} {\textbf{B}}_0 = p_0 {\textbf{B}}. \end{aligned}$$
(13)

This initial sampling simulates the case when one only obersves a partial graph with a small portion of the edges instead of the entire graph with all existing edges.

Theorem 2

For K-block SBMs, given two block connectivity probability matrices \({\textbf{B}}, p{\textbf{B}} \in (0, 1)^{K \times K}\) with \(p \in (0, 1)\) and a vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\), we have \({\textbf{B}} \succ p {\textbf{B}}\).

The proof of Theorem 2 can be found in Appendix. As an illustration, consider a 4-block SBM parametrized by block connectivity probability matrix \({\textbf{B}}\) as

$$\begin{aligned} {\textbf{B}} = \begin{bmatrix} 0.04 &{} 0.08 &{} 0.10 &{} 0.18 \\ 0.08 &{} 0.16 &{} 0.20 &{} 0.36 \\ 0.10 &{} 0.20 &{} 0.25 &{} 0.45 \\ 0.18 &{} 0.36 &{} 0.45 &{} 0.81 \end{bmatrix}. \end{aligned}$$
(14)

Figure 1 shows Chernoff information \(\rho\) as in Eq. (10) corresponding to \({\textbf{B}}\) as in Eq. (14) and \(p {\textbf{B}}\) for \(p \in (0, 1)\). In addition, Fig. 1a assumes \(\varvec{\pi } = (\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4})\) and Fig. 1b assumes \(\varvec{\pi } = (\frac{1}{8}, \frac{1}{8}, \frac{3}{8}, \frac{3}{8})\). As suggested by Theorem 2, for any \(p \in (0, 1)\) we have \(\rho _{B} > \rho _{pB}\) and thus \({\textbf{B}} \succ p {\textbf{B}}\).

Fig. 1
figure 1

Chernoff information \(\rho\) as in Eq. (10) corresponding to \({\textbf{B}}\) as in Eq. (14) and \(p {\textbf{B}}\) for \(p \in (0, 1)\)

Now given dynamic network sampling parameter \(p_1 \in (0, 1-p_0)\), the baseline sampling scheme can proceed uniformly at random again, i.e.,

$$\begin{aligned} {\textbf{B}}_1 = {\textbf{B}}_0 + p_1 {\textbf{B}} = (p_0 + p_1) {\textbf{B}}. \end{aligned}$$
(15)

This dynamic network sampling simulates the situation when one is given limited resources to sample some extra edges after observing the partial graph with only a small portion of the edges. Since we only have limited budget to sample another small portion of edges, one would benefit from identifying vertex pairs that have much influence on the community structure. In other words, the baseline sampling scheme just randomly choosing vertex pairs without using the information from the initial observed graphs and our goal is to design an alternative scheme to optimize this dynamic network sampling procedure so that one could have a better block recovery even with limited resources to only observe a partial graph with a small portion of the edges.

Corollary 1

For K-block SBMs, given block connectivity probability matrix \({\textbf{B}} \in (0, 1)^{K \times K}\) and a vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\). We have \({\textbf{B}} \succ {\textbf{B}}_1 \succ {\textbf{B}}_0\) where \({\textbf{B}}_0\) is defined as in Eq. (13) with \(p_0 \in (0, 1)\) and \({\textbf{B}}_1\) is defined as in Eq. (15) with \(p_1 \in (0, 1-p_0)\).

The proof of Corollary 1 can be found in Appendix. This corollay implies that we can have a better block recovery from \({\textbf{B}}_1\) than \({\textbf{B}}_0\).

Assumption 1

The Chernoff-active blocks after initial sampling is unique, i.e., there exists an unique pair \(\left( k_0^*, \ell _0^* \right) \in \{(k, \ell ) \; | \; 1 \le k < \ell \le K \}\) such that

$$\begin{aligned} \left( k_0^*, \ell _0^* \right) = \arg \min _{k \ne l} C_{k ,\ell }({\textbf{B}}_0, \varvec{\pi }), \end{aligned}$$
(16)

where \({\textbf{B}}_0\) is defined as in Eq. (13) and \(\varvec{\pi }\) is the vector of block assignment probabilities.

To improve this baseline sampling scheme, we concentrate on the Chernoff-active blocks \(\left( k_0^*, \ell _0^* \right)\) after initial sampling assuming Assumption 1 holds. Instead of sampling from the entire block connectivity probability matrix \({\textbf{B}}\) like the baseline sampling scheme as in Eq. (15), we only sample the entries associated with the Chernoff-active blocks. As a competitor to \({\textbf{B}}_1\), our Chernoff-optimal dynamic network sampling scheme is then given by

$$\begin{aligned} \widetilde{{\textbf{B}}}_1 = {\textbf{B}}_0 + \frac{p_1}{\left( \pi _{k_0^*} + \pi _{\ell _0^*}\right) ^2 } {\textbf{B}} \circ {\textbf{1}}_{k_0^*, \ell _0^*}, \end{aligned}$$
(17)

where \(\circ\) denotes Hadamard product, \(\pi _{k_0^*}\) and \(\pi _{\ell _0^*}\) denote the block assignment probabilities for block \(k_0^*\) and \(\ell _0^*\) respectively, and \({\textbf{1}}_*\) is the \(K \times K\) binary matrix with 0’s everywhere except for 1’s associated with the Chernoff-active blocks \(\left( k_0^*, \ell _0^* \right)\), i.e., for any \(i, j \in \{1, \cdots , K \}\)

$$\begin{aligned} {\textbf{1}}_{k_0^*, \ell _0^*}[i, j] = {\left\{ \begin{array}{ll} 1 &{} \text {if} \;\; (i, j) \in \left\{ \left( k_0^*, k_0^* \right) , \; \left( k_0^*, \ell _0^* \right) , \; \left( \ell _0^*, k_0^* \right) , \; \left( \ell _0^*, \ell _0^* \right) \right\} \\ 0 &{} \text {otherwise} \end{array}\right. } . \end{aligned}$$
(18)

Note that the multiplier \(\frac{1}{\left( \pi _{k_0^*} + \pi _{\ell _0^*}\right) ^2}\) on \(p_1 {\textbf{B}} \circ {\textbf{1}}_*\) assures that we sample the same number of potential edges with \(\widetilde{{\textbf{B}}}_1\) as we do with \({\textbf{B}}_1\) in the baseline sampling scheme. In addition, to avoid over-sampling with respect to \({\textbf{B}}\), i.e., to ensure \(\widetilde{{\textbf{B}}}_1[i, j] \le {\textbf{B}}[i, j]\) for any \(i, j \in \{1, \cdots , K \}\), we require

$$\begin{aligned} p_1 \le p_1^{\text {max}} = \left( 1 - p_0 \right) \left( \pi _{k_0^*} + \pi _{\ell _0^*}\right) ^2. \end{aligned}$$
(19)

Assumption 2

For K-block SBMs, given a block connectivity probability matrix \({\textbf{B}} \in (0, 1)^{K \times K}\) and a vector of block assignment probabilities \(\varvec{\pi } \in (0, 1)^K\). Let \(p_1^* \in (0, p_1^{\text {max}}]\) be the smallest positive \(p_1 \le p_1^{\text {max}}\) such that

$$\begin{aligned} \arg \min _{k \ne l} C_{k ,\ell }(\widetilde{{\textbf{B}}}_1, \varvec{\pi }) \end{aligned}$$
(20)

is not unique where \(p_1^{\text {max}}\) is defined as in Eq. (19) and \(\widetilde{{\textbf{B}}}_1\) is defined as in Eq. (17). If the arg min is always unique, let \(p_1^* = p_1^{\text {max}}\).

For any \(p_1 \in (0, p_1^*)\), we can have a better block recovery from \(\widetilde{{\textbf{B}}}_1\) than \({\textbf{B}}_1\), i.e., our Chernoff-optimal dynamic network sampling sheme is better than the baseline sampling scheme in terms of block recovery.

As an illustaration, consider the 4-block SBM with initial sampling parameter \(p_0 = 0.01\) and block connectivity probability matrix \({\textbf{B}}\) as in Eq. (14). Figure 2 shows the Chernoff information \(\rho\) as in Eq. (10) corresponding to \({\textbf{B}}\) as in Eq. (14), \({\textbf{B}}_0\) as in Eq. (13), \({\textbf{B}}_1\) as in Eq. (15), and \(\widetilde{{\textbf{B}}}_1\) as in Eq. (17) with dynamic network sampling parameter \(p_1 \in (0, p_1^*)\) where \(p_1^*\) is defined as in Assumption 2. In addition, Figure 2a assumes \(\varvec{\pi } = (\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4})\) and Fig. 2b assumes \(\varvec{\pi } = (\frac{1}{8}, \frac{1}{8}, \frac{3}{8}, \frac{3}{8})\). Note that for any \(p_1 \in (0, p_1^*)\) we have \(\rho _{B}> \rho _{{\widetilde{B}}_1}> \rho _{B_1} > \rho _{B_0}\) and thus \({\textbf{B}} \succ \widetilde{{\textbf{B}}}_1 \succ {\textbf{B}}_1 \succ {\textbf{B}}_0\). That is, in terms of Chernoff information, when given same amount of resources, the proposed Chernoff-optimal dynamic network sampling scheme can yield better block recovery results. In other words, to reach the same level of performance, in terms of Chernoff information, the proposed Chernoff-optimal dynamic network sampling scheme needs less resources.

Fig. 2
figure 2

Chernoff information \(\rho\) as in Eq. (10) corresponding to \({\textbf{B}}\) as in Eq. (14), \({\textbf{B}}_0\) as in Eq. (13), \({\textbf{B}}_1\) as in Eq. (15), and \(\widetilde{{\textbf{B}}}_1\) as in Eq. (17) with initial sampling parameter \(p_0 = 0.01\) and dynamic network sampling parameter \(p_1 \in (0, p_1^*)\) where \(p_1^*\) is defined as in Assumption 2

As described earlier, it may be the case that \(p_1^* < p_1^{\text {max}}\) at which point Chernoff-active blocks change to \((k_1^*, \ell _1^*)\). This potential non-uniquess of the Chernoff argmin is a consequence of our dynamic network sampling scheme. In the case of \(p_1 > p_1^*\), our Chernoff-optimal dynamic network sampling scheme is adopted as

$$\begin{aligned} \widetilde{{\textbf{B}}}_1^* = {\textbf{B}}_0 + \left( p_1 - p_1^* \right) {\textbf{B}} + \frac{p_1^*}{\left( \pi _{k_0^*} + \pi _{\ell _0^*}\right) ^2 } {\textbf{B}} \circ {\textbf{1}}_{k_0^*, \ell _0^*}, \end{aligned}$$
(21)

Similarly, the multiplier \(\frac{1}{\left( \pi _{k_0^*} + \pi _{\ell _0^*}\right) ^2}\) on \(p_1^* {\textbf{B}} \circ {\textbf{1}}_{k_0^*, \ell _0^*}\) assures that we sample the same number of potential edges with \(\widetilde{{\textbf{B}}}_1^*\) as we do with \({\textbf{B}}_1\) in the baseline sampling scheme. In addition, to avoid over-sampling with respect to \({\textbf{B}}\), i.e., \(\widetilde{{\textbf{B}}}_1^*[i, j] \le {\textbf{B}}[i, j]\) for any \(i, j \in \{1, \cdots , K \}\), we require

$$\begin{aligned} p_1 \le p_{11}^{\text {max}} = 1 - p_0 - \frac{p_1^*}{\left( \pi _{k_0^*} + \pi _{\ell _0^*}\right) ^2 } + p_1^*. \end{aligned}$$
(22)

For any \(p_1 \in [p_1^*, p_{11}^{\text {max}}]\), we can have a better block recovery from \(\widetilde{{\textbf{B}}}_1^*\) than \({\textbf{B}}_1\), i.e., our Chernoff-optimal dynamic network sampling sheme is again better than the baseline sampling scheme in terms of block recovery.

As an illustration, consider a 4-block SBM with initial sampling parameter \(p_0 = 0.01\) and block connectivity probability matrix \({\textbf{B}}\) as in Eq. (14). Figure 3 shows the Chernoff information \(\rho\) as in Eq. (10) corresponding to \({\textbf{B}}\) as in Eq. (14), \({\textbf{B}}_0\) as in Eq. (13), \({\textbf{B}}_1\) as in Eq. (15), and \(\widetilde{{\textbf{B}}}_1^*\) as in Eq. (21) with dynamic network sampling parameter \(p_1 \in [p_1^*, p_{11}^{\text {max}}]\) where \(p_1^*\) is defined as in Assumption 2 and \(p_{11}^{\text {max}}\) is defined as in Eq. (22). In addition, Fig. 3a assumes \(\varvec{\pi } = (\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4})\) and Fig. 3b assumes \(\varvec{\pi } = (\frac{1}{8}, \frac{1}{8}, \frac{3}{8}, \frac{3}{8})\). Note that for any \(p_1 \in [p_1^*, p_{11}^{\text {max}}]\) we have \(\rho _{B}> \rho _{{\widetilde{B}}_1^*}> \rho _{B_1} > \rho _{B_0}\) and thus \({\textbf{B}} \succ \widetilde{{\textbf{B}}}_1^* \succ {\textbf{B}}_1 \succ {\textbf{B}}_0\). That is, the adopted Chernoff-optimal dynamic network sampling scheme can still yield better block recovery results, in terms of Chernoff information, given the same amout of resources.

Fig. 3
figure 3

Chernoff information \(\rho\) as in Eq. (10) corresponding to \({\textbf{B}}\) as in Eq. (14), \({\textbf{B}}_0\) as in Eq. (13), \({\textbf{B}}_1\) as in Eq. (15), and \(\widetilde{{\textbf{B}}}_1^*\) as in Eq. (21) with initial sampling parameter \(p_0 = 0.01\) and dynamic network sampling parameter \(p_1 \in [p_1^*, p_{11}^{\text {max}}]\) where \(p_1^*\) is defined as in Assumption 2 and \(p_{11}^{\text {max}}\) is defined as in Eq. (22)

Now we illustrate how the proposed Chernoff-optimal dynamic network sampling sheme can be migrated for real applications. We summarize the uniform dynamic sampling scheme (baseline) as Algorithm 1 and our Chernoff-optimal dynamic network sampling scheme as Algorithm 2. Recall given potential edge set E and initial sampling parameter \(p_0 \in (0, 1)\), we have the initial edge set \(E_0 \subset E\) with \(|E_0 |= p_0 |E |\). The goal is to dynamically sample new edges from the potential edge set so that we can have a better block recovery given limited resources.

figure a
figure b

Experiments

Simulations

In addition to Chernoff analysis, we also evalute our Chernoff-optimal dynamic network sampling sheme via simulations. In particular, consider the 4-block SBM parameterized by block connectivity probability matrix \({\textbf{B}}\) as in Eq. (14) and dynamic network sampling parameter \(p_1 \in (0, p_{11}^{\text {max}}]\) where \(p_{11}^{\text {max}}\) is defined as in Eq. (22). We fix initial sampling parameter \(p_0 = 0.01\). For each \(p_1 \in (0, p_1^*)\) where \(p_1^*\) is defined as in Assumption 2, we simulate 50 adjacency matrices with \(n = 12000\) vertices from \({\textbf{B}}_1\) as in Eq. (15) and \(\widetilde{{\textbf{B}}}_1\) as in Eq. (17) respectively. For each \(p_1 \in [p_1^*, p_{11}^{\text {max}}]\), we simulate 50 adjacency matrices with \(n = 12000\) vertices from \({\textbf{B}}_1\) as in Eq. (15) and \(\widetilde{{\textbf{B}}}_1^*\) as in Eq. (21) respectively. In addition, Fig. 4a assumes \(\varvec{\pi } = (\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4})\), i.e., 3000 vertices in each block, and Fig. 4b assumes \(\varvec{\pi } = (\frac{1}{8}, \frac{1}{8}, \frac{3}{8}, \frac{3}{8})\), i.e., 1500 vertices in two of the blocks and 4500 vertices in the other two blocks. We then apply ASE \(\circ\) GMM (Step 3 and 4 in Algorithm 1) to recover block assignments and adopt adjusted Rand index (ARI) to measure the performance. Figure 4 shows ARI (mean\(\pm\)stderr) associated with \({\textbf{B}}_1\) for \(p_1 \in (0, p_{11}^{\text {max}}]\), \(\widetilde{{\textbf{B}}}_1\) for \(p_1 \in (0, p_1^*)\), and \(\widetilde{{\textbf{B}}}_1^*\) for \(p_1 \in [p_1^*, p_{11}^{\text {max}}]\) where the dashed lines denote \(p_1^*\). Note that we can have a better block recovery from \(\widetilde{{\textbf{B}}}_1\) and \(\widetilde{{\textbf{B}}}_1^*\) than \({\textbf{B}}_1\), which argee with our results from Chernoff analysis.

Fig. 4
figure 4

Simulations for 4-block SBM parameterized by block connectivity probability matrix \({\textbf{B}}\) as in Eq. (14) with initial sampling parameter \(p_0 = 0.01\) and dynamic network sampling parameter \(p_1 \in (0, p_{11}^{\text {max}}]\) where \(p_{11}^{\text {max}}\) is defined as in Eq. (22). The dashed lines denote \(p_1^*\) which is defined as in Assumption 2

Now we compare the performance of Algorithms 1 and 2 by actual block recovery results. In particular, we start with the 4-block SBM parameterized by block connectivity probability matrix \({\textbf{B}}\) as in Eq. (14). We consider dynamic network sampling parameter \(p_1 \in (0, 1-p_0)\) where \(p_0\) is the initial sampling parameter. For each \(p_1\), we simulate 50 adjacency matrices with \(n = 4000\) vertices and retrieve associated potential edge sets. We fix initial sampling parameter \(p_0 = 0.15\) and randomly sample initial edge sets. We then apply both algorithms to estimate the block assignments and adopt ARI to measure the performance. Figure 5 shows ARI (mean\(\pm\)stderr) of two algorithms for \(p_1 \in (0, 0.85)\) where Fig. 5a assumes \(\varvec{\pi } = (\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4})\), i.e., 1000 vertices in each block, and Fig. 5b assumes \(\varvec{\pi } = (\frac{1}{8}, \frac{1}{8}, \frac{3}{8}, \frac{3}{8})\), i.e., 500 vertices in two of the blocks and 1500 vertices in the other two blocks. Note that both algorithms tend to have a better performance as \(p_1\) increases, i.e., as we sample more edges, and Algorithm 2 can always recover more accurate block structure than Algorithm 1. That is, given the same amout of resources, the proposed Chernoff-optimal dynamic network sampling scheme can yield better block recovery results. In other words, to reach the same level of performance, in terms of the empirical clustering results, the proposed Chernoff-optimal dynamic network sampling scheme needs less resources.

Fig. 5
figure 5

Simulations for 4-block SBM parameterized by block connectivity probability matrix \({\textbf{B}}\) as in Eq. (14) with initial sampling parameter \(p_0 = 0.15\) and dynamic network sampling parameter \(p_1 \in (0, 0.85)\)

Real data

We also evaluate the performance of Algorithms 1 and 2 for real application. We conduct real data experiments on a diffusion MRI connectome dataset (Priebe et al. 2019). There are 114 graphs (connectomes) estimated by the NDMG pipeline (Kiar et al. 2018) in this dataset. Each vertex in these graphs (the number of vertices n varies from 23728 to 42022) has a {Left, Right} hemisphere label and a {Gray, White} tissue label. We consider the potential 4 blocks as {LG, LW, RG, RW} where L and R denote the Left and Right hemisphere label, G and W denote the Gray and White tissue label. Here we consider initial sampling parameter \(p_0 = 0.25\) and dynamic network sampling parameter \(p_1 = 0.25\). Let \(\Delta = \text {ARI(Algo2)} - \text {ARI(Algo1)}\) where ARI(Algo1) and ARI(Algo2) denotes the ARI when we apply Algorithms 1 and 2 respectively. The following hypothesis testing yields p-value=0.0184. Figure 6 shows algorithms’ comparative performance via boxplot and histogram.

$$\begin{aligned} H_0: \; \text {median}(\Delta ) \le 0 \qquad \text {v.s.} \qquad H_A: \; \text {median}(\Delta ) > 0. \end{aligned}$$
(23)
Fig. 6
figure 6

Algorithms’ comparative performance on diffusion MRI connectome data via ARI with initial sampling parameter \(p_0 = 0.25\) and dynamic network sampling parameter \(p_1 = 0.25\)

Furthermore, we test our algorithms on a Microsoft bing entity dataset (Agterberg et al. 2020). There are 2 graphs in this dataset where each has 13535 vertices. We treat block assignments estimated from the complete graph as ground truth. We consider initial sampling parameter \(p_0 \in \left\{ 0.2, \; 0.3 \right\}\) and dynamic network sampling parameter \(p_1 \in \left\{ 0, \; 0.05, \; 0.1, \; 0.15, \; 0.2 \right\}\). For each \(p_1\), we sample 100 times and compare the overall performance of Algorithm 1 and 2. Figure 7 shows the results where ARI is reported as mean(±stderr).

Fig. 7
figure 7

Algorithms’ comparative performance on Microsoft bing entity data via ARI with different initial sampling parameter \(p_0\) and dynamic network sampling parameter \(p_1\)

We also conduct real data experiments with 2 social network datasets.

  • LastFM asia social network data set (Leskovec and Krevl 2014; Rozemberczki and Sarkar 2020): Vertices (the number of vertices \(n = 7624\)) represent LastFM users from asian countries and edges (the number of edges \(e = 27806\)) represent mutual follower relationships. We treat 18 different location of users, which are derived from the country field for each user, as the potential block.

  • Facebook large page-page network data set (Leskovec and Krevl 2014; Rozemberczki et al. 2019): Vertices (the number of vertices \(n = 22470\)) represent official Facebook pages and edges (the number of edges \(e = 171002\)) represent mutual likes. We treat 4 page types {Politician, Governmental Organization, Television Show, Company}, which are defined by Facebook, as the potential block.

We consider initial sampling parameter \(p_0 \in \left\{ 0.15, \; 0.35 \right\}\) and dynamic network sampling parameter \(p_1 \in \left\{ 0.05, \; 0.1, \; 0.15, \; 0.2, \; 0.25 \right\}\). For each \(p_1\), we sample 100 times and compare the overall performance of Algorithm 1 and 2. Figure 8 shows the results where ARI is reported as mean(±stderr). Again it suggests that given the same amout of resources, the proposed Chernoff-optimal dynamic network sampling scheme can yield better block recovery results. In other words, to reach the same level of performance, in terms of the empirical clustering results, the proposed Chernoff-optimal dynamic network sampling scheme needs less resources.

Fig. 8
figure 8

Algorithms’ comparative performance on social network data via ARI with different initial sampling parameter \(p_0\) and dynamic network sampling parameter \(p_1\)

Discussion

We propose a dynamic network sampling scheme to optimize block recovery for SBM when we only have a limited budget to observe a partial graph. Theoretically, we provide justification of our proposed Chernoff-optimal dynamic sampling scheme via the Chernoff information. Practically, we evaluate the performance, in terms of block recovery (community detection), of our method on several real datasets including diffusion MRI connectome dataset, Microsoft bing entity graph transitions dataset and social network datasets. Both theoretically and practically results suggest that our method can identify vertices that have the most impact on block structure and only check whether there are edges between them to save significant resources but still recover the block structure.

As the Chernoff-optimal dynamic sampling scheme depends on the initial clustering results to identify Chernoff-active blocks and construct dynamic edge set. Thus the performance could be impacted if the initial clustering is not very ideal. One of the future direction is to design certain strategy to reduce this dependency such that the proposed scheme is more robust.

Availibility of data and materials

Social network datasets are available at https://www.snap.stanford.edu/data/.

Abbreviations

SBM:

Stochastic Blockmodel

GRDPG:

Generalized random dot product graph

ASE:

Adjacency spectral embedding

LSE:

Laplacian spectral embedding

GMM:

Gaussian mixture modeling

BIC:

Bayesian information criterion

ARI:

Adjusted Rand index

stderr:

Standard error

NDMG:

NeuroData’s magnetic resonance imaging to graphs

References

  • Agterberg J, Park Y, Larson J, White C, Priebe CE, Lyzinski V (2020) Vertex nomination, consistent estimation, and adversarial modification. Electron J Stat 14(2):3230–3267

    Article  MathSciNet  MATH  Google Scholar 

  • Athreya A, Priebe CE, Tang M, Lyzinski V, Marchette DJ, Sussman DL (2016) A limit theorem for scaled eigenvectors of random dot product graphs. Sankhya A 78(1):1–18

    Article  MathSciNet  MATH  Google Scholar 

  • Athreya A, Fishkind DE, Tang M, Priebe CE, Park Y, Vogelstein JT, Levin K, Lyzinski V, Qin Y (2017) Statistical inference on random dot product graphs: a survey. J Mach Learn Res 18(1):8393–8484

    MathSciNet  MATH  Google Scholar 

  • Binkiewicz N, Vogelstein JT, Rohe K (2017) Covariate-assisted spectral clustering. Biometrika 104(2):361–377

    Article  MathSciNet  MATH  Google Scholar 

  • Chernoff H (1952) A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann Math Stat 23(4):493–507

    Article  MathSciNet  MATH  Google Scholar 

  • Chernoff H (1956) Large-sample theory: parametric case. Ann Math Stat 27(1):1–22

    Article  MathSciNet  MATH  Google Scholar 

  • Choi DS, Wolfe PJ, Airoldi EM (2012) Stochastic blockmodels with a growing number of classes. Biometrika 99(2):273–284

    Article  MathSciNet  MATH  Google Scholar 

  • Fortunato S, Hric D (2016) Community detection in networks: a user guide. Phys Rep 659:1–44

    Article  MathSciNet  Google Scholar 

  • Gallagher I, Bertiger A, Priebe C, Rubin-Delanchy P (2019) Spectral clustering in the weighted stochastic block model. arXiv:1910.05534

  • Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York

    Book  MATH  Google Scholar 

  • Holland PW, Laskey KB, Leinhardt S (1983) Stochastic blockmodels: first steps. Soc Netw 5(2):109–137

    Article  MathSciNet  Google Scholar 

  • Horn RA, Johnson CR (2012) Matrix Analysis. Cambridge University Press, New York

    Book  Google Scholar 

  • Huang S, Feng Y (2018) Pairwise covariates-adjusted block model for community detection. arXiv:1807.03469

  • Jolliffe IT, Cadima J (2016) Principal component analysis: a review and recent developments. Philos Trans R Soc A Math Phys Eng Sci 374(2065):20150202

    Article  MathSciNet  MATH  Google Scholar 

  • Karrer B, Newman ME (2011) Stochastic blockmodels and community structure in networks. Phys Rev E 83(1):016107

    Article  MathSciNet  Google Scholar 

  • Kiar G, Bridgeford EW, Gray Roncal WR, Chandrashekhar V, Mhembere D, Ryman S, Zuo X-N, Margulies DS, Craddock RC, Priebe CE, Jung R, Calhoun VD, Caffo B, Burns R, Milham MP, Vogelstein JT (2018) A high-throughput pipeline identifies robust connectomes but troublesome variability. bioRxiv, 188706

  • Leskovec J, Krevl A (2014) SNAP datasets: stanford large network dataset collection. http://snap.stanford.edu/data

  • Lyzinski V, Sussman DL, Tang M, Athreya A, Priebe CE (2014) Perfect clustering for stochastic blockmodel graphs via adjacency spectral embedding. Electron J Stat 8(2):2905–2922

    Article  MathSciNet  MATH  Google Scholar 

  • Lyzinski V, Tang M, Athreya A, Park Y, Priebe CE (2016) Community detection and classification in hierarchical stochastic blockmodels. IEEE Trans Netw Sci Eng 4(1):13–26

    Article  MathSciNet  Google Scholar 

  • McSherry F (2001) Spectral partitioning of random graphs. In: Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pp 529–537. IEEE

  • Mele A, Hao L, Cape J, Priebe CE (2022) Spectral inference for large stochastic blockmodels with nodal covariates. J Bus Econ Stat

  • Mu C, Mele A, Hao L, Cape J, Athreya A, Priebe CE (2022) On spectral algorithms for community detection in stochastic blockmodel graphs with vertex covariates. IEEE Trans Netw Sci Eng

  • Priebe CE, Park Y, Vogelstein JT, Conroy JM, Lyzinski V, Tang M, Athreya A, Cape J, Bridgeford E (2019) On a two-truths phenomenon in spectral graph clustering. Proc Natl Acad Sci 116(13):5995–6000

    Article  MathSciNet  MATH  Google Scholar 

  • Purohit S, Choudhury S, Holder LB (2017) Application-specific graph sampling for frequent subgraph mining and community detection. In: 2017 IEEE International Conference on Big Data (Big Data), pp 1000–1005. IEEE

  • Rohe K, Chatterjee S, Yu B (2011) Spectral clustering and the high-dimensional stochastic blockmodel. Ann Stat 39(4):1878–1915

    Article  MathSciNet  MATH  Google Scholar 

  • Roy S, Atchadé Y, Michailidis G (2019) Likelihood inference for large scale stochastic blockmodels with covariates based on a divide-and-conquer parallelizable algorithm with communication. J Comput Graph Stat 28(3):609–619

    Article  MathSciNet  MATH  Google Scholar 

  • Rozemberczki B, Allen C, Sarkar R (2019) Multi-scale attributed node embedding. arXiv:1909.13021

  • Rozemberczki B, Sarkar R (2020) Characteristic functions on graphs: birds of a feather, from statistical descriptors to parametric models. In: Proceedings of the 29th ACM International conference on information and knowledge management (CIKM ’20), pp 1325–1334. ACM

  • Rubin-Delanchy P, Priebe CE, Tang M, Cape J (2022) A statistical interpretation of spectral embedding: the generalised random dot product graph. J R Stat Soc

  • Sussman DL, Tang M, Fishkind DE, Priebe CE (2012) A consistent adjacency spectral embedding for stochastic blockmodel graphs. J Am Stat Assoc 107(499):1119–1128

    Article  MathSciNet  MATH  Google Scholar 

  • Sweet TM (2015) Incorporating covariates into stochastic blockmodels. J Educ Behav Stat 40(6):635–664

    Article  Google Scholar 

  • Tang M, Priebe CE (2018) Limit theorems for eigenvectors of the normalized laplacian for random graphs. Ann Stat 46(5):2360–2415

    Article  MathSciNet  MATH  Google Scholar 

  • Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416

    Article  MathSciNet  Google Scholar 

  • Yun S-Y, Proutiere A (2014) Community detection via random and adaptive sampling. In: Conference on Learning Theory, pp 138–175. PMLR

  • Zhu M, Ghodsi A (2006) Automatic dimensionality selection from the scree plot via the use of profile likelihood. Comput Stati Data Anal 51(2):918–930

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This problem was posed to us by Adam Cardinal-Stakenas and Kevin Hoover.

Funding

Cong Mu’s work is partially supported by the Johns Hopkins Mathematical Institute for Data Science (MINDS) Data Science Fellowship.

Author information

Authors and Affiliations

Authors

Contributions

CM developed the theory, designed and implemented the methods, conducted the experiments, and wrote the manuscript . YP implemented the methods, conducted the experiments, and edited the manuscript. CEP formulated the problem, designed the methods, developed the theory and edited the manuscript. All authors read and approved the manuscript.

Corresponding author

Correspondence to Cong Mu.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Proof of Theorem 2

Let \({\textbf{B}} = {\textbf{U}} {\textbf{S}} {\textbf{U}}^\top\) be the spectral decomposition of \({\textbf{B}}\) and \({\textbf{B}}^\prime = p {\textbf{B}}\) with \(p \in (0, 1)\). Then we have

$$\begin{aligned} {\textbf{B}}^\prime = {\textbf{U}}^\prime {\textbf{S}} \left( {\textbf{U}}^\prime \right) ^\top \qquad \text {where} \qquad {\textbf{U}}^\prime = \sqrt{p} {\textbf{U}}. \end{aligned}$$
(24)

By Remark 1, to represent these two SBMs parametrized by two block connectivity matrices \({\textbf{B}}\) and \({\textbf{B}}^\prime\) respectively (with the same block assignment probability vector \(\varvec{\pi }\)) in the GRDPG models, we can take

$$\begin{aligned} \begin{aligned} \varvec{\nu }&= \begin{bmatrix} \varvec{\nu }_1&\cdots&\varvec{\nu }_K \end{bmatrix}^\top = {\textbf{U}} |{\textbf{S}}|^{1/2} \in {\mathbb {R}}^{K \times d}, \\ \varvec{\nu }^\prime&= \begin{bmatrix} \varvec{\nu }_1^\prime&\cdots&\varvec{\nu }_K^\prime \end{bmatrix}^\top = {\textbf{U}}^\prime |{\textbf{S}}|^{1/2} = \sqrt{p} {\textbf{U}} |{\textbf{S}}|^{1/2} = \sqrt{p} \varvec{\nu } \in {\mathbb {R}}^{K \times d}. \end{aligned} \end{aligned}$$
(25)

Then for any \(k \in \{1, \cdots , K \}\), we have \(\varvec{\nu }_k^\prime = \sqrt{p} \varvec{\nu }_k \in {\mathbb {R}}^{d}\). By Theorem 1, we have

$$\begin{aligned} \begin{aligned} \varvec{\Delta }&= \sum _{k=1}^{K} \pi _k \varvec{\nu }_k \varvec{\nu }_k^\top \in {\mathbb {R}}^{d \times d}, \\ \varvec{\Delta }^\prime&= \sum _{k=1}^{K} \pi _k \varvec{\nu }_k^\prime \left( \varvec{\nu }_k^\prime \right) ^\top = p \sum _{k=1}^{K} \pi _k \varvec{\nu }_k \varvec{\nu }_k^\top = p \varvec{\Delta } \in {\mathbb {R}}^{d \times d}. \end{aligned} \end{aligned}$$
(26)

Note that \({\textbf{B}}\) and \({\textbf{B}}^\prime\) have the same eigenvalues, thus we have \({\textbf{I}}_{d_+ d_-} = {\textbf{I}}_{d_+ d_-}^\prime\). See also Lemma 2 (Gallagher et al. 2019). Then for \(k \in \{1, \cdots , K \}\), we have

$$\begin{aligned} \begin{aligned} \varvec{\Sigma }_k&= {\textbf{I}}_{d_+ d_-} \varvec{\Delta }^{-1} {\mathbb {E}} \left[ \left( \varvec{\nu }_k^\top {\textbf{I}}_{d_+ d_-} \varvec{\nu } \right) \left( 1-\varvec{\nu }_k^\top {\textbf{I}}_{d_+ d_-} \varvec{\nu } \right) \varvec{\nu } \varvec{\nu }^\top \right] \varvec{\Delta }^{-1} {\textbf{I}}_{d_+ d_-} \\&= {\textbf{I}}_{d_+ d_-} \varvec{\Delta }^{-1} \left[ \sum _{\ell =1}^{K} \pi _{\ell } \left( \varvec{\nu }_k^\top {\textbf{I}}_{d_+ d_-} \varvec{\nu }_{\ell } \right) \left( 1-\varvec{\nu }_k^\top {\textbf{I}}_{d_+ d_-} \varvec{\nu }_{\ell } \right) \varvec{\nu }_{\ell } \varvec{\nu }_{\ell }^\top \right] \varvec{\Delta }^{-1} {\textbf{I}}_{d_+ d_-} \in {\mathbb {R}}^{d \times d}, \\[1em] \varvec{\Sigma }_k^{\prime }&= \frac{1}{p^2} {\textbf{I}}_{d_+ d_-} \varvec{\Delta }^{-1} \left[ p^2 \sum _{\ell =1}^{K} \pi _{\ell } \left( \varvec{\nu }_k^\top {\textbf{I}}_{d_+ d_-} \varvec{\nu }_{\ell } \right) \left( 1-p \varvec{\nu }_k^\top {\textbf{I}}_{d_+ d_-} \varvec{\nu }_{\ell } \right) \varvec{\nu }_{\ell } \varvec{\nu }_{\ell }^\top \right] \varvec{\Delta }^{-1} {\textbf{I}}_{d_+ d_-} \\&= {\textbf{I}}_{d_+ d_-} \varvec{\Delta }^{-1} \left[ p \sum _{\ell =1}^{K} \pi _{\ell } \left( \varvec{\nu }_k^\top {\textbf{I}}_{d_+ d_-} \varvec{\nu }_{\ell } \right) \left( 1-\varvec{\nu }_k^\top {\textbf{I}}_{d_+ d_-} \varvec{\nu }_{\ell } \right) \varvec{\nu }_{\ell } \varvec{\nu }_{\ell }^\top \right] \varvec{\Delta }^{-1} {\textbf{I}}_{d_+ d_-} \\&\quad + {\textbf{I}}_{d_+ d_-} \varvec{\Delta }^{-1} \left[ (1-p) \sum _{\ell =1}^{K} \pi _{\ell } \left( \varvec{\nu }_k^\top {\textbf{I}}_{d_+ d_-} \varvec{\nu }_{\ell } \right) \varvec{\nu }_{\ell } \varvec{\nu }_{\ell }^\top \right] \varvec{\Delta }^{-1} {\textbf{I}}_{d_+ d_-} \\&= p \varvec{\Sigma }_k + {\textbf{V}}^\top {\textbf{D}}_k(p) {\textbf{V}} \in {\mathbb {R}}^{d \times d}, \end{aligned} \end{aligned}$$
(27)

where

$$\begin{aligned} \begin{aligned} {\textbf{V}}&= \varvec{\nu } \varvec{\Delta }^{-1} {\textbf{I}}_{d_+ d_-} \in {\mathbb {R}}^{K \times d}, \\ {\textbf{D}}_k(p)&= (1-p) \text {diag} \left( \pi _1 \varvec{\nu }_k^\top {\textbf{I}}_{d_+ d_-} \varvec{\nu }_1, \cdots , \pi _K \varvec{\nu }_k^\top {\textbf{I}}_{d_+ d_-} \varvec{\nu }_K \right) \in (0, 1)^{K \times K}. \end{aligned} \end{aligned}$$
(28)

Recall that by Remark 1, we have \(\varvec{\nu }_k^\top {\textbf{I}}_{d_+ d_-} \varvec{\nu }_\ell = {\textbf{B}}_{k \ell } \in (0, 1)\) for all \(k, \ell \in \{ 1, \cdots , K \}\). Then we have \({\textbf{D}}_k(p)\) is positive-definite for any \(k \in \{1, \ldots , K \}\) and \(p \in (0, 1)\). For \(k, \ell \in \{1, \ldots , K \}\) and \(t \in (0, 1)\), let \(\varvec{\Sigma }_{k\ell }(t)\) and \(\varvec{\Sigma }_{k\ell }^{\prime }(t)\) denote the matrics as in Eq. (10) corresponding to \({\textbf{B}}\) and \({\textbf{B}}^\prime\) respectively, i.e.,

$$\begin{aligned} \begin{aligned} \varvec{\Sigma }_{k\ell }(t)&= t \varvec{\Sigma }_k + (1-t) \varvec{\Sigma }_\ell \in {\mathbb {R}}^{d \times d}, \\[1em] \varvec{\Sigma }_{k\ell }^{\prime }(t)&= t \varvec{\Sigma }_k^{\prime } + (1-t) \varvec{\Sigma }_\ell ^{\prime } \\&= t \left[ p \varvec{\Sigma }_k + {\textbf{V}}^\top {\textbf{D}}_k(p) {\textbf{V}} \right] + (1-t) \left[ p \varvec{\Sigma }_\ell + {\textbf{V}}^\top {\textbf{D}}_\ell (p) {\textbf{V}} \right] \\&= p \left[ t \varvec{\Sigma }_k + (1-t) \varvec{\Sigma }_\ell \right] + {\textbf{V}}^\top \left[ t {\textbf{D}}_k(p) + (1-t) {\textbf{D}}_\ell (p) \right] {\textbf{V}} \\&= p \varvec{\Sigma }_{k\ell }(t) + {\textbf{V}}^\top {\textbf{D}}_{k \ell }(p, t) {\textbf{V}} \in {\mathbb {R}}^{d \times d}, \end{aligned} \end{aligned}$$
(29)

where

$$\begin{aligned} {\textbf{D}}_{k \ell }(p, t) = t {\textbf{D}}_k(p) + (1-t) {\textbf{D}}_\ell (p) \in {\mathbb {R}}_+^{K \times K}. \end{aligned}$$
(30)

Recall that \({\textbf{D}}_k(p)\) and \({\textbf{D}}_\ell (p)\) are both positive-definite for any \(k, \ell \in \{1, \ldots , K \}\) and \(p \in (0, 1)\), thus \({\textbf{D}}_{k \ell }(p, t)\) is also positive-definite for any \(k, \ell \in \{1, \ldots , K \}\) and \(p, t \in (0, 1)\). Now by the Sherman-Morrison-Woodbury formula (Horn and Johnson 2012), we have

$$\begin{aligned} \begin{aligned} \left[ \varvec{\Sigma }_{k\ell }^{\prime }(t) \right] ^{-1}&= \left[ p \varvec{\Sigma }_{k\ell }(t) + {\textbf{V}}^\top {\textbf{D}}_{k \ell }(p, t) {\textbf{V}} \right] ^{-1} \\&= \frac{1}{p} \varvec{\Sigma }_{k\ell }^{-1}(t) - \frac{1}{p^2} \varvec{\Sigma }_{k\ell }^{-1}(t) {\textbf{V}}^\top \left[ {\textbf{D}}_{k \ell }^{-1}(p, t) + \frac{1}{p} {\textbf{V}} \varvec{\Sigma }_{k\ell }^{-1}(t) {\textbf{V}}^\top \right] ^{-1} {\textbf{V}} \varvec{\Sigma }_{k\ell }^{-1}(t) \\&= \frac{1}{p} \varvec{\Sigma }_{k\ell }^{-1}(t) - \frac{1}{p^2} \varvec{\Sigma }_{k\ell }^{-1}(t) {\textbf{V}}^\top {\textbf{M}}_{k \ell }^{-1}(p, t){\textbf{V}} \varvec{\Sigma }_{k\ell }^{-1}(t) \in {\mathbb {R}}^{d \times d}, \end{aligned} \end{aligned}$$
(31)

where

$$\begin{aligned} {\textbf{M}}_{k \ell }(p, t) = {\textbf{D}}_{k \ell }^{-1}(p, t) + \frac{1}{p} {\textbf{V}} \varvec{\Sigma }_{k\ell }^{-1}(t) {\textbf{V}}^\top \in {\mathbb {R}}^{K \times K}. \end{aligned}$$
(32)

Recall that for any \(k, \ell \in \{1, \ldots , K \}\) and \(p, t \in (0, 1)\), \({\textbf{D}}_{k \ell }(p, t)\) and \(\varvec{\Sigma }_{k\ell }(t)\) are both positive-definite, thus \({\textbf{M}}_{k \ell }(p, t)\) is also positive-definite. Then for any \(k, \ell \in \{1, \ldots , K \}\) and \(p,t \in (0, 1)\), we have

$$\begin{aligned} \begin{aligned} (\varvec{\nu }_k^\prime - \varvec{\nu }_\ell ^\prime )^\top \left[ \varvec{\Sigma }_{k\ell }^{\prime }(t) \right] ^{-1} (\varvec{\nu }_k^\prime - \varvec{\nu }_\ell ^\prime )&= p (\varvec{\nu }_k - \varvec{\nu }_\ell )^\top \\&\quad \left[ \frac{1}{p} \varvec{\Sigma }_{k\ell }^{-1}(t) - \frac{1}{p^2} \varvec{\Sigma }_{k\ell }^{-1}(t) {\textbf{V}}^\top {\textbf{M}}_{k \ell }^{-1}(p, t) {\textbf{V}} \varvec{\Sigma }_{k\ell }^{-1}(t) \right] \\&\quad (\varvec{\nu }_k - \varvec{\nu }_\ell ) \\&= (\varvec{\nu }_k - \varvec{\nu }_\ell )^\top \varvec{\Sigma }_{k\ell }^{-1}(t) (\varvec{\nu }_k - \varvec{\nu }_\ell ) \\&\quad - \frac{1}{p} {\textbf{x}}^\top {\textbf{M}}_{k \ell }^{-1}(p, t) {\textbf{x}} \\&= (\varvec{\nu }_k - \varvec{\nu }_\ell )^\top \varvec{\Sigma }_{k\ell }^{-1}(t) (\varvec{\nu }_k - \varvec{\nu }_\ell ) - h_{k \ell }(p, t), \end{aligned} \end{aligned}$$
(33)

where

$$\begin{aligned} \begin{aligned} {\textbf{x}}&= {\textbf{V}} \varvec{\Sigma }_{k\ell }^{-1}(t) (\varvec{\nu }_k - \varvec{\nu }_\ell ) \in {\mathbb {R}}^K, \\ h_{k \ell }(p, t)&= \frac{1}{p} {\textbf{x}}^\top {\textbf{M}}_{k \ell }^{-1}(p, t) {\textbf{x}}. \end{aligned} \end{aligned}$$
(34)

Recall that for any \(k, \ell \in \{1, \ldots , K \}\) and \(p, t \in (0, 1)\), \({\textbf{M}}_{k \ell }(p, t)\) is positive-definite, thus we have \(h_{k \ell }(p, t) > 0\). Together with Eq. (33), we have

$$\begin{aligned} t (1-t) (\varvec{\nu }_k - \varvec{\nu }_\ell )^\top \varvec{\Sigma }_{k\ell }^{-1}(t) (\varvec{\nu }_k - \varvec{\nu }_\ell ) > t (1-t) (\varvec{\nu }_k^\prime - \varvec{\nu }_\ell ^\prime )^\top \left[ \varvec{\Sigma }_{k\ell }^{\prime }(t) \right] ^{-1} (\varvec{\nu }_k^\prime - \varvec{\nu }_\ell ^\prime ). \end{aligned}$$
(35)

Thus for any \(k, \ell \in \{1, \ldots , K \}\), we have

$$\begin{aligned} \begin{aligned} C_{k ,\ell }({\textbf{B}}, \varvec{\pi })&=\sup _{t \in (0, 1)} \left[ t (1-t) (\varvec{\nu }_k - \varvec{\nu }_\ell )^\top \varvec{\Sigma }_{k\ell }^{-1}(t) (\varvec{\nu }_k - \varvec{\nu }_\ell ) \right] , \\&> \sup _{t \in (0, 1)} \left[ t (1-t) (\varvec{\nu }_k^\prime - \varvec{\nu }_\ell ^\prime )^\top \left[ \varvec{\Sigma }_{k\ell }^{\prime }(t) \right] ^{-1} (\varvec{\nu }_k^\prime - \varvec{\nu }_\ell ^\prime ) \right] \\&= C_{k ,\ell }({\textbf{B}}^\prime , \varvec{\pi }). \end{aligned} \end{aligned}$$
(36)

Let \(\rho _B\) and \(\rho _{B^\prime }\) denote the Chernoff information obtained as in Eq. (10) corresponding to \({\textbf{B}}\) and \({\textbf{B}}^\prime\) respectively (with the same block assignment probability vector \(\varvec{\pi }\)). Then we have

$$\begin{aligned} \rho _{B} \approx \min _{k \ne l} C_{k ,\ell }({\textbf{B}}, \varvec{\pi }) > \min _{k \ne l} C_{k ,\ell }({\textbf{B}}^\prime , \varvec{\pi }) \approx \rho _{B^\prime }. \end{aligned}$$
(37)

Thus we have \({\textbf{B}} \succ {\textbf{B}}^\prime = p {\textbf{B}}\) for \(p \in (0, 1)\). \(\square\)

Proof of Corollary 1

By Eq. (13) and Eq. (15), we have

$$\begin{aligned} \begin{aligned} {\textbf{B}}_0&= \frac{p_0}{p_0+p_1} {\textbf{B}}_1, \\ {\textbf{B}}_1&= (p_0 + p_1) {\textbf{B}}. \end{aligned} \end{aligned}$$
(38)

Recall that \(p_0 \in (0, 1)\) and \(p_1 \in (0, 1-p_0)\). Then by Theorem 2, we have \({\textbf{B}} \succ {\textbf{B}}_1 \succ {\textbf{B}}_0\). \(\square\)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mu, C., Park, Y. & Priebe, C.E. Dynamic network sampling for community detection. Appl Netw Sci 8, 5 (2023). https://doi.org/10.1007/s41109-022-00528-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s41109-022-00528-1

Keywords