Selecting a significance level in sequential testing procedures for community detection

While there have been numerous sequential algorithms developed to estimate community structure in networks, there is little available guidance and study of what significance level or stopping parameter to use in these sequential testing procedures. Most algorithms rely on prespecifiying the number of communities or use an arbitrary stopping rule. We provide a principled approach to selecting a nominal significance level for sequential community detection procedures by controlling the tolerance ratio, defined as the ratio of underfitting and overfitting probability of estimating the number of clusters in fitting a network. We introduce an algorithm for specifying this significance level from a user-specified tolerance ratio, and demonstrate its utility with a sequential modularity maximization approach in a stochastic block model framework. We evaluate the performance of the proposed algorithm through extensive simulations and demonstrate its utility in controlling the tolerance ratio in single-cell RNA sequencing clustering by cell type and by clustering a congressional voting network.


Introduction
In the last few decades, there has been an increasing interest among physicists, computer and social scientists to study network data.Identifying community structure in a networks has gained particular attention: the vertices in networks are often found to cluster into related groups where vertices within a community are more likely to be connected [see, e.g., Newman & Girvan (2004), Newman (2006)].The ability to detect such communities is crucial to understand the relationship between the structure and function of networks, such as the modeling of networks (Cheng et al., 2009), the evolution of networks (G.-Q.Zhang et al., 2008;Shen & Cheng, 2010), the resilience of networks (Albert et al., 1999;Cheng et al., 2010), and the capacity of networks (G.-Q.Zhang et al., 2007).The stochastic block model (Holland et al., 1983) is a popular model for community structures in network data where edge probabilities between and within communities are constant conditional on community membership.
Many community detection methods begin with a null model of no community structure.Historically, the most common approach involving a null model is the use of a node partition score that is large when nodes within a partition are highly interconnected, relative to what is expected under the null of no structure (Newman, 2006;Fortunato, 2010).Many sequential community detection algorithms perform this task by first dividing the network into two communities, and subsequently subdividing each community hierarchically, ideally terminating when the true number of communities, K, has been reached.One such algorithm that is widely used in literature is based on modularity maximization proposed by Newman (2006) and its different variants including fast greedy modularity optimization (Clauset et al., 2004), exhaustive modularity optimization via simulated annealing (Guimera et al. (2004), Massen & Doye (2005), Medus et al. (2005), Guimera & Amaral (2005)), fast modularity optimization (Blondel et al., 2008).Parallel community detection algorithms have garnered some attention over the last decade that modify existing algorithms to make them more suitable for the analysis of large networks.Riedy et al. (2011) modified the agglomerative community detection algorithm by choosing multiple contraction edges simultaneously as opposed to sequential contraction that is commonly done.Yang et al. (2016) compare several state-of-the-art algorithms on artificial networks in terms of accuracy and computing time.Que et al. (2015) proposed a parallel community detection algorithm derived from Louvain modularity maximization method using a novel graph mapping and data representation.A hypothesis testing framework based on modularity-based community detection has been studied by J. Zhang & Chen (2017) where they introduced a hypothesis testing procedure to determine the significance of the partitions obtained from maximizing the modularity function starting from a null model with no graph structure.However, this neglects the sequential nature of the test, and ignores correlations among test statistics which we incorporate in our approach.Bickel & Sarkar (2016) provides an algorithm for finding the number of clusters in a stochastic block framework using the Tracy-Widom distribution as the limiting distribution of the highest eigenvalue of the adjacency matrix, and therefore is not suitable for the small or moderate sized networks.To make a sequential community detection algorithm effective, the significance level for rejecting the null hypothesis needs to be specified for each test given by H 0 : K = j community against H a : K > j starting with j = 1 and incrementing j over the integers until the test fails to reject H 0 .The standard practice of setting the significance level arbitrarily to to 0.05 or 0.01 has drawbacks because it is susceptible to multiple testing leading to increased Type I error due to the repeated sequential tests.
To circumvent the multiple testing problem in sequential community detection procedures, analogous to controlling family-wise error rate, specifying a nominal significance level accounting for multiple tests is necessary.We aim to instead control for the underfitting (overfitting) probability, defined as the probability that the estimated number of communities obtained by a sequential testing procedure is less than (greater than) the true number of communities K present in the network.Any given contexts specific tolerance for overfitting and underfitting probabilities ultimately dictates the nominal significance level that should be used.We address the problem of finding the nominal significance level and aim to provide an algorithm to determine it aligns with a context-specific user-specified tolerance ratio, defined as the ratio of underfitting probability to overfitting probability in a generic sequential testing framework.Our algorithm hinges on finding a suitable estimate of the number of communities at a significance level that preserves the prespecified tolerance ratio.
The rest of this article is organized as follows.In Section 2, we first describe sequential community detection procedures and subsequently introduce our algorithm to choose a significance level guided by a pre-specified tolerance ratio.In Section 3, we provide an example of our approach applied to Newman's modularity maximization for sequential community detection to select an appropriate significance level.Section 4 describes the performance of our algorithm through extensive simulation studies in stochastic block model frameworks.We derive appropriate significance levels in two real applications in Section 5. Finally Section 6 concludes with a discussion of limitations and next directions for our approach.

Sequential community detection
In this section, we first describe a generalizable sequential testing procedure to detect the number of communities in a network.Secondly, we describe the estimation of the tolerance ratio by deriving the expressions of underfitting and overfitting probabilities using an estimate of the number of communities.This tolerance ratio estimate is a function of the nominal significance level, which we can then solve for to arrive at a desired prespecified level.

Sequential testing procedure
Assuming a network of size n, the sequential testing procedure can be described by the following hypotheses: for each integer j ≥ 1 until a test fails to reject.

Significance level from tolerance ratio
A common problem faced in community detection is the choice of an appropriate significance level α.Analogous to multiple testing problem (Benjamini & Hochberg, 1995) where the goal is to control the family-wise error rate (FWER) through some procedures such as Bonferroni correction, Tukey's range test etc., we focus on sequential community detection algorithms, where tests of the null hypothesis H 0 : K = j against the alternative H A : K > j are performed sequentially for j = 1, 2, . . .until a test fails to reject H 0 .We let p(j) be the p-value of the jth such test T (j; α) defined as: Using this sequential procedure, the estimated number of communities is: is a non-decreasing (step) function of α.
We define the underfitting probability to be pr( K(α) < K) = η u and the overfitting probability to be pr( K(α) > K) = η o .The tolerance ratio is defined as γ = η u /η o , where K is the true number of communities.One can note that γ ∈ [0, ∞).In particular, γ = 1 implies underfitting and overfitting probabilities are equally likely.For unknown K, this also suggests one approach to estimate K that is independent of α: select K to be the value of K that results from the widest subinterval of α in [0, 1].We call this α-free estimator K * .
We propose the following iterative procedure to identify the correct marginal significance level α to use from the user-specified tolerance ratio γ.

Input:
The original or estimated adjacency matrix A of a graph and user-specified tolerance γ 1.For a given α, perform sequential community detection to obtain K(α): For each k ∈ {1, 2, • • • , n}, we simulate SBM of size n and cluster k and use B bootstrap samples to compute the test statistic at k-th stage .This will give us empirical null distribution of the test statistic.Next we compare it with the observed value of the test statistic and find K(α) according to (2).
2. Determining K * : K(α) is a non-decreasing step function of α which can take integer values between 1 and n.Let Define K * = K(α M ).In other words, K * is the longest step when K(α) is plotted with respect to α.
The above algorithm yields an α which can be repeatedly seeded back into Step 1 until convergence in α is achieved.
Below we present a brief proof of the convergence of the algorithm which assumes the Lipschitz condition on γ, and exploits some key characteristics about the change of underfitting probability with respect to α.
Theorem 1. Suppose the target value of the significance is α 0 that corresponds to the tolerance ratio γ 0 .Also assume that the function γ(α) satisfies the Lipschitz condition where c > 0 is the Lipschitz constant, α, α * ∈ (0, 1), and B tends to ∞.Then if we terminate the algorithm with precision for the significance α, then the precision of the tolerance ratio is c .
Proof.First note that γ(α) = P ( K(α) < K * )/P ( K(α) > K * ) is an increasing function of the underfitting probability P ( K(α) < K * ), which in turn is a decreasing function of α.This implies γ(α) is a decreasing function of α.So, there exists a c (could be very large) so that the Lipschitz condition holds.Therefore, if we use the precision for α, the precision for γ is c .Two remarks are in order.
Remark 1.In Theorem 1, we assume that γ(α) is a continuous decreasing function of α.However, in Step 3 of our algorithm, γ(α) is guaranteed to be a non-increasing step function of α because we are estimating it empirically and K(b) (α) can take finitely many values in {1, 2, ..., n}.Therefore the difference |γ(α) − γ(α * )| can range over the entire real line, taking only finite values.The difference |α − α * | can range in the interval (0, 1).It is instructive to note that when the difference in α is zero or small, the corresponding difference in γ is also zero.Therefore, one can always pick c > 0 so that the Lipschitz condition is satisfied.
Remark 2. Our algorithm takes the user-specified tolerance as input and is expected to return a significance level is close to the tolerance ratio as possible.Like other iterative algorithms, we need some pre-defined precision that dictates the stopping criterion.This task is accomplished by fixing beforehand so that our algorithm returns a significance that lies within the -neighborhood of the optimal significance.Theorem 1 provides the stopping criterion in Step 4 based on the precision of α.

Example sequential community detection algorithm
While our approach for identifying an α that corresponds with a prespecfied tolerance ratio is agnostic to which sequential community detection algorithm is used, we detail one example use case here.Aside various community detection algorithms such as spectral clustering (White & Smyth, 2005; S. Zhang et al., 2007), random walks (Pons & Latapy, 2005), a popular approach to community detection is based on the idea of optimizing modularity.Modularity metrics were introduced by Newman & Girvan (2004), and the idea of detecting communities by optimizing a modularity function was proposed by Newman (2004) Nowadays, there are many variants of the modularity-based community detection approach to deal with directed or weighted networks (Leicht & Newman, 2008).Also, some variants of the modularity-based community detection approach use modularity functions with a somewhat modified mathematical structure (Reichardt & Bornholdt, 2006;Waltman et al., 2010;Traag et al., 2011).
Here we revisit Newman's sequential algorithm (Newman, 2006) of community detection which begins by first dividing the network into two communities and then subdividing into further communities by maximizing additional modularity; and we implement our approach to selecting an appropriate significance level in this context.
For a network with n vertices, let A denote the n × n adjacency matrix and s = (s 1 , s 2 , • • • , s n ) ∈ {−1, 1} n where s i = 1 if the i-th vertex belongs to group 1 and -1 otherwise.Let k i denote the degree of vertex i and m = n i=1 k i /2 be the total number of edges in the network.Then the modularity of the network is defined as where the matrix B = (B uv ) is defined as B uv = A uv − kukv 2m , a symmetric matrix of order n.
For further dividing a group j of size n j , the additional contribution to the modularity is is maximized in the similar way for Q in (3), where B (j) uv = B uv − δ uv l∈j B ul , and δ uv is the Kronecker δ-symbol.
If the total modularity of the network after splitting the network into j communities is Q (j) , then the gain in the modularity is defined by ∆Q (j) = Q (j+1) − Q (j) .Again, while we use this quantity ∆Q (j) as our test statistic for the jth step (H 0 : K = j vs H A : k > j), we stress that any sequential community detection algorithm can be adopted to this framework.

Simulations
We perform extensive simulation study in various directions to assess the performance of the proposed algorithm.In each set-up, networks of size n and 2n are simulated through SBM with K 0 number of balanced communities of size n/K 0 .We vary n ∈ {100, 200} corresponding to K 0 = 5, 10 respectively for symmetric edge probability matrix P of dimension K 0 of the form so that the diagonal and off-diagonal entries of P are 0.5 + and 0.5 − respectively implying that the difference between edge probability within and between community is 2 .We vary = 0.195, 0.010 to represent two cases of (S) strong and (W) weak community structure, respectively.

Estimated number of communities ( Kα )
For a fixed α, we simulate 1000 parametric bootstrap sample values of the null test statistic and calculate p-values by comparing them with the observed test statistic value.We start from the number of communities K = 1, and proceed by incrementing Kuntil the p-value is greater than α.We replicate the procedure 100 times and finally report the value of the estimated number of communities Kα by taking the mode of the 100 replications.
Next, we vary α ∈ {0.01, 0.05, 0.10, 0.20} and the corresponding Kα s are reported in Table 1.Further, the estimates of pr( Kα = K 0 ) are reported in Table 2 by taking the proportion of times Kα is equal to K 0 over 100 replications.It is instructive to note that entries in Table 1 is less than the significance level α.It can be shown by straightforward calculation that for a given α, pr( Kα = K 0 ) < α.
One can note that in the presence of strong differences in communities, estimated communities are close to the true number for α = 0.01, ..., 0.2.For weak signals, the number of communities is under estimated for the aforementioned α.However, the number of communities is over estimated for larger value of the significance level.This indicates that the choice of α can greatly influence K, which provides further incentivize for developing a rigorous approach to selecting an appropriate α.
Table 1: Mode of 100 independent replications ( Kα ), and proportion of times true number of communities correctly estimated pr( K = K 0 ) for different choice of α, P , and n.
Table 3: Choice of α for different choices of tolerance ratio γ and network size n(n = 2n) We apply our algorithm to the scRNA-seq data generated from the retina cells of two healthy adult donors using the 10X Genomics Chromium TM system.We should expect some clustering by cell type in networks derived from this data.Detailed preprocessing and donor characteristics of the scRNA-seq data can be found in Lyu et al. (2019).The data consists of 33694 genes sequenced over 92385 cells.The sequencing data were initially analyzed with R package Seurat (Satija et al., 2015) and each of the cells was identified as a particular cell-type.The virtual representation of the data in the t-SNE plot is given in Figure 1.Among different clusters in Seurat, we consider the data pertaining to five hierarchical clusters: "Astrocytes", "Endothelium", "Ganglion", "Horizontal", "Pericytes".Before we perform the analysis, we process the data in three steps.First, genes whose variability was less than the 50th quantile are filtered out, and then cells whose total cell counts across all genes are less than 500 and greater than 2500 are also filtered out.Second, we compute the normalized score (row wise) and perform a log transformation (log 2 (1 + x/10000)) as done in Booeshaghi & Pachter (2021) to convert the data into a continuous scale.The rationale behind such a transformation is that that different genes have different variances implying that genes that are highly expressed will have high variance whereas the genes that are barely expressed at all, will have almost zero variance.The transformed data is now used to compute correlations between the cells.Finally, for each cluster, we randomly select 100 cells ensuring that the within and between cluster correlations do not differ by more than 0.1 from those of the composite data.We use the correlation threshold (τ ) to construct an adjacency matrix A, and vary τ ∈ {0.3, 0.5, 0.7}, and report the significance level along with estimated number of communities in Table 4.We observe that estimated number of communities is larger as we increase the value of τ which gives rise to a denser network.The significance level (α) ranges over [0.01, 0.05] depending on the tolerance ratio.Also, for each choice of τ , the estimated number of communities is increasing with α.
It can also be noted from Table 4 that different values of α lead to different number of estimated communities.If one were to arbitrarily pick α as, say, 0.05, this choice can have a large impact on the analysis.For example, corresponding to τ = 0.3, α changes from 0.05 to 0.01 leading to different value of K. Thus the choice of α is an impactful decision, and the tolerance ratio presents an intuitive measure that allows the practitioner to place a value of overfitting relative to underfitting when performing community detection.

United States House Votes 1984 (USHV) data
In this example, we consider a data set of 267 democrats and 167 republican congressmen who has voted in 16 issues in 1984 in the United States of America.The data contains yes/no answer for each congressman on 16 different questions with some missing values.After removing the congressman who has not voted in more than three of the sixteen questions, the data is represented by a 417 × 17 matrix where the first column represents the political affiliation-republican and democrat.The adjacency matrix A is calculated by thresholding the correlations among congressmen by τ , i.e., if the correlation of voting between two congressmen is as high as τ , we assume they are connected by an edge and hence the corresponding entry of the adjacency matrix 1, and 0 otherwise.Finally we vary τ ∈ {0.3, 0.5, 0.7}.It is instructive to note that smaller values of τ leads to a more dense network.
In this data, the number of distinct communities is not expected to go below 2 because of the two party affiliations.However, in our analysis, the estimated number of communities varies in {3, 4, 5} (depending on the desired tolerance ratio) implying potential further subdivisions among political parties.Here too, the significance level has a large impact on the analysis.For example, when τ = 0.7, changing α from 0.06 to 0.04 drops the number of communities from 5 to 4. Therefore, a judicious choice of the significance level is necessary, and the tolerance ratio again provides a means of guiding this choice in an intuitive manner.

Discussion
We have proposed an algorithm to provide guidance to the practitioner in order to obtain a nominal significance level that matches their desired balance between overfitting and underfitting probabilities.Traditional approaches to estimate the number of communities often arbitrarily set the significance level, and the tolerance ratio presents a intuitive alternative.
To construct the test statistic in a sequential testing framework, we demonstrated the approach with Newman's modularity maximization method, although the procedure is general and can be applied equally to any sequential community detection approach.
Although here we have assumed a stochastic block model, a feasible extension of this approach would be to apply it to dynamic stochastic block models Matias & Miele (2015) in order to allow a time varying network structure.It is instructive to note that we proposed the solution using the sequential tests, and implemented the algorithm via bootstrap due to the lack of the analytic expression of the test statistic.A potential bottleneck that the proposed algorithm will face is when the network size is very large because bootstrapping will be computationally expensive.However, in case an analytic expression of the test statistic is available in closed form, the algorithm can be adapted trivially to use it in place of bootstrapping.This would further increase algorithmic stability by removing stochasticity introduced through the bootstrap.

Table 4 :Figure 1 :
Figure 1: Virtual representation of the estimated number of clusters of the analyzed scRNAseq data of human retina cells in the t-SNE plot obtained by selecting Seurat classified cell types namely: Astrocytes, Endothelium, Ganglion, Horizaontal, Pericytes in an equal manner of roughly 100 cells per cell type.

Table 5 :
Choice of α and corresponding estimated number of communities Kα for different values of tolerance ratio η across various choices of correlation threshold τ for the USVA data