Cross-validation of correlation networks using modular structure

Correlation networks derived from multivariate data appear in many applications across the sciences. These networks are usually dense and require sparsification to detect meaningful structure. However, current methods for sparsifying correlation networks struggle with balancing overfitting and underfitting. We propose a module-based cross-validation procedure to threshold these networks, making modular structure an integral part of the thresholding. We illustrate our approach using synthetic and real data and find that its ability to recover a planted partition has a step-like dependence on the number of data samples. The reward for sampling more varies non-linearly with the number of samples, with minimal gains after a critical point. A comparison with the well-established WGCNA method shows that our approach allows for revealing more modular structure in the data used here.


Introduction
Many important applications deal with networks derived from correlations in multivariate data, such as gene co-expression networks [1,2], fMRI scans of brain activity [3] and ecological co-occurrence networks [4]. Correlation networks are dense, and finding an appropriate threshold to make the networks sparser and separate signal from noise is a critical first step in revealing meaningful structure. Researchers have developed a suite of methods to this end, including weighted gene co-expression network analysis (WGCNA) [5] that imposes soft thresholding to obtain a heuristically motivated scale-free network, hard thresholding [4,6] and consensus approaches [2], while other researchers suggest that thresholding should be avoided [7]. Methods not tied to a particular application include network coarse-graining and filtering techniques [8,9,10], and regularisation methods such as the graphical lasso [11] and neighbourhood selection [12], which seek sparse estimates of the precision matrix. While these methods have particular advantages, they ignore whether the resulting pruned networks represent robust and informative structures. Since the overall objective when representing data as networks is often downstream analysis of network structure, existing thresholding approaches can limit our ability to reveal valuable patterns in the data.
Network communities -groups of densely connected nodes -are one of the most relevant network structures. They play a crucial role in the function of disparate complex systems, from metabolic networks [13] to ecological communities [14]. When considering the community structure, selecting a threshold in a correlation network presents the researcher with a model selection problem with the usual pitfalls of overfitting and underfitting. When increasing the threshold of which links to include, the network becomes sparser and more Figure 1: Thresholding correlation networks. The researcher must solve a model selection problem when thresholding a correlation network, here exemplified by gene co-expression data from the plant Arabidopsis thaliana. As the threshold increases, more communities appear, potentially leading to overfitting (left). Including too many links (low threshold) can lead to underfitting (middle). Our module-based cross-validation method guides the researcher to the most parsimonious model (right). communities appear (Fig. 1). If only strong links remain, the community structure can be rich and potentially informative about the modular structure of the underlying data. However, removing too many links leads to overfitting since spurious communities appear in the overly sparse network [15]. Including too many weak and noisy links, on the other hand, leads to underfitting, resulting in a dense network with few or no communities (see Fig. 1). The researcher must then find the best balance between underfitting and overfitting using standard procedures such as cross-validation. In the present work we propose a cross-validation method that focuses on communities, where the model complexity corresponds to the modular structure, in this way integrating thresholding and network community structure into a single step.
We use the modular compression as quantified by the community-detection objective function known as the map equation [16] combined with cross-validation to find the threshold that gives the best balance between over-and underfitting network communities. We illustrate the proposed module-based cross-validation method using synthetic data and gene co-expression data of the plant Arabidopsis thaliana. We also use the code length savings measured by the map equation to assess the number of samples needed to discern structure in experimental data. We show that the code length savings plateau with a step-like relation to the number of samples. This behaviour is essential to the practitioner since the reward for sampling effort varies strongly with the number of samples. We also compare our approach to the widely used WGCNA method and see that this method underfits the model to the data, possibly failing to identify relevant network structure.

Results
A correlation network is constructed from data having the same structure across scientific disciplines. Denoting the data X ∈ R p×n , we have n nodes or features that vary across p experiments, samples or time instances. We construct a correlation network G(Σ) from the empirical correlation matrixΣ whose elementsΣ i,j = |ρ(X i , X j )| are given by the absolute value of the correlation ρ between measurements at node i and j. For a threshold τ , the correlation matrix elements are given bŷ

Module-based cross-validation
To find the best balance between over-and underfitting in a correlation network, we propose a module-based cross-validation procedure. We split the data into a training and a test set X train ∈ R p ×n and X test ∈ R (p−p )×n , and construct the corresponding networks G(Σ τ,train ) and G(Σ τ,test ) using a specific threshold τ . To assess the support for a modular structure in the data, we use the map equation framework [16,17], which exploits the minimum description length principle: the modular structure with the shortest description best explains the data. The map equation measures the per-step average code length L required to encode a random walk on a network with a given partition M . The greedy optimisation algorithm Infomap [18] seeks the partition that minimises the code length and reveals the most modular structure of the network with respect to the random-walk process on the network. Denoting the code length for a given partition M of the training network G(Σ τ,train ) with threshold τ as L τ,train (M ), Infomap seeks to solve the optimisation problem where M τ,train is the optimal partition of the training network with threshold τ . This partition is a model of the modular structure in the training data, and the map equation framework can quantify how well this model fits the test data.
If the modular structure in the training network is present in the test network, M τ,train will also compress the modular description of the test network, decreasing the code length when applied to the test network. We quantify this compression by evaluating the relative code length savings, which we denote l and define as for a particular value of the threshold τ . L test (1) is the code length of the one-level uncompressed partition with all nodes in the same module. If l τ > 0, the training modules are present also in the test network. The optimal threshold choice τ * maximises the relative code length savings. Equivalently, the optimal threshold minimises the code length for the test network when using the optimal partition of the training network, The partition M τ * ,train for the optimal τ * best generalises to the independent test data set, balancing over-and underfitting. This module-based cross-validation allows us to select the optimal threshold and identify reliable modules.

Synthetic data
To illustrate the method using synthetic data we sample from a multivariate normal distribution with a planted partition. The covariance matrix elements are where we use M (n i ) to denote the module of node n i . The covariance matrix is block diagonal with the blocks corresponding to the planted partition with within-module covariance c ∈ (0, 1], ensuring that the matrix is positive definite. The empirical correlation matrixΣ can then be calculated from samples X ∼ N (0, Σ). In the following we use Spearman's rank correlation, but other correlation measures can be used as well.  The ability to separate signal from noise varies with the parameters in the sampled data. We therefore establish a baseline case with 8 modules with 30 nodes in each module, 100 samples and covariance c = 0.3, and explore how the distribution of empirical correlations changes when we vary these parameters. With increasing number of samples the planted correlations are more easily distinguished from noise-level correlations (Fig. 2). The number of modules has an opposite effect, with a clear peak of planted correlations with few modules. This can be explained considering that the fraction of within-module links and total links decreases as the inverse of the number of modules, meaning that the within-module links cover a larger part of the covariance matrix with fewer modules, thus increasing the signalto-noise ratio. The number of nodes per module has little effect on the distribution of empirical correlations, because adding more nodes to a module increases the support for the corresponding correlation in the data. And if the support is already sufficient, adding more nodes will not change the distribution. Having only small modules can, however, change the distribution, and increases the sensitivity to stochastic variations. Hence, there is no simple relation between the number of samples and the number of nodes, but more nodes does not necessarily require more samples to recover a planted partition. Obviously, the planted within-module covariance c affects the distribution, with a higher value leading to the planted correlations being farther away from the noise-level. This result shows the importance of including varying conditions in experiments.
Among the four variables expected to affect the signal-to-noise ratio, the practitioner can often control the number of samples, but there are often economic and other constraints that limit the availability of samples. Based on this, and given the analysis above showing the importance of the number of samples, we focus on how our proposed method performs when varying the number of samples, and how this affects the possibility to recover a planted partition. Figure 3 shows the code length savings l for our baseline case with varying number of samples. With 90 samples, l peaks for a specific value of the threshold, corresponding to the solution τ * of the optimisation problem in Eq. 4. With fewer samples, however, the code length savings never exceed zero, irrespective of the threshold. The intervals with zero code length savings correspond to a dense network with no modular structure. The optimisation problem must therefore be stated as where m τ denotes the number of modules. The added constraints only allow solutions with a modular structure present in both the training and test networks. Given the data with 50 and 70 samples in Fig. 3, we do not find a solution: a proper threshold without overor underfitting is not feasible with this method. These results show the need for sufficient sampling to reliably detect structure in the underlying data. More importantly, the results also suggest that the proposed method can be used to estimate a sufficient number of samples.
To further investigate the role of the number of samples, we calculate the code length savings for the optimal threshold τ * when varying the number of samples for different values of the within-module covariance c. The code length savings increase rapidly when there are enough samples to obtain a signal of the modules in the train network also in the test network (Fig. 4). When this increase happens and how sharp the increase is depends on the within-module covariance, with a covariance c = 0.1 requiring many samples. The code length savings flatten out as the number of samples is increased further, indicating that there is less reward, in terms of finding modular structure, after a certain point. This means that the method we propose can be used to estimate an optimal sampling effort, defined as the number of samples where the code length savings start to plateau. Figure 5 illustrates this effect further: For a large enough number of samples we recover the planted partition fairly well, with adjusted mutual information (AMI) larger than 0.8. Further increasing the number if samples only gives small gains in AMI. These results resemble those of Decelle et al. [19] that identify a phase transition in the detection of partitions planted using the stochastic block model when going from a random graph to separated modules, which is their way of increasing the signal-to-noise ratio. Our results are also based on modular structure and show that the detection of planted partitions undergo a process resembling a phase transition when increasing the number of samples.

Gene co-expression data
As an example using real data, we choose gene co-expression data from the plant Arabidopsis thaliana (see Methods for details). The objective in this application is to see how genes are co-expressed when the plant is subject to stress such as heat or cold. When we apply our proposed method to these data we see that the code length savings in the training network for different thresholds τ increase as the threshold increases (Fig. 6). This increase is to be expected since the network becomes sparser as the threshold increases, resulting in a larger modular compression of the network. Contrary to this, the code length savings l τ in the test network increase up to a certain point τ * ≈ 0.75 and then start to decrease. The optimal partition of the training network maximises the modular compression of the test network at τ * . This threshold gives the most parsimonious model of the data when considering the modular structure. The networks in Fig. 1 are based on these data, with the right-most one corresponding to τ * = 0.75.  Figure 6: Gene co-expression data. Applying the proposed method to gene co-expression data from the plant Arabidopsis thaliana shows that the code length savings in the test network has a maximum at threshold τ * ≈ 0.75 (left). This is the threshold value giving the most parsimonious model of the data. The right figure shows that the code length savings at the optimal threshold do not increase linearly with the number of samples, indicating in accordance with the results based on synthetic data that the reward for sampling effort decreases as the number of samples increases. Figure 6 also shows the maximum code length savings as the number of samples increases, using these same data but randomly undersampling the number of samples. The dependence on samples is non-linear, resembling the behaviour for synthetic data.
Gene co-expression data are often analysed using weighted gene co-expression network analysis (WGCNA) [5]. WGCNA employs soft thresholding by introducing a parameter β such that the network link weights are |ρ| β for a correlation measure ρ. We use WGCNA to assess whether this method over-or underfits the modular structure studied in this work, and find that WGCNA underfits the model to the data. For computational reasons, we select the 4000 genes with the largest variance and find that β = 8 is a good choice (see Methods for details). Figure 7 shows an alluvial diagram with network modules obtained using Infomap for networks corresponding to different β values and for the network corresponding to the threshold τ * . The threshold τ * that best balances between over-and underfitting gives a more modular network than WGCNA, meaning that WGCNA underfits in this sense.

Conclusions
We have suggested a way of solving the common and general problem of thresholding correlation networks by integrating modular structure in the model selection process. This approach avoids ad-hoc methods to sparsify networks that risk over-or underfitting modular structure. Our method quantifies the partition quality by means of the code length in the map equation, which we use to measure how a learned community structure fits unseen data. Experiments on synthetic and real data show promising results. The applicability of the method depends on the signal-to-noise ratio, which we showed depends on a number of factors, including within-module covariance, module size and the number of samples. There is no obvious relation between the number of nodes and the number of samples, but we showed that more nodes does not necessarily require more samples.
The reward for sampling effort varies strongly as the number of samples increases, and the ability to recover a partition undergoes what resembles a phase transition. Given the ubiquity of the problem and the substantial economic cost of sampling, our results indicating that increased sampling effort can have little reward could potentially lead to reduced spending and better allocation of resources across scientific fields working with correlations in multivariate data.
The comparison with WGCNA showed that there is a risk of underfitting when using WGCNA, which can lead to meaningful network structure being overlooked. Figure 7: WGCNA underfits modules. The allluvial diagram shows Infomap modules and reveals that the networks derived using WGCNA with different β values give less modular structure than the network corresponding to threshold τ * = 0.75. This shows that WGCNA underfits the modular structure in the data.

Optimization method
The objective function in the optimization problem (4) that we solve is the relative code length savings in the test network, L test (M τ,train ), that depend on the threshold τ . This objective function is stochastic for two reasons. The first and main reason is the two-fold splitting that we use here in the cross-validation. The second reason is the inherent randomness in Infomap and the non-convex properties of the map equation [20]. To overcome this we do a sample average approximation and average over 10 runs, such that L test (M τ,train ) = 1 10 10 i=1 L testi (M τ,traini ). In this work we use a simple approach to find the optimal τ by testing a set of thresholds, such as τ ∈ [0.1, 0.2, . . . , 0.9], and choosing the one that maximizes the code length savings and thus minimizes L test (M τ,train ). Regarding the two-fold splitting, we believe that this is the current best option to build both a training and a test network from a data set. There are however recent developments in regularization methods for link prediction [15] that can potentially provide better options.

Gene expression data and analysis
Gene expression data for the co-expression network were retrieved from the Sequence Read Archive (SRA). We searched SRA to identify all available RNA-Seq samples relating to cold stress in Arabidopsis thaliana ecotype Columbia-0, but limited ourselves to leaf tissue and excluded any genetic variants. The selected data sets include both control and treated samples and were retrieved in April 2021. A full list of included samples can be found in the supplementary material. The data were quantified using salmon version 1.2.1 [21] against the Araport 11 release of the Arabidopsis thaliana genome. Pre-processing and normalization were done in R using the variance stabilizing transform available in DESeq2 [22]. If the expression value is constant across samples for a gene, due to the sub sampling of experiments, then the correlation coefficient for this gene and the corresponding network links are undefined, and the gene is disconnected from the co-expression network.
In WGCNA the parameter β is chosen such that the co-expression network is approximately scale-free. This is done through linear regression of the distribution of weighted node degree on a log-log scale, where the smallest β such that the coefficient of determination R 2 > 0.8 is chosen. Figure 8 shows the dependence of R 2 on β, indicating that β = 8 is a good choice for our data.