Inter-chromosomal connectivity varies differently between healthy and cancer GCNs
Previous works have used GCNs to reveal topological and functional differences between cancer and healthy genetic profiles (Espinal-Enriquez et al. 2017; Alcalá-Corona et al. 2017; de Anda-Jáuregui et al. 2019). The work by Teschendorff and collaborators (West et al. 2012; Teschendorff and Severini 2010) for instance, has presented how information theoretical entropies based on local probability measures may be able to discern some functional features. Their results show that cancer-related networks are likley characterized by higher information theoretical entropies, possibly due to graph-structure configurational contributions. One remarkable topological difference found between healthy and breast cancer GCNs lies in terms of inter/intra-chromosomal connectivity (Espinal-Enriquez et al. 2017; García-Cortés et al. 2020; de Anda-Jáuregui et al. 2019; de Anda-Jáuregui et al. 2019). Edges in breast cancer GCNs seldom connect nodes between two different chromosomes while edges in the healthy GCN mostly link nodes in different chromosomes.
As previously mentioned, inter-chromosomal connectivity can be measured using the attribute assortativity coefficient (AAC), where a value of 1 is found in networks never connecting nodes in two different chromosomes, -1 in networks when only nodes in different chromosomes are connected, and 0 when the mixing is random (see Methods for a formal definition of assortativity). Single-layer breast cancer GCN has a very low inter-chromosomal connectivity (AAC close to 1) while single-layer healthy GCN has an expected or random inter-chromosomal connectivity (AAC close to 0).
As we can see in Fig. 2, the GCNs belonging to Basal (breast cancer) and healthy tissues exhibit different inter-chromosomal connectivity patterns across co-expression layers. We see a steady decrease in inter-chromosomal connectivity in the GCN of Basal tissue (as shown by a steady increase in AAC), while healthy GCN presents an almost constant close to random (0-valued AAC) inter-chromosomal connectivity throughout the layers.
On average, AAC equals 0.004 and 0.05 in healthy and Basal GCNs, i.e. Basal GCN presents a much weaker inter-chromosomal connectivity as compared to healthy tissue. To discern the extent of their AAC deviation from random mixing, we calculated the mean square error (MSE) from 0 of AAC values. Healthy GCN has a very small MSE (3×10−5) proving to be closer in terms of inter-chromosomal connectivity to a random network. Despite its almost constant form in healthy tissue, the AAC slightly increases when approaching the top co-expression layers. Meaning that inter-chromosomal connectivity decreases by a small amount when approaching top layers: the MSE of healthy GCN in the top 6 layers is an order of magnitude greater than the overall MSE (3×10−4).
In breast cancer on the other hand, the inter-chromosomal connectivity drops abruptly when reaching the last few layers, going from an overall MSE of 0.018 to 0.3 in the top 6 layers. One of the advantages of modeling the co-expression program as a multi-layer GCN is that we can think of a co-expression program as a discrete process, where each layer is a co-expression ‘picture’ representing a co-expression range ending in the top co-expressed layer.
Under this framework, we can state that layers in both networks lose inter-chromosomal edges as they approach the final layer but not at the same rate nor magnitude. Indeed, the deviation in inter-chromosomal connectivity of a random state in the healthy GCN is subtle but clear in the last few layers, while the one of the Basal GCN has a fluctuation layer (around layer 95) ending in the disconnection of chromosomes in the network (right panel of Fig. 2). The bottom layer, close to the median of the co-expression distribution, reinforces the assumption of noise in most of the co-expression program as both GCNs have an AAC close to 0 here, in agreement with the fact that only a handful of all possible gene pairs actually interact in a biological regulation task (Consortium 2004).
If co-expression is thought of as a discrete process with the top co-expressed layer as the endpoint, the underlying structural evolution mechanism that happens across layers in the GCN becomes a central question to the understanding of the genetic co-expression program. A first step in this direction is to evaluate the structural or topological evolution of the layers in both GCNs, first to see whether this mechanism is related to the evolution in terms of inter/intra-chromosomal connectivity across the co-expression layers. Second, to evaluate this structural process with the help of known network-generating mechanisms.
Structural evolution of healthy GCN absent in breast cancer
A first observation from looking at the structural evolution of the GCN of breast cancer is an almost constant structural nature. All structural values stagnate on a value for most layers, only to abruptly explode in the last few (Fig. 3a). Its degree assortativity coefficient (DAC) is 0 for most layers, implying a random assortative mixing in terms of the degree of nodes. In the case of the healthy GCN, the closest layer with a DAC of 0 is the noisy bottom layer with a value of −3×10−3.
To quantify the extent of the variation from noisy co-expression in both GCNs, we computed for each structural measure, the absolute sum of the differences between the layers’ values and the value of layer 0 (Fig. 3b). Structurally, the healthy GCN deviates considerably more from the network structure of noise co-expression (layer 0 in the GCN), than breast cancer. This contrasts to its intra-chromosomal connectivity, where the AAC of healthy GCN evolves closely around random mixing across most layers (Fig. 2), with a slight increase in the top co-expression layers. The fact that the structural sensitivity to co-expression changes in the healthy GCN is lost in breast cancer, could suggest two independent/different structural development mechanisms underlying their co-expression programs.
The above discussion indicates that the difference in structural sensitivity is best depicted by the relation between chromosomal and degree assortativity, AAC and DAC respectively. This relation is quite close to be linear in both GCNs (Fig. 3c), and well-fitted using a simple regression model as defined in James et al. (2013). This claim is supported by the small mean square errors of both models (1×10−4 for the healthy GCN and 1.5×10−5 for the Basal GCN), where both models have an intercept value close to 0, congruent with the fact that a random network has random mixing in terms of chromosomes and degrees.
Moreover, the slope varies greatly between both models: in the healthy case, the slope is 27.5 times greater than in breast cancer (11.3 and 0.4 respectively), suggesting a different structural sensitivity—both in terms of network structure and chromosome rearrangement—and an almost linear relation in each GCN between these two. In other words, the structural changes happening across the layers of the healthy GCN are steady but do not seem to be related to inter-chromosomal connectivity, as in breast cancer. The mechanism underlying the healthy co-expression program, starts from an early layer a structural differentiation from noise that is non-existent in breast cancer, which follows an almost exclusively noisy path. This structural difference could be a lead into an organizational principle that may be necessary for a healthy co-expression program.
Preferential attachment-like evolution in healthy GCN is lost in breast cancer
A popular model for network generation of heavy-tailed degree distributions is preferential attachment (Barabási and Albert 1999; Price 1965). Roughly, preferential attachment happens when nodes with higher degree (many neighbors) gain edges faster than less connected nodes. This phenomenon of network creation appears in several different networks like the internet, scientific citations networks, movie actors networks, among others (Price 1976; Lehmann et al. 2004; Newman 2001; 2003; Jeong et al. 2003). This model was thought around the concept of a scale-free network, whose degree distribution function follows a power-law, Pr(x)=xα.
Commonly, a power-law distribution is characterized on a log-log scale as a linear shape (Jeong et al. 2003). This linear shape is found in the degree distribution of the top layer of the healthy GCN (Fig. 4b). Interestingly, the evolution of the degree distributions of the healthy layers starts at least 50 layers downstream and gradually takes the linear shape seen at the top co-expression layer (Fig. 4a). This supports the hypothesis of the inclusion of an organizational factor in the structure of the healthy GCN across layers. As we will see, such a factor seems to be based on a preferential attachment-like mechanism.
It is worth mentioning that the ubiquity and even the existence of power laws in networks (therefore its relation with preferential attachment), remains an active (and somewhat polemic) topic of research (Broido and Clauset 2019). In our case, only the tail of the degree distribution of the top layer follows a power-law after statistical testing (Alstott et al. 2014). The rest of the distribution is best fitted by a log-normal law than by a power-law. Although the true statistical nature of the heavy-tailed distribution is out of the scope of the paper, here we make use of the concept of emergent, as opposed to statistical, scale-free power-law network (Holme 2019).
In breast cancer, the degree distribution of layers shows a different evolution from the healthy GCN: most layers have a degree distribution similar to that of layer 0, representing a random network (Fig. 4a). There is no gradual development towards the top co-expression layer, only an abrupt change in the degree distribution around layer 95. The top layer has a degree distribution that is also heavy tailed but without the power law form of the one of the healthy GCN (Fig. 4b). With this in consideration, the network generator model of preferential attachment does not appear suited for the breast cancer co-expression program.
Top layer hubs and core nodes are structurally relevant in the healthy GCN but not in cancer
Taken as a general concept, preferential attachment relates to the fact that a group of exclusive nodes gain faster connections than the rest. Here, we do not make reference to the mathematical model of preferential attachment (as coined in Barabási and Albert (1999)), but to the general concept of network generation. To see whether the preferential attachment concept can hold in the healthy GCN, we consider two different sets of highly connected nodes: hub genes and core genes. The idea is to take a highly connected group of nodes in the top layer to look at their degree evolution from noise to highest co-expression layer.
Hub nodes, or simply hubs, are nodes that have significantly more neighbors in the network than the average node (Barabási and et al. 2016). The degree threshold for hubs in healthy GCN is 69.6 and 53.8 in Basal GCN, which defines a set of 533 and 534 nodes, respectively. Their degree distributions are displayed in the rightmost panel of Fig. 6a, in red. To compare their distributions to the rest of the network, we constructed a control set of nodes, generated by taking at random (100 times) a set of nodes with the same size as the hubs (see Methods). The degree distribution of this control group is displayed in blue across layers. In the top layer, the mean degree of the control group in healthy GCN is 17.5 and 111.5 for the hubs. In the case of Basal GCN these numbers are 18.9 and 63.9, respectively.
For each tissue, we test the hypothesis H0 that degrees of hubs in the top layer are comparable to a same-size random group across 6 layers (layers 0, 50, 70, 90, 95, and 100). In the case of the healthy tissue, H0 is rejected not surprisingly in layer 100, were the hubs were taken from. In layers 90 and 95 H0 is also rejected with probability p=0.95, and in layer 70 H0 is rejected with p=0.93 and in layer 50 with p=0.87. Finally, H0 is not rejected in layer 0, showing that degrees of top-layer hubs are confused with random nodes in the median of the co-expression distribution. On the Basal-like GCN, H0 is only rejected in layer 100, providing evidence of the random-like structure of the network starting at layer 95.
Interestingly, the degree evolution of hubs seems to separate from the control group early on in the healthy co-expression program (top panel Fig. 6a). The higher the layer, the larger the difference in mean degree between the two groups. Furthermore, the mean degree of the control group takes values that are close to the mean degree of layer 0 (16.5), across all the layers. On the other hand, the degrees of the top layer hubs are more connected on average from the control group, at least 50 co-expression layers downstream. These results support a preferential attachment-like mechanism in the healthy GCN: the group of hubs in the top-layer network gain gradually more connections across layers, and increasingly faster, than the average node.
In the case of breast cancer, this mechanism is completely lost (bottom panel Fig. 6a). Indeed, the degree distribution of the top layer hubs closely resembles that of the control group for most layers. There is no gradual gain of connections of the hubs at an early layer, suggesting that top layer hubs are average nodes in the rest of the layers. This is congruent with the overall degree distribution of Basal GCN shown previously, where most layers seem to have a distribution similar to the one found in the noise co-expression layer. Top layer hubs in breast cancer only start to differentiate to the control group around layer 96 (supplementary Figure S2), suggesting a two structural mechanisms across its layers, one random and constant from layer 0 to 95 and a different one starting at layer 96, which is strongly dictated by the intra-chromosomal connectivity.
In terms of spreading information on a network, hubs may not be the ideal set of nodes to consider, but the nodes in the main core of the network (Kitsak et al. 2010). In the context of a gene co-expression network, spreader genes could be an important part of the formation of functional pathways within a cell and thus important to be considered in this study. The core of a network is a concept from graph theory: the k-core of a network is the largest subnetwork H=(Vk,Ek), such that every node in Vk has degree at least k. The core or main core of a network is a k-core such that the (k+1)-core does not exist (Seidman 1983).
The core is a strongly inter-connected subnetwork: each node in the top layer core of the healthy GCN shares at least 58 connections with other nodes in the core (it is a 58-core), and 47 in the case of the Basal GCN. The total number of nodes is 141 and 42 in the healthy and Basal GCN, respectively (Fig. 5). As it can be observed in the top panel of Fig. 6b, the degree evolution of the core in the healthy GCN is similar to the one of hubs: its nodes gradually gain connections faster than the rest, starting at an early co-expression layer. The mean degree of the healthy top layer core is 142.9, while the degree of the control group is 17.5, a greater difference than in the case of hubs. In this case, H0 is rejected with p=0.95 for layers 100, 95 and 90, with p=0.94 for layer 70 and p=0.89 for layer 50. It is not rejected in layer 0.
In a similar fashion, the degree evolution of the top layer core in breast cancer GCN is stagnant and close to the control group on 5 of the 6 layers, similarly to the top layer hubs (bottom panel of Fig. 6b). The differentiation to the control group happens even later than in top layer hubs, in layer 98 (supplementary Figure S3). As in the case of hubs, H0 is only rejected in layer 100. In general, the nodes in the top layer core have a degree evolution parallel to that of top layer hubs, with a slightly stronger differentiation to control nodes in the case of the healthy GCN and, conversely, a closer resemblance to the control group in the breast cancer GCN.
Organizational principles in the healthy co-expression program are lost in cancer
A preferential attachment-like mechanism in the healthy gene co-expression program could imply the existence of an underlying organizational principle in the co-expression program, where a set of genes could act as an interface of regulatory functional pathways. So far, we have seen that two groups of nodes convey such a mechanism: the top layer core nodes and the hubs. A high conservation rate of these nodes throughout the layers of the healthy GCN could indicate that such an organizational principle exists and is dependent upon them.
Results show that such a similar principle is not likely to be found in breast cancer given the structural properties of its co-expression program. The random-like nature of the structure of the Basal GCN in most layers, suggests that an order of any kind is not to be expected with the exception of the top five layers. We also highlight the existence of another non-random mechanism in breast cancer (first seen in a single-layer network inEspinal-Enriquez et al. (2017)), that appears to be strongly related to a molecular phenomenon leading to changes in inter-chromosomal connectivity. This other mechanism is apparent only within the top five co-expression layers, where structural values deviate from random, and where highly connected nodes in the top layer gain connections faster than the average nodes (Figs. 4 and 6). Further experimental and theoretical research is needed to identify the causes of this change.
The cumulative conservation rate (ccr, see Methods) of nodes belonging to the top layer hubs and core nodes of the healthy GCN across layers is shown in Fig. 7. In sum, ccr is a measure of how well-conserved is a class of top layer nodes in the set of the same class of the rest of layers. Figure 7 displays the ccr decay for healthy and Basal GCNs using a radial plot where the co-expression layers are displayed on the circumference, starting at layer 100 at the left and finishing at the same position with layer 0. The radius of the plot represents the value of ccr at any given layer. Values at the center of the circle are equal to 0 while values on the circumference are equal to 1. The ccr values of both hubs and nodes in the core are represented, as well as the values of a control core group (see ccr in Methods).
Hubs have been shown to be relevant in protein-protein networks in terms of both lethality and evolutionary adaptability of a cell (Jeong et al. 2001;Helsen et al. 2019). However, in the case of both GCNs, hubs are not specially well-conserved across co-expression layers (Fig. 7). Even if they are preferentially connected as we have seen, a large number of neighbors alone appears not to be a sufficient property for conservation at different levels of co-expression. In the case of the Basal GCN, the conservation is almost null, with non-null values only found previous to layer 96. The only little conservation happens on the top 4 layers, congruent with the double structural mechanism divided at around layer 95 in the Basal GCN. The healthy GCN shows a greater conservation of the top layer hubs, but it decay happens fast: ccr99=0.84, and ccr81=0.03; only 3% of the hubs in the top layer are conserved in the preceding 20 layers.
The control core group of Fig. 7 is obtained, for each layer, by producing one thousand random networks with the same inter/intra-chromosomal connectivity as said layer (as in a stochastic block model, see Methods), and then obtaining its main core. The ccr values of the control group are then computed for the complete set of one thousand iterations, and averaged. This way, we have a random ccr for reference to compare the significance of the conservation rate of core nodes in the real GCNs.
A striking result is the high ccr of nodes in the core of the top layer of the healthy GCN (Fig. 7). There is a good conservation rate from the top layer core nodes until the last layer ccr1=0.32, moreover, there is a slow decay of conservation with ccr81=0.93. In the control core GCN, these values are 0.001 and 0.06, respectively. The cumulative conservation of core nodes, starting at the top-layer and across the subsequent co-expression-layer cores is in general much greater than what would be expected at random. Indeed, in the set of 141 genes comprising the top-layer core of the healthy GCN, 90% of them are conserved in the cores of the other subsequent 30 layers’ cores. This, considering a different pattern of interconnectedness across any two cores, as edges do not repeat across layers.
The different conservation rates in core nodes between cancer and healthy tissues hold true when considering layer 94 as the reference to compute ccr (supplementary Figure S4). In this case, the conservation ratio of the cores in the healthy network is lower with respect to the case when layer 100 was the reference, indicating that the set of cores at the top 6 layers maximize conservation and cores in subsequent layers add substantial irrelevant genes. However, the difference in ccr is still significant with respect to the breast-cancer GCN. The conservation ratio ten layers below the reference (ccr84) is 0.6 and 0.42 for healthy and cancer GCNs, respectively. Ten further layers downstream these values are 0.46 and 0.12, finally ccr64 values are 0.37 and 0.05 in healthy and Basal GCNs, respectively. More generally, avoiding the top 5 layers shows a close relation between the conservation ratio of the breast cancer and the control group, congruent with the random-like behavior found in layer 94 and below in breast cancer.
In both cases, the ccr values are unlikely to be obtained from a random control group in the case of the healthy tissue. The average mean-square-error (MSE) between the control ccr values and those of 1000 random networks is equal to 8×10−5, which is five orders of magnitude smaller than 0.5, the MSE between control and real ccr values. In the case of breast cancer the MSE between the random group and the real values is 0, specifically ccr99=1 in both cases, meaning that the vertiginous increase in intra-chromosomal connections from layer 99 to layer 100 is responsible for this MSE.
Results above are an effect of taking layer 100 as a reference for ccr: when instead we take layer 94 as the reference the MSE value between control and breast cancer crr is 6×10−3, suggesting a close relation between the structure of random networks and that of breast cancer GCN for layers below 94.
Nodes in the core of networks have been shown to be good information spreaders (Al-garadi et al. 2017;Liu et al. 2015). One possible reason of this conservation could be the fact that genes in this set take part of several different functional pathways, each one with a complementary set of other genes. However, further research is needed for the identification of the functional relevance of this gene set.
The strong variance in the structural properties of healthy and Basal breast cancer GCNs in terms of conservation, leads us to argue that the organizational principle that determines the co-expression landscape in a healthy phenotype is abolished during the oncogenic process. While it is an already known factor that co-expression between physically close genes is higher than distant gene pairs in general (Hurst et al. 2004;Wang et al. 2011;Hurst 2017), in this work we have observed that the distant interactions between inter-chromosomal genes are central in the process that shapes the network landscape in the healthy phenotype, even if strong interactions between close genes also appear.
The loss of inter-chromosomal co-expression at the top layers in the cancer GCN, apparently leaves the intra-chromosomal interactions as the main (if not the only) mechanism for shaping the co-expression landscape. As intra-chromosomal interactions are the strongest and more abundant in the top layers, the lower layers have less of them and are connected similarly to a random network. The separation of the layers into inter and intra chromosomal layers, reflects on the two different structural mechanisms found in the breast cancer GCN found in this investigation.
Functional implications of the loss of inter-chromosomal co-expression in basal breast cancer
The biological implications of this phenomenon may have impact in the way we understand gene expression and co-expression in cancer. In the healthy GCN, we observe a far-from-random degree distribution from the lowest layers. Conversely, almost all the structure in the Basal breast cancer GCN is similar to a randomly-generated network, but for the top layers, in which we observe a high intra-chromosomal connectivity. This effect in cancer GCN might be attributed to a loss of those mechanisms that orchestrate the gene co-expression landscape.
After a disruption in the organizational principles involved in maintaining the co-expression landscape, the possible remaining solution for a damaged cancer cell could be an elevated co-expression between close genes. The regulatory elements of gene transcription may behave in an operon-like fashion: the RNA polymerase complex may transcribe large sections of a given chromosome (perhaps due to an incorrectly open 3D structure of DNA). These sections could be flanked similarly in several breast cancer patients (probably due to methylation marks, CTCF binding sites, or stop signals of transcription). The resulting sections will have similar expression patterns, thus allowing high co-expression values in a large portion of the cancer genome.
Gene somatic copy number alterations (SCNAs) are one of the most documented genomic modifications in breast cancer. SCNAs are also known to affect gene expression over clusters of genes (Zhou et al. 2003;Inaki et al. 2014;Menghi et al. 2016). A specific breast cancer molecular subtype called HER2+, is actually defined by an amplification (located at Chr17q12) and protein over-expression of ERBB2 and neighbor genes (Slamon et al. 1987;Sørlie et al. 2001). Another breast cancer-associated amplification is located at region 17q25.3. This alteration occurs in BRCA1 mutated triple negative breast cancer, HER2+, or Luminal B breast cancer subtype (Toffoli et al. 2014).
Recently (García-Cortés et al. 2020), it was reported that in breast cancer subtypes GCNs, highly dense intra-cytoband hotspots coincide with commonly amplified regions. Intra-cytoband clusters of genes such as the aforementioned region Chr17q12 or Chr8q24.3 form highly dense connected components. The presence of co-expression clusters in cancer GCNs allows us to suggest that—additionally with CNAs—several other mechanisms may influence the co-expression landscape in breast cancer: non-coding RNAs, epigenetic modifications, 3D structure of DNA, CTCF binding sites alterations, etc. To establish the validity of the previous lines, it is necessary to analyze and integrate other -omic technologies (high-throughput genetical data) in a global framework. This will allow us to define the structures responsible for the correct maintenance of the genome-wide co-expression landscape.