Analyzing breast cancer comorbidities: a network approach using community detection algorithms

Permana, Angga A.; Yaputra, Reynard M.

doi:10.1007/s41109-024-00644-0

Research
Open access
Published: 08 July 2024

Analyzing breast cancer comorbidities: a network approach using community detection algorithms

Angga A. Permana¹^na1 &
Reynard M. Yaputra¹^na1

Applied Network Science volume 9, Article number: 31 (2024) Cite this article

138 Accesses
1 Altmetric
Metrics details

Abstract

Breast cancer is a prominent global health concern, as the data from the International Agency for Research on Cancer (IARC) shows that breast cancer is the leading cancer type with new cases in 2020 and among the Top 5 cancer types with the most deaths. To help improve the current breast cancer comorbidity identification by medical personnel and ultimately, lower the number of death cases from breast cancer comorbidity, this research aims to discover the breast cancer comorbidity community, do modularity and similarity-based evaluation, suggest the best semantic similarity measurement and threshold value, and validate the data of breast cancer comorbidities with several data from research papers. The Wang algorithm, with a threshold value of 0.5, is chosen to build the network. Leiden, Louvain, RBER Pots, RB Pots, and Walktrap are the best five community detection algorithms. Similarity measurements with the best three fitness functions (edges inside, scaled density, and size) suggest that the Leiden–Louvain algorithm and RBER Pots-RB Pots algorithm are two pairs of algorithms with similar results. Other similarity measurements with the V-measure heatmap suggest that Louvain–Leiden (0.99), RB Pots–Leiden (0.97), and RB Pots–RBER Pots (0.96) results are similar. Comorbidity is then evaluated using the best five community detection algorithms and four centrality algorithms. As a result, fourteen diseases are agreed upon by the best five community detection algorithms, five diseases are agreed by four algorithms, two diseases are agreed by three algorithms, a disease is agreed by two algorithms, and ten diseases are agreed by an algorithm.

Introduction

There are a lot of theories about comorbidity and its definition. In 1996, Akker et al.’s research was the first to propose a distinctive definition between comorbidity and multimorbidity. Comorbidity is defined as any linked medical condition besides the index or primary disease (or, in other words, given a primary disease). On the other hand, multimorbidity is defined as any co-occurrence of multiple medical disorders or conditions without any given disease (Swain et al. 2020). The discovery of comorbidities and their impacts may help us prioritize our choice of treatment actions to address the primary disease and underpin the biology-related basics of the disease (Russell et al. 2023). Hence, information about comorbidity is crucial, as comorbidity is an important relationship between diseases and may lead to essential discoveries in the future.

Nowadays, cancer is often associated with a deadly disease. This stereotype is not a misjudge since, according to data from the International Agency for Research on Cancer (IARC), in 2020, nearly 10 million deaths were caused by cancer, making cancer the leading cause of death worldwide. The World Health Organization defines cancer as a process of the body’s part(s) cell abnormal or unreasonable proliferation (Cancer 2022). Upon further analysis and breakdown, from all the 10 million deaths, breast cancer is among the top 5 cancer types that cause the most number of death cases, with 0.69 million deaths, preceded by lung cancer with 1.80 million deaths, colon and rectum cancer with 0.92 million deaths, liver cancer with 0.83 million deaths, and stomach cancer with 0.77 million deaths. According to other data from the IARC regarding the types of cancer with the most new cases in 2020, breast cancer is in the lead with 2.26 million cases, followed by lung cancer with 2.21 million cases, colon and rectum cancer with 1.93 million cases, prostate cancer with 1.41 million cases, and skin cancer with 1.20 million cases (Ferlay et al. 2018).

This data shows us that breast cancer stands out as a prominent global health concern. Any information to prevent further impacts will be beneficial for the healthcare world. As stated before, the comorbidities of breast cancer have become one of those important pieces of information. Some institutions came up with breast cancer comorbidities. According to the Danish National Patient Register (DNPR) data in the Ewertz et al. (2018) study, there are twelve breast cancer comorbidities, and it’s confirmed that all diseases impact the all-cause mortality rate significantly. Based on data from another study conducted by Sharma et al. (2015), out of 134 patients, there are twenty-eight breast cancer comorbidities assessed by self-report and verified by medical record review and the Charlson Comorbidity Index. One of the research conclusions from this study is that the comorbidities are decreasing breast cancer survivors’ quality of life Fu et al. (2015). Research by Sharma et al. (2015) also stated nine breast cancer comorbidity. Back to the Danish National Patient Register, but in different research conducted by Ording et al. (2013), there are nineteen diseases related to breast cancer comorbidity. Lastly, research from Hong et al. (2015) concludes all breast cancer comorbidities from a lot of research papers.

Network analysis, or graph analytics, is a robust method designed to infer insights from data in graph form. One of the sub-studies of network analysis is community detection. Community detection clusters nodes (communities) within a network or graph with the assumption that the connection is sparse. In some applications, this cluster can infer some meaningful information that, with regular data representation, is hard or even impossible to come to the same conclusion (Li et al. 2022). One of its applications is to infer comorbidity, which is the connection or relationship between diseases. Several algorithms can be used for community detection, such as Leiden, Louvain, RB Pots, Belief, Girvan-Newman, and much more.

Comorbidity discovery with community detection algorithms is discussed in several research papers. Algorithms for disease ontology (DO) similarity, the best algorithms for revealing lung cancer comorbidities, calculation of fitness score, and several significant lung cancer comorbidities were discussed in the research conducted by Rustamaji et al. (2022) in the paper titled “A Network Analysis to Identify Lung Cancer Comorbid Disease,” released in 2022. Hepatocellular carcinoma comorbidity risk is discussed by Mu et al. (2020) in the paper titled “Patterns of Comorbidity in Hepatocellular Carcinoma: A Network Perspective”. The network analysis approach is used to conceptualize eating disorder and social anxiety disorder comorbidity by Levinson et al. in the paper titled “Social anxiety and eating disorder comorbidity and underlying vulnerabilities: Using network analysis to conceptualize comorbidity” (Levinson et al. 2018). Comorbidities and conditions in a comorbidity network that are attributed most to diabetes are revealed by Khan et al. Khan et al. (2018). An analysis of the network structure of depression and anxiety symptoms is explained by Kaiser et al. (2021). Other applications of community detection algorithms for comorbidity information discovery are discussed in the paper conducted by Baggio et al. (2018), Das (2020, 2021), Vilela et al. (2022), and Chatterjee and Sanjeev (2022).

This research will discuss the in-silico effort to discover breast cancer’s comorbidities with community detection algorithms. On the other hand, this research will also suggest the best gene ontology-based semantic similarity measurements (out of Wang, Jiang, Lin, Resnik, and Rel algorithms) and the optimal threshold value to get the network’s adjacency matrix from the similarity matrix. Moreover, this research compares all the community detection algorithms to produce the best five algorithms to address the breast cancer comorbidity network community detection based on the network’s visual, density, and modularity measurements. The best five algorithms’ results are then measured with some fitness functions applied to each algorithm’s result, as well as a heatmap from each algorithm’s result similarity evaluation. Lastly, upon getting the comorbidities list, the list will be visualized with the disease ontology’s directed acyclic graph (DAG) and validated with the data on breast cancer’s comorbidities from various sources.

Research methodology

Research steps

This research is divided into six steps: data gathering, data preprocessing, network formation, community detection, community evaluation (similarity and modularity), and comorbidity discovery. The flowchart of this research’s steps is shown in Fig. 1.

The details of each step are as follows.

1.
Data gathering The Pubtator Central (PTC) website (https://www.ncbi.nlm.nih.gov/research/pubtator/) is scraped. PTC was chosen because it provides comprehensive biology information on cell lines, chemicals, diseases, genes, mutations, and species. PTC mining systems annotate 29 million abstracts in PubMed, and 3 million full-text articles on PMC in various kind of formats (Wei et al. 2019). An example of the articles scraped in Pubtator Central is depicted in Table 1. The result can be retrieved from an application programming interface (API) prepared by the website developer. From all the manuscripts related to ’comorbid breast cancer’, disease annotations are collected since this research focuses on breast cancer disease comorbidities.
2.
Data preprocessing Upon getting the manuscripts and extracting all the words related to the disease, the disease ontology, whose link can be accessed here, https://disease-ontology.org/do is used to get each disease identifier in the disease ontology or Disease Ontology Identifier (DOID). Disease Ontology (DO) is a platform that curates the semantic correlation between diseases and integrates biology concepts, such as their genes and environmental drivers and attributes. DO has become a reliable and robust platform for acquiring disease ontology data and has already been trusted by a lot of stakeholders (Baron et al. 2023). DO is chosen because it possesses a key role in linking data with real-world data. To represent its data, DO uses the directed acyclic graph (DAG) form. With DAG representation, it will be easier to maintain and track such complex data (Rustamaji et al. 2022). To gain a better understanding of the Disease Ontology entry, an example of a disease information in Disease Ontology is shown in Table 2. Elimination of words unrelated to the disease, synonyms of diseases, synonyms of breast cancer, and diseases that cannot be found in the ontology is performed besides the process of searching the disease ID in the disease ontology. After all the DOIDs are acquired, a list of unique DOIDs is generated.
3.
Network formation A similarity matrix between unique DOIDs is acquired with the Disease Ontology Semantic and Enrichment Analysis (DOSE) library downloaded from Bioconductor within an R program (Yu et al. 2015). Wang et al. (2007), Jiang and Conrath (1997), Lin (1998), Resnik (1995), and Schlicker et al. (2006) are five of the gene ontology-based semantic similarity algorithms that are included in the doSim function within the DOSE library. Wang’s algorithm uses the position of the disease in the directed acyclic graph (DAG) and its parent nodes to convert the term’s semantics into a numeric value (Wang et al. 2007). Jiang’s algorithm utilizes the word, concepts, and lexical taxonomy structure, combined with corpus statistical information (Jiang and Conrath 1997). Lin’s algorithm works with assumptions and defines the universal definition of similarity (Lin 1998). Resnik’s algorithm works with information content (Resnik 1995). The Rel algorithm compares gene-ontology (GO) terms and values the similarity of gene products (Schlicker et al. 2006). The value that appears in the similarity matrix is in the range [0, 1], with the value of zero indicating two diseases have no correlation at all (or no similarity at all) and the value of one showing two diseases are identical to each other (or perfectly similar). Furthermore, the null value cleaning is performed, and the adjacency matrix is made with the transformation of the similarity matrix. A threshold value is chosen to transform the similarity matrix into an adjacency matrix. For each value in the similarity matrix that is below the threshold, change that value to zero (disconnect two nodes). In this research, Wang’s algorithm is chosen due to its accuracy over two similar diseases and good distribution of edge weight. Then, 0.5 is chosen as the threshold value to balance the false positive and modularity of the network (Rustamaji et al. 2022). With the adjacency matrix, the network is constructed in a Python program equipped with the NetworkX library. Following the analysis, the subgraph with the most nodes connected, or giant components of the graph, will be taken for further calculation. The Jupyter Widget (ipycytoscape) and Cytoscape applications will be used to visualize the network and giant component.
4.
Communities detection The Python library cdlib is being used in this research to calculate the community results from community detection algorithms. Python is chosen because it has functionalities and libraries that are sufficient to do data science research and analysis (Shruthi et al. 2020). Belief (Zhang and Moore 2014), Constant Potts Model (CPM) (Traag et al. 2011), Chinese Whispers (Ustalov et al. 2019), Diffusion Entropy Reducer (DER) (Kozdoba and Mannor 2015), Eigenvector Newman (2006), Genetic Algorithm (GA) (Pizzuti 2008), Greedy Modularity (Clauset et al. 2004), Label Propagation (Cordasco and Gargano 2011), Leiden (Traag et al. 2018), Louvain (Blondel et al. 2008), Markov Clustering (Enright et al. 2002), RB Pots (Leicht and Newman 2008), RBER Pots (Reichardt and Bornholdt 2006), Significance Community (Traag etal. 2013), Spinglass (Reichardt and Bornholdt 2006), Surprise Community (Traag et al. 2015), and Walktrap (Pons and Latapy 2005) are the seventeen community detection algorithms used in this research. The description of all the community detection algorithms used in this research can be found in Additional file 1.
5.
Community evaluation (similarity and modularity) With the seventeen community detection algorithms’ results, modularity measurement is carried out. Newman–Girvan modularity (Newman and Girvan 2004), Erdos–Renyi modularity (Erdos and Rényi 2011), link modularity (Nicosia et al. 2009), density modularity (Zhang et al. 2010), and Z modularity (Miyauchi and Kawase 2016) are the five modularity measurements used to evaluate the community. All the modularity measurements have the same benefit characteristic, which means that the higher modularity value represents a better and denser community or cluster, and vice versa. The formula of each modularity is shown in Formulas 1, 2, 3, 4, and 5.
$$\begin{aligned}{} & {} Q(S)_{Newman-Girvan} = \frac{1}{m}\sum _{c \in S}^{}\left( m_{s} - \frac{(2m_{s} + l_{s})^{2}}{4m}\right) \end{aligned}$$
(1)
$$\begin{aligned}{} & {} Q(S)_{ErdosRenyi} = \frac{1}{m}\sum _{c \in S}^{}\left( m_{s} - \frac{mn_{s}(n_{s}-1)}{n(n-1)}\right) \end{aligned}$$
(2)
$$\begin{aligned}{} & {} Q(S)_{Link} = \frac{1}{2m}\sum _{i,j \in V}^{}\left[ A_{ij} - \frac{k_{i}k_{j}}{2m}\right] \delta (c_{i}, c_{j}) \end{aligned}$$
(3)
$$\begin{aligned}{} & {} Q(S)_{Density} = \sum _{c \in S}^{}\frac{1}{n_{c}}\left( \sum _{i \in C}^{} 2 *\lambda * k^{in}_{iC} - \sum _{i \in C}^{} 2 *\lambda * k^{out}_{iC}\right) \end{aligned}$$
(4)
$$\begin{aligned}{} & {} Z(C)_{Z} = \frac{\sum _{c \in C}^{}\frac{m_{c}}{m}-\sum _{c \in C}^{}(\frac{D_{c}}{2m})^{2}}{\sqrt{\sum _{c \in C}^{}(\frac{D_{c}}{2m})^{2}(1-\sum _{c \in C}^{}\left(\frac{D_{c}}{2m})^{2}\right)}} \end{aligned}$$
(5)
After getting all the modularity values, normalization is applied to transform the range of all the modularity measurements into the same range of [0, 1], and principal component analysis (PCA) is performed to reduce the dimensionality and convert all the numbers into a single comparable number. The algorithm’s result is then ranked based on its modularity and eigenvector characteristics. Another measurement of the algorithm’s result is the similarity between them. The fitness scores from each algorithm result are calculated. Fitness scores applied to the algorithms are average internal degree, edges inside, expansion, internal edge density (Radicchi et al. 2004), conductance, normalized cut (Shi and Malik 2000), cut ratio (Fortunato 2010), fraction over median degree, triangle participation ratio (Yang and Leskovec 2015), max out-degree fraction (max ODF), average out-degree fraction (average ODF), flake out-degree fraction (flake ODF) (Flake et al. 2000), average embeddedness, average transitivity, scaled density, and size (Rossetti et al. 2019). The description of all the fitness functions used in this research can be found in Additional file 2. Furthermore, normalization and PCA are applied for the same reason as in modularity measurement. The best three fitness functions with a strong correlation with the outcomes are then chosen, applied to the community detection algorithm, and visualized to see the similarity between the best five community detection algorithms. Lastly, to assess the cluster similarity, a v-measure is applied, and the similarity score is depicted in a heatmap. V-measure is a cluster similarity measurement that is independent of the absolute values of the labels. The v-measure value doesn’t change because of class permutation or because the label of a cluster is different. The formula for V-measure is shown in Formula 6 (Rosenberg and Hirschberg 2007).
$$\begin{aligned} \begin{gathered} v = \frac{(1 + beta) * homogeneity * completeness}{(beta * homogeneity + completeness)} \\ homogeneity = 1 - \frac{H(Y_{true} | Y_{pred})}{H(Y_{true})} \\ completeness = 1 - \frac{H(Y_{pred} | Y_{true})}{H(Y_{pred})} \\ \end{gathered} \end{aligned}$$
(6)
In Formula 6, beta represents the coefficient of weight to adjust the impact of homogeneity and completeness, and H(n) represents the number of cases where event n occurred. Moreover, in the default case, the beta value is 1.
6.
Comorbidities discovery Following the evaluation of the community result from each community detection algorithm, the comorbidity is determined based on the consensus of the centrality algorithm for each cluster formed from each algorithm’s result. The NetworkX library provides the function to evaluate the centrality of each cluster, and the cdlib library provides the function to write the communities formed by each algorithm. Common centrality algorithms in network analysis studies are degree, closeness, betweenness, and eigenvector centrality (Permana et al. 2023). Those centrality algorithms are also chosen because they show successful results in other research (Collins and Houghten 2020). Degree centrality uses the information on the nodes’ degrees and normalizes the value. Betweenness centrality suggests that centrality should be counted based on the shortest path from one node to the other for each pair of nodes (Freeman 1977). Closeness centrality takes the same approach as betweenness centrality but with a different formula that is not guaranteed to be true if the edge’s weight is in decimal form (Freeman 1978). Eigenvector centrality is computed with the centrality of a node added from the centrality of its parent nodes (Wei 1952). Hence, all of them will be used. A cluster’s central disease(s) is defined as a node or some nodes that have the highest centrality among all the nodes in their respective cluster. In a community, a centrality algorithm may suggest more than one disease as a central disease, and the consensus itself may agree on several diseases as comorbidity. In that case, all the diseases are considered central and comorbidity. Hence, if there are n communities, it doesn’t mean that there are n comorbidities. After all the comorbidities are discovered, all the breast cancer comorbidities from several studies are listed and traced for their ancestors’ data, and this research comorbidity list discovery will be validated. The goal is to compare these research results with real-world data.

Table 1 Example of articles scraped in pubtator central

Analyzing breast cancer comorbidities: a network approach using community detection algorithms

Abstract

Introduction

Research methodology

Research steps

System specification

Result and discussion

Data gathering

Data preprocessing

Network formation

Communities detection

Community evaluation (modularity and similarity)

Comorbidities discovery

Conclusion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Supplementary information

Additional file 1.

Additional file 2.

Additional file 3.

Rights and permissions

About this article

Cite this article

Share this article

Keywords