- Research
- Open access
- Published:
Analyzing breast cancer comorbidities: a network approach using community detection algorithms
Applied Network Science volume 9, Article number: 31 (2024)
Abstract
Breast cancer is a prominent global health concern, as the data from the International Agency for Research on Cancer (IARC) shows that breast cancer is the leading cancer type with new cases in 2020 and among the Top 5 cancer types with the most deaths. To help improve the current breast cancer comorbidity identification by medical personnel and ultimately, lower the number of death cases from breast cancer comorbidity, this research aims to discover the breast cancer comorbidity community, do modularity and similarity-based evaluation, suggest the best semantic similarity measurement and threshold value, and validate the data of breast cancer comorbidities with several data from research papers. The Wang algorithm, with a threshold value of 0.5, is chosen to build the network. Leiden, Louvain, RBER Pots, RB Pots, and Walktrap are the best five community detection algorithms. Similarity measurements with the best three fitness functions (edges inside, scaled density, and size) suggest that the Leiden–Louvain algorithm and RBER Pots-RB Pots algorithm are two pairs of algorithms with similar results. Other similarity measurements with the V-measure heatmap suggest that Louvain–Leiden (0.99), RB Pots–Leiden (0.97), and RB Pots–RBER Pots (0.96) results are similar. Comorbidity is then evaluated using the best five community detection algorithms and four centrality algorithms. As a result, fourteen diseases are agreed upon by the best five community detection algorithms, five diseases are agreed by four algorithms, two diseases are agreed by three algorithms, a disease is agreed by two algorithms, and ten diseases are agreed by an algorithm.
Introduction
There are a lot of theories about comorbidity and its definition. In 1996, Akker et al.’s research was the first to propose a distinctive definition between comorbidity and multimorbidity. Comorbidity is defined as any linked medical condition besides the index or primary disease (or, in other words, given a primary disease). On the other hand, multimorbidity is defined as any co-occurrence of multiple medical disorders or conditions without any given disease (Swain et al. 2020). The discovery of comorbidities and their impacts may help us prioritize our choice of treatment actions to address the primary disease and underpin the biology-related basics of the disease (Russell et al. 2023). Hence, information about comorbidity is crucial, as comorbidity is an important relationship between diseases and may lead to essential discoveries in the future.
Nowadays, cancer is often associated with a deadly disease. This stereotype is not a misjudge since, according to data from the International Agency for Research on Cancer (IARC), in 2020, nearly 10 million deaths were caused by cancer, making cancer the leading cause of death worldwide. The World Health Organization defines cancer as a process of the body’s part(s) cell abnormal or unreasonable proliferation (Cancer 2022). Upon further analysis and breakdown, from all the 10 million deaths, breast cancer is among the top 5 cancer types that cause the most number of death cases, with 0.69 million deaths, preceded by lung cancer with 1.80 million deaths, colon and rectum cancer with 0.92 million deaths, liver cancer with 0.83 million deaths, and stomach cancer with 0.77 million deaths. According to other data from the IARC regarding the types of cancer with the most new cases in 2020, breast cancer is in the lead with 2.26 million cases, followed by lung cancer with 2.21 million cases, colon and rectum cancer with 1.93 million cases, prostate cancer with 1.41 million cases, and skin cancer with 1.20 million cases (Ferlay et al. 2018).
This data shows us that breast cancer stands out as a prominent global health concern. Any information to prevent further impacts will be beneficial for the healthcare world. As stated before, the comorbidities of breast cancer have become one of those important pieces of information. Some institutions came up with breast cancer comorbidities. According to the Danish National Patient Register (DNPR) data in the Ewertz et al. (2018) study, there are twelve breast cancer comorbidities, and it’s confirmed that all diseases impact the all-cause mortality rate significantly. Based on data from another study conducted by Sharma et al. (2015), out of 134 patients, there are twenty-eight breast cancer comorbidities assessed by self-report and verified by medical record review and the Charlson Comorbidity Index. One of the research conclusions from this study is that the comorbidities are decreasing breast cancer survivors’ quality of life Fu et al. (2015). Research by Sharma et al. (2015) also stated nine breast cancer comorbidity. Back to the Danish National Patient Register, but in different research conducted by Ording et al. (2013), there are nineteen diseases related to breast cancer comorbidity. Lastly, research from Hong et al. (2015) concludes all breast cancer comorbidities from a lot of research papers.
Network analysis, or graph analytics, is a robust method designed to infer insights from data in graph form. One of the sub-studies of network analysis is community detection. Community detection clusters nodes (communities) within a network or graph with the assumption that the connection is sparse. In some applications, this cluster can infer some meaningful information that, with regular data representation, is hard or even impossible to come to the same conclusion (Li et al. 2022). One of its applications is to infer comorbidity, which is the connection or relationship between diseases. Several algorithms can be used for community detection, such as Leiden, Louvain, RB Pots, Belief, Girvan-Newman, and much more.
Comorbidity discovery with community detection algorithms is discussed in several research papers. Algorithms for disease ontology (DO) similarity, the best algorithms for revealing lung cancer comorbidities, calculation of fitness score, and several significant lung cancer comorbidities were discussed in the research conducted by Rustamaji et al. (2022) in the paper titled “A Network Analysis to Identify Lung Cancer Comorbid Disease,” released in 2022. Hepatocellular carcinoma comorbidity risk is discussed by Mu et al. (2020) in the paper titled “Patterns of Comorbidity in Hepatocellular Carcinoma: A Network Perspective”. The network analysis approach is used to conceptualize eating disorder and social anxiety disorder comorbidity by Levinson et al. in the paper titled “Social anxiety and eating disorder comorbidity and underlying vulnerabilities: Using network analysis to conceptualize comorbidity” (Levinson et al. 2018). Comorbidities and conditions in a comorbidity network that are attributed most to diabetes are revealed by Khan et al. Khan et al. (2018). An analysis of the network structure of depression and anxiety symptoms is explained by Kaiser et al. (2021). Other applications of community detection algorithms for comorbidity information discovery are discussed in the paper conducted by Baggio et al. (2018), Das (2020, 2021), Vilela et al. (2022), and Chatterjee and Sanjeev (2022).
This research will discuss the in-silico effort to discover breast cancer’s comorbidities with community detection algorithms. On the other hand, this research will also suggest the best gene ontology-based semantic similarity measurements (out of Wang, Jiang, Lin, Resnik, and Rel algorithms) and the optimal threshold value to get the network’s adjacency matrix from the similarity matrix. Moreover, this research compares all the community detection algorithms to produce the best five algorithms to address the breast cancer comorbidity network community detection based on the network’s visual, density, and modularity measurements. The best five algorithms’ results are then measured with some fitness functions applied to each algorithm’s result, as well as a heatmap from each algorithm’s result similarity evaluation. Lastly, upon getting the comorbidities list, the list will be visualized with the disease ontology’s directed acyclic graph (DAG) and validated with the data on breast cancer’s comorbidities from various sources.
Research methodology
Research steps
This research is divided into six steps: data gathering, data preprocessing, network formation, community detection, community evaluation (similarity and modularity), and comorbidity discovery. The flowchart of this research’s steps is shown in Fig. 1.
The details of each step are as follows.
-
1.
Data gathering The Pubtator Central (PTC) website (https://www.ncbi.nlm.nih.gov/research/pubtator/) is scraped. PTC was chosen because it provides comprehensive biology information on cell lines, chemicals, diseases, genes, mutations, and species. PTC mining systems annotate 29 million abstracts in PubMed, and 3 million full-text articles on PMC in various kind of formats (Wei et al. 2019). An example of the articles scraped in Pubtator Central is depicted in Table 1. The result can be retrieved from an application programming interface (API) prepared by the website developer. From all the manuscripts related to ’comorbid breast cancer’, disease annotations are collected since this research focuses on breast cancer disease comorbidities.
-
2.
Data preprocessing Upon getting the manuscripts and extracting all the words related to the disease, the disease ontology, whose link can be accessed here, https://disease-ontology.org/do is used to get each disease identifier in the disease ontology or Disease Ontology Identifier (DOID). Disease Ontology (DO) is a platform that curates the semantic correlation between diseases and integrates biology concepts, such as their genes and environmental drivers and attributes. DO has become a reliable and robust platform for acquiring disease ontology data and has already been trusted by a lot of stakeholders (Baron et al. 2023). DO is chosen because it possesses a key role in linking data with real-world data. To represent its data, DO uses the directed acyclic graph (DAG) form. With DAG representation, it will be easier to maintain and track such complex data (Rustamaji et al. 2022). To gain a better understanding of the Disease Ontology entry, an example of a disease information in Disease Ontology is shown in Table 2. Elimination of words unrelated to the disease, synonyms of diseases, synonyms of breast cancer, and diseases that cannot be found in the ontology is performed besides the process of searching the disease ID in the disease ontology. After all the DOIDs are acquired, a list of unique DOIDs is generated.
-
3.
Network formation A similarity matrix between unique DOIDs is acquired with the Disease Ontology Semantic and Enrichment Analysis (DOSE) library downloaded from Bioconductor within an R program (Yu et al. 2015). Wang et al. (2007), Jiang and Conrath (1997), Lin (1998), Resnik (1995), and Schlicker et al. (2006) are five of the gene ontology-based semantic similarity algorithms that are included in the doSim function within the DOSE library. Wang’s algorithm uses the position of the disease in the directed acyclic graph (DAG) and its parent nodes to convert the term’s semantics into a numeric value (Wang et al. 2007). Jiang’s algorithm utilizes the word, concepts, and lexical taxonomy structure, combined with corpus statistical information (Jiang and Conrath 1997). Lin’s algorithm works with assumptions and defines the universal definition of similarity (Lin 1998). Resnik’s algorithm works with information content (Resnik 1995). The Rel algorithm compares gene-ontology (GO) terms and values the similarity of gene products (Schlicker et al. 2006). The value that appears in the similarity matrix is in the range [0, 1], with the value of zero indicating two diseases have no correlation at all (or no similarity at all) and the value of one showing two diseases are identical to each other (or perfectly similar). Furthermore, the null value cleaning is performed, and the adjacency matrix is made with the transformation of the similarity matrix. A threshold value is chosen to transform the similarity matrix into an adjacency matrix. For each value in the similarity matrix that is below the threshold, change that value to zero (disconnect two nodes). In this research, Wang’s algorithm is chosen due to its accuracy over two similar diseases and good distribution of edge weight. Then, 0.5 is chosen as the threshold value to balance the false positive and modularity of the network (Rustamaji et al. 2022). With the adjacency matrix, the network is constructed in a Python program equipped with the NetworkX library. Following the analysis, the subgraph with the most nodes connected, or giant components of the graph, will be taken for further calculation. The Jupyter Widget (ipycytoscape) and Cytoscape applications will be used to visualize the network and giant component.
-
4.
Communities detection The Python library cdlib is being used in this research to calculate the community results from community detection algorithms. Python is chosen because it has functionalities and libraries that are sufficient to do data science research and analysis (Shruthi et al. 2020). Belief (Zhang and Moore 2014), Constant Potts Model (CPM) (Traag et al. 2011), Chinese Whispers (Ustalov et al. 2019), Diffusion Entropy Reducer (DER) (Kozdoba and Mannor 2015), Eigenvector Newman (2006), Genetic Algorithm (GA) (Pizzuti 2008), Greedy Modularity (Clauset et al. 2004), Label Propagation (Cordasco and Gargano 2011), Leiden (Traag et al. 2018), Louvain (Blondel et al. 2008), Markov Clustering (Enright et al. 2002), RB Pots (Leicht and Newman 2008), RBER Pots (Reichardt and Bornholdt 2006), Significance Community (Traag etal. 2013), Spinglass (Reichardt and Bornholdt 2006), Surprise Community (Traag et al. 2015), and Walktrap (Pons and Latapy 2005) are the seventeen community detection algorithms used in this research. The description of all the community detection algorithms used in this research can be found in Additional file 1.
-
5.
Community evaluation (similarity and modularity) With the seventeen community detection algorithms’ results, modularity measurement is carried out. Newman–Girvan modularity (Newman and Girvan 2004), Erdos–Renyi modularity (Erdos and Rényi 2011), link modularity (Nicosia et al. 2009), density modularity (Zhang et al. 2010), and Z modularity (Miyauchi and Kawase 2016) are the five modularity measurements used to evaluate the community. All the modularity measurements have the same benefit characteristic, which means that the higher modularity value represents a better and denser community or cluster, and vice versa. The formula of each modularity is shown in Formulas 1, 2, 3, 4, and 5.
$$\begin{aligned}{} & {} Q(S)_{Newman-Girvan} = \frac{1}{m}\sum _{c \in S}^{}\left( m_{s} - \frac{(2m_{s} + l_{s})^{2}}{4m}\right) \end{aligned}$$(1)$$\begin{aligned}{} & {} Q(S)_{ErdosRenyi} = \frac{1}{m}\sum _{c \in S}^{}\left( m_{s} - \frac{mn_{s}(n_{s}-1)}{n(n-1)}\right) \end{aligned}$$(2)$$\begin{aligned}{} & {} Q(S)_{Link} = \frac{1}{2m}\sum _{i,j \in V}^{}\left[ A_{ij} - \frac{k_{i}k_{j}}{2m}\right] \delta (c_{i}, c_{j}) \end{aligned}$$(3)$$\begin{aligned}{} & {} Q(S)_{Density} = \sum _{c \in S}^{}\frac{1}{n_{c}}\left( \sum _{i \in C}^{} 2 *\lambda * k^{in}_{iC} - \sum _{i \in C}^{} 2 *\lambda * k^{out}_{iC}\right) \end{aligned}$$(4)$$\begin{aligned}{} & {} Z(C)_{Z} = \frac{\sum _{c \in C}^{}\frac{m_{c}}{m}-\sum _{c \in C}^{}(\frac{D_{c}}{2m})^{2}}{\sqrt{\sum _{c \in C}^{}(\frac{D_{c}}{2m})^{2}(1-\sum _{c \in C}^{}\left(\frac{D_{c}}{2m})^{2}\right)}} \end{aligned}$$(5)After getting all the modularity values, normalization is applied to transform the range of all the modularity measurements into the same range of [0, 1], and principal component analysis (PCA) is performed to reduce the dimensionality and convert all the numbers into a single comparable number. The algorithm’s result is then ranked based on its modularity and eigenvector characteristics. Another measurement of the algorithm’s result is the similarity between them. The fitness scores from each algorithm result are calculated. Fitness scores applied to the algorithms are average internal degree, edges inside, expansion, internal edge density (Radicchi et al. 2004), conductance, normalized cut (Shi and Malik 2000), cut ratio (Fortunato 2010), fraction over median degree, triangle participation ratio (Yang and Leskovec 2015), max out-degree fraction (max ODF), average out-degree fraction (average ODF), flake out-degree fraction (flake ODF) (Flake et al. 2000), average embeddedness, average transitivity, scaled density, and size (Rossetti et al. 2019). The description of all the fitness functions used in this research can be found in Additional file 2. Furthermore, normalization and PCA are applied for the same reason as in modularity measurement. The best three fitness functions with a strong correlation with the outcomes are then chosen, applied to the community detection algorithm, and visualized to see the similarity between the best five community detection algorithms. Lastly, to assess the cluster similarity, a v-measure is applied, and the similarity score is depicted in a heatmap. V-measure is a cluster similarity measurement that is independent of the absolute values of the labels. The v-measure value doesn’t change because of class permutation or because the label of a cluster is different. The formula for V-measure is shown in Formula 6 (Rosenberg and Hirschberg 2007).
$$\begin{aligned} \begin{gathered} v = \frac{(1 + beta) * homogeneity * completeness}{(beta * homogeneity + completeness)} \\ homogeneity = 1 - \frac{H(Y_{true} | Y_{pred})}{H(Y_{true})} \\ completeness = 1 - \frac{H(Y_{pred} | Y_{true})}{H(Y_{pred})} \\ \end{gathered} \end{aligned}$$(6)In Formula 6, beta represents the coefficient of weight to adjust the impact of homogeneity and completeness, and H(n) represents the number of cases where event n occurred. Moreover, in the default case, the beta value is 1.
-
6.
Comorbidities discovery Following the evaluation of the community result from each community detection algorithm, the comorbidity is determined based on the consensus of the centrality algorithm for each cluster formed from each algorithm’s result. The NetworkX library provides the function to evaluate the centrality of each cluster, and the cdlib library provides the function to write the communities formed by each algorithm. Common centrality algorithms in network analysis studies are degree, closeness, betweenness, and eigenvector centrality (Permana et al. 2023). Those centrality algorithms are also chosen because they show successful results in other research (Collins and Houghten 2020). Degree centrality uses the information on the nodes’ degrees and normalizes the value. Betweenness centrality suggests that centrality should be counted based on the shortest path from one node to the other for each pair of nodes (Freeman 1977). Closeness centrality takes the same approach as betweenness centrality but with a different formula that is not guaranteed to be true if the edge’s weight is in decimal form (Freeman 1978). Eigenvector centrality is computed with the centrality of a node added from the centrality of its parent nodes (Wei 1952). Hence, all of them will be used. A cluster’s central disease(s) is defined as a node or some nodes that have the highest centrality among all the nodes in their respective cluster. In a community, a centrality algorithm may suggest more than one disease as a central disease, and the consensus itself may agree on several diseases as comorbidity. In that case, all the diseases are considered central and comorbidity. Hence, if there are n communities, it doesn’t mean that there are n comorbidities. After all the comorbidities are discovered, all the breast cancer comorbidities from several studies are listed and traced for their ancestors’ data, and this research comorbidity list discovery will be validated. The goal is to compare these research results with real-world data.
System specification
In this research, researchers use some computational-related resources. The list of the hardware used in this research is as follows:
-
1.
Lenovo Ideapad Gaming 3 Laptop with the detailed specifications as follows:
-
Operating System: Windows 10 Home Single Language 64-bit;
-
Processor: Intel(R) Core(TM) i7-10750H CPU @ 2.60 GHz (12 CPUs), 2.6 GHz;
-
RAM: 16384 MB;
-
Memory: 512 GB;
-
DirectX Version: DirectX 12; and
-
Graphic Card: NVIDIA GeForce GTX 1650.
-
-
2.
Input Device: Mouse and keyboard.
-
3.
Output Device: Monitor.
Besides the hardware, software is also important to carry out the computation. The list of software used in this research is as follows:
-
1.
Python 3.9.12.
-
2.
Jupyter Notebook.
-
3.
Cytoscape v3.10.1.
-
4.
Microsoft Office Excel 2010.
-
5.
Library on Python programming language, such as beautifulsoup4 4.11.1, bs4 0.0.1, numpy 1.25.1, pandas 1.5.3, pip 23.0.1, requests 2.28.1, scikit-learn 1.1.3, scipy 1.10.0, NetworkX 2.8.4, matplotlib 3.7.0, ipycytoscape 1.3.3, cdlib 0.2.6, and seaborn 0.12.2.
-
6.
Browser to scrape and browse websites, such as Microsoft Edge.
-
7.
Website, such as PubTator Central and Disease Ontology.
Result and discussion
Data gathering
Articles registered in PubMed publications related to breast cancer are scraped on the Pubtator Central website (PTC) with the provided API. This process resulted in 4860 manuscripts as of April 20, 2023, related to the keyword “comorbid breast cancer.” Out of these 4860 manuscripts, 17607 non-unique disease names were acquired. After eliminating duplicated disease names, 3762 unique disease names are ready for the next calculation.
Data preprocessing
The DOID of each unique disease name is traced in the DO database, resulting in 1006 unique DOIDs as of May 20, 2023. The elimination of words unrelated to disease, synonyms of breast cancer, synonyms of other diseases, and diseases that cannot be found in DO decreases the number of data points. With these 1006 unique DOIDs, the doSim function does its job to calculate the similarity matrix by determining two DOIDs’ similarity (Yu et al. 2015). There are five algorithms used in this process, namely Wang, Jiang, Lin, Resnik, and Rel. Each algorithm gives a square matrix with an order of 1006 x 1006. Each element of this similarity matrix is in the range of [0, 1]. The most visible characteristic that separates Wang, Jiang, and Lin’s algorithm from Resnik and Rel is that in those three algorithms, its diagonal values are 1. Refer to Tables 3, 4, and 5 for examples of the similarity matrix computed with Wang, Jiang, and Lin’s algorithms, respectively.
However, with the Resnik and Rel, the diagonal value may vary, mostly not 1. In the Rel algorithm, the diagonal value is close to but not exactly 1, for example, 0.997 for DOID.10283, 0.999 for DOID.3008, and 0.991 for DOID.2394. Refer to Table 6 for the similarity matrix computed with the Rel algorithm.
In the Resnik algorithm, the diagonal value is even worse and less accurate. For example, when comparing disease with DOID.10283, DOID.3008, and DOID.2394 with the Resnik algorithm, the results are 0.62, 0.86, and 0.49, respectively. Refer to Table 7 for the similarity matrix computed with the Resnik algorithm.
Hence, those two algorithms are not selected for further calculation. After cleaning DOIDs with null values in the similarity matrix, 845 DOIDs are selected, leaving 161 DOIDs with no value, and the similarity matrix order becomes 845 \(\times\) 845.
Network formation
A graph or network is then created with the adjacency matrix, which is the similarity matrix with a certain threshold applied to it-to be specific, changing the value below or equal to the threshold to 0. If two nodes are connected with edges with a weight above the threshold, they will stay connected with the respective edges. The threshold range varies from the range [0, 1], with threshold = 0 meaning that all the nodes connect the way they should, and threshold = 1 meaning that all the nodes are not connected, or, in other words, a null graph. The selection of the gene ontology-based semantic similarity algorithms is among the Wang, Jiang, and Lin algorithms, and the threshold value choice is in the range [0, 1]. First, to select the similarity algorithm, a graph of each algorithm’s edge count over different thresholds is figured in Fig. 2.
As shown in the graph, the Jiang and Lin algorithm is too sensitive to the slight change in the threshold, which means there are so many values in some range, which is not a sign of a good or reliable algorithm. Following this analysis, Wang’s algorithm is chosen since the transition over the threshold is smooth and the value is much more distributed than the other two algorithms. Then, for the threshold choice, threshold = 0.5 is chosen. The main reason for this choice is to balance the density and false-positive rate. Consider that if there are fewer edges, the network formed will be low in modularity, even though the false-positive rate (or the rate that diseases that should not be breast cancer comorbidities included in this research’s result) are lower. However, if there are more edges, there will be a high chance of false positives occurring, even though the network is denser. In similar research, a threshold of 0.5 was also chosen for this calculation (Rustamaji et al. 2022).
To gain a better understanding of the network formed by the Wang algorithm with 0.5 as the threshold, the network degree distribution needs to be examined. Figure 3 shows the degree distribution for the network formed with the Wang algorithm and threshold = 0.5.
Graph in Fig. 3, a lot of nodes with low degrees are shown as the graph distribution is left-skewed. Including those nodes in this research’s further calculation will significantly impact the network density. Thus, the giant component must be determined to calibrate the focus only on the significant comorbidities. In the overall network, there are 845 nodes and 4000 edges. The giant component of this network is pictured with ipycytoscape-CytoscapeWidget in Figs. 4 and 5, and with Cytoscape in Fig. 6.
After the giant component selection, there are 323 nodes (diseases) and 1401 edges selected. This giant component is selected and proceeds to the next calculation.
Communities detection
Communities were discovered with all seventeen algorithms curated in the cdlib library. Figure 7 shows the result of the unique community formed with the best five community detection algorithms by its modularity, in which the modularity calculation will be explained further.
Figure 8 shows the result of the unique community formed with the other algorithms.
Community evaluation (modularity and similarity)
Modularity was measured with several modularity algorithms, such as Girvan–Newman modularity, Erdos–Renyi modularity, link modularity, density modularity, and Z modularity. Each of these modularity measurements has a benefit characteristic, which means that a higher value indicates better community structure, while a lower modularity score indicates worse community structure. After applying normalization and PCA, the calculation of PCA produces an eigenvalue of 3.65046443, and the eigenvector for each modularity is [− 0.49600251 − 0.49625634 − 0.50703837 − 0.49746022 0.05618339], respectively. Since normalization is already applied (which means that all the values are already in the same range) and most of the eigenvector elements are negative, the criteria to rank the algorithms are to find the lower or minimize value after PCA is applied. Table 8 shows the modularity measurement result for each algorithm before normalization and PCA are applied.
After normalization and PCA, here is the final score shown in Table 9.
As shown in Table 9, Leiden, Louvain, RBER Pots, RB Pots, and Walktrap are the best five community detection algorithms based on their modularity. Following this information, the fitness function for all five of the best community detection algorithms is calculated. Fitness functions used in this research are average internal degree, edges inside, expansion, internal edge density (Radicchi et al. 2004), conductance, normalized cut (Shi and Malik 2000), cut ratio (Fortunato 2010), fraction over median degree, triangle participation ratio (Yang and Leskovec 2015), max out-degree fraction (max ODF), average out-degree fraction (average ODF), flake out-degree fraction (flake ODF) (Flake et al. 2000), average embeddedness, average transitivity, scaled density, and size (Rossetti et al. 2019). The fitness score for each algorithm and fitness function is shown in Table 10.
Following the same procedures to simplify the modularity measurement into a single comparable number, normalization and PCA are applied. Fitness functions also have benefit characteristics. The eigenvalue after applying PCA is 5.27650015, and the eigenvector after applying PCA is [0.44947171 0.44925215 0.44043878 0.44758673 0.44925215]. Since all the eigenvector values are positive, the higher value is preferable. Table 11 shows the final sorted score for each fitness function.
As shown in Table 11, edges inside, scaled density, and size are the strongly correlated fitness functions. Edges inside is the count of edges inside a network; size is the count of nodes inside a network; and scaled density is the ratio of community density over overall network density. To justify and know the characteristics of the algorithm, Figs. 9, 10, and 11 will show the similarity across the algorithm.
As shown in Figs. 9, 10, and 11, all three fitness functions visualize the similarity of the Leiden and Louvain algorithms, and RBER Pots and RB Pots, and distinguish the Walktrap algorithm as an algorithm that has a unique result. Another similarity measurement was carried out with V-measure metrics over the community detected with each algorithm. The heatmap of these V-measure metrics from algorithm results is depicted in Fig. 12.
V-measure metrics suggest that Leiden and Louvain have a similarity of 0.99, Leiden and RB Pots have a similarity of 0.97, and RBER Pots and RB Pots have a similarity of 0.96. All the result similarities are also above 90%, which means that the community detected by each of the best five community detection algorithms is relatively similar to each other.
Comorbidities discovery
For each algorithm and cluster formed from those algorithms, the central disease(s) with the highest centrality is calculated, and the comorbidities are taken with consensus from the centrality algorithms that are chosen. The centrality algorithms that are applied in this research are betweenness, degree, closeness, and eigenvector centrality. Tables 12, 13, 14, 15, and 16 show the results of centrality measurements over the best five community detection algorithms.
From Tables 12, 13, 14, 15, and 16, the comorbidities found in each algorithm is summarized in Table 17.
Table 18 and Fig. 13 made with InteractiVenn Heberle et al. (2015) summarizes these findings.
For the full explanation and information on comorbidities in Disease Ontology, refer to Additional file 3. To visualize the correlation between breast cancers and comorbidities listed in this research, the disease ontology directed acyclic graph for breast cancer and comorbidities agreed upon by all five algorithms is shown in Fig. 14.
From Fig. 14, breast cancers and all the comorbidities, except for the disease of mental health, genetic disease, and its derivative, are diseases of the anatomical entity. This directed acyclic graph depicted all fourteen comorbidities of breast cancer discovered by this research, namely nervous system disease, cardiovascular system disease, gastrointestinal system disease, disease of mental health, retinal degeneration, lung disease, lower respiratory tract disease, autoimmune disease, dermatitis, bone disease, kidney disease, autosomal dominant disease, thyroid gland disease, and genetic disease.
To validate the result of breast cancer comorbidity discovery in this research, real-world data from several research papers is used. This research will combine the breast cancer comorbidities list from five research papers conducted by Ewertz et al. (2018), Fu et al. (2015), Sharma et al. (2015), Ording et al. (2013), and Hong et al. (2015) to become a list of complete and unique breast cancer comorbidities. The list of breast cancer comorbidities from those five research papers is shown in Table 19.
The compiled or summarized list of breast cancer comorbidities from those five research papers and their ancestor in the Disease Ontology is shown in Table 20.
As shown in Table 20, from around 43 diseases curated from five research papers, only five diseases are not included in this research result, namely secondary cancers, anemia, leukemia, obesity, and acquired immunodeficiency syndrome (AIDS). For secondary cancers, the concept of cancer being the comorbidity of cancer is somewhat not applicable, and the elimination procedure in data preprocessing suggests all the synonyms of cancer to be dropped. For the other diseases, it might not be covered by the data, or the diseases might not be in the giant component, such that further analysis or research might be conducted on this matter.
As the result and discussion above, this research aligns with previous breast cancer comorbidity research due to most of the comorbidities ancestor is found in this research result, while some comorbidities might need further examination. This research also aligns with some previous network analysis research to discover disease’s comorbidities in some steps, while having the uniqueness in providing the validation from several breast cancer comorbidities research papers. This research provides the list of breast cancer comorbidities based on community detection approaches. The result of this research might benefit medical personnel in identifying or validating current or new breast cancer comorbidities. Future research on breast cancer comorbidities can also use other techniques with or without a computer science approach to validate, strengthen, or find other breast cancer comorbidities. Ultimately, it can lower the number of breast cancer death cases caused by comorbidity and overall death cases.
Conclusion
In this research, a breast cancer comorbidity network has been built. The breast cancer comorbidity network is built based on disease similarity, constructing a similarity matrix with a gene ontology-based semantic similarity function and transforming that similarity matrix into an adjacency matrix with a certain threshold. In this research, the Wang algorithm with a threshold of 0.5 is chosen based on the result distribution and balances the problems of modularity and false-positive rate.
The community algorithm is then applied to discover the community within a network’s giant component. The result is evaluated with modularity and similarity measurements. The modularity evaluation using five modularity measurements suggests the best five community detection algorithms, namely Leiden, Louvain, RBER Pots, RB Pots, and Walktrap. Similarity measurements with the best three fitness functions (edges inside, scaled density, and size) suggest that the Leiden–Louvain algorithm and RBER Pots-RB Pots algorithm are two pairs of algorithms with similar results, leaving Walktrap as an algorithm with different community results. Other similarity measurements with V-measure and visualized with a heatmap suggest that Louvain–Leiden (0.99), RB Pots–Leiden (0.97), and RB Pots-RBER Pots (0.96) results are quite similar to each other.
Lastly, comorbidity is then evaluated using the best five community detection algorithms and four centrality algorithms. The Leiden algorithm produces 16 clusters and 21 diseases. The Louvain algorithm produces 14 clusters and 16 diseases. The RBER Pots algorithm produces 24 clusters and 31 diseases. The RB Pots algorithm produces 16 clusters and 21 diseases. Walktrap produces 15 clusters and 19 diseases. Fourteen diseases are agreed upon by the best five community detection algorithms; five diseases are agreed upon by four algorithms; two diseases are agreed upon by three algorithms; a disease is agreed upon by two algorithms; and ten diseases are agreed upon by an algorithm.
With these research findings, the ultimate goal is to lower the number of breast cancer death cases number. Medical personnel might be helped in identifying or validating current or new breast cancer comorbidities. Further research can also be conducted to validate or strengthen breast cancer comorbidities discovery.
Availability of data and materials
Not applicable.
References
Baggio S, Sapin M, Khazaal Y, Studer J, Wolff H, Gmel G (2018) Comorbidity of symptoms of alcohol and cannabis use disorders among a population-based sample of simultaneous users. Insight from a network perspective. Int J Environ Res Public Health 15:2893. https://doi.org/10.3390/IJERPH15122893
Baron JA, Johnson CSB, Schor MA, Olley D, Nickel L, Felix V, Munro JB, Bello SM, Bearer C, Lichenstein R, Bisordi K (2023) The do-kb knowledgebase: a 20-year journey developing the disease open science ecosystem. Nucleic Acids Res. https://doi.org/10.1093/NAR/GKAD1051
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory and Exp. https://doi.org/10.1088/1742-5468/2008/10/p10008
Cancer (2022). https://www.who.int/news-room/fact-sheets/detail/cancer
Chatterjee S, Sanjeev BS (2022) Network-based community detection of comorbidities and their association with SARS-COV-2 virus during Covid-19 pathogenesis
Clauset A, Newman MEJ, Moore C (2004) Finding community structure in very large networks. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics 70:6. https://doi.org/10.1103/PHYSREVE.70.066111/FIGURES/3/MEDIUM
Collins TK, Houghten S (2020) A centrality based multi-objective approach to disease gene association. Biosystems 193–194:104133. https://doi.org/10.1016/J.BIOSYSTEMS.2020.104133
Cordasco G, Gargano L (2011) Community detection via semi-synchronous label propagation algorithms. In: 2010 IEEE international workshop on business applications of social network analysis, BASNA 2010. https://doi.org/10.1504/..045103
Das AB (2020) Lung disease network reveals the impact of comorbidity on SARS-COV-2 infection. bioRxiv, 2020–0513092577. https://doi.org/10.1101/2020.05.13.092577
Das AB (2021) Lung disease network reveals impact of comorbidity on SARS-COV-2 infection and opportunities of drug repurposing. BMC Med Genom 14:1–14. https://doi.org/10.1186/S12920-021-01079-7/FIGURES/6
Enright AJ, Dongen SV, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30:1575–1584. https://doi.org/10.1093/NAR/30.7.1575
Erdos P, Rényi A (2011) On the evolution of random graphs. Struct Dyn Netw 9781400841356:38–82. https://doi.org/10.1515/9781400841356.38/MACHINEREADABLECITATION/RIS
Ewertz M, Land LH, Dalton SO, Cronin-Fenton D, Jensen MB (2018) Influence of specific comorbidities on survival after early-stage breast cancer. Acta Oncol 57:129–134. https://doi.org/10.1080/0284186X.2017.1407496
Ferlay J, Ervik M, Lam F, Colombet M, Mery L, Piñeros M, Soerjomataram I, Znaor A, Bray F (2018) Global cancer observatory: cancer today. https://gco.iarc.fr/today/home
Flake GW, Lawrence S, Giles CL (2000) Efficient identification of web communities. In: Proceeding of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, pp 150–160. https://doi.org/10.1145/347090.347121
Fortunato S (2010) Community detection in graphs. Phys Rep 486:75–174. https://doi.org/10.1016/J.PHYSREP.2009.11.002
Freeman LC (1977) A set of measures of centrality based on betweenness. Sociometry 40:35. https://doi.org/10.2307/3033543
Freeman LC (1978) Centrality in social networks conceptual clarification. Soc Netw 1:215–239. https://doi.org/10.1016/0378-8733(78)90021-7
Fu MR, Axelrod D, Guth AA, Clel CM, Ryan CE, Weaver KR, Qiu JM, Kleinman R, Scagliola J, Palamar JJ, Melkus GD (2015) Comorbidities and quality of life among breast cancer survivors: a prospective study. J Personal Med 5(5):229–242. https://doi.org/10.3390/JPM5030229
Heberle H, Meirelles VG, Silva FR, Telles GP, Minghim R (2015) Interactivenn: a web-based tool for the analysis of sets through Venn diagrams. BMC Bioinform 16:1–7. https://doi.org/10.1186/S12859-015-0611-3/FIGURES/4
Hong CC, Ambrosone CB, Goodwin PJ (2015) Comorbidities and their management: potential impact on breast cancer outcomes. Adv Exp Med Biol 862:155–175. https://doi.org/10.1007/978-3-319-16366-6_11/COVER
Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. https://aclanthology.org/O97-1002
Kaiser T, Herzog P, Voderholzer U, Brakemeier EL (2021) Unraveling the comorbidity of depression and anxiety in a large inpatient sample: network analysis to examine bridge symptoms. Depress Anxiety 38:307–317. https://doi.org/10.1002/DA.23136
Khan A, Uddin S, Srinivasan U (2018) Comorbidity network for chronic disease: a novel approach to understand type 2 diabetes progression. Int J Med Inform 115:1–9. https://doi.org/10.1016/J.IJMEDINF.2018.04.001
Kozdoba M, Mannor S (2015) Community detection via measure space embedding. In: Advances in neural information processing systems, vol 28
Leicht EA, Newman MEJ (2008) Community structure in directed networks. Phys Rev Lett. https://doi.org/10.1103/PHYSREVLETT.100.118703
Levinson CA, Brosof LC, Vanzhula I, Christian C, Jones P, Rodebaugh TL, Langer JK, White EK, Warren C, Weeks JW, Menatti A, Lim MH, Fernandez KC (2018) Social anxiety and eating disorder comorbidity and underlying vulnerabilities: using network analysis to conceptualize comorbidity. Int J Eat Disord 51:693–709. https://doi.org/10.1002/EAT.22890
Li T, Lei L, Bhattacharyya S, Berge KV, Sarkar P, Bickel PJ, Levina E (2022) Hierarchical community detection by recursive partitioning. J Am Stat Assoc 117:951–968. https://doi.org/10.1080/01621459.2020.1833888
Lin D (1998) An information-theoretic definition of similarity. In: International conference on machine learning
Miyauchi A, Kawase Y (2016) Z-score-based modularity for community detection in networks. PLoS ONE 11:0147805. https://doi.org/10.1371/JOURNAL.PONE.0147805
Mu XM, Wang W, Jiang YY, Feng J (2020) Patterns of comorbidity in hepatocellular carcinoma: A network perspective. Int J Environ Res Public Health 17:3108. https://doi.org/10.3390/IJERPH17093108
Newman MEJ (2006) Finding community structure in networks using the eigenvectors of matrices. Phys Rev E Stat Nonlinear Soft Matter Phys. https://doi.org/10.1103/PhysRevE.74.036104
Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69:026113. https://doi.org/10.1103/PhysRevE.69.026113
Nicosia V, Mangioni G, Carchiolo V, Malgeri M (2009) Extending the definition of modularity to directed graphs with overlapping communities. J Stat Mech Theory Exp 2009:03024. https://doi.org/10.1088/1742-5468/2009/03/P03024
Ording AG, Garne JP, Nyström PMW, Frøslev T, Sørensen HT, Lash TL (2013) Comorbid diseases interact with breast cancer to affect mortality in the first year after diagnosis: a Danish nationwide matched cohort study. PLoS ONE 8:76013. https://doi.org/10.1371/JOURNAL.PONE.0076013
Permana AA, Romdendine MF, Perdana AT (2023) Graph analysis for the discovery of key proteins in type 2 diabetes mellitus. Indones J Electron Electromed Eng Med Inform 5:201–209. https://doi.org/10.35882/IJEEEMI.V5I4.335
Pizzuti C (2008) Ga-net: a genetic algorithm for community detection in social networks. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 5199 LNCS, pp 1081–1090. https://doi.org/10.1007/978-3-540-87700-4_107/COVER
Pons P, Latapy M (2005) Computing communities in large networks using random walks. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 3733 LNCS, pp 284–293. https://doi.org/10.1007/11569596_31/COVER
Radicchi F, Castellano C, Cecconi F, Loreto V, Paris D (2004) Defining and identifying communities in networks. Proc Natl Acad Sci USA 101:2658–2663. https://doi.org/10.1073/PNAS.0400054101/ASSET/4114B903-130D-459A-9A50-3BE8CE1D71EE/ASSETS/GRAPHIC/ZPQ0080438860006.JPEG
Reichardt J, Bornholdt S (2006) Statistical mechanics of community detection. Phys Rev E Stat Nonlinear Soft Matter Phys 74:016110. https://doi.org/10.1103/PHYSREVE.74.016110/FIGURES/13/MEDIUM
Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy
Rosenberg A, Hirschberg J (2007) V-measure: a conditional entropy-based external cluster evaluation measure, pp 410–420
Rossetti G, Milli L, Cazabet R (2019) Cdlib: a python library to extract, compare and evaluate communities from complex networks. Appl Netw Sci 4:1–26. https://doi.org/10.1007/S41109-019-0165-9/TABLES/5
Russell CD, Lone NI, Baillie JK (2023) Comorbidities, multimorbidity and Covid-19. Nat Med 29:334–343. https://doi.org/10.1038/s41591-022-02156-9
Rustamaji HC, Suharini YS, Permana AA, Kusuma WA, Nurdiati S, Batubara I, Djatna T (2022) A network analysis to identify lung cancer comorbid diseases. Appl Netw Sci 7:1–23. https://doi.org/10.1007/S41109-022-00466-Y/TABLES/8
Schlicker A, Domingues FS, Rahnenführer J, Lengauer T (2006) A new measure for functional similarity of gene products based on gene ontology. BMC Bioinform 7:1–16. https://doi.org/10.1186/1471-2105-7-302/FIGURES/13
Sharma N, Narayan S, Sharma R, Kapoor A, Kumar N, Nirban R (2015) Association of comorbidities with breast cancer: an observational study. Trop J Med Res 19:168
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22:888–905. https://doi.org/10.1109/34.868688
Shruthi S, Binu Xavier F, Ravi Kumar A, Yeshwanth S, Mandi MV (2020) Breast cancer classification using python programming in machine learning. Int J Eng Res. https://doi.org/10.17577/IJERTV9IS080359
Swain S, Sarmanova A, Coupland C, Doherty M, Zhang W (2020) Comorbidities in osteoarthritis: a systematic review and meta-analysis of observational studies. Arthritis Care Res 72:991–1000. https://doi.org/10.1002/ACR.24008/ABSTRACT
Traag VA, Dooren PV, Nesterov Y (2011) Narrow scope for resolution-limit-free community detection. Phys Rev E Stat Nonlinear Soft Matter Phys 84:016114. https://doi.org/10.1103/PHYSREVE.84.016114/FIGURES/3/MEDIUM
Traag VA, Krings G, Dooren PV (2013) Significant scales in community structure. Sci Rep 3:1–10. https://doi.org/10.1038/srep02930
Traag VA, Aldecoa R, Delvenne JC (2015) Detecting communities using asymptotical surprise. Phys Rev E Stat Nonlinear Soft Matter Phys 92:022816. https://doi.org/10.1103/PHYSREVE.92.022816/FIGURES/5/MEDIUM
Traag V, Waltman L, Eck NJ (2018) From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep. https://doi.org/10.1038/s41598-019-41695-z
Ustalov D, Panchenko A, Biemann C, Ponzetto SP (2019) Watset: local-global graph clustering with applications in sense and frame induction. Comput Linguist 45:423–479. https://doi.org/10.1162/COLI_A_00354
Vilela J, Martiniano H, Marques AR, Santos JX, Rasga C, Oliveira G, Vicente AM (2022) Disease similarity network analysis of autism spectrum disorder and comorbid brain disorders. Front Mol Neurosci 15:932305. https://doi.org/10.3389/FNMOL.2022.932305/BIBTEX
Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF (2007) A new method to measure the semantic similarity of go terms. Bioinformatics 23:1274–1281. https://doi.org/10.1093/BIOINFORMATICS/BTM087
Wei T-H (1952) Algebraic foundations of ranking theory. https://doi.org/10.17863/CAM.96653
Wei CH, Allot A, Leaman R, Lu Z (2019) Pubtator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res 47:587–593. https://doi.org/10.1093/NAR/GKZ389
Yang J, Leskovec J (2015) Defining and evaluating network communities based on ground-truth. Knowl Inf Syst 42:181–213. https://doi.org/10.1007/S10115-013-0693-Z/FIGURES/15
Yu G, Wang LG, Yan GR, He QY (2015) Dose: an r/bioconductor package for disease ontology semantic and enrichment analysis. Bioinformatics 31:608–609. https://doi.org/10.1093/BIOINFORMATICS/BTU684
Zhang P, Moore C (2014) Scalable detection of statistically significant communities and hierarchies, using message passing for modularity. Proc Natl Acad Sci USA 111:18144–18149. https://doi.org/10.1073/PNAS.1409770111/SUPPL_FILE/PNAS.201409770SI.PDF
Zhang S, Ning X-M, Ding C, Zhang X-S (2010) Determining modular organization of protein interaction networks by maximizing modularity density. BMC Syst Biol. https://doi.org/10.1186/1752-0509-4-S2-S10
Acknowledgements
Thank you to Universitas Multimedia Nusantara for the continuous support of this research. Furthermore, thank you to Bonifasius Ariesto Adrian Finantyo for the help in data gathering and labeling, and Nehemia Gueldi for the help in data labeling.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
Conceptualization, RMY, AAP; methodology, RMY, AAP; software, RMY; resources, RMY, AAP; data labeling and preprocessing, RMY, AAP; articles proof reading, RMY; writing paper draft, RMY; paper review and editing, RMY, AAP; visualization, RMY; supervision, AAP; project administration, RMY, AAP. All authors involved in this research have perused and agreed that the submitted and published versions of the manuscript are updated and reflect the authors’ best knowledge until publication day. All authors have made significant contributions to the research reported in this article. All authors listed in this manuscript had notable contributions to the research, and all mentioned authors had contributions to this research’s success. All authors wrote, read, and approved the submitted manuscript. All authors read and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Permana, A.A., Yaputra, R.M. Analyzing breast cancer comorbidities: a network approach using community detection algorithms. Appl Netw Sci 9, 31 (2024). https://doi.org/10.1007/s41109-024-00644-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41109-024-00644-0