Skip to main content

Analyzing breast cancer comorbidities: a network approach using community detection algorithms

Abstract

Breast cancer is a prominent global health concern, as the data from the International Agency for Research on Cancer (IARC) shows that breast cancer is the leading cancer type with new cases in 2020 and among the Top 5 cancer types with the most deaths. To help improve the current breast cancer comorbidity identification by medical personnel and ultimately, lower the number of death cases from breast cancer comorbidity, this research aims to discover the breast cancer comorbidity community, do modularity and similarity-based evaluation, suggest the best semantic similarity measurement and threshold value, and validate the data of breast cancer comorbidities with several data from research papers. The Wang algorithm, with a threshold value of 0.5, is chosen to build the network. Leiden, Louvain, RBER Pots, RB Pots, and Walktrap are the best five community detection algorithms. Similarity measurements with the best three fitness functions (edges inside, scaled density, and size) suggest that the Leiden–Louvain algorithm and RBER Pots-RB Pots algorithm are two pairs of algorithms with similar results. Other similarity measurements with the V-measure heatmap suggest that Louvain–Leiden (0.99), RB Pots–Leiden (0.97), and RB Pots–RBER Pots (0.96) results are similar. Comorbidity is then evaluated using the best five community detection algorithms and four centrality algorithms. As a result, fourteen diseases are agreed upon by the best five community detection algorithms, five diseases are agreed by four algorithms, two diseases are agreed by three algorithms, a disease is agreed by two algorithms, and ten diseases are agreed by an algorithm.

Introduction

There are a lot of theories about comorbidity and its definition. In 1996, Akker et al.’s research was the first to propose a distinctive definition between comorbidity and multimorbidity. Comorbidity is defined as any linked medical condition besides the index or primary disease (or, in other words, given a primary disease). On the other hand, multimorbidity is defined as any co-occurrence of multiple medical disorders or conditions without any given disease (Swain et al. 2020). The discovery of comorbidities and their impacts may help us prioritize our choice of treatment actions to address the primary disease and underpin the biology-related basics of the disease (Russell et al. 2023). Hence, information about comorbidity is crucial, as comorbidity is an important relationship between diseases and may lead to essential discoveries in the future.

Nowadays, cancer is often associated with a deadly disease. This stereotype is not a misjudge since, according to data from the International Agency for Research on Cancer (IARC), in 2020, nearly 10 million deaths were caused by cancer, making cancer the leading cause of death worldwide. The World Health Organization defines cancer as a process of the body’s part(s) cell abnormal or unreasonable proliferation (Cancer 2022). Upon further analysis and breakdown, from all the 10 million deaths, breast cancer is among the top 5 cancer types that cause the most number of death cases, with 0.69 million deaths, preceded by lung cancer with 1.80 million deaths, colon and rectum cancer with 0.92 million deaths, liver cancer with 0.83 million deaths, and stomach cancer with 0.77 million deaths. According to other data from the IARC regarding the types of cancer with the most new cases in 2020, breast cancer is in the lead with 2.26 million cases, followed by lung cancer with 2.21 million cases, colon and rectum cancer with 1.93 million cases, prostate cancer with 1.41 million cases, and skin cancer with 1.20 million cases (Ferlay et al. 2018).

This data shows us that breast cancer stands out as a prominent global health concern. Any information to prevent further impacts will be beneficial for the healthcare world. As stated before, the comorbidities of breast cancer have become one of those important pieces of information. Some institutions came up with breast cancer comorbidities. According to the Danish National Patient Register (DNPR) data in the Ewertz et al. (2018) study, there are twelve breast cancer comorbidities, and it’s confirmed that all diseases impact the all-cause mortality rate significantly. Based on data from another study conducted by Sharma et al. (2015), out of 134 patients, there are twenty-eight breast cancer comorbidities assessed by self-report and verified by medical record review and the Charlson Comorbidity Index. One of the research conclusions from this study is that the comorbidities are decreasing breast cancer survivors’ quality of life Fu et al. (2015). Research by Sharma et al. (2015) also stated nine breast cancer comorbidity. Back to the Danish National Patient Register, but in different research conducted by Ording et al. (2013), there are nineteen diseases related to breast cancer comorbidity. Lastly, research from Hong et al. (2015) concludes all breast cancer comorbidities from a lot of research papers.

Network analysis, or graph analytics, is a robust method designed to infer insights from data in graph form. One of the sub-studies of network analysis is community detection. Community detection clusters nodes (communities) within a network or graph with the assumption that the connection is sparse. In some applications, this cluster can infer some meaningful information that, with regular data representation, is hard or even impossible to come to the same conclusion (Li et al. 2022). One of its applications is to infer comorbidity, which is the connection or relationship between diseases. Several algorithms can be used for community detection, such as Leiden, Louvain, RB Pots, Belief, Girvan-Newman, and much more.

Comorbidity discovery with community detection algorithms is discussed in several research papers. Algorithms for disease ontology (DO) similarity, the best algorithms for revealing lung cancer comorbidities, calculation of fitness score, and several significant lung cancer comorbidities were discussed in the research conducted by Rustamaji et al. (2022) in the paper titled “A Network Analysis to Identify Lung Cancer Comorbid Disease,” released in 2022. Hepatocellular carcinoma comorbidity risk is discussed by Mu et al. (2020) in the paper titled “Patterns of Comorbidity in Hepatocellular Carcinoma: A Network Perspective”. The network analysis approach is used to conceptualize eating disorder and social anxiety disorder comorbidity by Levinson et al. in the paper titled “Social anxiety and eating disorder comorbidity and underlying vulnerabilities: Using network analysis to conceptualize comorbidity” (Levinson et al. 2018). Comorbidities and conditions in a comorbidity network that are attributed most to diabetes are revealed by Khan et al. Khan et al. (2018). An analysis of the network structure of depression and anxiety symptoms is explained by Kaiser et al. (2021). Other applications of community detection algorithms for comorbidity information discovery are discussed in the paper conducted by Baggio et al. (2018), Das (2020, 2021), Vilela et al. (2022), and Chatterjee and Sanjeev (2022).

This research will discuss the in-silico effort to discover breast cancer’s comorbidities with community detection algorithms. On the other hand, this research will also suggest the best gene ontology-based semantic similarity measurements (out of Wang, Jiang, Lin, Resnik, and Rel algorithms) and the optimal threshold value to get the network’s adjacency matrix from the similarity matrix. Moreover, this research compares all the community detection algorithms to produce the best five algorithms to address the breast cancer comorbidity network community detection based on the network’s visual, density, and modularity measurements. The best five algorithms’ results are then measured with some fitness functions applied to each algorithm’s result, as well as a heatmap from each algorithm’s result similarity evaluation. Lastly, upon getting the comorbidities list, the list will be visualized with the disease ontology’s directed acyclic graph (DAG) and validated with the data on breast cancer’s comorbidities from various sources.

Research methodology

Research steps

This research is divided into six steps: data gathering, data preprocessing, network formation, community detection, community evaluation (similarity and modularity), and comorbidity discovery. The flowchart of this research’s steps is shown in Fig. 1.

Fig. 1
figure 1

Research steps’ flowchart

The details of each step are as follows.

  1. 1.

    Data gathering The Pubtator Central (PTC) website (https://www.ncbi.nlm.nih.gov/research/pubtator/) is scraped. PTC was chosen because it provides comprehensive biology information on cell lines, chemicals, diseases, genes, mutations, and species. PTC mining systems annotate 29 million abstracts in PubMed, and 3 million full-text articles on PMC in various kind of formats (Wei et al. 2019). An example of the articles scraped in Pubtator Central is depicted in Table 1. The result can be retrieved from an application programming interface (API) prepared by the website developer. From all the manuscripts related to ’comorbid breast cancer’, disease annotations are collected since this research focuses on breast cancer disease comorbidities.

  2. 2.

    Data preprocessing Upon getting the manuscripts and extracting all the words related to the disease, the disease ontology, whose link can be accessed here, https://disease-ontology.org/do is used to get each disease identifier in the disease ontology or Disease Ontology Identifier (DOID). Disease Ontology (DO) is a platform that curates the semantic correlation between diseases and integrates biology concepts, such as their genes and environmental drivers and attributes. DO has become a reliable and robust platform for acquiring disease ontology data and has already been trusted by a lot of stakeholders (Baron et al. 2023). DO is chosen because it possesses a key role in linking data with real-world data. To represent its data, DO uses the directed acyclic graph (DAG) form. With DAG representation, it will be easier to maintain and track such complex data (Rustamaji et al. 2022). To gain a better understanding of the Disease Ontology entry, an example of a disease information in Disease Ontology is shown in Table 2. Elimination of words unrelated to the disease, synonyms of diseases, synonyms of breast cancer, and diseases that cannot be found in the ontology is performed besides the process of searching the disease ID in the disease ontology. After all the DOIDs are acquired, a list of unique DOIDs is generated.

  3. 3.

    Network formation A similarity matrix between unique DOIDs is acquired with the Disease Ontology Semantic and Enrichment Analysis (DOSE) library downloaded from Bioconductor within an R program (Yu et al. 2015). Wang et al. (2007), Jiang and Conrath (1997), Lin (1998), Resnik (1995), and Schlicker et al. (2006) are five of the gene ontology-based semantic similarity algorithms that are included in the doSim function within the DOSE library. Wang’s algorithm uses the position of the disease in the directed acyclic graph (DAG) and its parent nodes to convert the term’s semantics into a numeric value (Wang et al. 2007). Jiang’s algorithm utilizes the word, concepts, and lexical taxonomy structure, combined with corpus statistical information (Jiang and Conrath 1997). Lin’s algorithm works with assumptions and defines the universal definition of similarity (Lin 1998). Resnik’s algorithm works with information content (Resnik 1995). The Rel algorithm compares gene-ontology (GO) terms and values the similarity of gene products (Schlicker et al. 2006). The value that appears in the similarity matrix is in the range [0, 1], with the value of zero indicating two diseases have no correlation at all (or no similarity at all) and the value of one showing two diseases are identical to each other (or perfectly similar). Furthermore, the null value cleaning is performed, and the adjacency matrix is made with the transformation of the similarity matrix. A threshold value is chosen to transform the similarity matrix into an adjacency matrix. For each value in the similarity matrix that is below the threshold, change that value to zero (disconnect two nodes). In this research, Wang’s algorithm is chosen due to its accuracy over two similar diseases and good distribution of edge weight. Then, 0.5 is chosen as the threshold value to balance the false positive and modularity of the network (Rustamaji et al. 2022). With the adjacency matrix, the network is constructed in a Python program equipped with the NetworkX library. Following the analysis, the subgraph with the most nodes connected, or giant components of the graph, will be taken for further calculation. The Jupyter Widget (ipycytoscape) and Cytoscape applications will be used to visualize the network and giant component.

  4. 4.

    Communities detection The Python library cdlib is being used in this research to calculate the community results from community detection algorithms. Python is chosen because it has functionalities and libraries that are sufficient to do data science research and analysis (Shruthi et al. 2020). Belief (Zhang and Moore 2014), Constant Potts Model (CPM) (Traag et al. 2011), Chinese Whispers (Ustalov et al. 2019), Diffusion Entropy Reducer (DER) (Kozdoba and Mannor 2015), Eigenvector Newman (2006), Genetic Algorithm (GA) (Pizzuti 2008), Greedy Modularity (Clauset et al. 2004), Label Propagation (Cordasco and Gargano 2011), Leiden (Traag et al. 2018), Louvain (Blondel et al. 2008), Markov Clustering (Enright et al. 2002), RB Pots (Leicht and Newman 2008), RBER Pots (Reichardt and Bornholdt 2006), Significance Community (Traag etal. 2013), Spinglass (Reichardt and Bornholdt 2006), Surprise Community (Traag et al. 2015), and Walktrap (Pons and Latapy 2005) are the seventeen community detection algorithms used in this research. The description of all the community detection algorithms used in this research can be found in Additional file 1.

  5. 5.

    Community evaluation (similarity and modularity) With the seventeen community detection algorithms’ results, modularity measurement is carried out. Newman–Girvan modularity (Newman and Girvan 2004), Erdos–Renyi modularity (Erdos and Rényi 2011), link modularity (Nicosia et al. 2009), density modularity (Zhang et al. 2010), and Z modularity (Miyauchi and Kawase 2016) are the five modularity measurements used to evaluate the community. All the modularity measurements have the same benefit characteristic, which means that the higher modularity value represents a better and denser community or cluster, and vice versa. The formula of each modularity is shown in Formulas 1, 2, 3, 4, and 5.

    $$\begin{aligned}{} & {} Q(S)_{Newman-Girvan} = \frac{1}{m}\sum _{c \in S}^{}\left( m_{s} - \frac{(2m_{s} + l_{s})^{2}}{4m}\right) \end{aligned}$$
    (1)
    $$\begin{aligned}{} & {} Q(S)_{ErdosRenyi} = \frac{1}{m}\sum _{c \in S}^{}\left( m_{s} - \frac{mn_{s}(n_{s}-1)}{n(n-1)}\right) \end{aligned}$$
    (2)
    $$\begin{aligned}{} & {} Q(S)_{Link} = \frac{1}{2m}\sum _{i,j \in V}^{}\left[ A_{ij} - \frac{k_{i}k_{j}}{2m}\right] \delta (c_{i}, c_{j}) \end{aligned}$$
    (3)
    $$\begin{aligned}{} & {} Q(S)_{Density} = \sum _{c \in S}^{}\frac{1}{n_{c}}\left( \sum _{i \in C}^{} 2 *\lambda * k^{in}_{iC} - \sum _{i \in C}^{} 2 *\lambda * k^{out}_{iC}\right) \end{aligned}$$
    (4)
    $$\begin{aligned}{} & {} Z(C)_{Z} = \frac{\sum _{c \in C}^{}\frac{m_{c}}{m}-\sum _{c \in C}^{}(\frac{D_{c}}{2m})^{2}}{\sqrt{\sum _{c \in C}^{}(\frac{D_{c}}{2m})^{2}(1-\sum _{c \in C}^{}\left(\frac{D_{c}}{2m})^{2}\right)}} \end{aligned}$$
    (5)

    After getting all the modularity values, normalization is applied to transform the range of all the modularity measurements into the same range of [0, 1], and principal component analysis (PCA) is performed to reduce the dimensionality and convert all the numbers into a single comparable number. The algorithm’s result is then ranked based on its modularity and eigenvector characteristics. Another measurement of the algorithm’s result is the similarity between them. The fitness scores from each algorithm result are calculated. Fitness scores applied to the algorithms are average internal degree, edges inside, expansion, internal edge density (Radicchi et al. 2004), conductance, normalized cut (Shi and Malik 2000), cut ratio (Fortunato 2010), fraction over median degree, triangle participation ratio (Yang and Leskovec 2015), max out-degree fraction (max ODF), average out-degree fraction (average ODF), flake out-degree fraction (flake ODF) (Flake et al. 2000), average embeddedness, average transitivity, scaled density, and size (Rossetti et al. 2019). The description of all the fitness functions used in this research can be found in Additional file 2. Furthermore, normalization and PCA are applied for the same reason as in modularity measurement. The best three fitness functions with a strong correlation with the outcomes are then chosen, applied to the community detection algorithm, and visualized to see the similarity between the best five community detection algorithms. Lastly, to assess the cluster similarity, a v-measure is applied, and the similarity score is depicted in a heatmap. V-measure is a cluster similarity measurement that is independent of the absolute values of the labels. The v-measure value doesn’t change because of class permutation or because the label of a cluster is different. The formula for V-measure is shown in Formula 6 (Rosenberg and Hirschberg 2007).

    $$\begin{aligned} \begin{gathered} v = \frac{(1 + beta) * homogeneity * completeness}{(beta * homogeneity + completeness)} \\ homogeneity = 1 - \frac{H(Y_{true} | Y_{pred})}{H(Y_{true})} \\ completeness = 1 - \frac{H(Y_{pred} | Y_{true})}{H(Y_{pred})} \\ \end{gathered} \end{aligned}$$
    (6)

    In Formula 6, beta represents the coefficient of weight to adjust the impact of homogeneity and completeness, and H(n) represents the number of cases where event n occurred. Moreover, in the default case, the beta value is 1.

  6. 6.

    Comorbidities discovery Following the evaluation of the community result from each community detection algorithm, the comorbidity is determined based on the consensus of the centrality algorithm for each cluster formed from each algorithm’s result. The NetworkX library provides the function to evaluate the centrality of each cluster, and the cdlib library provides the function to write the communities formed by each algorithm. Common centrality algorithms in network analysis studies are degree, closeness, betweenness, and eigenvector centrality (Permana et al. 2023). Those centrality algorithms are also chosen because they show successful results in other research (Collins and Houghten 2020). Degree centrality uses the information on the nodes’ degrees and normalizes the value. Betweenness centrality suggests that centrality should be counted based on the shortest path from one node to the other for each pair of nodes (Freeman 1977). Closeness centrality takes the same approach as betweenness centrality but with a different formula that is not guaranteed to be true if the edge’s weight is in decimal form (Freeman 1978). Eigenvector centrality is computed with the centrality of a node added from the centrality of its parent nodes (Wei 1952). Hence, all of them will be used. A cluster’s central disease(s) is defined as a node or some nodes that have the highest centrality among all the nodes in their respective cluster. In a community, a centrality algorithm may suggest more than one disease as a central disease, and the consensus itself may agree on several diseases as comorbidity. In that case, all the diseases are considered central and comorbidity. Hence, if there are n communities, it doesn’t mean that there are n comorbidities. After all the comorbidities are discovered, all the breast cancer comorbidities from several studies are listed and traced for their ancestors’ data, and this research comorbidity list discovery will be validated. The goal is to compare these research results with real-world data.

Table 1 Example of articles scraped in pubtator central
Table 2 Examples on disease ontology entry

System specification

In this research, researchers use some computational-related resources. The list of the hardware used in this research is as follows:

  1. 1.

    Lenovo Ideapad Gaming 3 Laptop with the detailed specifications as follows:

    • Operating System: Windows 10 Home Single Language 64-bit;

    • Processor: Intel(R) Core(TM) i7-10750H CPU @ 2.60 GHz (12 CPUs),  2.6 GHz;

    • RAM: 16384 MB;

    • Memory: 512 GB;

    • DirectX Version: DirectX 12; and

    • Graphic Card: NVIDIA GeForce GTX 1650.

  2. 2.

    Input Device: Mouse and keyboard.

  3. 3.

    Output Device: Monitor.

Besides the hardware, software is also important to carry out the computation. The list of software used in this research is as follows:

  1. 1.

    Python 3.9.12.

  2. 2.

    Jupyter Notebook.

  3. 3.

    Cytoscape v3.10.1.

  4. 4.

    Microsoft Office Excel 2010.

  5. 5.

    Library on Python programming language, such as beautifulsoup4 4.11.1, bs4 0.0.1, numpy 1.25.1, pandas 1.5.3, pip 23.0.1, requests 2.28.1, scikit-learn 1.1.3, scipy 1.10.0, NetworkX 2.8.4, matplotlib 3.7.0, ipycytoscape 1.3.3, cdlib 0.2.6, and seaborn 0.12.2.

  6. 6.

    Browser to scrape and browse websites, such as Microsoft Edge.

  7. 7.

    Website, such as PubTator Central and Disease Ontology.

Result and discussion

Data gathering

Articles registered in PubMed publications related to breast cancer are scraped on the Pubtator Central website (PTC) with the provided API. This process resulted in 4860 manuscripts as of April 20, 2023, related to the keyword “comorbid breast cancer.” Out of these 4860 manuscripts, 17607 non-unique disease names were acquired. After eliminating duplicated disease names, 3762 unique disease names are ready for the next calculation.

Data preprocessing

The DOID of each unique disease name is traced in the DO database, resulting in 1006 unique DOIDs as of May 20, 2023. The elimination of words unrelated to disease, synonyms of breast cancer, synonyms of other diseases, and diseases that cannot be found in DO decreases the number of data points. With these 1006 unique DOIDs, the doSim function does its job to calculate the similarity matrix by determining two DOIDs’ similarity (Yu et al. 2015). There are five algorithms used in this process, namely Wang, Jiang, Lin, Resnik, and Rel. Each algorithm gives a square matrix with an order of 1006 x 1006. Each element of this similarity matrix is in the range of [0, 1]. The most visible characteristic that separates Wang, Jiang, and Lin’s algorithm from Resnik and Rel is that in those three algorithms, its diagonal values are 1. Refer to Tables 3, 4, and 5 for examples of the similarity matrix computed with Wang, Jiang, and Lin’s algorithms, respectively.

Table 3 Wang algorithm similarity matrix result (raw)
Table 4 Jiang algorithm similarity matrix result (raw)
Table 5 Lin algorithm similarity matrix result (raw)

However, with the Resnik and Rel, the diagonal value may vary, mostly not 1. In the Rel algorithm, the diagonal value is close to but not exactly 1, for example, 0.997 for DOID.10283, 0.999 for DOID.3008, and 0.991 for DOID.2394. Refer to Table 6 for the similarity matrix computed with the Rel algorithm.

Table 6 Rel algorithm similarity matrix result (raw)

In the Resnik algorithm, the diagonal value is even worse and less accurate. For example, when comparing disease with DOID.10283, DOID.3008, and DOID.2394 with the Resnik algorithm, the results are 0.62, 0.86, and 0.49, respectively. Refer to Table 7 for the similarity matrix computed with the Resnik algorithm.

Table 7 Resnik algorithm similarity matrix result (Raw)

Hence, those two algorithms are not selected for further calculation. After cleaning DOIDs with null values in the similarity matrix, 845 DOIDs are selected, leaving 161 DOIDs with no value, and the similarity matrix order becomes 845 \(\times\) 845.

Network formation

A graph or network is then created with the adjacency matrix, which is the similarity matrix with a certain threshold applied to it-to be specific, changing the value below or equal to the threshold to 0. If two nodes are connected with edges with a weight above the threshold, they will stay connected with the respective edges. The threshold range varies from the range [0, 1], with threshold = 0 meaning that all the nodes connect the way they should, and threshold = 1 meaning that all the nodes are not connected, or, in other words, a null graph. The selection of the gene ontology-based semantic similarity algorithms is among the Wang, Jiang, and Lin algorithms, and the threshold value choice is in the range [0, 1]. First, to select the similarity algorithm, a graph of each algorithm’s edge count over different thresholds is figured in Fig. 2.

Fig. 2
figure 2

Comparison Wang, Jiang, and Lin algorithm over different threshold

As shown in the graph, the Jiang and Lin algorithm is too sensitive to the slight change in the threshold, which means there are so many values in some range, which is not a sign of a good or reliable algorithm. Following this analysis, Wang’s algorithm is chosen since the transition over the threshold is smooth and the value is much more distributed than the other two algorithms. Then, for the threshold choice, threshold = 0.5 is chosen. The main reason for this choice is to balance the density and false-positive rate. Consider that if there are fewer edges, the network formed will be low in modularity, even though the false-positive rate (or the rate that diseases that should not be breast cancer comorbidities included in this research’s result) are lower. However, if there are more edges, there will be a high chance of false positives occurring, even though the network is denser. In similar research, a threshold of 0.5 was also chosen for this calculation (Rustamaji et al. 2022).

To gain a better understanding of the network formed by the Wang algorithm with 0.5 as the threshold, the network degree distribution needs to be examined. Figure 3 shows the degree distribution for the network formed with the Wang algorithm and threshold = 0.5.

Fig. 3
figure 3

Degree distribution with Wang algorithm and threshold = 0.5

Graph in Fig. 3, a lot of nodes with low degrees are shown as the graph distribution is left-skewed. Including those nodes in this research’s further calculation will significantly impact the network density. Thus, the giant component must be determined to calibrate the focus only on the significant comorbidities. In the overall network, there are 845 nodes and 4000 edges. The giant component of this network is pictured with ipycytoscape-CytoscapeWidget in Figs. 4 and 5, and with Cytoscape in Fig. 6.

Fig. 4
figure 4

Giant component of whole network figured with ipycytoscape

Fig. 5
figure 5

Giant component by itself figured with ipycytoscape

Fig. 6
figure 6

Giant component with cytoscape apps

After the giant component selection, there are 323 nodes (diseases) and 1401 edges selected. This giant component is selected and proceeds to the next calculation.

Communities detection

Communities were discovered with all seventeen algorithms curated in the cdlib library. Figure 7 shows the result of the unique community formed with the best five community detection algorithms by its modularity, in which the modularity calculation will be explained further.

Fig. 7
figure 7

Community formed in each algorithm of Top 5 algorithms by its modularity

Figure 8 shows the result of the unique community formed with the other algorithms.

Fig. 8
figure 8

Community formed in other algorithms

Community evaluation (modularity and similarity)

Modularity was measured with several modularity algorithms, such as Girvan–Newman modularity, Erdos–Renyi modularity, link modularity, density modularity, and Z modularity. Each of these modularity measurements has a benefit characteristic, which means that a higher value indicates better community structure, while a lower modularity score indicates worse community structure. After applying normalization and PCA, the calculation of PCA produces an eigenvalue of 3.65046443, and the eigenvector for each modularity is [− 0.49600251 − 0.49625634 − 0.50703837 − 0.49746022 0.05618339], respectively. Since normalization is already applied (which means that all the values are already in the same range) and most of the eigenvector elements are negative, the criteria to rank the algorithms are to find the lower or minimize value after PCA is applied. Table 8 shows the modularity measurement result for each algorithm before normalization and PCA are applied.

Table 8 Modularity measurement result from each community Detection Algorithms

After normalization and PCA, here is the final score shown in Table 9.

Table 9 Final modularity score and algorithms rank

As shown in Table 9, Leiden, Louvain, RBER Pots, RB Pots, and Walktrap are the best five community detection algorithms based on their modularity. Following this information, the fitness function for all five of the best community detection algorithms is calculated. Fitness functions used in this research are average internal degree, edges inside, expansion, internal edge density (Radicchi et al. 2004), conductance, normalized cut (Shi and Malik 2000), cut ratio (Fortunato 2010), fraction over median degree, triangle participation ratio (Yang and Leskovec 2015), max out-degree fraction (max ODF), average out-degree fraction (average ODF), flake out-degree fraction (flake ODF) (Flake et al. 2000), average embeddedness, average transitivity, scaled density, and size (Rossetti et al. 2019). The fitness score for each algorithm and fitness function is shown in Table 10.

Table 10 Fitness function score for best five community detection algorithms

Following the same procedures to simplify the modularity measurement into a single comparable number, normalization and PCA are applied. Fitness functions also have benefit characteristics. The eigenvalue after applying PCA is 5.27650015, and the eigenvector after applying PCA is [0.44947171 0.44925215 0.44043878 0.44758673 0.44925215]. Since all the eigenvector values are positive, the higher value is preferable. Table 11 shows the final sorted score for each fitness function.

Table 11 Final fitness function score after normalization and PCA

As shown in Table 11, edges inside, scaled density, and size are the strongly correlated fitness functions. Edges inside is the count of edges inside a network; size is the count of nodes inside a network; and scaled density is the ratio of community density over overall network density. To justify and know the characteristics of the algorithm, Figs. 9, 10, and 11 will show the similarity across the algorithm.

Fig. 9
figure 9

Swarm plot by size fitness function

Fig. 10
figure 10

Swarm plot by edges inside fitness function

Fig. 11
figure 11

Swarm plot by scaled density fitness function

As shown in Figs. 9, 10, and 11, all three fitness functions visualize the similarity of the Leiden and Louvain algorithms, and RBER Pots and RB Pots, and distinguish the Walktrap algorithm as an algorithm that has a unique result. Another similarity measurement was carried out with V-measure metrics over the community detected with each algorithm. The heatmap of these V-measure metrics from algorithm results is depicted in Fig. 12.

Fig. 12
figure 12

Algorithms’ result similarity V-measure with heatmap

V-measure metrics suggest that Leiden and Louvain have a similarity of 0.99, Leiden and RB Pots have a similarity of 0.97, and RBER Pots and RB Pots have a similarity of 0.96. All the result similarities are also above 90%, which means that the community detected by each of the best five community detection algorithms is relatively similar to each other.

Comorbidities discovery

For each algorithm and cluster formed from those algorithms, the central disease(s) with the highest centrality is calculated, and the comorbidities are taken with consensus from the centrality algorithms that are chosen. The centrality algorithms that are applied in this research are betweenness, degree, closeness, and eigenvector centrality. Tables 12, 13, 14, 15, and 16 show the results of centrality measurements over the best five community detection algorithms.

Table 12 DOID of Comorbidities from Leiden Algorithm
Table 13 DOID of comorbidities from Louvain algorithm
Table 14 DOID of comorbidities from RBER Pots algorithm
Table 15 DOID of comorbidities from RB Pots algorithm
Table 16 DOID of comorbidities from walktrap algorithm

From Tables 12, 13, 14, 15, and 16, the comorbidities found in each algorithm is summarized in Table 17.

Table 17 Comorbidities summary from the best five community detection algorithms

Table 18 and Fig. 13 made with InteractiVenn Heberle et al. (2015) summarizes these findings.

Table 18 Comorbidities grouped by algorithms result
Fig. 13
figure 13

Venn diagram of comorbidities result count

For the full explanation and information on comorbidities in Disease Ontology, refer to Additional file 3. To visualize the correlation between breast cancers and comorbidities listed in this research, the disease ontology directed acyclic graph for breast cancer and comorbidities agreed upon by all five algorithms is shown in Fig. 14.

Fig. 14
figure 14

Disease ontology directed acyclic graph for breast cancer and common comorbidities found by five algorithms

From Fig. 14, breast cancers and all the comorbidities, except for the disease of mental health, genetic disease, and its derivative, are diseases of the anatomical entity. This directed acyclic graph depicted all fourteen comorbidities of breast cancer discovered by this research, namely nervous system disease, cardiovascular system disease, gastrointestinal system disease, disease of mental health, retinal degeneration, lung disease, lower respiratory tract disease, autoimmune disease, dermatitis, bone disease, kidney disease, autosomal dominant disease, thyroid gland disease, and genetic disease.

To validate the result of breast cancer comorbidity discovery in this research, real-world data from several research papers is used. This research will combine the breast cancer comorbidities list from five research papers conducted by Ewertz et al. (2018), Fu et al. (2015), Sharma et al. (2015), Ording et al. (2013), and Hong et al. (2015) to become a list of complete and unique breast cancer comorbidities. The list of breast cancer comorbidities from those five research papers is shown in Table 19.

Table 19 Comorbidity list from several papers

The compiled or summarized list of breast cancer comorbidities from those five research papers and their ancestor in the Disease Ontology is shown in Table 20.

Table 20 Compiled breast cancer comorbidities list by several papers and its ancestor in the disease ontology

As shown in Table 20, from around 43 diseases curated from five research papers, only five diseases are not included in this research result, namely secondary cancers, anemia, leukemia, obesity, and acquired immunodeficiency syndrome (AIDS). For secondary cancers, the concept of cancer being the comorbidity of cancer is somewhat not applicable, and the elimination procedure in data preprocessing suggests all the synonyms of cancer to be dropped. For the other diseases, it might not be covered by the data, or the diseases might not be in the giant component, such that further analysis or research might be conducted on this matter.

As the result and discussion above, this research aligns with previous breast cancer comorbidity research due to most of the comorbidities ancestor is found in this research result, while some comorbidities might need further examination. This research also aligns with some previous network analysis research to discover disease’s comorbidities in some steps, while having the uniqueness in providing the validation from several breast cancer comorbidities research papers. This research provides the list of breast cancer comorbidities based on community detection approaches. The result of this research might benefit medical personnel in identifying or validating current or new breast cancer comorbidities. Future research on breast cancer comorbidities can also use other techniques with or without a computer science approach to validate, strengthen, or find other breast cancer comorbidities. Ultimately, it can lower the number of breast cancer death cases caused by comorbidity and overall death cases.

Conclusion

In this research, a breast cancer comorbidity network has been built. The breast cancer comorbidity network is built based on disease similarity, constructing a similarity matrix with a gene ontology-based semantic similarity function and transforming that similarity matrix into an adjacency matrix with a certain threshold. In this research, the Wang algorithm with a threshold of 0.5 is chosen based on the result distribution and balances the problems of modularity and false-positive rate.

The community algorithm is then applied to discover the community within a network’s giant component. The result is evaluated with modularity and similarity measurements. The modularity evaluation using five modularity measurements suggests the best five community detection algorithms, namely Leiden, Louvain, RBER Pots, RB Pots, and Walktrap. Similarity measurements with the best three fitness functions (edges inside, scaled density, and size) suggest that the Leiden–Louvain algorithm and RBER Pots-RB Pots algorithm are two pairs of algorithms with similar results, leaving Walktrap as an algorithm with different community results. Other similarity measurements with V-measure and visualized with a heatmap suggest that Louvain–Leiden (0.99), RB Pots–Leiden (0.97), and RB Pots-RBER Pots (0.96) results are quite similar to each other.

Lastly, comorbidity is then evaluated using the best five community detection algorithms and four centrality algorithms. The Leiden algorithm produces 16 clusters and 21 diseases. The Louvain algorithm produces 14 clusters and 16 diseases. The RBER Pots algorithm produces 24 clusters and 31 diseases. The RB Pots algorithm produces 16 clusters and 21 diseases. Walktrap produces 15 clusters and 19 diseases. Fourteen diseases are agreed upon by the best five community detection algorithms; five diseases are agreed upon by four algorithms; two diseases are agreed upon by three algorithms; a disease is agreed upon by two algorithms; and ten diseases are agreed upon by an algorithm.

With these research findings, the ultimate goal is to lower the number of breast cancer death cases number. Medical personnel might be helped in identifying or validating current or new breast cancer comorbidities. Further research can also be conducted to validate or strengthen breast cancer comorbidities discovery.

Availability of data and materials

Not applicable.

References

Download references

Acknowledgements

Thank you to Universitas Multimedia Nusantara for the continuous support of this research. Furthermore, thank you to Bonifasius Ariesto Adrian Finantyo for the help in data gathering and labeling, and Nehemia Gueldi for the help in data labeling.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization, RMY, AAP; methodology, RMY, AAP; software, RMY; resources, RMY, AAP; data labeling and preprocessing, RMY, AAP; articles proof reading, RMY; writing paper draft, RMY; paper review and editing, RMY, AAP; visualization, RMY; supervision, AAP; project administration, RMY, AAP. All authors involved in this research have perused and agreed that the submitted and published versions of the manuscript are updated and reflect the authors’ best knowledge until publication day. All authors have made significant contributions to the research reported in this article. All authors listed in this manuscript had notable contributions to the research, and all mentioned authors had contributions to this research’s success. All authors wrote, read, and approved the submitted manuscript. All authors read and approved the final version of the manuscript.

Corresponding author

Correspondence to Angga A. Permana.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Permana, A.A., Yaputra, R.M. Analyzing breast cancer comorbidities: a network approach using community detection algorithms. Appl Netw Sci 9, 31 (2024). https://doi.org/10.1007/s41109-024-00644-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s41109-024-00644-0

Keywords