A network analysis to identify lung cancer comorbid diseases

Cancer patients with comorbidities face various life problems, health costs, and quality of life. Therefore, determining comorbid diseases would significantly affect the treatment of cancer patients. Because cancer disease is very complex, we can represent the relationship between cancer and its comorbidities as a network. Furthermore, the network analysis can be employed to determine comorbidities as a community detection problem because the relationship between cancer and its comorbidities forms a community. This study investigates which community detection algorithms are more appropriate to determine the comorbid of cancer. Given different community findings, this study attempted to analyze the modularity generated by the algorithm to decide the significant comorbid diseases. We retrieved lung cancer comorbid data on the basis of text mining manuscripts in PubMed, searched through disease ontologies, and calculated disease similarity. We investigate 20 algorithms using five modularity metrics and 16 fitness function evaluations to determine the significant comorbid diseases. The results show the five best modularity algorithms, namely label propagation, spinglass, Chinese whispers, Louvain, RB Pots. These five algorithms found significant comorbidities: blood vessels, immune system, bone, pancreas, and metabolic disorders, atrial cardiac septal defect, atrial fibrillation respiratory system, interstitial lung, and diabetes mellitus. The fitness function justifies the results of the community algorithm, and the ones that have a significant effect are average internal degree, size, and edges inside. This study contributes to more comprehensive knowledge and management of diseases in the healthcare context.


Introduction
It is challenging to derive the collective behavior from knowing the system's components in complex systems, such as proteins and disease networks. We will never fully comprehend the complex systems unless we understand the networks that underpin them (Barabasi 2016). The network often conceptualizes system interactions, expressed as vertices (nodes) and edges (links) between pairs of nodes (Loe and Jensen 2015). The nodes represent the elements that make up the system, and the links describe their interactions. In a disease network, the nodes represent disease, and the links represent disease similarities between the corresponding illnesses, constructed via Disease Ontology (DO). DO, an authoritative disease curation service, established curation to coordinate disease representation across biomedical resource (Schriml and Mitraka 2015). DO enables researchers to analyze the disease similarity through semantic similarity measures, expanding our understanding of the relationships between various diseases and classifying them (Li et al. 2011).
Network analysis reveals the network's core features, allowing complex relationships and network structure to be estimated (Hevey 2018). The network analysis also identifies groups of nodes strongly connected to the rest of the network. These interrelated groups characterize communities (Yang et al. 2016). A community is a local subgraph densely connected in a network. The community detection aims to expose the community structure attached to the network. Most of the techniques do not specify the number and size of communities (Barabasi 2016).
The community is essential in the medical field, with diseases as complex as cancer and several comorbidities.Comorbidity refers to the existence of a long-term health condition in the presence of a primary disease of interest. Having comorbidities may influence the patient's prognosis for primary diseases such as cancer (Fowler et al. 2020). According to the World Health Organization, cancer was the first or second cause of death before the age of 70 in 112 of 183 countries worldwide in 2019. Lung cancer remained the leading cause of cancer death, with an estimated 1.8 million deaths (18%) (Sung et al. 2021). Various studies, revealed that many cancer patients have comorbidities. Chronic obstructive pulmonary and cardiovascular disorders are the most common comorbidities among patient with lung cancer (Pavia et al. 2007). Other comorbid diseases are immune system (Jacob et al. 2020), bone diseases (Kuchuk et al. 2013), pancreatic disease (Bang et al. 2014), metabolic disease , atrial cardiac septal defect (Inafuku et al. 2016), interstitial lung disease (Margaritopoulos et al. 2017), familial atrial fibrillation (Bandyopadhyay et al. 2019), respiratory system disease (Leduc et al. 2017), diabetes mellitus (Hatlen et al. 2011), and hyperlipidemia (Huang et al. 2016). It is also possible for patient with lung cancer to have multiple overlapping conditions (Sigel et al. 2017). Comorbidities can affect the stage of cancer. Patients with comorbidities face a poorer quality of life and require higher healthcare costs, resulting in shorter patient survival (Sarfati et al. 2016). Thus, understanding the comorbidities that coexist with lung cancer is necessary for screening and disease management.
Disease comorbidity is a complex system because it involves various components in the body. Behind this complex system there is a network that defines the interactions between components. With the network representation, the structure of the relationship among diseases can be known and can be analyzed. Exploring the network structure is an efficient approach to identifying the complex disease networks by identifying the highly connected individual nodes and the specific node communities (Barabási et al. 2011;Mu et al. 2020). Chen and Xu (Chen et al. 2015) explored and analyzed the comorbidities pattern in colorectal cancer. There is also relationship between hepatocellular carcinoma and medical comorbidities based on community detection in a comorbid network . Comorbid networks were grouped using a community detection algorithm and evaluated using disease-gene associations. The disease comorbidity network shows a genetic link between colorectal cancer and metabolic disorders. Community detection from a network determines the cancer subtypes using multi-omics data (Nguyen et al. 2020). Human diseases frequently arise from protein dysfunction and can be expressed in the community. Tripathi (Tripathi et al. 2019) comprehensively assessed many classical community detection algorithms for biological networks to recognize non-overlapping communities and proposed a heuristic algorithm to identify structurally small and well-defined communities.The network and tree approach is a tool for inference in decision support systems related to comorbidities because it involves uncertainty in diagnosis and treatment (Capobianco and Liò 2015).
There are many community algorithms, however, not every algorithm is suitable for performing community clustering in all fields. An efficient approach to measuring a community quality is known as modularity (Newman 2006). Some algorithms give optimum results in one area but are less than optimum if applied to other problem areas. This paper investigates which community detection algorithm is more suitable for the health sector, especially for comorbidities determination. The network community is evaluated based on the value of modularity. In network community, besides modularity, it is essential to evaluate various characteristics reflected in the fitness function of the set of communities.
This study consists of five stages (1) data preprocessing, (2) develop a network based on the calculation of similarity between diseases, (3) determine communities using various algorithms and measure modularity, (4) determine significant comorbidities based on communities, and (5) determine various fitness functions that correlate with cluster formation. The contribution of our findings is to provide an alternative use of networks in biomedical problems, especially in determining comorbid lung cancer. We hope that this study will aid in the better understanding and management of diseases in the clinical context.

Data acquisition
We searched the list of diseases through text mining of manuscripts in PubMed via Pubtator Central (PTC) https:// www. ncbi. nlm. nih. gov/ resea rch/ pubta tor/. PTC performs automatic annotations to provide six bioconcepts: disease, gene, species, mutation, chemical, and cell line (Wei et al. 2019). We focus on the disease since our goal is to obtain lung cancer comorbidities.

Data preprocessing
We cleaned the data and identified the comorbidities. The first cleaning was to remove the words death, mortality, lung cancer, considering that they are not comorbid diseases of lung cancer. Furthermore, for each disease found in the text mining stage, a disease ontology search was conducted through https:// disea se-ontol ogy. org/ to find the DOID. Disease Ontology (DO) is a framework for describing gene products from a disease perspective. It is critical for supporting functional genomics in disease contexts. Accurate disease descriptions can lead to the discovery of novel links between genes and disease, as well as new functions for previously unknown genes and alleles. DO is structured as a directed acyclic graph, which lays the groundwork for quantitative disease knowledge computing. For instance, pneumonia is a disease with DOID:552, which also has synonyms with acute pneumonia (Table 1). The aim was to determine the disease term. We also investigated for names based on their synonyms. Finally, we eliminated diseases that cannot be traced through disease ontologies ended up with 395 lung cancer comorbid diseases identified using DOID.

Develop a network based on the calculation of similarity between diseases
We calculated the similarity on the basis of the DO generated from the previous stage. Afterward, we created a similarity matrix using the R program, mainly the doSim function in the DOSE library downloaded from Bioconductor (Yu et al. 2015). There are five calculation algorithms in the doSim function as developed by Wang (Wang et al. 2007), Jiang (Jiang and Conrath 1997) , Lin (Lin 1998), Resnik (Grabowski 1995) and Rel (Schlicker et al. 2006). Wang computes the semantic similarity of two DO terms based on their positions in the DO directed acyclic graph and their relationships to their ancestor terms. Four other methods based on information content are based on the frequencies of two DO terms and their closest common ancestor term in a corpus of DO annotations. The negative log likelihood of a DO term occuring in the DO corpus is used to calculate the information content of the term. The weight/value of this similarity shows that these two comorbid disease terms have semantic relationship, phenotype characteristics, relationships between genes and disease, and related medical vocabulary disease concepts. The result of this stage was the comorbidity similarity matrix. The matrix elements ranged from 0 to 1, with 0 indicating that the two comorbidities were not identical and 1 indicating that they were. Then, the matrix was analyzed using the applied threshold. Finally, we constructed a network based on the similarity of the disease matrices, followed by building network formation using the Cytoscape application (Shannon et al. 2003).
We calculated and compared the modularity of the community outcomes formed by each algorithm. The higher the modularity, the better and optimal the community structure. There were five modularity algorithms: Newman Girvan (Newman and Girvan 2004), Erdos Renyi modularity (Erdos and Rényi 2011), link modularity (Nicosia et al. 2009), modularity density (Zhang et al. 2010), and Z modularity (Miyauchi and Kawase 2016). Then, we performed principal component analysis (PCA) to calculate the eigenvalues, which will be the weight of each modularity in calculating the overall modularity (Ramadhani et al. 2021). Nevertheless, PCA is usually used for dimensional reduction (Ahmadi et al. 2021). We sorted and selected the five best algorithms, each of which compared the results. At this stage, we prepared a clustering heatmap.
where m is the number of graph edges, m s is the number of community edges, l s is the number of edges from nodes in S to nodes outside S, n c is the number of nodes in C, K int iC is the degree of node i within C, K out iC are the degree of node i outside C, and is a parameter that allows for tuning of the measure resolution.

Determine significant comorbidities based on communities
The list of significant comorbid lung cancer in each community was calculated on the basis of centrality in each community formed, betweenness, degree, closeness, and eigenvector centrality. The diseases found were relatively consistent in each community formed by algorithms such as label propagation, spinglass, Chinese whisper, Louvain, and RB POTS.

Determine various fitness functions that correlate with cluster formation
We compared the proximity level of the five algorithms by calculating fitness scores, such as average internal degree, internal edge density, edges inside, expansion (Radicchi et al. 2004), conductance (Shi and Malik 2000), cut ratio (Fortunato 2010), a fraction over median degree, triangle participation ratio, (Yang and Leskovec 2015), normalized cut (Shi and Malik 2000), max ODF, avg ODF, flake ODF (Flake et al. 2000), average embeddedness, average transitivity, scaled density, and size (Rossetti et al. 2019). Information regarding each fitness function can be found in Additional file 2. Additionally, we used PCA to determine the eigenvector (Gan and Djauhari 2012), representing the weight assigned to each fitness function when computing the overall fitness functions, and pick several fitness functions strongly related to the findings of the community algorithm.

Data acquisition
We compiled the list of diseases by performing text mining on PubMed publications using Pubtator Central (PTC) https:// www. ncbi. nlm. nih. gov/ resea rch/ pubta tor/. Text mining of manuscripts in PubMed search using the keywords "comorbid lung cancer" from PTC yielded 150 manuscripts (filter the full text) and 551 manuscripts (abstract).
There is the name of lung cancer and other diseases that accompany each of these manuscripts; we take the comorbid disease from 551 manuscript. We found a list of 7183 disease names, with 1151 unique disease data. One of the manuscripts with PMID 34439135 obtained three disease names from the PTC automatic annotation results (Table 2).

Network based on the calculation of similarity between diseases
Following the 395 comorbid disease data with known DOID, the doSim function in the DOSE library calculated the matrix (Yu et al. 2015) to determine DO similarity. There are five calculation algorithms in the doSim: Wang, Jiang, Lin, Resnik, and Rel. According  (1) Mutation -

Species
Patients (4), honeycomb (1) Cell line -to the number of diseases, the calculation gives a symmetrical matrix size 395 × 395 . Each matrix element has a value range between 0 and 1, indicating a similarity level. The higher the value, the more each pair of diseases has a high similarity, and vice versa. At the time of the matrix calculation, 41 diseases did not have a matrix value. Hence, they are removed. A pair of identical illnesses should have a similarity value of 1. The results of calculations using Wang, Jiang, and Lin show that the diagonal of the matrix is worth 1. Table 3 presents an example of the data pieces for the Wang method. Nevertheless, this is not the case with the Rel and Resnik methods. Neither method assigns a value of 1 to a pair of identical diseases. In the Rel method, diagonals contain values close to 1; e.g., in DOID 14667 (the value is 0.951) and 50117 (0.911). Even the Resnik method gives far from accurate results, for instance, on DOID 14667 (0.312), 50117 (0.250), 50127 (0.656), 50156 (0.886), and 9970 (0.799). Based on these considerations, the Rel and Resnik methods are removed for further calculations.
A graph/network was created on the basis of the the threshold value of the similarity matrix. The threshold value displays the connectivity of two nodes in a network from 0 to 1. If the matrix element value is above the threshold, the two nodes are connected, and vice versa. The threshold 0 indicates that all nodes connect to other nodes, with as many links as n(n − 1)/2 where n represents the number of nodes. Threshold 1 causes all nodes to be disconnected and form a null graph. Figure 1 compares the number of links between calculations using the Jiang, Lin, and Wang methods. Among three approaches, the Wang's method seems more feasible because it forms a smooth and unbroken curve, where for a small threshold, there are many pairs of nodes connected to a link. Meanwhile, in the Jiang and Lin methods, the number of links suddenly drops drastically and breaks with a slight increase in the threshold.
Measuring disease similarity is based on functional associations between genes, and it is a disease data source for the building of biomedical databases. In the terminology graph, the similarity of the two diseases is represented by a link that connects the nodes of the two diseases. An edge connects two nodes because they have disease similarities calculated from the disease ontology. The more significant similarity between two diseases means that the more closely related they are, the more common information they have (Su et al. 2019). On the other hand, the smaller the similarity value, the less similarity between the two diseases. Moreover, a threshold value determines the link. The lower the threshold will result in a denser network. However, a  On the other hand, a high threshold will form fewer nodes clusters and give lower modularity. With these considerations, the moderate threshold used is 0.5 (Additional file 3). The following is a distribution of degrees with a threshold of 0.5, used in constructing the community networks (Fig. 2). Threshold minimizing the proportions of false-positive and false-negative (Bettembourg et al. 2015). A similarity threshold set to 0.5 filters low diseases similarities that do not well represent a link on a network (Zhao and Wang 2018). This threshold will change the network structure significantly. For example, if the threshold is 0 then the network will be a complete graph, whereas if the threshold is 1 then it will be a null graph. A network was developed based on matrix similarity between diseases calculated using the Wang method. In the network, there are 338 nodes and 1609 edges; the average number of neighbors is 13,639, and the clustering coefficient is 0.796. The number of connected components, The subgraph in which every pair of nodes has a path connecting them, is 23, each of which contains 144,35,29,25,11,11,11,8,8,7,6,6,6,5,5,5,4, 2, 2, 2, 2, 2, and 2 nodes (Fig. 3). In this study, we selected the highest connected component, containing 144 nodes. The graph's largest connected components have a distinct community structure, as opposed to the second or third. This is accomplished by grouping nodes belonging to the largest components into nonoverlapping cohesive subgroups. Most identified groups are strong because each node collaborates with nodes from their group more frequently than with nodes from other groups (Savić et al. 2015). The highest modularity is the first most significant component, and the smaller the number of nodes, the lower the modularity value (Additional file 4). The disease group obtained in the first most significant component is heterogeneous, and the second and third largest components have clustered like a group of cancers other than lung and psychological disorders.  Table 4). The modularity calculation formula is expressed in Eqs. (1)-(5).
The calculations and sorting based on each modularity formula produced a sequence of 20 different community algorithms and overall modularity calculated by PCA. The calculation gives an eigenvalue of 4.322 and eigenvector for each analysis of modularity Newman Girvan, Erdos Renyi, link modularity, modularity density, and Z modularity  Fig. 4, the left side expresses the complete nodes in each community, and the right side illustrates the relationship between each cluster. The best five algorithms is found in Additional file 5. The label propagation algorithm divides the community into nine clusters associated with respiratory system disease, vascular disease, immune system disease, bone disease, metabolism disease, atrial heart septal defect, pancreatic disease, familial atrial fibrillation, and persistent generalized lymphadenopathy. The spinglass divides the community into eight clusters associated with interstitial lung disease, vascular disease, immune system disease, metabolism disease, bone disease, atrial heart septal defect, pancreatic disease, and familial atrial fibrillation. The Louvin and RB Pots algorithms also produced seven similar clusters associated with interstitial lung disease, vascular disease, immune system disease, bone disease, metabolism disease, atrial heart septal defect, and pancreatic disease. The essential diseases are diabetes mellitus, vascular disease, respiratory system disease, immune system disease, bone disease, diabetes mellitus, pancreatic disease, and familial hyperlipidemia. Cluster 2 relates to vascular becomes central in lung cancer comorbid diseases in every community algorithm applied.

List of comorbid diseases in each community
Our intention is to find diseases that is considered significant among existing diseases that has the greatest centrality in each community. The list of significant comorbid lung cancer in each community is calculated on the basis of centrality in each community formed based on betweenness, degree, closeness, and eigenvector centrality. The diseases found were relatively consistent in each community formed from community algorithms such as label propagation, spinglass, Chinese whisper, Louvain, and RB POTS. Conversely, vascular , immune system, disease, and pancreatic disease are commonly encountered based on differences in community algorithms and centrality (Table 5). These five algorithms can find community patterns that are relatively similar in finding significant comorbid diseases. Diseases that occur in a community, have similarities with each other. For example, in communities with respiratory/interstitial system diseases primary lung disease associated respiratory disease, emphysema and pneumonia (Fig. 5). Disease groups formed in each cluster can be further investigated for their relationship, especially on the similarity of symptoms, anatomy, cells, genes, phenotypes, and potential for treatment. It is important in relation to precision medicine for cancer comorbid patients, especially to improve diagnosis and safe therapy.

Determine various fitness functions
We calculated the fitness score, consisting of average embeddedness, average internal degree, average transitivity, conductance, cut ratio, edges inside, expansion, a fraction over median degree, internal edge density, normalized cut, max ODF, avg ODF, Flake ODF, scaled density, size, and triangle participation ratio ( Table 6). The higher the fitness score, the better the results. Several fitness functions chosen are those with a significant relationship to the community. The calculation using PCA gives eigenvalue of 5.333 with each eigenvector of 0.447169; 0.447227; 0.447224; 0.447224; and 0.447224. Based on these results, the most significant fitness sequences that correlate with community formation are average internal degree, size and edges inside.

Community algorithms comparison
Community algorithms heatmap reveals the closeness between algorithms (Fig. 6). In

Intersection result of comorbid disease among algorithms
A summary is visualized by Venn diagram using InteractiVenn (Heberle et al. 2015) based on the list of comorbidities, using the ensemble vote majority method. The numbers in this Venn diagram are the number of comorbid diseases produced by each community algorithm. The results are shown in Fig. 7, and the disease details are presented in Table 7. Every row shows the diseases identified by a particular algorithm in this table.
For example, in the first row, all of the community algorithms found that vascular disease, immune system disease, bone disease, pancreas disease is significant comorbid lung cancer. According to the existing references, these algorithms have succeeded in detecting various significant comorbid. Based on the DOID hierarchical structure, the results can be seen based on the DOID structure that have group/upper-level organization of diseases as in Additional file 6. The major comorbidity in patients with lung cancer is cardiovascular, approximately 23% (Pavia et al. 2007). The immune system dysregulation associated with autoimmune diseases increases the risk of cancer. Standardized incidence, standardized mortality, and hazard ratios indicated an increased risk of lung cancer (Hemminki et al. 2012). Bone metastases diseases are common in patient with lung cancer and have shorter overall survival (Kuchuk et al. 2013). Pancreatic metastases are found in advanced small cell lung cancer. In autopsy studies, pancreatic metastasis occurs between 1.6 and 10.6%. The primary tumor is usually in the left lung, and 15% of  Figure 7 Similarity matrix among five algorithms denoted via heatmap. The lighter the color, the more similar. All of the method have high similarity (more than 0.900) patients have pancreatic metastasis (Gonlugur et al. 2014). There is a lipid metabolism disorder in lung cancer (Merino Salvador et al. 2017). When patient with lung cancer have comorbid interstitial lung disease, the average survival at diagnosis is worse than without comorbidities (Margaritopoulos et al. 2017). Among 159,615 patients diagnosed with lung cancer in 2016, 10,050 (6.29%) patients had a concurrent diagnosis of atrial fibrillation (Bandyopadhyay et al. 2019). These patients frequently have tobacco-related illnesses (e.g., respiratory diseases) due to the much higher incidence of lung cancer in smokers and ex-smokers. (Leduc et al. 2017).
Conversely, patients with diabetes mellitus who have lung cancer have a higher survival rate than those without (Hatlen et al. 2011). Additionally, comorbid  hyperlipidemia is associated with a significant reduction in mortality in patients with lung cancer (Lazzarini et al. 2016). Specifically, COPD is a disease often found in comorbid cancer in the respiratory system disease group. Nevertheless, the community system limitations used that do not involve the prevalence of comorbid disease occurrence and its severity can be attached to the weight of nodes in the network. Structure of Directed Acyclic Graph in vascular, immune system , bone , pancreas disease, and Lung cancer can be described in terms of its relationship to the disease ontology as shown in the Fig. 8. All these diseases are a group of disease of anatomical entities. Finally, the fitness functions in average internal degree, size, and edges inside correlate with the grouping between community algorithms by justifying the results. By a swarm plot, the Louvain and RB Pots algorithms have similar results in comparing the size of the edges inside, while label propagation and spinglass have identical results (Fig. 9). The size comparison of the five algorithms shows that the Chinese Fig. 8 Diseases ontology on vascular disease immune system disease, bone disease, pancreas disease, and lung cancer Fig. 9 Comparison of the size of the edges inside the five algorithms whisper algorithm has significant differences from the results of the other algorithms (Fig. 10). Nevertheless, the Chinese whisper algorithm is close to the Louvain and RB Pots algorithms (Fig. 11).
Network Analysis has been used by Folino et al. (2010), to predict the risk of comorbid diseases suffered by patients and use association rules. reveal the comorbid network of occurrence of comorbidity. Ljubic et al. (2020) also conducted a network analysis to obtain genes that are associated with colorectal cancer and its comorbidities. Chmiel et al. (2014) conducted a research on Spreading of diseases through comorbidity networks across life and gender. However, the three studies used network analysis, but did not use community so that they could not reveal groups of diseases that have closeness which is the hallmark of this study. Table 8 provides a comparison among the research.